Profiling Wikibase APIs and import speed

There has been some recent chat once again on the Wikibase telegram groups around importing, and the best approach to import a large amount of data into a Wikibase instance. 2 years ago I started a little GitHub project aimed at profiling the speed of loading using the action API, and various settings, DB versions etc, as well as trying out a bulk load API. And I have just taken the opportunity to take another look at it and try to visualize some of the comparisons given changes through the last 2 years.

In case you don’t want to read and follow everything below, the key takeaways are:

  • EPS (edits per second) of around 150 are achievable on a single laptop
  • When testing imports, you really need to test at least 50k items to get some good figures
  • The 2 ID generation related settings are VERY IMPORTANT if you want to maximise import times
  • Make async requests, but not too many, likely tuned to the number of CPUs you have serving web requests. You wan near 100% utilization
  • A batch API, such as FrozenMink/batchingestionextension would dramaticly increase import times

Some napkin math benchmarks for smallish items, I would hope:

  • 1 million items, 2 hours (validated)
  • 10 million items, 1 day
  • Wikidata (116 million) items, 14 day+

Read more

What happens in Wikibase when you make a new Item?

A recent Wikibase email list post on the topic of Wikibase and bulk imports caused me to write up a mostly human readable version of what happens, in what order, and when, for Wikibase action API edits, for the specific case of item creation.

There are a fair few areas that could be improved and optimized for a bulk import use case in the existing APIs and code. Some of which are actively being worked on today (T285987). Some of which are on the roadmap, such as the new REST APIs for Wikibase. And others which are out there, waiting to be considered.

This post is is written looking at Wikibase and MediaWiki 1.36 with links to Github for code references. Same areas may be glossed over or even slightly inaccurate, so take everything here with a pinch of salt.

Reach out to me on Twitter if you have questions or fancy another deep dive.

Read more