Profiling Wikibase APIs and import speed

March 29, 2025 0 By addshore

There has been some recent chat once again on the Wikibase telegram groups around importing, and the best approach to import a large amount of data into a Wikibase instance. 2 years ago I started a little GitHub project aimed at profiling the speed of loading using the action API, and various settings, DB versions etc, as well as trying out a bulk load API. And I have just taken the opportunity to take another look at it and try to visualize some of the comparisons given changes through the last 2 years.

In case you don’t want to read and follow everything below, the key takeaways are:

  • EPS (edits per second) of around 150 are achievable on a single laptop
  • When testing imports, you really need to test at least 50k items to get some good figures
  • The 2 ID generation related settings are VERY IMPORTANT if you want to maximise import times
  • Make async requests, but not too many, likely tuned to the number of CPUs you have serving web requests. You wan near 100% utilization
  • A batch API, such as FrozenMink/batchingestionextension would dramaticly increase import times

Some napkin math benchmarks for smallish items, I would hope:

  • 1 million items, 2 hours (validated)
  • 10 million items, 1 day
  • Wikidata (116 million) items, 14 day+

General method

The profiling code makes use of the Wikibase docker images, and some bash scripts. The docker images provide an installation of MediaWiki and Wikibase that has some default configuration, which I can easily alter, and also mount additional files into (such as a batch import API). There is a YAML file that generally shows a bunch of different variables that were changed throughout the tests.

These include:

  • Instances: 1 or 2 instances of Wikibase being run. Though I didn’t detect much difference so generally stuck with 1
  • Async: How many requests to try and make at once (don’t wait for the last request to finish before starting the next)
  • Images: Which Wikibase image, and thus version to use. Possible versions were… wikibase:1.39.1-wmde.11, wikibase:1.40.3-wmde.19, wikibase:1.41.1-wmde.20, wikibase:mw1.42.5, wikibase:mw1.43.0
  • SQL: Which DB image to use. Not much difference was detected, so mariadb:11.7 was generally used
  • Settings: What settings to alter for MediaWiki and or Wikibase
  • Loads: How to load the data (which bash script)
  • Entity Groups: In the case of batching, how many items to create in a single request
  • Entity Counts: How many entities to create during the test (Generally leaning toward 5000)

The data for this post was generated on my laptop within a VM that had 20 core and 30GB memory allocated to it, although not much of this was actually used during the tests. (At some point in the future I might do further profiling while being more restrictive on the CPU and Memory allocation for the VM, as I imagine that will have some impact).

At some stage I was checking the MediaWiki statistics to see how the imports had gone, however I had to stop doing that as I didn’t seem to be able to reliably retrieve the final edit or page count from there (perhaps some jobs needed to run etc). So instead, I counted the number of items the APIs claimed to have successfully created as validation.

The raw data is accessible in CSV form https://github.com/addshore/wikibase-profile/blob/9c8025ff7b0c7aa0f78f7de5bba532c57fcdb411/archive/2025-03-29.csv

  • runtime: When the run happened
  • instance, async, image, sql, setting, load, entity_count (see above)
  • start_time & stop_time: Timing points around the main load to profile
  • time: Seconds taken for the main load
  • success_line: Number of successful writes (though the generation of this changed throughout, and some earlier values might be wrong)

The key numbers that I’ll be referring to in this post are ESP, or edits per second.

idGeneratorSeparateDbConnection

I dabbled with a bunch of different setting across tests, deciding that the only one that made a noticeable difference during mass bulk fast and async imports was idGeneratorSeparateDbConnection. You can see the others I played with in this pinned setting file.

The separate DB connection for ID generator setting is something that was added to Wikibase specifically for Wikidata, to alleviate DB issues with transactions when lots of new entity IDs were needed. You can find the docs on doc.wikimedia.org.

In simple terms, when the setting is false (the default), an ID is minted as part of the main transaction for making an entity. This can lead to the next request wanting an ID needing to wait etc for the prior transaction to complete, before it can get an ID. When setting it to true, the ID allocation happens in a very short transaction

In fact, why trying to use async requests with a bulk API that allows creating multiple items in a single requests, this setting had to be set, otherwise requests would fail for me during testing, leading to me carrying out the rest of my profile testing with this setting set!

Across 11 tests, creating 200 items with this setting either ON or OFF, the results were clear…

idGeneratorSeparateDbConnectionmin smax s mean smean EPS
true151715.75126.98
false283430.8564.82

mysql-upsert idGenerator

I had totally forgotten about this setting, until I was reading the documentation for the setting above, so at the last minute I figured I’d give it a little test as well, as in theory an UPSERT will result in less mysql queries occurring during the ID minting process, and thus less back and forth between the PHP code of Wikibase, and the DB server…

The alternative to the UPSERT, and the default, can be seen here, which SELECTs, updates in PHP code, and then INSERTs the value in the DB table.

There was a noticeable improvement in speed with the upsert id generator being used for a bulk import, as ID generation is certainly one of the pinch points when going via the APIs.

While creating 5000 items, with an async setting of 30, across 50 different tests, I gathered the following data which shows the upsert ID generator to be marginally faster.

idGeneratormin smax s mean smean EPS
upsert445246.69107.08
default446549.96100.08

However, the rest of my profiles will not use the upsert generator, as I didn’t notice this until the end!

Database comparison

The DB is an easy thing to change, and in the past I had compared MySQL and MariaDB to determine that seemingly MariaDB was marginally faster. In this test I simply compared MariaDB 10.9 vs 11.7, and generally found any difference to be negligible…

While loading 2000 items, in individual requests, trying to make up to 60 requests at a time on Wikibase 1.39, mariadb:11.7 marginally won accross 9 tests.

DBmin smax s mean smean EPS
mariadb:10.9151715.75126.98
mariadb:11.7141715.20131.57

However, loading 5000 items, in individual requests, trying to make up to 30 requests at a time on Wikibase 1.43, mariadb 10.9 marginally won across 30 tests.

DBmin smax s mean smean EPS
mariadb:10.9454846108.69
mariadb:11.7446550.499.20

So you are unlikely to get much performance gain at the surface level just by changing your DB version.

I do imagine there is other tweaking that you could do further down in the DB to optimize however.

Throughout the rest of the tests, I’ll stick to mariadb:11.7 for consistency…

Action vs REST

As I upgraded the profile stack to work with newer versions of Wikibase, including 1.43, this means that I can finally test the REST API speed for item creation alongside the action API.

I was expecting the REST API to come out marginally on top, which would correlate with what I had seen in the wild, however it looks like the REST API was marginally slower.

Across 30 tests on the mw1.43.0 image, trying to make 30 requests async the action API won while creating 5000 Items.

APImin smax s mean smean EPS
action (wbeditentity)446550.499.20
rest578169.4272.02

Async requests

This one doesn’t really need fancy comparisons. Waiting for one request to complete before starting your next one will result in a much slower creation rate…

Creating 2000 items, one at a time, might take 1064 seconds. If you up that to 5 requests simultaneously, you can do it in 217 seconds (roughly 1/5th) of the time. Try 60 at a time, and you can do it around 34 seconds, roughly 1/6th again. Here you can start to see the diminishing returns.

Finding the right number will likely take some time for your individual setup, but certainly don’t do one at a time if you want speed!

Bulk requests

Providing multiple items in single request removes lots of the overhead and overall reduces the number of DB transactions that are needed to create items. For testing, I hacked the wbeditentity API to accept multiple items, and create them in a single request. However, if you are looking for a nicer solution for your Wikibase, you might find FrozenMink/batchingestionextension useful (which should enable you to see the same performance gains).

How’s that for foreshadowing?

I don’t need as many runs to show how fast batching can be, but here are some tests made up of 5000 item creations using a batch API on Wikibase 1.39.1 trying to reach 60 async requests at a time

Batchedmin smax smean EPS
14950101.01
23034156.25
52424208.33
101720270.27
251317333.33
501317333.33

So, being able to create multiple items in a single request appears to give us 3x faster item creation that creating a single item at a time.

I asked an AI to plot this on a simple chart for me, and this is what we get…

This is primarily meant to highlight the possible gains that can come from using a batch API instead of creating items one by one, and I imagine you will actually not reach this rate of creations as you’ll have larger, more varying and complex items.

Pushing for speed

Dialling in what we can tweak, let’s see how fast it can go…

So, having IDs use their own connection and use the UPSERT method is a given, as well as using the batch API, however what about the other settings? and maybe more instances?

I tried running a bunch of tests on 1 vs 2 instances of Wikibase (running on the same hardware) and there really were not many differences for 5000 items and differing batch sizes, so I’ll lock that in at 1.

Moving from 30 to 60 async requests as a target, and from 50 to 100 to 150 items per batch, I could see marginal improvements in various tests, but I started to realize that testing with only 5000 items would no longer work as these combinations would result in all items being created in only a handfull of requests… So time to up our item target.

Aiming for 25k, so far more requests running at the same time, and some started failing, and thats when I realized that my curl timeout was set to 20 seconds, so I upped that to 5 mins.

As I ramped the batch sizes up to 300 I started encountering curl: Argument list too long, due to how I was making the requests, so 250 is where I will have to leave it which seems to be OK.

But here is a summary of toward the end…

AsyncBatchItemsTimeEPS
1502505,00012416
15025010,00023434
15025025,00056446
15025050,000273183
150250100,000585170
150250250,0001516164
150250500,0003282 (0.9h)152
30250100,000572174
30250500,0003069 (0.8h)163
302501,000,0006390 (1.7h)156
302502,000,00013348 (3.7h)150
16250500,0003417 (0.9h)146

So, on my hardware at least, you need to do a test of at least 50k item in order to determine some sort of real longer term import speed

Final thoughts and observations

I’m pretty sure I could go faster with multiple proccesses, and if I rewrote this to not be in bash I could likely hit even larger batch sizes at once, and potentially at this scale multiple wikibase instances might come back into play?

The next observable is that CPU was a limiting factor on the import speed. Both PHP and MariaDB ended up being very visible in my process list, and my CPU was pegged, with a load of 130…

So, having a seperate set of DB server hardware, and also possibly multiple instances of Wikibase running on multiple different sets of hardware would likely improve speed further (but this is probably for another blog post…)

My last test using the batch API I ran with 16 async requests, and my CPU was also pegged. So as expected, if batching you need less async, but for single requests, more async helps fill the CPU idle time.

I also noticed that the moment I tried to open the Wikibase in a web browser, my import API requests would start failing, likely due to timeouts, especially when running at high levels of async, so doing this on a live site would liekly lead to very different outcomes.

Off the back of this, I feel a next roud of import profiling would make sense if split out over a few different bits of hardware, where resources could scale more. It’s likely such a solution could be used to initially generate a DB, and then actually run that on the hardware you desire.

Of course, things such as RaiseWikibase still exist, and can be used, which write directly to the DB (removing over half of this CPU usage), however they likely come with a bunch of followup PHP scripts and processes that need to run to correctly populate other DBs and secondary stores etc (but I am saying this without having tried them).