profiling – addshore

There has been some recent chat once again on the Wikibase telegram groups around importing, and the best approach to import a large amount of data into a Wikibase instance. 2 years ago I started a little GitHub project aimed at profiling the speed of loading using the action API, and various settings, DB versions etc, as well as trying out a bulk load API. And I have just taken the opportunity to take another look at it and try to visualize some of the comparisons given changes through the last 2 years.

In case you don’t want to read and follow everything below, the key takeaways are:

EPS (edits per second) of around 150 are achievable on a single laptop
When testing imports, you really need to test at least 50k items to get some good figures
The 2 ID generation related settings are VERY IMPORTANT if you want to maximise import times
Make async requests, but not too many, likely tuned to the number of CPUs you have serving web requests. You wan near 100% utilization
A batch API, such as FrozenMink/batchingestionextension would dramaticly increase import times

Some napkin math benchmarks for smallish items, I would hope:

1 million items, 2 hours (validated)
10 million items, 1 day
Wikidata (116 million) items, 14 day+

Today I was in a Wikibase Stakeholder group call, and one of the discussions was around Wikibase importing speed, data loading, and the APIs. My previous blog post covering what happens when you make a new Wikibase item was raised, and we also got onto the topic of profiling.

So here comes another post looking at some of the internals of Wikibase, through the lens of profiling on test.wikidata.org.

The tools used to write this blog post for Wikimedia infrastructure are both open source, and also public. You can do similar profiling on both your own Wikibase, or for your requests that you suspect are slow on Wikimedia sites such as Wikidata.

Wikimedia Profiling

Profiling of Wikimedia sites is managed and maintained by the Wikimedia performance team. They have a blog, and one of the most recent posts was actually covering profiling PHP at scale in production, so if you want to know the details of how this is achieved give it a read.

Throughout this post I will be looking at data collected from a production Wikimedia request, by setting the X-Wikimedia-Debug header in my request. This header has a few options, and you can find the docs on wikitech.wikimedia.org. There are also browser extensions available to easily set this header on your requests.

I will be using the Wikimedia hosted XHGui to visualize the profile data. Wikimedia specific documentation for this interface also exists on wikitech.wikimedia.org. This interface contains a random set of profiled requests, as well as any requests that were specifically requested to be profiled.

Profiling PHP & MediaWiki

If you want to profile your own MediaWiki or Wikibase install, or PHP in general, then you should take a look at the mediawiki.org documentation page for this. You’ll likely want to use either Tideways or XDebug, but probably want to avoid having to setup any extra UI to visualize the data.

This profiling only covered the main PHP application (MediaWiki & Wikibase extension). Other services such as the query service would require separate profiling.

Profiling Wikibase APIs and import speed

Profiling a Wikibase item creation on test.wikidata.org

Wikimedia Profiling

Profiling PHP & MediaWiki