Profiling Wikibase APIs and import speed

There has been some recent chat once again on the Wikibase telegram groups around importing, and the best approach to import a large amount of data into a Wikibase instance. 2 years ago I started a little GitHub project aimed at profiling the speed of loading using the action API, and various settings, DB versions etc, as well as trying out a bulk load API. And I have just taken the opportunity to take another look at it and try to visualize some of the comparisons given changes through the last 2 years.

In case you don’t want to read and follow everything below, the key takeaways are:

  • EPS (edits per second) of around 150 are achievable on a single laptop
  • When testing imports, you really need to test at least 50k items to get some good figures
  • The 2 ID generation related settings are VERY IMPORTANT if you want to maximise import times
  • Make async requests, but not too many, likely tuned to the number of CPUs you have serving web requests. You wan near 100% utilization
  • A batch API, such as FrozenMink/batchingestionextension would dramaticly increase import times

Some napkin math benchmarks for smallish items, I would hope:

  • 1 million items, 2 hours (validated)
  • 10 million items, 1 day
  • Wikidata (116 million) items, 14 day+

Read more

Vuetify app with Wikimedia OAuth login

Do you often find yourself wanting to make a basic (or complex) web app that is client side only and will log users into Wikimedia sites with ease? Me to!

I have been trying this every year or so, and it’s gradually been getting easier. This year it only took me a couple of to get a really nice template web app setup using Vue, Vuetify and a OAuth 2.0 Wikimedia consumer (thanks to the OAuth extension).

Firstly, some links that you’ll find useful:

Starting off with a default Vuetify app install using vite (commit c3edb0f), you’ll end up with a basic web page that just says welcome to Vuetify. You can copy the code in my commit, or just follow the Vuetify instructions.

Read more

Wikibase Phrase Entity, Viewing

This entry is part 7 of 7 in the series Wikibase Entities

In my previous post, we got to the point of being able to create a new Wikibase Entity, it is stored in the MediaWiki database as a page, however we can’t actually view it via any interface currently.

In this post, we will work through another set of code changes, tackling each issue as we see it arise, until we can see the entity represented in the various places that users might expect.

Viewing the page

The provided entity serialization is neither legacy nor current

When clicking on one of the links on Special:RecentChanges to a phrase page that we have created, we get our first error.

/wiki/Phrase:Phrase66900b01937842.29097733 MWContentSerializationException: The provided entity serialization is neither legacy nor current
from /var/www/html/w/extensions/Wikibase/lib/includes/Store/EntityContentDataCodec.php(253)Code language: JavaScript (javascript)

The full stack trace is a little large, but you can find it in a paste bin.

This error is very similar to an issue we saw in the creation blog post, but this time the codec class can not deserialize what we have stored in the database, as we have not registered a deserializer for phrases.

Adding a deserializer to the entity registration file is very simple:

Read more

Wikimedia Enterprise: A first look

Wikimedia Enterprise is a new (now 1-year-old) service and offered by the Wikimedia Foundation, via Wikimedia, LLC.

This is a wholly-owned LLC that provides opt-in services for third-party content reuse, delivered via API services.

In essence, this means that Wikimedia Enterprise is an optional product that third parties can choose to use that repackages data from within Wikimedia projects in a more useful, more reliable, and stable format presenting them primarily via data downloads and APIs, with profits going into the Wikimedia Foundation.

Want to find out more? Read the FAQ.

The project and APIs are well documented, and access can be requested for free, but I wanted to spend a little bit of time hands-on with the APIs to get a full understanding of what is offered, the formats, and how it differs from things I know are exposed elsewhere in Wikimedia projects.

Account Creation

Wikimedia Enterprise accounts are separate from any other Wikimedia related accounts, so you’ll need a new one.

In order to get an account you need to fill out a pretty straightforward form (username, password, email, and accept terms). You then need to verify your email address. Tada, you are in!

Read more

Finding the most liked tweets for a topic in a year

I’m nearly halfway through writing a month of daily blog posts. I wanted to write some posts covering the history of both Wikidata and Wikibase on Twitter. Being a developer, I looked for APIs, but it seems tweets are not as accessible as they once were.

This is a short write up of my adventure, covering APIs, scraping thoughts, and finally, my working solution, all be it with a quirk of 2 that I can’t explain.

Read more

What happens in Wikibase when you make a new Item?

A recent Wikibase email list post on the topic of Wikibase and bulk imports caused me to write up a mostly human readable version of what happens, in what order, and when, for Wikibase action API edits, for the specific case of item creation.

There are a fair few areas that could be improved and optimized for a bulk import use case in the existing APIs and code. Some of which are actively being worked on today (T285987). Some of which are on the roadmap, such as the new REST APIs for Wikibase. And others which are out there, waiting to be considered.

This post is is written looking at Wikibase and MediaWiki 1.36 with links to Github for code references. Same areas may be glossed over or even slightly inaccurate, so take everything here with a pinch of salt.

Reach out to me on Twitter if you have questions or fancy another deep dive.

Read more

Using Hue & Hive to quickly determine Wikidata API maxlag usage

Hue, or Hadoop User Experience is described by its documentation pages as “a Web application that enables you to easily interact with an Hadoop cluster”.

The Wikimedia Foundation has a Hue frontend for their Hadoop cluster, which contains various datasets including web requests, API usage and the MediaWiki edit history for all hosted sites. The install can be accessed at https://hue.wikimedia.org/ using Wikimedia LDAP for authentication.

Once logged in Hue can be used to write Hive queries with syntax highlighting, auto suggestions and formatting, as well as allowing users to save queries with names and descriptions, run queries from the browser and watch hadoop job execution state.

The Wikidata & maxlag bit

MediaWiki has a maxlag API parameter that can be passed alongside API requests in order to cause errors / stop writes from happening when the DB servers are lagging behind the master. Within MediaWiki this lag can also be raised when the JobQueue is very full. Recently Wikibase introduced the ability to raise this lag when the Dispatching of changes to client projects is also lagged behind. In order to see how effective this will be, we can take a look at previous API calls.

Read more

The break in Wikidata edits on 28 Jan 2016

On the 28th of January 2016 all Wikimedia MediaWiki APIs had 2 short outages. The outage is documented on Wikitech here. The outage didn’t have much of an impact on most projects hosted by Wikimedia. However due to most Wikidata editing happening through the API, even when using the UI, the project basically stopped for roughly … Read more

Github release download count – Chrome Extension

The Github IconGitHub tracks the number of downloads for all assets (files) that are attached to a release, but GitHub currently makes it very hard for users to get at this information. The number of downloads is currently only accessible through the API.

I noticed this many months ago when wondering how many people were downloading the new C++ version of Huggle. At the time I started coming up with some odd little script that I could set running somewhere checking the download counts and updating a static list, I even thought about just uploading the build files to some other site that tracked this information.

A few days ago I took my first look at developing chrome extensions for a referencing tool for Wikidata, and thus today I thought, why not just make an extension for chrome that adds the download counts to the GitHub releases page!

Read more

Media Hack Day 2014

The Media Hack Day is an annual event held at Axel Springer | Plug & Play Accelerator in Berlin. The event for March 2014 can be found on hackerleague.org. I attended representing the Wikidata API. Also in attendance were Axel Springer, storyful, Der Spiegel, sanoma, watchmi, Getty Images and embed.ly. It was in a great location, the food and rinks were … Read more