Wikidata query service updater evolution

The Wikidata Query Service (WDQS) sits in front of Wikidata and provides access to query its data via a SPARQL API. The query service itself is built on top of Blazegraph, but in many regards is very similar to any other triple store that provides a SPARQL API.

In the early days of the query service (circa 2015), the service was only run by Wikidata, hence the name. However, as interest and usage of Wikibase continued to grow more people started running a query service of their own, for data in their own Wikibase. But you’ll notice most people still refer to it as WDQS today.

Whereas most core Wikibase functionality is developed by Wikimedia Deutschland, the query service is developed by the search platform team at the Wikimedia Foundation, with a focus on wikidata.org, but also a goal of keeping it useable outside of Wikimedia infrastructure.

The query service itself currently works as a whole application rather than just a database. Under the surface, this can roughly be split into 2 key parts

  • Backend Blazegraph database that stores and indexes data
  • Updater process that takes data from a Wikibase and puts it in the database

This actually means that you can run your own query service, without running a Wikibase at all. For example, you can load the whole of Wikidata into a query service that you operate, and have it stay up to date with current events. Though in practice this is quite some work, and expense on storage and indexing and I expect not many folks do this.

Over time the updater element of the query service updater has iterated through some changes. The updater now packaged with Wikibase as used by most folks outside of the Wikimedia infrastructure is now 2 steps behind the updater used for Wikidata itself.

The updater generations look something like this:

  • HTTP API Recent Changes polling updater (used by most Wikibases)
  • Kafka based Recent Changes polling updater
  • Streaming updater (used on Wikidata)

Let’s take a look at a high-level overview of these updaters, what has changed and why. I’ll also be applying some pretty arbitrary / gut feeling scores to 4 categories for each updater.

Read more

Infrastructure as Code for wbstack deployments

This entry is part 12 of 12 in the series WBStack

For most of its life wbstack was a mostly one-man operation. This certainly sped up the decision making process around features, requests, communication and prioritization, I also had to maintain a complex and young project supporting hundreds of sites on the side of my regular 8 hour day job.

In order to ensure that I’d feel comfortable with this extra context, be able to support the platform for multiple years, have a platform that could grow and scale from day one and also leave the future of the platform with as many possibilities as possible I roughly followed a few principles throughout implementation and operation.

  • Scalability: Tink about scale at multiple levels. Everything was either already horizontally scalable, or the path to get to horizontal scalability had been thought out
  • Automation: Automate actions, if you have 2 of something now, pretend you have 1000 of them instead and develop the solution to fit
  • Infrastructure as code: All infrastructure configuration was contained somehow in the deploy repository
  • Cloud agnostic: Things would be cloud-agnostic where possible, resulting in most things being in Kubernetes or using other external services
  • Own fewer things: Try to not create many new services or codebases, or take ownership of forks that should not exist, as this will become too much work

The one part of the above list that I want to dive into more in this post is infrastructure as code and how it worked for the multi-year lifespan of wbstack, before the move to wikibase.cloud.

Read more

WikiCrowd at 50k answers

In January 2022 I published a new Wikimedia tool called WikiCrowd.

This tool allows people to answer simple questions to contribute edits to Wikimedia projects such as Wikimedia Commons and Wikidata.

It’s designed to be able to deal with a wide variety of questions, but due to time constraints, the extent of the current questions covers Aliases for Wikidata, and Depict statements for Wikimedia Commons.

Read more