Wikidata Architecture Overview (diagrams)

December 19, 2018 2 By addshore

Over the years diagrams have appeared in a variety of forms covering various areas of the architecture of Wikidata. Now, as the current tech lead for Wikidata it is my turn.

Wikidata has slowly become a more and more complex system, including multiple extensions, services and storage backends. Those of us that work with it on a day to day basis have a pretty good idea of the full system, but it can be challenging for others to get up to speed. Hence, diagrams!

All diagrams can currently be found on Wikimedia Commons using this search, and are released under CC-BY-SA 4.0. The layout of the diagrams with extra whitespace is intended to allow easy comparison of diagrams that feature the same elements.

High level overview

High level overview of the Wikidata architecture

This overview shows the Wikidata website, running Mediawiki with the Wikibase extension in the left blue box. Various other extensions are also run such as WikibaseLexeme, WikibaseQualityConstraints, and PropertySuggester.

Wikidata is accessed through a Varnish caching and load balancing layer provided by the WMF. Users, tools and any 3rd parties interact with Wikidata through this layer.

Off to the right are various other external services provided by the WMF. Hadoop, Hive, Ooozie and Spark make up part of the WMF analytics cluster for creating pageview datasets. Graphite and Grafana provide live monitoring. There are many other general WMF services that are not listed in the diagram.

Finally we have our semi persistent and persistent storages which are used directly by Mediawiki and Wikibase. These include Memcached and Redis for caching, SQL(mariadb) for primary meta data, Blazegraph for triples, Swift for files and ElasticSearch for search indexing.

Getting data into Wikidata

There are two ways to interact with Wikidata, either the UI or the API.

The primary UI is JS based and itself interacts with the API. The JS UI covers most of the core functionality of Wikibase with the exception of some small small features such as merging of entities (T140124, T181910). 

A non JS UI also exists covering most features. This UI is comprised of a series of Mediawiki SpecialPages. Due to the complexities around editing statements there is currently no non JS UI for this.

The API and UIs interact with Wikidata entities stored as Mediawiki pages saving changes to persistent storage and doing any other necessary work.

Wikidata data getting to Wikipedia

Wikidata clients within the Wikimedia cluster can use data from wikidata in a variety of ways. The most common and automatic way is the generation of the “Languages” side bar on projects linking to the same article in other languages.

Data can also be accessed through the property parser function and various LUA functions.

Once entities are updated on wikidata.org that data needs to be pushed to client sites that are subscribed to the entity. This happens using various subscription metadata tables on both the clients and the repo(wikidata.org) itself. The Mediawiki jobqueue is used to process the updates outside of a regular webrequest, and the whole process is controlled by a cron job running the dispatchChanges,php maintenance script.

For wikidata.org multiple copies of the dispatchChanges script run simultaneously, looking at the list of client sites and changes that have happened since updates were last pushed, determining if updates need to be pushed and queueing jobs to actually update the data where needed, causing a page purge on the client. When these jobs are triggered the changes are also added to the client recent changes table so that they appear next to other changes for users of the site.

The Query Service

The Wikidata query service, powered by blazegraph, listens to a stream of changes happening on Wikidata.org. There are two possible modes, polling Special:RecentChanges, or using a kafka queue of EventLogging data. Whenever an entity changes the query service will request new turtle data for the entity from Special:EntityData, munge it (do further processing) and add it to the triple store.

Data can also be loaded into the query service from the RDF dumps. More details can be found here.

Data Dumps

Wikidata data is dumped in a variety of formats using a couple of different php based dump scripts.

More can be read about this here.