Wikibase and reconciliation

Over the years I have created a few little side projects, as well as working on other folks’ Wikibases, and of course Wikidata. And the one thing that I still wish would work better out of the box is reconciliation.

What is reconciliation

In the context of Wikibase, reconciliation refers to the process of matching or aligning external data sources with items in a Wikibase instance. It involves comparing the data from external sources with the existing data in Wikibase to identify potential matches or associations.

The reconciliation process typically follows these steps:

  1. Data Source Identification: Identify and select the external data sources that you want to reconcile with your Wikibase instance. These sources can include databases, spreadsheets, APIs, or other structured datasets.
  2. Data Comparison: Compare the data from the external sources with the existing data in your Wikibase. This step involves matching the relevant attributes or properties of the external data with the corresponding properties in Wikibase.
  3. Record Matching: Determine the level of similarity or matching criteria to identify potential matches between the external data and items in Wikibase. This can include exact matches, fuzzy matching, or other techniques based on specific properties or identifiers.
  4. Reconciliation Workflow: Develop a workflow or set of rules to reconcile the identified potential matches. This may involve manual review and confirmation or automated processes to validate the matches based on predefined criteria.
  5. Data Integration: Once the matches are confirmed, integrate the reconciled data from the external sources into your Wikibase instance. This may include creating new items, updating existing items, or adding additional statements or qualifiers to enrich the data.

Reconciliation plays a crucial role in data integration, data quality enhancement, and ensuring consistency between external data sources and the data stored in Wikibase. It enables users to leverage external data while maintaining control over data accuracy, completeness, and alignment with their knowledge base.

Existing reconciliation

One of my favourite places to reconcile data for Wikidata is by using OpenRefine. I have two previous posts looking at my first time using it, and a follow-up, both of which take a look at the reconciliation interface (You can also read the docs).

Read more

A first Wikidata query service JNL file for public use

This entry is part 2 of 3 in the series Your own Wikidata Query Service

Back in 2019 I wrote a blog post called Your own Wikidata Query Service, with no limits which documented loading a Wikidata TTL dump into your own Blazegraph instance running within Google cloud, a near 2 week process.

I ended that post speculating that part 2 might be using a “pre-generated Blazegraph journal file to deploy a fully loaded Wikidata query service in a matter of minutes”. This post should take us a step close to that eventuality.

Wikidata Production

There are many production Wikidata query service instances all up to date with Wikidata and all of which are powered using open source code that anyone can use, making use of Blazegraph.

Per wikitech documentation there are currently at least 17 Wikidata query service backends:

  • public cluster, eqiad: wdqs1004, wdqs1005, wdqs1006, wdqs1007, wdqs1012, wdqs1013
  • public cluster, codfw: wdqs2001, wdqs2002, wdqs2003, wdqs2004, wdqs2007
  • internal cluster, eqiad: wdqs1003, wdqs1008, wdqs1011
  • internal cluster, codfw: wdqs2005, wdqs2006, wdqs2008

These servers all have hardware specs that look something like Dual Intel(R) Xeon(R) CPU E5-2620 v3 CPUs, 1.6TB raw raided space SSD, 128GB RAM.

When you run a query it may end up in any one of the backends powering the public clusters.

All of these servers also then have an up-to-date JNL file full of Wikidata data that anyone wanting to set up their own blazegraph instance with Wikidata data could use. This is currently 1.1TB.

So let’s try and get that out of the cluster for folks to use, rather than having people rebuild their own JNL files.

Read more

Wikibase a history

I have had the pleasure of being part of the Wikibase journey one way or another since 2013 when I first joined Wikimedia Germany to work on Wikidata. That long-running relation to the project should put me in a fairly good position to give a high-level overview of the history, from both a technical and higher-level perspective. So here it goes.

For those that don’t know Wikibase is code that powers wikidata.org, and a growing number of other sites. If you want to know more read about it on Wikipedia, or the Wikibase website.

For this reason, a lot of the early timeline is quite heavy on the Wikidata side. There are certainly some key points missing, if you think they are worthy of mentioning then leave a comment or reach out!

Read more

WBStack Infrastructure

This entry is part 7 of 12 in the series WBStack

WBStack is a platform allowing shared scalable hosting of Wikibase and surrounding services.

A year ago I made an initial post covering the state of WBStack infrastructure. Since then some things have changed, and I have also had more time to create a clear diagram. So it is time for the 2021 edition.

WBStack currently runs on a single Google Cloud Kubernetes Engine cluster, now made up of 3 virtual machines, one e2-standard-2 and two e2-medium. This results in roughly 4 vCPUs and 12GB of allocatable memory.

Read more

How can I get data on all the dams in the world? Use Wikidata

During my first week at Newspeak house while explaining Wikidata and Wikibase to some folks on the terrace the topic of Dams came up while discussing an old project that someone had worked on. Back in the day collecting information about Dams would have been quite an effort, compiling a bunch of different data from different sources to try to get a complete worldwide view on the topic. Perhaps it is easier with Wikidata now?

Below is a very brief walkthrough of topic discovery and exploration using various Wikidata features and the SPARQL query service.

A typical known Dam

In order to get an idea of the data space for the topic within Wikidata I start with a Dam that I know about already, the Three Gorges Dam (Q12514). Using this example I can see how Dams are typically described.

Read more

WBStack Infrastructure (2020)

This entry is part 3 of 12 in the series WBStack

UPDATE: You can find an up to date 2021 version of this post here.

WBStack currently runs on a Google Cloud Kubernetes cluster made up of 2 virtual machines, one e2-medium and one e2-standard-2. This adds up to a current total of 4 vCPUs and 12GB of memory. No Google specific services make up any part of the core platform at this stage meaning WBStack can run wherever there is a Kubernetes cluster with little to no modification.

A simplified overview of the internals can be seen in the diagram below where blue represents the Google provided services, with green representing everything running within the kubernetes cluster.

Read more

WBStack – November review

This entry is part 2 of 12 in the series WBStack

It’s been roughly 1 month since WBStack appeared online, and it’s time for a quick review of what has been happening in the first month. If you don’t already know what WBStack is, then head to my introduction post.

The number of users and wikis has slowly been increasing. In my last post I stated ” 20 users on the project with 30 Wikibase installs”. 3 weeks after that post WBStack now sits at roughly 38 users with roughly 65 wikibases. Many of these wikibases are primarily users test wikis, but that’s great, the barrier to trying out Wikibase is definitely lowered.

If you would like an invite code to try WBStack, or have any related thoughts of ideas, then please get in touch.

What’s changed

As WBStack is a shared platform, all changes mentioned in this blog post are immediately visible on all hosted Wikibases. In the future there will be various options to turn things on and off, but at this early stage things are being kept simple.

Read more