Addshore

It's a blog

Your own Wikidata Query Service, with no limits (part 1)

The Wikidata Query Service allows anyone to use SPARQL to query the continuously evolving data contained within the Wikidata project, currently standing at nearly 65 millions data items (concepts) and over 7000 properties, which translates to roughly 8.4 billion triples.

Screenshot of the Wikidata Query Service home page including the example query which returns all Cats on Wikidata.

You can find a great write up introducing SPARQL, Wikidata, the query service and what it can do here. But this post will assume that you already know all of that.

Guide

Here we will focus on creating a copy of the query service using data from one of the regular TTL data dumps and the query service docker image provided by the wikibase-docker git repo supported by WMDE.

Somewhere to work

Firstly we need a machine to hold the data and do the needed processing. This blog post will use a “n1-highmem-16” (16 vCPUs, 104 GB memory) virtual machine on the Google Cloud Platform with 3 local SSDs held together with RAID 0.

This should provide us with enough fast storage to store the raw TTL data, munged TTL files (where extra triples are added) as well as the journal (JNL) file that the blazegraph query service uses to store its data.

This entire guide will work on any instance size with more than ~4GB memory and adequate disk space of any speed. For me this setup seemed to be the fastest for the entire process to run.

If you are using the cloud console UI to create an instance then you can use the following options:

  • Select Machine type n1-highmem-16 (16 vCPU, 104 GB memory)
  • Bootdisk: Google Drawfork Debian GNU / Linux 9
  • Firewall (optional): Allow HTTP and HTTPS traffic if you want to access your query service copy externally
  • Add disk: Local SSD scratch disk, NVMe (select 3 of them)

If you are creating the instance via the command line your command will look something like the one below. All other defaults should be fine as long as the user you are using can create instances, and you are in the correct project and region.

Initial setup

Setting up my SSDs in RAID

As explained above I will be using some SSDs that need to be setup in a RAID, so I’ll start with that following the docs provided by GCE. If you have no need to RAID then skip this step.

Packages & User

Docker will be needed to run the wdqs docker image, so follow the instructions and install that.

This process will require some long running scripts so best to install tmux so that we don’t loose where they are.

Now lets create a user that is in the docker group for us to run our remaining commands via. We will also use the home directory of this user for file storage.

Then start a tmux session with the following command

And download a copy of the wdqs docker image from docker hub.

Here we use version 0.3.6, but this guide should work for future versions just the same (maybe not for 0.4.x+ when that comes out).

Download the TTL dump

The main location to find the Wikidata dumps is on https://dumps.wikimedia.org partially buried in the wikidatawiki/entities/ directory. Unfortunately downloads from here have speed restrictions, so you are better of using one of the mirrors.

I found that the your.org mirror was the fastest from us-east4-c GCE region. To download simply type the following and wait. (At the end of 2018 this was a 47GB download).

Firstly make sure we are in the directory which is backed by our local SSD storage which was mounted above.

And download the latest TTL dump file which can take around 30 minutes to an hour.

Munge the data

Munging the data involves running a script which is packaged with the query service over the TTL dump that we have just downloaded. We will use Docker and the wdqs docker image to do this, starting a new container running bash.

And then running the munge script within the internal container bash. This will take around 20 hours and create many files.

The -c option was introduced for use in this blog post and allows the chunk size to be selected. If the chunk size is too big you may run into import issues. 50000 is half of the default.

Once the munge has completed you can leave the container that we started and return to the virtual machine.

Run the service

In order to populate the service we first need to run it using the below command. This mounts the directory containing the munged data as well as the directory for storing the service JNL file in. This will also expose the service on port 9999 which will be writable, so if you don’t want other people to access this check your firewall rules. Don’t worry if you don’t have a dockerData directory, as it will be created when you run this command.

You can verify that the service is running with curl.

Load the data

Using the ID of the container that is now running, which you can find out by running “docker ps”, start a new bash shell alongside the service.

Now that we are in the container running our query service we can load the data using the loadData script and the previously munged files.

Once the the data appears to be loaded, you can try out a shorter version of the cats query from the service examples.

Timings

Loading data from a totally fresh TTL dump into a blank query service is not a quick task currently. In production (wikidata.org) it takes roughly a week, and I had a similar experience while trying to streamline the process as best I could on GCE.

For a dump taken at the end of 2018 the timings for each stage were as follows:

  • Data dump download: 2 hours
  • Data Munge: 20 hours
  • Data load: 4.5 days
  • Total time: ~5.5 days

Various parts of the process lead me to believe that this could be done faster as throughout CPU usage was pretty low and not all memory was utilized. The loading of the data into blazegraph was by far the slowest step, but digging into this would require someone that is more familiar with the blazegraph internals.

Blog post series

This blog post, and future blog posts on the topic of running a Wikidata query service on Google cloud, are supported by Google and written by WMDE.

Part 2 will use the docker images and a pre generated Blazegraph journal file to deploy a fully loaded Wikidata query service in a matter of minutes.

2 Comments

  1. Your loading wasn’t maximizing your CPU or Memory because you forgot to add the following into the loadData.sh file and configure properly:

    com.bigdata.rdf.store.DataLoader.flush=false
    com.bigdata.rdf.store.DataLoader.bufferCapacity=100000
    com.bigdata.rdf.store.DataLoader.queueCapacity=10

    Read more here: https://wiki.blazegraph.com/wiki/index.php/Bulk_Data_Load#Configuring

    Also you can even tweak the database instance:

    /opt/RWStore.properties

    See more options here: https://wiki.blazegraph.com/wiki/index.php/Configuring_Blazegraph
    and here: https://wiki.blazegraph.com/wiki/index.php/PerformanceOptimization#Batch_is_beautiful

    • Thanks!
      I knew that I was missing things but really needed to get this long overdue post up.
      I look forward trying it out with some alternate settings!
      I’ll be sure to follow up here when I do.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

© 2019 Addshore

Theme by Anders NorĂ©nUp ↑

%d bloggers like this: