google cloud – addshore

A first Wikidata query service JNL file for public use

August 25, 2023October 2, 2022 by addshore

This entry is part 2 of 3 in the series Your own Wikidata Query Service

Back in 2019 I wrote a blog post called Your own Wikidata Query Service, with no limits which documented loading a Wikidata TTL dump into your own Blazegraph instance running within Google cloud, a near 2 week process.

I ended that post speculating that part 2 might be using a “pre-generated Blazegraph journal file to deploy a fully loaded Wikidata query service in a matter of minutes”. This post should take us a step close to that eventuality.

Edit: August 2023

New post publishing JNLs from August 2023 to Cloudflare R2 and Internet archive available now!

Wikidata Production

There are many production Wikidata query service instances all up to date with Wikidata and all of which are powered using open source code that anyone can use, making use of Blazegraph.

Per wikitech documentation there are currently at least 17 Wikidata query service backends:

public cluster, eqiad: wdqs1004, wdqs1005, wdqs1006, wdqs1007, wdqs1012, wdqs1013
public cluster, codfw: wdqs2001, wdqs2002, wdqs2003, wdqs2004, wdqs2007
internal cluster, eqiad: wdqs1003, wdqs1008, wdqs1011
internal cluster, codfw: wdqs2005, wdqs2006, wdqs2008

These servers all have hardware specs that look something like Dual Intel(R) Xeon(R) CPU E5-2620 v3 CPUs, 1.6TB raw raided space SSD, 128GB RAM.

When you run a query it may end up in any one of the backends powering the public clusters.

All of these servers also then have an up-to-date JNL file full of Wikidata data that anyone wanting to set up their own blazegraph instance with Wikidata data could use. This is currently 1.1TB.

So let’s try and get that out of the cluster for folks to use, rather than having people rebuild their own JNL files.

Google Cloud Storage upload with s3cmd

September 20, 2022September 7, 2022 by addshore

I recently had to upload a large (1TB) file from a Wikimedia server into a Google Cloud Storage bucket. gsutil was not available to me, but s3cmd was. So here is a little how to post for uploading to a Google Cloud Storage bucket using the S3 API and s3cmd.

S3cmd is a free command line tool and client for uploading, retrieving and managing data in Amazon S3 and other cloud storage service providers that use the S3 protocol, such as Google Cloud Storage or DreamHost DreamObjects.
https://s3tools.org/s3cmd

s3cmd was already installed on my system, but if you want to get it, see the download page.

WBStack Infrastructure

August 28, 2023April 29, 2021 by addshore

This entry is part 7 of 12 in the series WBStack

WBStack is a platform allowing shared scalable hosting of Wikibase and surrounding services.

A year ago I made an initial post covering the state of WBStack infrastructure. Since then some things have changed, and I have also had more time to create a clear diagram. So it is time for the 2021 edition.

WBStack currently runs on a single Google Cloud Kubernetes Engine cluster, now made up of 3 virtual machines, one e2-standard-2 and two e2-medium. This results in roughly 4 vCPUs and 12GB of allocatable memory.

Testing WDQS Blazegraph data load performance

February 11, 2021February 10, 2021 by addshore

Toward the end of 2020 I spent some time blackbox testing data load times for WDQS and Blazegraph to try and find out which possible setting tweaks might make things faster.

I didn’t come to any major conclusions as part of this effort but will write up the approach and data nonetheless incase it is useful for others.

I expect the next step toward trying to make this go faster would be via some whitebox testing, consulting with some of the original developers or with people that have taken a deep dive into the code (which I started but didn’t complete).

Creating a new replica after purging binlogs with bitnami mariadb docker images

November 15, 2020 by addshore

I have been using the bitnami mariadb docker images and helmfiles for just over a year now in a personal project (wbstack). I have 1 master and 1 replica setup in a cluster serving all of my SQL needs. As the project grew disk space became pressing and from an early time I has to start automatically purging the bin logs setting expire_logs_days to 14. This meant that I could no longer easily scale up the cluster, as new replicas would not be able to entirely build themselves.

This blog post walks through the way that I ended up creating a new replica from my master after my replica corrupt itself and I was all out of binlogs. This directly relates to the Github issue on the bitnami docker images of https://github.com/bitnami/bitnami-docker-mariadb/issues/177

The walkthrough was performed on a Google Kubernetes Engine cluster using the 7.3.16 bitnami/mariadb helm charts which contain the 10.3.22-debian-10-r92 bitnami/mariadb docker image. So if you are using something newer expect some differences, but in principle it should all work the same.

Faster munging for the Wikidata Query Service using Hadoop

October 15, 2020October 14, 2020 by addshore

The Wikidata query service is a public SPARQL endpoint for querying all of the data contained within Wikidata. In a previous blog post I walked through how to set up a complete copy of this query service. One of the steps in this process is the munge step. This performs some pre-processing on the RDF dump that comes directly from Wikidata.

Back in 2019 this step took 20 hours and now takes somewhere between 1-2 days as Wikidata has continued to grow. The original munge step (munge.sh) makes use of only a single CPU. The WMF has been experimenting for some time with performing this step in their Hadoop cluster as part of their modern update mechanism (streaming updater). An additional patch has now also made this useful for the current default load process (using loadData.sh).

This post walks through using the new Hadoop based munge step with the latest Wikidata TTL dump on Google clouds Dataproc service. This cuts the munge time down from 1-2 days to just 2 hours using an 8 worker cluster. Even faster times can be expected with more workers, all the way down to ~20 minutes.

WBStack Infrastructure (2020)

August 28, 2023January 25, 2020 by addshore

This entry is part 3 of 12 in the series WBStack

UPDATE: You can find an up to date 2021 version of this post here.

WBStack currently runs on a Google Cloud Kubernetes cluster made up of 2 virtual machines, one e2-medium and one e2-standard-2. This adds up to a current total of 4 vCPUs and 12GB of memory. No Google specific services make up any part of the core platform at this stage meaning WBStack can run wherever there is a Kubernetes cluster with little to no modification.

A simplified overview of the internals can be seen in the diagram below where blue represents the Google provided services, with green representing everything running within the kubernetes cluster.

Your own Wikidata Query Service, with no limits

August 25, 2023October 23, 2019 by addshore

This entry is part 1 of 3 in the series Your own Wikidata Query Service

The Wikidata Query Service allows anyone to use SPARQL to query the continuously evolving data contained within the Wikidata project, currently standing at nearly 65 millions data items (concepts) and over 7000 properties, which translates to roughly 8.4 billion triples.

Screenshot of the Wikidata Query Service home page including the example query which returns all Cats on Wikidata.

You can find a great write up introducing SPARQL, Wikidata, the query service and what it can do here. But this post will assume that you already know all of that.

EDIT 2022: You may also want to read https://addshore.com/2022/10/a-wikidata-query-service-jnl-file-for-public-use/

Guide

Here we will focus on creating a copy of the query service using data from one of the regular TTL data dumps and the query service docker image provided by the wikibase-docker git repo supported by WMDE.