WBStack is a platform allowing shared scalable hosting of Wikibase and surrounding services.
A year ago I made an initial post covering the state of WBStack infrastructure. Since then some things have changed, and I have also had more time to create a clear diagram. So it is time for the 2021 edition.
Toward the end of 2020 I spent some time blackbox testing data load times for WDQS and Blazegraph to try and find out which possible setting tweaks might make things faster.
I didn’t come to any major conclusions as part of this effort but will write up the approach and data nonetheless incase it is useful for others.
I expect the next step toward trying to make this go faster would be via some whitebox testing, consulting with some of the original developers or with people that have taken a deep dive into the code (which I started but didn’t complete).
I have been using the bitnami mariadb docker images and helmfiles for just over a year now in a personal project (wbstack). I have 1 master and 1 replica setup in a cluster serving all of my SQL needs. As the project grew disk space became pressing and from an early time I has to start automatically purging the bin logs setting expire_logs_days to 14. This meant that I could no longer easily scale up the cluster, as new replicas would not be able to entirely build themselves.
The walkthrough was performed on a Google Kubernetes Engine cluster using the 7.3.16 bitnami/mariadb helm charts which contain the 10.3.22-debian-10-r92 bitnami/mariadb docker image. So if you are using something newer expect some differences, but in principle it should all work the same.
The Wikidata query service is a public SPARQL endpoint for querying all of the data contained within Wikidata. In a previous blog post I walked through how to set up a complete copy of this query service. One of the steps in this process is the munge step. This performs some pre-processing on the RDF dump that comes directly from Wikidata.
This post walks through using the new Hadoop based munge step with the latest Wikidata TTL dump on Google cloudsDataproc service. This cuts the munge time down from 1-2 days to just 2 hours using an 8 worker cluster. Even faster times can be expected with more workers, all the way down to ~20 minutes.
I have been using Google Cloud Build for a budget project for roughly a year now. Cloud Build stores built images in a storage bucket which you are of course billed for. Within the first weeks of using it I realized that I needed some automated way to cleanup unused and old images that were built there.
At the time I had a quick search around on the web for something already implemented that I could copy, but I came up blank, and decided putting my problem off would be the best solution. I filed issue number 6 for my project and left it for future me.
Now it’s time to finally close that issue, and I hope others might also find the small bash script useful.
UPDATE: You can find an up to date 2021 version of this post here.
WBStack currently runs on a Google Cloud Kubernetes cluster made up of 2 virtual machines, one e2-medium and one e2-standard-2. This adds up to a current total of 4 vCPUs and 12GB of memory. No Google specific services make up any part of the core platform at this stage meaning WBStack can run wherever there is a Kubernetes cluster with little to no modification.
A simplified overview of the internals can be seen in the diagram below where blue represents the Google provided services, with green representing everything running within the kubernetes cluster.