A first Wikidata query service JNL file for public use
Back in 2019 I wrote a blog post called Your own Wikidata Query Service, with no limits which documented loading a Wikidata TTL dump into your own Blazegraph instance running within Google cloud, a near 2 week process. I ended that post speculating that part 2 might be using a “pre-generated Blazegraph journal file to…
Google Cloud Storage upload with s3cmd
I recently had to upload a large (1TB) file from a Wikimedia server into a Google Cloud Storage bucket. gsutil was not available to me, but s3cmd was. So here is a little how to post for uploading to a Google Cloud Storage bucket using the S3 API and s3cmd. S3cmd is a free command line…
WBStack Infrastructure
WBStack is a platform allowing shared scalable hosting of Wikibase and surrounding services. A year ago I made an initial post covering the state of WBStack infrastructure. Since then some things have changed, and I have also had more time to create a clear diagram. So it is time for the 2021 edition. WBStack currently…
Testing WDQS Blazegraph data load performance
Toward the end of 2020 I spent some time blackbox testing data load times for WDQS and Blazegraph to try and find out which possible setting tweaks might make things faster. I didn’t come to any major conclusions as part of this effort but will write up the approach and data nonetheless incase it is…
Creating a new replica after purging binlogs with bitnami mariadb docker images
I have been using the bitnami mariadb docker images and helmfiles for just over a year now in a personal project (wbstack). I have 1 master and 1 replica setup in a cluster serving all of my SQL needs. As the project grew disk space became pressing and from an early time I has to…
Faster munging for the Wikidata Query Service using Hadoop
The Wikidata query service is a public SPARQL endpoint for querying all of the data contained within Wikidata. In a previous blog post I walked through how to set up a complete copy of this query service. One of the steps in this process is the munge step. This performs some pre-processing on the RDF…
Automatic cleanup of old gcloud container images
I have been using Google Cloud Build for a budget project for roughly a year now. Cloud Build stores built images in a storage bucket which you are of course billed for. Within the first weeks of using it I realized that I needed some automated way to cleanup unused and old images that were…
WBStack Infrastructure (2020)
UPDATE: You can find an up to date 2021 version of this post here. WBStack currently runs on a Google Cloud Kubernetes cluster made up of 2 virtual machines, one e2-medium and one e2-standard-2. This adds up to a current total of 4 vCPUs and 12GB of memory. No Google specific services make up any…
Your own Wikidata Query Service, with no limits
The Wikidata Query Service allows anyone to use SPARQL to query the continuously evolving data contained within the Wikidata project, currently standing at nearly 65 millions data items (concepts) and over 7000 properties, which translates to roughly 8.4 billion triples. You can find a great write up introducing SPARQL, Wikidata, the query service and what…