It's a blog

Tag: Wikidata (Page 1 of 6)

Tech Lead Digest – Q1 2021

At some point last year I started sending a weekly internal digest to the Wikidata Wikibase team with my tech lead hat on.

The emails are internal only but contain lots of links to reading, podcasts and general goings on that could be useful to everyone.

So here is my first Wikidata Wikibase tech lead digest digest!

πŸ§‘β€πŸ€β€πŸ§‘Wikidata & Wikibase

Continue reading

WBStack setting changes, Federated properties, Wikidata entity mapping & more

During the first 3 months of 2021, some Wikimedia Deutschland engineers, from the Wikidata / Wikibase team, spent some time working on WBStack as part of an effort to explore the WBaaS (Wikibase as a service) topic during the year, as outlined by the development plan.

We want to make it easier for non-Wikimedia projects to set up Wikibase for the first time and to evaluate the viability of Wikibase as a Service.

Wikibase 2021 Development plan

This has lead to a few new Wikibase features being exposed through the WBStack dashboard for sites that run on the platform. These features are primarily features developed by the Wikibase team in 2020 and 2021. The work also brought some other quality of life improvements for the settings pages.

Here is a quick rundown of what’s new and improved.

Continue reading

Twitter bot powered by Github Actions (WikidataMeter)

Recently 2 new Twitter bots appeared in my feed, fullyjabbed & fullyjabbedUK, created by iamdanw and powered entirely by Github Actions (code).

I have been thinking about writing a Twitter bot for some time and decided to copy this pattern running a cron based Twitter bot on Github Actions, with an added bit of free persistence using jsonstorage.net.

This post if my quick walkthrough of my new bot, WikidataMeter, what it does and how it works. You can find the code version when writing this blog post here, and the current version here.

Continue reading

Testing WDQS Blazegraph data load performance

Toward the end of 2020 I spent some time blackbox testing data load times for WDQS and Blazegraph to try and find out which possible setting tweaks might make things faster.

I didn’t come to any major conclusions as part of this effort but will write up the approach and data nonetheless incase it is useful for others.

I expect the next step toward trying to make this go faster would be via some whitebox testing, consulting with some of the original developers or with people that have taken a deep dive into the code (which I started but didn’t complete).

Continue reading

Faster munging for the Wikidata Query Service using Hadoop

The Wikidata query service is a public SPARQL endpoint for querying all of the data contained within Wikidata. In a previous blog post I walked through how to set up a complete copy of this query service. One of the steps in this process is the munge step. This performs some pre-processing on the RDF dump that comes directly from Wikidata.

Back in 2019 this step took 20 hours and now takes somewhere between 1-2 days as Wikidata has continued to grow. The original munge step (munge.sh) makes use of only a single CPU. The WMF has been experimenting for some time with performing this step in their Hadoop cluster as part of their modern update mechanism (streaming updater). An additional patch has now also made this useful for the current default load process (using loadData.sh).

This post walks through using the new Hadoop based munge step with the latest Wikidata TTL dump on Google clouds Dataproc service. This cuts the munge time down from 1-2 days to just 2 hours using an 8 worker cluster. Even faster times can be expected with more workers, all the way down to ~20 minutes.

Continue reading

How can I get data on all the dams in the world? Use Wikidata

During my first week at Newspeak house while explaining Wikidata and Wikibase to some folks on the terrace the topic of Dams came up while discussing an old project that someone had worked on. Back in the day collecting information about Dams would have been quite an effort, compiling a bunch of different data from different sources to try to get a complete worldwide view on the topic. Perhaps it is easier with Wikidata now?

Below is a very brief walkthrough of topic discovery and exploration using various Wikidata features and the SPARQL query service.

A typical known Dam

In order to get an idea of the data space for the topic within Wikidata I start with a Dam that I know about already, the Three Gorges Dam (Q12514). Using this example I can see how Dams are typically described.

Continue reading

Creating new Wikidata items with OpenRefine and Quickstatements

Following on from my blog post using OpenRefine for the first time, I continued my journey to fill Wikidata with all of the Tors on Dartmoor.

This post assumes you already have some knowledge of Wikidata, Quickstatements, and have OpenRefine setup.

Note: If you are having problems with the reconciliation service it might be worth giving this mailing list post a read!

Getting some data

I searched around for a while looking at various lists of tors on Dartmoor. Slowly I compiled a list that seemed to be quite complete from a variety of sources into a Google Sheet. This list included some initial names and rough OS Map grid coordinates(P613).

In order to load the data into OpenRefine I exported the sheet as a CSV and dragged it into OpenRefine using the same process as detailed in my previous post.

Continue reading

Using OpenRefine with Wikidata for the first time

I have long known about OpenRefine (previously Google Refine) which is a tool for working with data, manipulating and cleaning it. As of version 3.0 (May 2018), OpenRefine included a Wikidata extension, allowing for extra reconciliation and also editing of Wikidata directly (as far as I understand it). You can find some documentation on this topic on Wikidata itself.

This post serves as a summary of my initial experiences with OpenRefine, including some very basic reconciliation from a Wikidata Query Service SPARQL query, and making edits on Wikidata.

In order to follow along you should already know a little about what Wikidata is.

Starting OpenRefine

I tried out OpenRefine in two different setups both of which were easy to set up following the installation docs. The setups were on my actual machine and in a VM. For the VM I also had to use the -i option to make the service listen on a different IP. refine -i 172.23.111.140

Continue reading

Wikidata Map May – November 2019

It’s time for another blog post in my Wikidata map series, this time comparing the item maps that were generated on the 13th May 2019 and 11th November 2019 (roughly 6 months). I’ll again be using Resemble.js to generate a difference image highlighting changed areas in pink, and breakdown the areas that have had the greatest change throughout the 6 month period. The full comparison image can be found here.

Differences in the Wikidata map highlights in pink for changes between May 2019 and November 2019

If you don’t know what Wikidata is, or what items are then give this page a read. This map shows all items that have a “coordinate location” as a light pixel on a black canvas. The more items with coordinates in a single pixel, the brighter that pixel. This map is generated using code that can be found here.

Continue reading

Covid-19 Wikipedia pageviews, a first look

World events often have a dramatic impact on online services. A past example would be the death of Michael Jackson which brought down Twitter and Wikipedia and made Google believe that they were under attack according to the BBC.

Events like the COVID-19 (Coronavirus) pandemic have less instantaneous affect but trends can still be seen to change. Cloudflare recently posted about some of the internet wide traffic changes due to the pandemic and various government announcements, quarantines and lockdowns.

Currently the main English Wikipedia article for the COVID-19 pandemic is receiving roughly 1.2 million page views per day (14 per second). This article has already gone through 4 different names over the past months, and the pageview rate continues to climb.

Wikipedia pageviews tool showing English Wikipedia COVID-19 pandemic article views up to 21 March 2020 (source)
Continue reading
« Older posts

© 2021 Addshore

Theme by Anders NorΓ©nUp ↑