WikiCrowd at 50k answers

In January 2022 I published a new Wikimedia tool called WikiCrowd.

This tool allows people to answer simple questions to contribute edits to Wikimedia projects such as Wikimedia Commons and Wikidata.

It’s designed to be able to deal with a wide variety of questions, but due to time constraints, the extent of the current questions covers Aliases for Wikidata, and Depict statements for Wikimedia Commons.

Read more

Wikidata maxlag, via the ApiMaxLagInfo hook

Wikidata tinkers with the concept of maxlag that has existed in MediaWiki for some years in order to slow automated editing at times of lag in various systems.

Here you will find a little introduction to MediaWiki maxlag, and the ways that Wikidata hooks into the value, altering it for its needs.

Screenshot of the “Wikidata Edits” grafana dashboard showing increased maxlag and decreased edits

As you can see above, a high maxlag can cause automated editing to reduce or stop on wikidata.org

Read more

Wikibase a history

I have had the pleasure of being part of the Wikibase journey one way or another since 2013 when I first joined Wikimedia Germany to work on Wikidata. That long-running relation to the project should put me in a fairly good position to give a high-level overview of the history, from both a technical and higher-level perspective. So here it goes.

For those that don’t know Wikibase is code that powers wikidata.org, and a growing number of other sites. If you want to know more read about it on Wikipedia, or the Wikibase website.

For this reason, a lot of the early timeline is quite heavy on the Wikidata side. There are certainly some key points missing, if you think they are worthy of mentioning then leave a comment or reach out!

Read more

Profiling a Wikibase item creation on test.wikidata.org

Today I was in a Wikibase Stakeholder group call, and one of the discussions was around Wikibase importing speed, data loading, and the APIs. My previous blog post covering what happens when you make a new Wikibase item was raised, and we also got onto the topic of profiling.

So here comes another post looking at some of the internals of Wikibase, through the lens of profiling on test.wikidata.org.

The tools used to write this blog post for Wikimedia infrastructure are both open source, and also public. You can do similar profiling on both your own Wikibase, or for your requests that you suspect are slow on Wikimedia sites such as Wikidata.

Wikimedia Profiling

Profiling of Wikimedia sites is managed and maintained by the Wikimedia performance team. They have a blog, and one of the most recent posts was actually covering profiling PHP at scale in production, so if you want to know the details of how this is achieved give it a read.

Throughout this post I will be looking at data collected from a production Wikimedia request, by setting the X-Wikimedia-Debug header in my request. This header has a few options, and you can find the docs on wikitech.wikimedia.org. There are also browser extensions available to easily set this header on your requests.

I will be using the Wikimedia hosted XHGui to visualize the profile data. Wikimedia specific documentation for this interface also exists on wikitech.wikimedia.org. This interface contains a random set of profiled requests, as well as any requests that were specifically requested to be profiled.

Profiling PHP & MediaWiki

If you want to profile your own MediaWiki or Wikibase install, or PHP in general, then you should take a look at the mediawiki.org documentation page for this. You’ll likely want to use either Tideways or XDebug, but probably want to avoid having to setup any extra UI to visualize the data.

This profiling only covered the main PHP application (MediaWiki & Wikibase extension). Other services such as the query service would require separate profiling.

Read more

Wikidata ontological tree of Trains

While looking working on my recent WikiCrowd project I ended up looking at the ontological tree of both Wikidata entities and Wikimedia Commons categories.

In this post, I’ll look at some of the ontology mappings that happen between projects, some of the SPARQL that can help you use this ontology in tools, and also some tools to help you explore this complex tree.

I’m using trains as I think they are fairly easy for most folks to relate to, and also don’t have a massively complex tree.

Commons & Wikidata mapping

Depicts questions in WikiCrowd are entirely generated from these Wikimedia Commons categories, such as Category:Trains & Category:Steam locomotives. These are then mapped to items on Wikidata such as Q870 (train) & Q171043 (steam locomotive).

Wikimedia Commons categories quite often contain infoboxes on the right-hand side that link to a variety of resources for the thing the category is covering. And quite often there is a Wikidata item ID present, this is the case for the categories above.

Likewise on Wikidata statements for P373 (Commons category) will often exist for entities that are depicted on Commons.

Read more

A first look at WikiCrowd

I have quite enjoyed the odd contribution to an app by Google called Crowdsource. You can find it either on the web, or also as an app.

Crowdsource allows people to throw data at Google in controlled ways to add to the massive pile of data that Google uses to improve its services and at the end of the day beat its competition.

It does this by providing a collection of micro contribution tasks in a marginally gamified way, similar to how Google Maps contributions get you Local Guide points etc. In Crowdsource you get a contribution count, a level, and a metric for agreements.

While I enjoy making the odd contribution when bored out of my mind and enjoy looking at the new challenges (currently at 2625 contributions), I always think that data like this should just be going out into the world under a free licence to benefit everyone.

So finally, introducing WikiCrowd, an interface, and soon to be app, that I developed over the new year period.

WikiCrowd Overview

WikiCrowd is hosted on toolforge and can be found at https://wikicrowd.toolforge.org/ (Source code on Github)

In order to contribute, you need some knowledge of the world, a Wikimedia account and that’s it!

Screenshot showing the wikicrowd application, listing various groups of questions users can contribute to

Read more

Wikidata user and project talk page connection graph

Talk pages are a pretty key part of how wikis have worked over the years. Realtime chat apps and services are probably changing this dynamic somewhat, but they are still used, and also most of the history of these pages is still recorded.

I started up an IPython Notebook to try and take a look at some of the connections between different users on Wikidata over the years. Below you’ll find a few representations of these connections, as well as notable things I spotted along the way, the generating code, SQL query and more!

The data

MediaWiki maintains links tables for all pages, so getting all of the current links out of Wikidata is very easy. I made use of the Wikimedia Cloud Quarry service to run this query and host a CSV of the results.

SELECT
  SUBSTRING_INDEX(page_title, '/', 1) AS t1,
  pl_from_namespace AS t1ns,
  SUBSTRING_INDEX(pl_title, '/', 1) AS t2,
  pl_namespace AS t2ns
FROM pagelinks, page
WHERE pl_namespace IN (3,5) AND pl_from_namespace IN (3,5)
AND page_id = pl_from AND page_title != pl_title
GROUP BY t1, t2Code language: PHP (php)

I then loaded this data directly into an IPython Notebook and did some cleaning, such as removing all IP addresses. I then spent quite some time applying more filtering and twiddling knobs to try and get some graphics out that are easy to look at. The first attempts looked like solid blobs as you can see in this tweet.

You can find a copy of the Notebook on notebooksharing.space.

Read more

Most liked Wikidata tweets

Wikidata is 9, and Twitter has been around for the entire Wikidata lifespan. So let’s take a look back through time at some of the most liked Wikidata tweets (according to Twitter free search) since creation.

Personally, I think it’s rather cool that half of the tweets are in languages other than English!

Want this list but for Wikibase (the software that runs Wikidata)? Check out my Wikibase focused post!

2021, @wikidata 412 💕s

Announcement of the new Wikidata Query Builder by @wikidata!

Read more

Tech Lead Digest – Q3/4 2021

This entry is part 5 of 5 in the series Tech Lead Digest (wmde)

It’s time for the 5th instalment of my tech lead digest posts. I switched to monthly for 2 months, but decided to back down to quarterlyish. You can find the other digests by checking out the series.

🧑‍🤝‍🧑Wikidata & Wikibase

The biggest event of note in the past months was WikidataCon 2021 which took place toward the end of October 2021. Spread over 3 days the event celebrated Wikidatas 9th birthday. We are still awaiting the report from the event to know how many folks participated, and recordings of talks will likely not be available until early 2022. At which point I’ll try to write another blog post.

Just before WikidataCon the updated strategy for Linked Open Data was published by Wikimedia Deutschland which includes sub-strategies for Wikidata and the Wikibase Ecosystem. This strategy is much easier to digest than the strategy papers published in 2019 and I highly recommend the read. Part of the Wikidata strategy talks about “sharing workload” which reminds me of some thoughts I recently had comparing Wikipedia and Wikidata editing. Wikibase has a focus on Ecosystem enablement, which I am looking forward to working on.

The Wikibase stakeholder group continues to grow and organize. A Twitter account (@wbstakeholders) now exists tweeting relevant updates. Now with over 14 organizational members and 15 individual members, the budget is now public and the group is working on getting some desired features implemented. If you are an organization or individual working in the Wikibase space, be sure to check them out! The group recently published a prioritized list of institutional requirements, and I’m happy to say that some parts of the “Automatic maintenance processes and updating cascades should work out of the box” area that scored 4 have already been tackled by the Wikidata / Wikibase teams.

Read more