mwcli (a MediaWiki focused command line tool targeting developers) over the years

mcwli includes the third of so generation of “developer environments” that I have made for MediaWiki over the years. You can see the backstory in this earlier post.

Since the early days of 2022, there has been optional metric collection included within the mwcli tool.

This metric collection simply collects what command you run, and when you run it (without any parameters or inputs) so that the data can be aggregated, and the various commands usage can be graphed.

Each command run includes something like:

Commanddocker mediawiki exec
DateTime2025-01-07T12:45:18.213Z
Version0.25.1

I used to have live (ish) graphs in the Wikimedia Superset installation, however, the queries there appear to fail now. So I took some time to export the dataset as a CSV, and shove it around a bit in a Python notebook.

Read more

2 years of wikibase.cloud by WMDE

It’s been somewhere between 2 and 3 years since WMDE took over WBStack, turned it into wikibase.cloud. During this time, my techy focus has slowly shifted away from the world of Wikibase, though I still enjoy following along and working on other Wikimedia areas.

Here I will ramble on about what I saw in terms of potential for wikibase.cloud within the Wikibase ecosystem, as well as what developments have happened within the past years.

The initial problems, goals and dreams

From An introduction to WBStack, I said:

The idea behind the project is to provide Wikibase and surrounding services, such as a blazegraph query service, query service ui, quick statements, and others on a shared platform where installs, upgrades and maintenance are handled centrally.

Now, this is fairly obvious, and clearly something that wikibase.cloud still offers today, however I didn’t write why!? And this is potentially something that has gotten lost through the years of multiple PMs, multiple engineers, multiple project names etc.

Read more

COVID-19 Wikipedia pageview spikes, 2019-2022

Back in 2019 at the start of the COVID-19 outbreak, Wikipedia saw large spikes in page views on COVID-19 related topics while people here hunting for information.

I briefly looked at some of the spikes in March 2020 using the easy-to-use pageview tool for Wikimedia sites. But the problem with viewing the spikes through this tool is that you can only look at 10 pages at a time on a single site, when in reality you’d want to look at many pages relating to a topic, across multiple sites at once.

I wrote a notebook to do just this, submitted it for privacy review, and I am finally getting around to putting some of those moving parts and visualizations in public view.

Methodology

It certainly isn’t perfect, but the representation of spikes is much more accurate than looking at a single Wikipedia or set of hand selected pages.

  1. Find statements on Wikidata that relate to COVID-19 items
  2. Find Wikipedia site links for these items
  3. Find previous names of these pages if they have been moved
  4. Lookup pageviews for all titles in the pageview_hourly dataset
  5. Compile into a gigantic table and make some graphs using plotly

I’ll come onto the details later, but first for the…

Graphics

All graphics generally show an initial peak in the run-up to the WHO declaring an international public health emergency (12 Feb 2020), and another peak starting prior to the WHO declaring a pandemic.

Be sure to have a look at the interactive views of each diagram to really see the details.

COVID-19 related Wikimedia pageviews (interactive view)

Read more

Verifying Wikimedia user page links on Mastodon

While reviewing the ongoings of the 2023 Wikimedia hackathon, I learned about the RealMe MediaWiki extension, which is already deployed to Wikimedia sites and allows verification of URLs that appear on user pages within other software or platforms, such as Mastodon.

Link verification for dummies

Imagine you want to show that your online profiles, like on Mastodon, truly belong to you. One way to do this is by using a special code called “rel=me”. It’s like saying, “Hey, this link over here is connected to me.” However, there’s a catch: both the link and the page it points to need to say they’re connected.

On platforms like Mastodon, you can add links to your other profiles. The platform then checks if those profiles also point back to your original page using the same “rel=me” code. If they do, your link gets a stamp of approval, showing it’s really yours.

The RealMe extension allows you to configure a set of links on your user page that include this “rel=me” special code that other systems, such as Mastodon, can check.

Configuring it

This one took me a few minutes to get working after reading the instructions, but on meta.wikimedia.org I added a link to my Mastodon profile, enabled the setting on meta, headed over to my Mastodon profile to add the link, and tada, it is verified!

Read more

Wikidata query service Blazegraph JNL file on Cloudflare R2 and Internet Archive

This entry is part 3 of 3 in the series Your own Wikidata Query Service

At the end of 2022, I published a Blazegraph JNL file for Wikidata in a Google Cloud bucket for 1 month for folks to download and determine if it was useful.

Thanks to Arno from weblyzard, inflatador from the WMF search platform team, and Mark from the Internet Archive for the recent conversations around this topic.

You can now grab some new JNL files from a few days ago, hosted on either the Internet Archive or Cloudflare R2.

Read more

What is Wikibase Cloud (according to Bing AI)

A few days ago I wrote a post called What is Wikibase Cloud (according to ChatGPT). Now I also have access to the new Bing AI chatbot preview, so time to put it to the same test.

This is one of the first times I’m going to have purposefully used Bing since its launch!

As before, Messages to Bing AI will be in bold, things that are wrong, in red, and things that are right in green.

I’ll try to steer the conversation through the same questions that I used with ChatGPT.

Firstly, “What is Bing AI?”

Bing AI is a term that refers to various features and capabilities of Bing that are powered by artificial intelligence. For example, Bing AI can help you find answers, chat with you, generate content, and personalize your web experience²³.

Source: Conversation with Bing, 06/03/2023
(1) Reinventing search with a new AI-powered Microsoft Bing and Edge, your …. Accessed 06/03/2023.
(2) Factbox: How to get Microsoft’s new AI-enhanced Bing. Accessed 06/03/2023.
(3) Bing. Accessed 06/03/2023.
(4) Microsoft ignores Bing fails by planning to let AI loose on Windows 11 desktops. Accessed 06/03/2023.
(5) Bing AI chatbot now lets you fine-tune its personality. Accessed 06/03/2023.
(6) AI-powered Bing Chat gains three distinct personalities. Accessed 06/03/2023.

I’m really liking the fact that the responses come with references, whereas with ChatGPT you need to ask for the resources that were used to generate the answers. Also, it looks like the “Accessed” date for all of these resources is much more recent, perhaps this means Bing will be better at conversations about more recent ongoings?

This is all also nicely displayed in the UI.

So, “What is Wikibase Cloud?”

Wikibase Cloud is a cloud-based platform that hosts instances of Wikibase, an open-source software suite for creating collaborative knowledge bases¹². It’s currently an invite-only beta service provided by Wikimedia Deutschland¹⁴.

Are you interested in applying for early access to Wikibase Cloud?

Source: Conversation with Bing, 06/03/2023
(1) Wikibase/Wikibase.cloud – MediaWiki. Accessed 06/03/2023.
(2) Wikibase/Wikibase.cloud – Meta – Wikimedia. Accessed 06/03/2023.
(3) wikibase.cloud. Accessed 06/03/2023.
(4) What is Wikibase Cloud (according to ChatGPT) – addshore. Accessed 06/03/2023.

So, now we are apparently reaching the point of self-referential AI, as Bing is referencing the blog post that I wrote a few days ago pointing out where ChatGPT was incorrect about various points regarding Wikibase Cloud.

There is also an interesting prompt by Bing to continue the conversation asking if I want to apply for early access, so let’s give it a go with a simple “Yes”.

Read more

What is Wikibase Cloud (according to ChatGPT)

In this post, I’ll be asking ChatGPT a bunch of questions about Wikibase Cloud. mainly to see how accurate it manages to be, and how helpful the information is.

Messages to Chat GPT will be in bold, things that are wrong, in red, and things that are right in green.

Firstly, “What is Chat GPT?

But I needed to follow up, asking “Can you answer again but in the third person?

ChatGPT is a large language model developed by OpenAI that has been trained on a vast corpus of text data. The acronym GPT stands for “Generative Pretrained Transformer,” which refers to the model’s ability to generate natural language responses to text-based prompts. ChatGPT uses a transformer neural network architecture that enables it to process and comprehend long sequences of text. Its training data is sourced from a diverse range of materials, including books, articles, and websites, and covers a broad spectrum of topics. ChatGPT has been designed to assist users with natural language processing tasks such as answering questions, summarizing text, and generating human-like responses in conversation.

ChatGPT on 4 March 2023

But when asking about Wikibase Cloud it gets a little confused and answers incorrectly…

What is Wikibase Cloud?

Wikibase Cloud is a hosting service provided by Wikimedia Cloud Services that allows users to create and manage their own Wikibase instances in the cloud. …

ChatGPT on 4 March 2023
The full ChatGPT response for “What is Wikibase cloud”

Wikibase Cloud is a hosting service provided by Wikimedia Deutschland, so the reference to Wikimedia Cloud Services here is incorrect.

So naturally, I need to ask “Are you sure it is provided by Wikimedia Cloud Services?

Read more

Wikimedia Enterprise: A first look

Wikimedia Enterprise is a new (now 1-year-old) service and offered by the Wikimedia Foundation, via Wikimedia, LLC.

This is a wholly-owned LLC that provides opt-in services for third-party content reuse, delivered via API services.

In essence, this means that Wikimedia Enterprise is an optional product that third parties can choose to use that repackages data from within Wikimedia projects in a more useful, more reliable, and stable format presenting them primarily via data downloads and APIs, with profits going into the Wikimedia Foundation.

Want to find out more? Read the FAQ.

The project and APIs are well documented, and access can be requested for free, but I wanted to spend a little bit of time hands-on with the APIs to get a full understanding of what is offered, the formats, and how it differs from things I know are exposed elsewhere in Wikimedia projects.

Account Creation

Wikimedia Enterprise accounts are separate from any other Wikimedia related accounts, so you’ll need a new one.

In order to get an account you need to fill out a pretty straightforward form (username, password, email, and accept terms). You then need to verify your email address. Tada, you are in!

Read more

A first Wikidata query service JNL file for public use

This entry is part 2 of 3 in the series Your own Wikidata Query Service

Back in 2019 I wrote a blog post called Your own Wikidata Query Service, with no limits which documented loading a Wikidata TTL dump into your own Blazegraph instance running within Google cloud, a near 2 week process.

I ended that post speculating that part 2 might be using a “pre-generated Blazegraph journal file to deploy a fully loaded Wikidata query service in a matter of minutes”. This post should take us a step close to that eventuality.

Wikidata Production

There are many production Wikidata query service instances all up to date with Wikidata and all of which are powered using open source code that anyone can use, making use of Blazegraph.

Per wikitech documentation there are currently at least 17 Wikidata query service backends:

  • public cluster, eqiad: wdqs1004, wdqs1005, wdqs1006, wdqs1007, wdqs1012, wdqs1013
  • public cluster, codfw: wdqs2001, wdqs2002, wdqs2003, wdqs2004, wdqs2007
  • internal cluster, eqiad: wdqs1003, wdqs1008, wdqs1011
  • internal cluster, codfw: wdqs2005, wdqs2006, wdqs2008

These servers all have hardware specs that look something like Dual Intel(R) Xeon(R) CPU E5-2620 v3 CPUs, 1.6TB raw raided space SSD, 128GB RAM.

When you run a query it may end up in any one of the backends powering the public clusters.

All of these servers also then have an up-to-date JNL file full of Wikidata data that anyone wanting to set up their own blazegraph instance with Wikidata data could use. This is currently 1.1TB.

So let’s try and get that out of the cluster for folks to use, rather than having people rebuild their own JNL files.

Read more

Wikidata query service updater evolution

The Wikidata Query Service (WDQS) sits in front of Wikidata and provides access to query its data via a SPARQL API. The query service itself is built on top of Blazegraph, but in many regards is very similar to any other triple store that provides a SPARQL API.

In the early days of the query service (circa 2015), the service was only run by Wikidata, hence the name. However, as interest and usage of Wikibase continued to grow more people started running a query service of their own, for data in their own Wikibase. But you’ll notice most people still refer to it as WDQS today.

Whereas most core Wikibase functionality is developed by Wikimedia Deutschland, the query service is developed by the search platform team at the Wikimedia Foundation, with a focus on wikidata.org, but also a goal of keeping it useable outside of Wikimedia infrastructure.

The query service itself currently works as a whole application rather than just a database. Under the surface, this can roughly be split into 2 key parts

  • Backend Blazegraph database that stores and indexes data
  • Updater process that takes data from a Wikibase and puts it in the database

This actually means that you can run your own query service, without running a Wikibase at all. For example, you can load the whole of Wikidata into a query service that you operate, and have it stay up to date with current events. Though in practice this is quite some work, and expense on storage and indexing and I expect not many folks do this.

Over time the updater element of the query service updater has iterated through some changes. The updater now packaged with Wikibase as used by most folks outside of the Wikimedia infrastructure is now 2 steps behind the updater used for Wikidata itself.

The updater generations look something like this:

  • HTTP API Recent Changes polling updater (used by most Wikibases)
  • Kafka based Recent Changes polling updater
  • Streaming updater (used on Wikidata)

Let’s take a look at a high-level overview of these updaters, what has changed and why. I’ll also be applying some pretty arbitrary / gut feeling scores to 4 categories for each updater.

Read more