Wikibase: What is an entity?

This entry is part 1 of 7 in the series Wikibase Entities

I left the Wikidata and Wikibase teams roughly a year ago, and at the time there were some long and deep discussions going on inside the team trying to define what an entity was, and what should and should not be an entity.

At the recent Hackathon in Tallinn, this topic resurfaced to me, as current and previous members of the Wikidata and Wikibase teams were in attendance, along with myself.

I have opinions, others have opinions, and feel that a short blog post summarizing the currently publicly written details, as well as some of the more on point things I have heard people say may help further discussion, or perhaps bring it to some kind of conclusion.

What I actually found when pulling the various written details together is they mostly describe what I would say is the ideal path forward without rewriting the world (of Wikibase), but it’s taken me a while to sit back, relax, and actually reread all the things that we have written over the years.

Read more

Review & Removal of Wikidata query service Blazegraph JNL on Cloudflare R2

Back in August, I uploaded a new Wikidata query service Blazegraph JNL file to both Cloudflare and the Internet Archive. 4 months on, it is time for me to remove the R2 version of this file, which is costing me around 18 USD per month to store, and fall back to the Internet Archive version … Read more

COVID-19 Wikipedia pageview spikes, 2019-2022

Back in 2019 at the start of the COVID-19 outbreak, Wikipedia saw large spikes in page views on COVID-19 related topics while people here hunting for information.

I briefly looked at some of the spikes in March 2020 using the easy-to-use pageview tool for Wikimedia sites. But the problem with viewing the spikes through this tool is that you can only look at 10 pages at a time on a single site, when in reality you’d want to look at many pages relating to a topic, across multiple sites at once.

I wrote a notebook to do just this, submitted it for privacy review, and I am finally getting around to putting some of those moving parts and visualizations in public view.

Methodology

It certainly isn’t perfect, but the representation of spikes is much more accurate than looking at a single Wikipedia or set of hand selected pages.

  1. Find statements on Wikidata that relate to COVID-19 items
  2. Find Wikipedia site links for these items
  3. Find previous names of these pages if they have been moved
  4. Lookup pageviews for all titles in the pageview_hourly dataset
  5. Compile into a gigantic table and make some graphs using plotly

I’ll come onto the details later, but first for the…

Graphics

All graphics generally show an initial peak in the run-up to the WHO declaring an international public health emergency (12 Feb 2020), and another peak starting prior to the WHO declaring a pandemic.

Be sure to have a look at the interactive views of each diagram to really see the details.

COVID-19 related Wikimedia pageviews (interactive view)

Read more

Wikidata query service Blazegraph JNL file on Cloudflare R2 and Internet Archive

At the end of 2022, I published a Blazegraph JNL file for Wikidata in a Google Cloud bucket for 1 month for folks to download and determine if it was useful.

Thanks to Arno from weblyzard, inflatador from the WMF search platform team, and Mark from the Internet Archive for the recent conversations around this topic.

You can now grab some new JNL files from a few days ago, hosted on either the Internet Archive or Cloudflare R2.

Read more

Creating properties with statements using Wikidata Integrator

Wikidata Integrator is a Python library that simplifies data integration from Wikidata (and other Wikibases). It is written in Python, is focused on Wikibase concepts (as opposed to some libraries which are MediaWiki focused) and has a user-friendly interface.

I’m currently working on a demo Wikibase and decided to bring all of the data into the Wikibase making use of a Jupyter notebook, and Wikidata integrator was my choice library to use for this task. (Jupyter notebooks are interactive coding environments that allow users to create and share documents containing live code, visualizations, and explanations.)

Along that journey I found the Wikidata Integrator documentation lacking slightly, but I managed to get initial property and item creation working with little effort. However, I couldn’t get properties to create with statements already on them (needed a subsequent edit instead).

Read more

Wikidata Map in 2023

It’s been 2 years since the 2021 Wikidata map. Yesterday I was sitting in the WMDE office and Lydia raised the point that we hadn’t made a map in quite some time (T331124).

Maps used to try to generate in a somewhat automated fashion, but the process was rewritten in 2021 and still needs to be run by hand with someone with access to the WMF analytics platform.

Thankfully the documentation of the updates still works perfectly, and the whole process of the map generation only took a few minutes!

Read more

What is Wikibase Cloud (According to Bard)

This is the third in a series of posts where I ask language models, or modern conversational chat bots the question “what is wikibase cloud” to generally see the way they like to answer, and be able to make some sort of comparison of them. The previous posts were for Chat GPT and Bing, but now I find myself with access to Bard by Google.

Firstly, “What is Bard?”

Bard is a large language model, also known as a conversational AI or chatbot trained to be informative and comprehensive. I am trained on a massive amount of text data, and I am able to communicate and generate human-like text in response to a wide range of prompts and questions. For example, I can provide summaries of factual topics or create stories.

Bard, 24 March 2023

On to the main event….

What is wikibase cloud?”

As a language model, I’m not able to assist you with that.

Bard, 24 March 2023

So not off to a very promising start.

Read more

Asking Bing Chat AI to reference Wikidata

I previously compared Chat GPT and the Bing chat AI when it came to the question of “What is Wikibase Cloud”.

This comparison and further discussion highlighted problems with using Chat GPT alone. It seems to like inventing URLs that look right but have never existed, as it primarily wants to have a good conversation. And it seems to also do this for things such as Wikidata identifiers when referenced.

So, let’s take this same problem, test it out with the Bing chat AI, and see how it fares.

Firstly, how do we make the Bing chat AI actually reference Wikidata?

If we simply ask questions like “What is the capital of Germany?” or “Who is the prime minister of the UK?” we get mostly accurate responses referencing a variety of sources, including Wikipedia but not Wikidata.

Note I say mostly accurate here as the answer about Rishi Sunak is inaccurate as he became prime minister after Lizz Truss resigned, not Boris Johnson!

Asking the same questions with a follow-up request to “Please reference your answer with Wikidata Items!” seems to get us part of the way.

Read more

Wikimedia Enterprise: A first look

Wikimedia Enterprise is a new (now 1-year-old) service and offered by the Wikimedia Foundation, via Wikimedia, LLC.

This is a wholly-owned LLC that provides opt-in services for third-party content reuse, delivered via API services.

In essence, this means that Wikimedia Enterprise is an optional product that third parties can choose to use that repackages data from within Wikimedia projects in a more useful, more reliable, and stable format presenting them primarily via data downloads and APIs, with profits going into the Wikimedia Foundation.

Want to find out more? Read the FAQ.

The project and APIs are well documented, and access can be requested for free, but I wanted to spend a little bit of time hands-on with the APIs to get a full understanding of what is offered, the formats, and how it differs from things I know are exposed elsewhere in Wikimedia projects.

Account Creation

Wikimedia Enterprise accounts are separate from any other Wikimedia related accounts, so you’ll need a new one.

In order to get an account you need to fill out a pretty straightforward form (username, password, email, and accept terms). You then need to verify your email address. Tada, you are in!

Read more

A first Wikidata query service JNL file for public use

Back in 2019 I wrote a blog post called Your own Wikidata Query Service, with no limits which documented loading a Wikidata TTL dump into your own Blazegraph instance running within Google cloud, a near 2 week process.

I ended that post speculating that part 2 might be using a “pre-generated Blazegraph journal file to deploy a fully loaded Wikidata query service in a matter of minutes”. This post should take us a step close to that eventuality.

Wikidata Production

There are many production Wikidata query service instances all up to date with Wikidata and all of which are powered using open source code that anyone can use, making use of Blazegraph.

Per wikitech documentation there are currently at least 17 Wikidata query service backends:

  • public cluster, eqiad: wdqs1004, wdqs1005, wdqs1006, wdqs1007, wdqs1012, wdqs1013
  • public cluster, codfw: wdqs2001, wdqs2002, wdqs2003, wdqs2004, wdqs2007
  • internal cluster, eqiad: wdqs1003, wdqs1008, wdqs1011
  • internal cluster, codfw: wdqs2005, wdqs2006, wdqs2008

These servers all have hardware specs that look something like Dual Intel(R) Xeon(R) CPU E5-2620 v3 CPUs, 1.6TB raw raided space SSD, 128GB RAM.

When you run a query it may end up in any one of the backends powering the public clusters.

All of these servers also then have an up-to-date JNL file full of Wikidata data that anyone wanting to set up their own blazegraph instance with Wikidata data could use. This is currently 1.1TB.

So let’s try and get that out of the cluster for folks to use, rather than having people rebuild their own JNL files.

Read more