Wikidata, instance of and subclass of through time (P31 & P279)

August 31, 2025 0 By addshore

Last month I looked at all Wikimedia Commons revisions and managed to generate some data and graphs for the usage of depicts statements since they were introduced on the project.

This month, I have applied the same analysis on Wikidata but looking at instance of and subclasses of items. A slightly bigger data set, however essentially the same process.

This will enable easy updating, of various pie charts that have been published over the years, such as

https://commons.wikimedia.org/wiki/File:WikidataStatisticsofWikipediaType_of_content.png from 2015
Wikidata:Statistics pie chart, that is generated by Module:Statistical_data/by_project/classes, but has not updated since 2020
https://commons.wikimedia.org/wiki/File:Wikidata_content_2024.svg which was generated in 2024

In future, this could be easily adapted to show per Wikipedia project graphs, such as those that are currently at Wikidata:Statistics/Wikipedia

Method

The details of the method can be seen in code in my previous post about depicts statements, and this mostly stays the same.

In words:

Look at every revision of Wikidata ever
Parse the JSON to determine what values there are for P31 and P279 for each revision
Find the latest revision of each item in each given month, and thus find the state of all items in that month
Plot the data by number of items that are P31 or P279 of each value item

There are some minor defects to this logic currently that could be cleaned up with future iterations:

Deleted items will continue being counted, as I don’t consider the point items are deleted
Things will be double counted in this data, as 1 item may have multiple P31 and P279 values, and I don’t try to join these into higher level concept at all

We make an OTHER and UNALLOCATED count as part of the final data summarization. OTHER accounts for things that have not made it into the top 20 items by count, and UNALLOCATED means that we didn’t have a P31 or P279 value in the latest revision.

2025

For August 2025 (or at least part way through it), this is the current state of Wikidata per the above method.

You can now find a PNG of this pie chart on Wikimedia Commons https://commons.wikimedia.org/wiki/File:Wikidata_P31_%26_P279_analysis_August_2025.png

This pie chart would be easy to generate for any given month, or even point in time, given the method used.

The raw data is below, where we can see that some of the values included in the chart do indeed have an overlapping conceptual space, such as star, infrared source, and galaxy.

So astronomical objects are likely, actually, 6.5% ish of Wikidata.

P31 / P279 value	Count	Percentage
“scholarly article”	45,192,839	35.26
OTHER	32,587,231	25.42
“human”	12,496,296	9.75
“UNALLOCATED”	6,673,336	5.21
“Wikimedia category”	5,679,651	4.43
“taxon”	3,783,682	2.95
“star”	3,635,045	2.84
“infrared source”	2,621,809	2.05
“galaxy”	2,133,424	1.66
“protein”	1,782,975	1.39
“gene”	1,679,022	1.31
“Wikimedia disambiguation page”	1,513,507	1.18
“type of chemical entity”	1,278,804	1.00
“chemical compound”	1,065,619	0.83
“painting”	1,057,459	0.83
“protein-coding gene”	969,272	0.76
“Wikimedia template”	824,881	0.64
“street”	708,719	0.55
“family name”	658,502	0.51
“encyclopedia article”	640,305	0.50
“version, edition or translation”	599,485	0.47
“village of the People’s Republic of China”	592,629	0.46

Over time

Rather than creating hundreds of pie charts for the view of Wikidata over time, we can display this on a chart similar to what we have done before.

It’s rather interesting seeing the large dips in OTHER, and UNALLOCATED back in 2016, I assume then a lot of statements were added? Or perhaps this dates back to the time before P31 and P279 etc were actually fully adopted?

Continuing this

We can always regenerate this sort of data based on having every revision of Wikidata still available. We can also decide to change how we are evaluating certain things, and reevaluate the whole data set with ease.

One part of integrating this closer with Wikimedia and Wikidata itself would be T341649 (Provide an easy way for MediaWiki to fetch aggregate data from the data lake), where I left a comment. As this data is all generated from the Wikimedia data lake, and the tables still exist there. Doing this in a structured and automated way each month would likely be easy enough, and could for example be included on wiki then within wiki projects and or on pages for individual properties too.

addshore/wikimedia-notebooks on Github includes all the notebook code used to generate this data in commit e67ca7d394ced5c82530b3d25718610bd8fc649a. This is currently split across 3 notebooks, each of which saves data into Hadoop in tables under my users as intermediary steps.

history_wikidata.ipynb: Extraction of data on a per revision level
history_wikidata_monthly.ipynb: Aggregation into monthly snapshots
history_wikidata_monthly_export_instance_subclass.ipynb: Analysis of P31 and P279 specifically

If anyone wants to recreate this as part of airflow to be progressively built, let me know! (Or if you just like looking at the pretty graphs…)

For the benefit of web scraping and previews… I’ll include the PNG of the pie chart here at the bottom! :)

CategoryTech Posts

Tagsvisualization analytics plotly spark mediawiki_content_history_v1 Wikimedia Wikidata

Wikidata, instance of and subclass of through time (P31 & P279)

Method

2025

Over time

Continuing this

Related

Leave a CommentCancel reply

Wikidata, instance of and subclass of through time (P31 & P279)

Method

2025

Over time

Continuing this

Share this:

Related

Leave a CommentCancel reply