Wikidata, instance of and subclass of through time (P31 & P279)
Last month I looked at all Wikimedia Commons revisions and managed to generate some data and graphs for the usage of depicts statements since they were introduced on the project.
This month, I have applied the same analysis on Wikidata but looking at instance of and subclasses of items. A slightly bigger data set, however essentially the same process.
This will enable easy updating, of various pie charts that have been published over the years, such as
- https://commons.wikimedia.org/wiki/File:WikidataStatisticsofWikipediaType_of_content.png from 2015
- Wikidata:Statistics pie chart, that is generated by Module:Statistical_data/by_project/classes, but has not updated since 2020
- https://commons.wikimedia.org/wiki/File:Wikidata_content_2024.svg which was generated in 2024
In future, this could be easily adapted to show per Wikipedia project graphs, such as those that are currently at Wikidata:Statistics/Wikipedia
Method
The details of the method can be seen in code in my previous post about depicts statements, and this mostly stays the same.
In words:
- Look at every revision of Wikidata ever
- Parse the JSON to determine what values there are for P31 and P279 for each revision
- Find the latest revision of each item in each given month, and thus find the state of all items in that month
- Plot the data by number of items that are P31 or P279 of each value item
There are some minor defects to this logic currently that could be cleaned up with future iterations:
- Deleted items will continue being counted, as I don’t consider the point items are deleted
- Things will be double counted in this data, as 1 item may have multiple P31 and P279 values, and I don’t try to join these into higher level concept at all
We make an OTHER and UNALLOCATED count as part of the final data summarization. OTHER accounts for things that have not made it into the top 20 items by count, and UNALLOCATED means that we didn’t have a P31 or P279 value in the latest revision.
2025
For August 2025 (or at least part way through it), this is the current state of Wikidata per the above method.
You can now find a PNG of this pie chart on Wikimedia Commons https://commons.wikimedia.org/wiki/File:Wikidata_P31_%26_P279_analysis_August_2025.png
This pie chart would be easy to generate for any given month, or even point in time, given the method used.
The raw data is below, where we can see that some of the values included in the chart do indeed have an overlapping conceptual space, such as star, infrared source, and galaxy.
So astronomical objects are likely, actually, 6.5% ish of Wikidata.
| P31 / P279 value | Count | Percentage |
|---|---|---|
| “scholarly article” | 45,192,839 | 35.26 |
| OTHER | 32,587,231 | 25.42 |
| “human” | 12,496,296 | 9.75 |
| “UNALLOCATED” | 6,673,336 | 5.21 |
| “Wikimedia category” | 5,679,651 | 4.43 |
| “taxon” | 3,783,682 | 2.95 |
| “star” | 3,635,045 | 2.84 |
| “infrared source” | 2,621,809 | 2.05 |
| “galaxy” | 2,133,424 | 1.66 |
| “protein” | 1,782,975 | 1.39 |
| “gene” | 1,679,022 | 1.31 |
| “Wikimedia disambiguation page” | 1,513,507 | 1.18 |
| “type of chemical entity” | 1,278,804 | 1.00 |
| “chemical compound” | 1,065,619 | 0.83 |
| “painting” | 1,057,459 | 0.83 |
| “protein-coding gene” | 969,272 | 0.76 |
| “Wikimedia template” | 824,881 | 0.64 |
| “street” | 708,719 | 0.55 |
| “family name” | 658,502 | 0.51 |
| “encyclopedia article” | 640,305 | 0.50 |
| “version, edition or translation” | 599,485 | 0.47 |
| “village of the People’s Republic of China” | 592,629 | 0.46 |
Over time
Rather than creating hundreds of pie charts for the view of Wikidata over time, we can display this on a chart similar to what we have done before.
It’s rather interesting seeing the large dips in OTHER, and UNALLOCATED back in 2016, I assume then a lot of statements were added? Or perhaps this dates back to the time before P31 and P279 etc were actually fully adopted?
Continuing this
We can always regenerate this sort of data based on having every revision of Wikidata still available. We can also decide to change how we are evaluating certain things, and reevaluate the whole data set with ease.
One part of integrating this closer with Wikimedia and Wikidata itself would be T341649 (Provide an easy way for MediaWiki to fetch aggregate data from the data lake), where I left a comment. As this data is all generated from the Wikimedia data lake, and the tables still exist there. Doing this in a structured and automated way each month would likely be easy enough, and could for example be included on wiki then within wiki projects and or on pages for individual properties too.
addshore/wikimedia-notebooks on Github includes all the notebook code used to generate this data in commit e67ca7d394ced5c82530b3d25718610bd8fc649a. This is currently split across 3 notebooks, each of which saves data into Hadoop in tables under my users as intermediary steps.
- history_wikidata.ipynb: Extraction of data on a per revision level
- history_wikidata_monthly.ipynb: Aggregation into monthly snapshots
- history_wikidata_monthly_export_instance_subclass.ipynb: Analysis of P31 and P279 specifically
If anyone wants to recreate this as part of airflow to be progressively built, let me know! (Or if you just like looking at the pretty graphs…)
For the benefit of web scraping and previews… I’ll include the PNG of the pie chart here at the bottom! :)
