Wikimedia Commons Depicts statements over time

July 22, 2025 1 By addshore

Wikimedia Commons now uses Structured Data on Commons (SDC) to make media information multilingual and machine-readable. A core part of SDC is the ‘depicts’ statement (P180), which identifies items clearly visible in a file. Depicts statements are crucial for MediaSearch, enabling it to find relevant results in any language by using Wikidata labels, as well as having pre precise definition and structure than the existing category structures.

SDC functionalities began to roll out in 2019. Multilingual file captions were introduced early that year, enabling broader accessibility, followed by the ability to add depicts statements directly on file pages and through the UploadWizard.

Although there are numbers floating around showing a general increase in usage of structured data on Commons, there didn’t seem to be any concrete numbers around the growth in use of depicts statements.

I was particularly interested in this, as must tool WikiCrowd is steadily becoming a more and more efficient way of adding these statements en masse. So I decided to see what data I could come up with.

Getting the data

When going through historic data, and generating previous versions of the Wikidata Map I made use of JSON dumps of Wikidata that were started on archive.org. Fortunately there were many saved for Wikidata, however I didn’t have such luck for the Wikibase JSON dumps of Commons, which only seem to be preserved up until 2022.

After a bit of digging around, I found the wmf_content.mediawiki_content_history_v1 dataset that is stored within the Wikimedia analytics infrastructure (which I fortunately have access to). This data set provides the full content of all revisions, past and present, from all Wikimedia wikis.

Knowing that this includes the Wikibase mediainfo content that stores the depicts statements, and that the structure of the internal JSON representation hasn’t changed much (if at all) throughout the years deployed on commons, this seemed like the right place to look.

Depicts per revision

The data can all be accessed via Spark (which I had to jab and poke at the config for until it worked), and after much iteration I came to a query that would extract the base set of information that I needed

revision_id – Unique ID of the revision on Wikimedia Commons
revision_dt – Point in time that the revision was created
id – MediaInfo ID for the file
depicts_qids – Extracted Wikidata item IDs for depicts statements within the revision

The query:

Looks for the mediainfo content element only, as this is where the Wikibase JSON actually lives.
Decodes the content as a JSON object, looking at the $.id element and $.statements.P180 element, and saving them for another query (parsed_data).
This parsed_data, particularly for P180 statements, is then exploded into a list of mainsnak values.
This explosion is then collected again as a list, using collect_list

WITH parsed_data AS (
  SELECT
    revision_id,
    revision_dt,
    get_json_object(content, '$.id') AS id,
    get_json_object(content, '$.statements.P180') AS depicts_json
  FROM (
    SELECT
      revision_id,
      revision_dt,
      revision_content_slots['mediainfo'].content_body AS content
    FROM wmf_content.mediawiki_content_history_v1
    WHERE wiki_id = 'commonswiki'
      AND page_namespace_id = 6
      AND revision_content_slots['mediainfo'].content_body IS NOT NULL
      AND CARDINALITY(revision_content_slots) = 2
  )
),
exploded_depicts AS (
  SELECT 
    revision_id,
    revision_dt,
    id,
    explode(from_json(depicts_json, 'array<struct<mainsnak:struct<datavalue:struct<value:struct<id:string>>>>>')) AS depicts_statement
  FROM parsed_data
  WHERE depicts_json IS NOT NULL
)
SELECT 
  revision_id,
  revision_dt,
  id,
  collect_list(depicts_statement.mainsnak.datavalue.value.id) AS depicts_qids
FROM exploded_depicts
GROUP BY revision_id, revision_dt, id
ORDER BY revision_dtCode language: SQL (Structured Query Language) (sql)

And the result looks something like this:

revision_id	revision_dt	id	depicts_qids
346989938	2019-04-23 12:31:02	M75908279	[Q4022]
347004424	2019-04-23 15:08:17	M75908279	[Q4022, Q4022]
347004425	2019-04-23 15:08:19	M75908279	[Q4022, Q4022, Q12280]

Depicts per file each month

From here, we are further reduce our dataset, which currently includes every revision of every file, into a set that includes only the single latest revision of each file in a given month.

month – The month in question for the data
qid – Wikidata item ID that is being depicted
qid_count – Number of images that depicted the qid in the given month

When broken down, this query:

Find the earliest revision_dt time, that marks the start of the window and first depicts revision every created.
Generate a list of months from this earliest point in time, to the current month, our data_spine.
For each month, join all prior revisions, then assigning a row number, and keeping only the latest per depicted id.
Split the depicts_qids array into individual qid rows, so each qid appears on its own row for the corresponding month.
Group by month and qid, count the occurrences and order the result.

WITH date_spine AS (
  SELECT explode(sequence(
    DATE_TRUNC('MONTH', (SELECT MIN(revision_dt) FROM revisions_temp)),
    DATE_TRUNC('MONTH', CURRENT_DATE()),
    INTERVAL 1 MONTH
  )) AS month
),
latest_per_month AS (
  SELECT 
    month,
    id,
    depicts_qids,
    ROW_NUMBER() OVER (
      PARTITION BY month, id 
      ORDER BY revision_dt DESC
    ) as rn
  FROM date_spine ds
  CROSS JOIN (
    SELECT 
      id,
      revision_dt,
      depicts_qids,
      DATE_TRUNC('MONTH', revision_dt) as revision_month
    FROM revisions_temp
  ) r
  WHERE r.revision_dt < ds.month
),
final_states AS (
  SELECT 
    month, 
    id, 
    depicts_qids
  FROM latest_per_month 
  WHERE rn = 1 AND depicts_qids IS NOT NULL
),
exploded_qids AS (
  SELECT 
    month,
    explode(depicts_qids) AS qid
  FROM final_states
)
SELECT 
  month,
  qid,
  COUNT(*) AS qid_count
FROM exploded_qids
GROUP BY month, qid
ORDER BY month, qid;Code language: SQL (Structured Query Language) (sql)

Finally leading to a result that looks something like this…

month	qid	qid_count
2019-05-01 00:00:00	Q1001	1
2019-05-01 00:00:00	Q1003545	190
2019-05-01 00:00:00	Q1010400	2

And if you are interested, and have access, this dataset is currently stored in addshore.commons_depicts_monthly, and the raw notebook is stored in a Github gist.

Looking at the graphs

I would embed the HTML of these graphs to make them interactive, but it seems I kept too much data into some of them, so pictures will have to suffice…

The first graph looks at all Wikidata items that are depicted more than 2500 times. The top 5 items are:

Q34442 (road) – wide way leading from one place to another, especially one with a specially prepared surface which vehicles can use
Q5004679 (path) – small road or street
Q532 (village) – small clustered human settlement smaller than a town
Q3947 (house) – building usually intended for living in
Q11451 (agriculture) – cultivation of plants and animals to provide useful products

When looking at the same data on a log scale, it’s even easier to see a dip in depicts statements for various items between Feb 2024 and 2025. I believe this is due to the Mass revert of computer-aided tagging by SchlurcherBot due to the evaluation of the experiment that was deemed a failure. See T339902 for further evaluation.

It’s clear to see that road has been winning since July 2021, where it overtook Q527 (sky) which was then in the top spot, and Q828144 (floor exercise) which was second, and Q623270 (horizontal bar) which was third.

Having a look at the earlier days of the structured data, up until August 2021, we are see that a very different set of Wikidata items were taking the top spots.

Q41176 (building) – structure, typically with a roof and walls, standing more or less permanently in one place
Q10884 (tree) – perennial woody plant
Q527 (sky) – everything that is above the surface of a planet, typically the Earth
Q3947 (house) – building usually intended for living in
Q16970 (church building) – building for Christian worship

But most of these have since been left in the dust of others.

I thought it might be a nice idea to look at the first ~2025 or so Wikidata items, to see how the original set of items on Wikidata are being used as part of the Commons depict statements.

Again we see Q532 (village) here which has the clear top spot. There was a sharp spike in Q525 (sun) around March 2024, and interestingly Q801 (Israel).

I also wanted to have a look at some of the Wikidata items that have been used as part of my WikiCrowd tool over the years.

Back in early 2022 I wrote my first blog post on the tool after initially putting it online. In May 2025 I reworked the tool a fair bit at and around the Wikimedia hackathon adding a grid view, and making it even easier to add statements en masse. I think both of these events can be seen to have caused spikes in depicts statements for these items.

However, what on earth happened in April and May of 2023? :D

So…

I really like this approach of reviewing historical structured data on Wikimedia Commons, it was far less complex than downloading a bunch of JSON dumps, and likely I could have got to the same dataset with the downloadable XML dumps, and some additional time and extraction.

A similar method won’t quite be as easy with the Wikidata data stored in the same wmf_content.mediawiki_content_history_v1 table as this, as the JSON structure stored in MediaWiki content has changed at least once, however adapting the queries to extract both formats would likely be trivial.

This new history table (that I have not used before) is stored using the Iceberg system. One consequence of this is that data is simply appended to the dataset, rather than needing to entirely reload the data, which I have previously seen happen among the analytics data sets.

I can see how this sort of analytical data could now be easily incrementally calculated month by month as part of the WMF data lake, and exposed via some very nice APIs, looking across the structured data world the Wikibases that are used within Wikimedia sites exposes. Currently, tools such as commonswalkabout provide similar point in time snapshots of the state of usage of statements on the projects via SPARQL queries, but these queries are often slow.

Ultimately, if this could be productionixed someway, it would be very nice to display this kind of usage infomaation on the Wikidata properties and items that are being used, and also somewhere on Wikimedia Commons.

CategoryTech Posts

TagsWikidata wikibase commons analytics depicts wikicrowd plotly iceberg spark Wikimedia mediawiki_content_history_v1

1 Comment

Wikidata, instance of and subclass of through time (P31 & P279) – addshore says:

August 31, 2025 at 11:22 am

[…] month I looked at all Wikimedia Commons revisions and managed to generate some data and graphs for the usage of depicts statements since they were introduced on the […]

Loading...

Reply

Wikimedia Commons Depicts statements over time