Wikimedia Commons Depicts statements over time

July 22, 2025 1 By addshore

Wikimedia Commons now uses Structured Data on Commons (SDC) to make media information multilingual and machine-readable. A core part of SDC is the ‘depicts’ statement (P180), which identifies items clearly visible in a file. Depicts statements are crucial for MediaSearch, enabling it to find relevant results in any language by using Wikidata labels, as well as having pre precise definition and structure than the existing category structures.

SDC functionalities began to roll out in 2019. Multilingual file captions were introduced early that year, enabling broader accessibility, followed by the ability to add depicts statements directly on file pages and through the UploadWizard.

Although there are numbers floating around showing a general increase in usage of structured data on Commons, there didn’t seem to be any concrete numbers around the growth in use of depicts statements.

I was particularly interested in this, as must tool WikiCrowd is steadily becoming a more and more efficient way of adding these statements en masse. So I decided to see what data I could come up with.

Getting the data

When going through historic data, and generating previous versions of the Wikidata Map I made use of JSON dumps of Wikidata that were started on archive.org. Fortunately there were many saved for Wikidata, however I didn’t have such luck for the Wikibase JSON dumps of Commons, which only seem to be preserved up until 2022.

After a bit of digging around, I found the wmf_content.mediawiki_content_history_v1 dataset that is stored within the Wikimedia analytics infrastructure (which I fortunately have access to). This data set provides the full content of all revisions, past and present, from all Wikimedia wikis.

Knowing that this includes the Wikibase mediainfo content that stores the depicts statements, and that the structure of the internal JSON representation hasn’t changed much (if at all) throughout the years deployed on commons, this seemed like the right place to look.

Depicts per revision

The data can all be accessed via Spark (which I had to jab and poke at the config for until it worked), and after much iteration I came to a query that would extract the base set of information that I needed

  • revision_id – Unique ID of the revision on Wikimedia Commons
  • revision_dt – Point in time that the revision was created
  • id – MediaInfo ID for the file
  • depicts_qids – Extracted Wikidata item IDs for depicts statements within the revision

The query:

  1. Looks for the mediainfo content element only, as this is where the Wikibase JSON actually lives.
  2. Decodes the content as a JSON object, looking at the $.id element and $.statements.P180 element, and saving them for another query (parsed_data).
  3. This parsed_data, particularly for P180 statements, is then exploded into a list of mainsnak values.
  4. This explosion is then collected again as a list, using collect_list
WITH parsed_data AS (
  SELECT
    revision_id,
    revision_dt,
    get_json_object(content, '$.id') AS id,
    get_json_object(content, '$.statements.P180') AS depicts_json
  FROM (
    SELECT
      revision_id,
      revision_dt,
      revision_content_slots['mediainfo'].content_body AS content
    FROM wmf_content.mediawiki_content_history_v1
    WHERE wiki_id = 'commonswiki'
      AND page_namespace_id = 6
      AND revision_content_slots['mediainfo'].content_body IS NOT NULL
      AND CARDINALITY(revision_content_slots) = 2
  )
),
exploded_depicts AS (
  SELECT 
    revision_id,
    revision_dt,
    id,
    explode(from_json(depicts_json, 'array<struct<mainsnak:struct<datavalue:struct<value:struct<id:string>>>>>')) AS depicts_statement
  FROM parsed_data
  WHERE depicts_json IS NOT NULL
)
SELECT 
  revision_id,
  revision_dt,
  id,
  collect_list(depicts_statement.mainsnak.datavalue.value.id) AS depicts_qids
FROM exploded_depicts
GROUP BY revision_id, revision_dt, id
ORDER BY revision_dtCode language: SQL (Structured Query Language) (sql)

And the result looks something like this:

revision_idrevision_dtiddepicts_qids
3469899382019-04-23 12:31:02M75908279[Q4022]
3470044242019-04-23 15:08:17M75908279[Q4022, Q4022]
3470044252019-04-23 15:08:19M75908279[Q4022, Q4022, Q12280]

Depicts per file each month

From here, we are further reduce our dataset, which currently includes every revision of every file, into a set that includes only the single latest revision of each file in a given month.

  • month – The month in question for the data
  • qid – Wikidata item ID that is being depicted
  • qid_count – Number of images that depicted the qid in the given month

When broken down, this query:

  1. Find the earliest revision_dt time, that marks the start of the window and first depicts revision every created.
  2. Generate a list of months from this earliest point in time, to the current month, our data_spine.
  3. For each month, join all prior revisions, then assigning a row number, and keeping only the latest per depicted id.
  4. Split the depicts_qids array into individual qid rows, so each qid appears on its own row for the corresponding month.
  5. Group by month and qid, count the occurrences and order the result.
WITH date_spine AS (
  SELECT explode(sequence(
    DATE_TRUNC('MONTH', (SELECT MIN(revision_dt) FROM revisions_temp)),
    DATE_TRUNC('MONTH', CURRENT_DATE()),
    INTERVAL 1 MONTH
  )) AS month
),
latest_per_month AS (
  SELECT 
    month,
    id,
    depicts_qids,
    ROW_NUMBER() OVER (
      PARTITION BY month, id 
      ORDER BY revision_dt DESC
    ) as rn
  FROM date_spine ds
  CROSS JOIN (
    SELECT 
      id,
      revision_dt,
      depicts_qids,
      DATE_TRUNC('MONTH', revision_dt) as revision_month
    FROM revisions_temp
  ) r
  WHERE r.revision_dt < ds.month
),
final_states AS (
  SELECT 
    month, 
    id, 
    depicts_qids
  FROM latest_per_month 
  WHERE rn = 1 AND depicts_qids IS NOT NULL
),
exploded_qids AS (
  SELECT 
    month,
    explode(depicts_qids) AS qid
  FROM final_states
)
SELECT 
  month,
  qid,
  COUNT(*) AS qid_count
FROM exploded_qids
GROUP BY month, qid
ORDER BY month, qid;Code language: SQL (Structured Query Language) (sql)

Finally leading to a result that looks something like this…

monthqidqid_count
2019-05-01 00:00:00Q10011
2019-05-01 00:00:00Q1003545190
2019-05-01 00:00:00Q10104002

And if you are interested, and have access, this dataset is currently stored in addshore.commons_depicts_monthly, and the raw notebook is stored in a Github gist.

Looking at the graphs

I would embed the HTML of these graphs to make them interactive, but it seems I kept too much data into some of them, so pictures will have to suffice…

The first graph looks at all Wikidata items that are depicted more than 2500 times. The top 5 items are:

  • Q34442 (road) – wide way leading from one place to another, especially one with a specially prepared surface which vehicles can use
  • Q5004679 (path) – small road or street
  • Q532 (village) – small clustered human settlement smaller than a town
  • Q3947 (house) – building usually intended for living in
  • Q11451 (agriculture) – cultivation of plants and animals to provide useful products

When looking at the same data on a log scale, it’s even easier to see a dip in depicts statements for various items between Feb 2024 and 2025. I believe this is due to the Mass revert of computer-aided tagging by SchlurcherBot due to the evaluation of the experiment that was deemed a failure. See T339902 for further evaluation.

It’s clear to see that road has been winning since July 2021, where it overtook Q527 (sky) which was then in the top spot, and Q828144 (floor exercise) which was second, and Q623270 (horizontal bar) which was third.

Having a look at the earlier days of the structured data, up until August 2021, we are see that a very different set of Wikidata items were taking the top spots.

  • Q41176 (building) – structure, typically with a roof and walls, standing more or less permanently in one place
  • Q10884 (tree) – perennial woody plant
  • Q527 (sky) – everything that is above the surface of a planet, typically the Earth
  • Q3947 (house) – building usually intended for living in
  • Q16970 (church building) – building for Christian worship

But most of these have since been left in the dust of others.

I thought it might be a nice idea to look at the first ~2025 or so Wikidata items, to see how the original set of items on Wikidata are being used as part of the Commons depict statements.

Again we see Q532 (village) here which has the clear top spot. There was a sharp spike in Q525 (sun) around March 2024, and interestingly Q801 (Israel).

I also wanted to have a look at some of the Wikidata items that have been used as part of my WikiCrowd tool over the years.

Back in early 2022 I wrote my first blog post on the tool after initially putting it online. In May 2025 I reworked the tool a fair bit at and around the Wikimedia hackathon adding a grid view, and making it even easier to add statements en masse. I think both of these events can be seen to have caused spikes in depicts statements for these items.

However, what on earth happened in April and May of 2023? :D

So…

I really like this approach of reviewing historical structured data on Wikimedia Commons, it was far less complex than downloading a bunch of JSON dumps, and likely I could have got to the same dataset with the downloadable XML dumps, and some additional time and extraction.

A similar method won’t quite be as easy with the Wikidata data stored in the same wmf_content.mediawiki_content_history_v1 table as this, as the JSON structure stored in MediaWiki content has changed at least once, however adapting the queries to extract both formats would likely be trivial.

This new history table (that I have not used before) is stored using the Iceberg system. One consequence of this is that data is simply appended to the dataset, rather than needing to entirely reload the data, which I have previously seen happen among the analytics data sets.

I can see how this sort of analytical data could now be easily incrementally calculated month by month as part of the WMF data lake, and exposed via some very nice APIs, looking across the structured data world the Wikibases that are used within Wikimedia sites exposes. Currently, tools such as commonswalkabout provide similar point in time snapshots of the state of usage of statements on the projects via SPARQL queries, but these queries are often slow.

Ultimately, if this could be productionixed someway, it would be very nice to display this kind of usage infomaation on the Wikidata properties and items that are being used, and also somewhere on Wikimedia Commons.