Wikidata ontological tree of Trains

January 7, 2022 0 By addshore

While looking working on my recent WikiCrowd project I ended up looking at the ontological tree of both Wikidata entities and Wikimedia Commons categories.

In this post, I’ll look at some of the ontology mappings that happen between projects, some of the SPARQL that can help you use this ontology in tools, and also some tools to help you explore this complex tree.

I’m using trains as I think they are fairly easy for most folks to relate to, and also don’t have a massively complex tree.

Commons & Wikidata mapping

Depicts questions in WikiCrowd are entirely generated from these Wikimedia Commons categories, such as Category:Trains & Category:Steam locomotives. These are then mapped to items on Wikidata such as Q870 (train) & Q171043 (steam locomotive).

Wikimedia Commons categories quite often contain infoboxes on the right-hand side that link to a variety of resources for the thing the category is covering. And quite often there is a Wikidata item ID present, this is the case for the categories above.

Likewise on Wikidata statements for P373 (Commons category) will often exist for entities that are depicted on Commons.

In theory, this means that I could run a simple SPARQL query to find all Wikidata items that have matching categories on Commons, throw them into the WikCrowd app and let people answer depicts questions.

Unfortunately, the world is a little more complex than that. On both sides of the equation, the ontological tree has some hidden surprises. For example Category:Trains includes Category:Views from trains as a subcategory, and this will mostly not include images of trains.

Wikidata ontology

Generally speaking, you want to keep the depcits statements of the most specific description of the thing being depicted. You can read more about this in the depicts project page on Commons. So if an image depicts a steam locomotive, perhaps it doesn’t want to also say that it depicts a more generic locomotive.

But here we hit our first problem, as I had to say locomotive above, and not train!

As you can see on the right, locomotives make up a part of a train, rather than being an instance of or subclass of train. The description of train is currently “form of rail transport consisting of a series of connected vehicles”, implying multiple carriages.

So, a steam train is probably powered by a steam locomotive, but it in itself doesn’t constitute a train unless it’s made of multiple carriages? or does it? This is a question for another time…

Another part of the Wikimedia Commons advice on creating depicts statements is also relevant here. I drew wheels in this diagram! Does that mean wheels should be depicted? Imagine how many more parts of a train or locomotive there are in one of these photos?

Visually exploring the wider tree

You need to understand the area of data that you are working with, otherwise, you’ll end up making incorrect changes or incorrect assumptions.

Many tools exist to help you explore this space, but the one that I’m going to suggest today is a Wikidata visualization tool made by metaphacts.

This tool enables easy creation of diagrams and exploration of relations, such as the expanded view of trains & locomotives, all the way up to vehicles shown below. (Permalink to this diagram on the tool)

You can start on the tool homepage, and search for a single Wikidata item, such as Q870 (train).

This will load the item as the first node in your diagram, and prompt you to explore connections.

If you search for P279 (subclass of) you’ll find relations in both directions. On one side, you’ll find the more generic class of Q1301433 (land vehicle). The other side doesn’t contain locomotive for the reasons stated above, for this relation you’ll have to take a look at P527 (has part).

In SPARQL

The above methods are all well and good for people, but what if you want to write a tool making use of this tree in some way?

This is exactly what WikiCrowd currently does for depict statement questions in order to check if a less or more specific statement already exists before making an edit.

Looking up the tree at the superclasses, or less specific items you can do something like this:

SELECT DISTINCT ?i ?iLabel
WHERE {
  wd:Q870 wdt:P279+ ?i
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}Code language: SQL (Structured Query Language) (sql)

Or looking down the tree, for all instances of, or instances of subclasses of, or subclasses of, or subclasses of subclasses of you can do something like this:

SELECT DISTINCT ?i ?iLabel
WHERE {
  ?i wdt:P31/wdt:P279*|wdt:P279/wdt:P279* wd:Q870
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}Code language: SQL (Structured Query Language) (sql)

Finally

I want to write more about some of the ontological challenges on Wikidata, particularly how different types of people perceive the world. The case that came up in 2021 was how to use depicts statements on Commons to aid searching. But it turns out searching for “Cat” which could be either Q146 (house cat) or Q20980826 (cat – species of mammal) makes things hard.

I also might queue up some interesting ontological diagrams from Wikidata on Twitter, so be sure to follow! (Yes Tom, if you are reading this, I wrote this for you. Time to like and subscribe.)

Thanks to Lucas (@LucasWerkmeistr) for always being happy to talk to me about SPARQL :)