Wikibase: What is an entity?
I left the Wikidata and Wikibase teams roughly a year ago, and at the time there were some long and deep discussions going on inside the team trying to define what an entity was, and what should and should not be an entity.
At the recent Hackathon in Tallinn, this topic resurfaced to me, as current and previous members of the Wikidata and Wikibase teams were in attendance, along with myself.
I have opinions, others have opinions, and feel that a short blog post summarizing the currently publicly written details, as well as some of the more on point things I have heard people say may help further discussion, or perhaps bring it to some kind of conclusion.
What I actually found when pulling the various written details together is they mostly describe what I would say is the ideal path forward without rewriting the world (of Wikibase), but it’s taken me a while to sit back, relax, and actually reread all the things that we have written over the years.
The written definitions
(Or at least, some of them…)
Wikidata & Wikibase Architecture documentation
The architecture documentation was started back in 2020 after a series of internal workshops, with the partial goal of making things within the Wikidata and Wikibase clearer and coming to a common understanding of the sprawling projects and codebases that the team was writing and maintaining, and to provide a base for the teams to continue to move forward from.
These docs haven’t seen a meaningful commit since November 2021, but the glossary does contain the following…
Entities are the top level concepts of Wikibase’s data model. Items and Properties are the core Entity types of Wikibase. Other types can be added through Wikibase extensions, such as Lexemes.
Wikidata & Wikibase Architecture docs: Glossary, Last Updated: 4/4/2022, 8:05:53 AM
Wikidata:Glossary
This page contains many terms used on Wikidata itself, which of course given the history overlaps a lot with Wikibase as well. However, these words and phrases are often described specifically in the Wikidata context, as you can see below.
Entity is the content of a Wikidata page in one of the data namespaces, such as an item (in the main namespace), property (in the Property namespace) or lexeme (in Lexeme namespace). Every entity is uniquely identified by an entity ID, which is a number with a prefix; for example, starting with the prefix
Q
for an item andP
for a property.
I have trimmed the quote a little, as it goes on to state some things that are not entirely true, specifically about labels, descriptions and aliases, which only appear on some entities.
An entity is also identified by a unique combination of label and description in each language. An entity may have alternate aliases in multiple languages (something similar to synonyms).
Code documentation
The git repository contains a collection of documentation files which are maintained alongside code, and published to doc.wikimedia.org.
My chosen quote from these docs comes from the Entitytypes page and reads…
Entities as defined by Wikibase have a unique identifier and a type. As Wikibase is an extension of MediaWiki, every entity object is stored on its own page in a given namespace that can hold entities of one type.
The EntityDocument interface describes this construct and adds some methods around it that allow getting the type, getting and setting the id, creating a copy and checking if an entity has content and whether two entities are equal. It is important that the identifier does not count as content and neither affect’s emptiness nor equality.All entities must implement this interface. The two entity types ‘item’ and ‘property’ are defined in Wikibase by default. They can be enabled by defining their namespace.
The actual content of an entity can be anything. However, Wikibase defines some basic structures including labels, descriptions, aliases and statements. If an entity holds one of these structures, it has to implement the corresponding provider interface (eg. LabelsProvider).
The code…
Of course, the code powering the current system also has something to say. Currently in code what an entity is within the PHP code is defined in the WikibaseDataModel package, specifically within the EntityDocument interface.
It exposes the following public interface:
interface EntityDocument {
public function getType();
public function getId();
public function setId( $id );
public function isEmpty();
public function equals( $target );
public function copy();
}
Code language: PHP (php)
Which suggests that an entity has a type, and Id, and can have some basic operations performed on it.
At the datamodel level, not much else is suggested within the definition of an entity, however of course there are already implementations of this interface that add more, such as Item, which includes the following parts:
class Item implements
StatementListProvidingEntity,
FingerprintProvider,
StatementListHolder,
LabelsProvider,
DescriptionsProvider,
AliasesProvider,
ClearableEntity
Code language: PHP (php)
But these are not required by the definition of an entity.
Lastly comes the integration into the Wikibase extension itself, which has never really been cleaned up in a big way (just progressively worked on). I try to set out how an entity with just a string as the content could be added to Wikibase in this branch, but really this should be summarized in it’s own blog post for full clarity.
TLDR; In order to integreate with Wikibase and MediaWiki there are some additional things you currently must, or optionally can do, and all of this could do with lots of tidying up. Minimally from the branch, this is something like:
Def::CONTENT_MODEL_ID
Def::CONTENT_HANDLER_FACTORY_CALLBACK
Def::STORAGE_SERIALIZER_FACTORY_CALLBACK
Def::ENTITY_DIFFER_STRATEGY_BUILDER
Def::ARTICLE_ID_LOOKUP_CALLBACK
Def::TITLE_TEXT_LOOKUP_CALLBACK
Def::DESERIALIZER_FACTORY_CALLBACK
Def::VIEW_FACTORY_CALLBACK
Def::PREFETCHING_TERM_LOOKUP_CALLBACK
Def::URL_LOOKUP_CALLBACK
Def::EXISTENCE_CHECKER_CALLBACK
Def::LINK_FORMATTER_CALLBACK
Def::ENTITY_FACTORY_CALLBACK
Def::SERIALIZER_FACTORY_CALLBACK
Def::RDF_BUILDER_FACTORY_CALLBACK
Code language: CSS (css)
You can read more about what each of these is in the entitytypes docs, and if you look at the branch, you’ll see that lots of these are not really needed, or should not be needed, but Wikibase has never been fixed to not require extenders to provide them when providing a new entity type to the system.
Some opinions
Now, of course I think my opinion is the right one, otherwise I wouldn’t have it.
Though with all opinions, as time passes and lenses change focus and shift around, the way of describing opinions can change, and I certainly havn’t articulated my thoughts quite as I have written them below before.
My opinion
To me, if you boil it down to the simplest view, a Wikibase entity, is an identifiable thing.
In essence, that means it has a way of being uniquely identified.
Within a single Wikibase this is essentially the ID of the entity, so for items, something like Q64.
That’s it, that is an entity, identifiable.
In order to make that thing useful to humans, it needs to be human-understandable, probably via some human-readable text attached to it so that we know what it means. This could be a name, maybe it has more than one name that people might refer to it as? Maybe lots of things have the same name?
In many cases for Wikibase, this is done via labels, descriptions and aliases. Though the Wikidata glossary incorrectly says all entities have these, they do not, for example Lexemes and MediaInfo have taken alternative approaches. But all entities to date have something that fulfils the need to be human-understandable.
Lastly, to remember what Wikibase is actually trying to do, creating structured information, and more specifically, a knowledge graph of linked data.
This is where some form of statements come into the equation in all existing cases of entities, allowing connection of two or more entities together via the use of Properties.
Now, I’m not saying that all entities should necessarily have statements as they were developed on Wikidata for Items back in 2014 or so, but linking needs to be possible in some way.
All of this, other than the identifiable part, should be taken with a pinch of salt in my opinion.
The details above are the pinch of salt that you need when reading all existing documents on the topic, or looking at the current state of code. But in my opinion, these are the base requirements of the system that is Wikibase, or rather, the Wikibase ecosystem.
Knowledge graphs
The domain of Wikidata and Wikibase is the knowledge graph.
To use generic terms, the knowledge graph consists of nodes, which can also have attributes. Some of these attributes take the form of edges between nodes. There are many types of nodes in the knowledge graph.
Wikidata/Wikibase supports about five of them:
- There are nodes that describe things/concepts in the real world. In the Wikidata ecosystem, those nodes are called “Items”.
- There are nodes that describe words. What forms they can take and their meanings. In the Wikidata ecosystem, those nodes are called “Lexemes”.
- There are nodes that describe the attributes that nodes can have. In the Wikidata ecosystem, those nodes are called “Properties”.
- There are nodes that describe the shapes other nodes ought to have. What attributes with what kind of values, etc. In the Wikidata ecosystem, those nodes are called “(Entity) Schemas”.
- There are nodes that describe media-files on a specific file system. In the Wikidata ecosystem, those nodes are called “MediaInfo”.
There are many other types of nodes out there. For example, scholarly articles, could, in principle, have their own type of node to describe them. But in practice, they are currently described by Items in the Wikidata/Wikibase ecosystem.
Michael Große in 2024
RDF
An entity is a thing that has “identity” beyond its properties – it stays the same “thing” as its properties change, and you can have two distinct entities with identical properties.
That means, in the model, an entity has a permanent unique ID – in contrast to a value, which doesn’t.
In Wikibase, an entity is something that one can make statements/claims about. In RDF, the equivalent of an entity is a “resource” or “subject”, the left-hand side of a triple.
An RDF “blank node” is not a proper entity – its a hack.
Daniel Kinzler in 2024
A summary
In summary, the ongoing discussions about the definition and role of entities within Wikidata and Wikibase are essential. Reflecting on the various written documents, opinions, and architectural guidelines from past and present team members, it becomes clear that entities form the backbone of the Wikibase system. They are identifiable objects with unique identifiers that may have human-readable attributes to enhance understanding and utility. The need for a structured approach to linking these entities is fundamental to maintaining the integrity and functionality of the knowledge graph that Wikibase, and the ecosystem of Wikibases, aims to build
While opinions vary on the specifics, the consensus aligns on some core principles: entities must be identifiable, their attributes must be understandable, and they must be capable of forming meaningful connections within the knowledge graph.
As we move forward, it is crucial to keep these principles in mind, ensuring that the evolution of Wikibase continues to support a coherent and usable knowledge graph. This understanding will not only aid in furthering internal discussions but also provide a stable foundation for future developments and extensions within the Wikibase framework.
Ultimately, I see two forward approaches:
First, which feels to be the way the teams are heading, would in my opinion be the complete rewrite. I don’t really want to add a bunch of memes to this blog post about complete rewrites, multiple standards, but hope you all know what I’m talking about.
Second, which to me feels like the “right” way, is to use the existing framework which is at the heart of Wikibase and its entities, but actually bring it up to standard with the expectations the teams have. (I feel they are still plauged by and not able to move past seeing things hardcoded to ye olde ways of Items and Properties from 2013)
I’ll defintly be writing some more on this soon…
The data model chose to use the term “Entity” for the top-level Thing/class in the hierarchy of the data model. But in reality, a better term would have been “Document” or “Record”. In general, the confusion is often due simply to folks that are more familiar with one of the domains than the other between OOP Objects and Semantic Web Objects. https://en.wikipedia.org/wiki/Object_(computer_science)
But for the data model? Yeah, another term besides “Entity” should have been chosen for the top-level Thing. Looks like the code itself leans toward “Document” via EntityDocument. This is not surprising since it borrows from the semantic web terminology where it had already borrowed the “web” terminology of just “Document” as in HTML document. Just like an Owl Document or RDF Document has any number of class axioms, property axioms etc. https://www.w3.org/TR/owl-ref/#OWLDocument
Personally, I’d like to see all the places where “Entity” is used to describe the Document from the model and call it a “Component” instead. The Wikibase data model is made of multiple “Components”. The top-level component is called a “Document”. “Resources are described in a Document”. The “Document” component is the top-level store of knowledge. This then makes sense for folks on BOTH domain sides.
What a perfect comment to follow on from that post with.
To quote Daniel again…
And indeed, in the early days in PHP code, everything (Items and Properties at the time) extended from an Entity base class, however as early as Jul 7, 2014 the idea of the EntityDocument interface was introduced, however the work that was started back then to decouple “the old way” with this new idea was never really completed in my opinion.
Now that 2 people have mentioned documents, I might have to write a follow-up focusing in on that and alternative terminology. I quite like Components as well, and this works quite well with one of my half dreams where all components within the wikibase datamodel across all wikibases are also individually and uniquely addressable.
[…] The term document has already come up a few times while discussing what a Wikibase entity is, and if that should change (be that in name only, code or structures), including in my first post of this series. […]
[…] today), was one of the things that has led me to recently writing a series of blog posts about what I think an “entity” is from my perspective, as well as looking at some other entities, and the use of EntityDocument in […]
[…] after a long lead up of discussing what an entity is, looking at some examples of entity extensions, and one extension that chose not to make use of the […]