Wikibase, from Entity to EntityDocument

June 3, 2024 2 By addshore
This entry is part 2 of 7 in the series Wikibase ecosystem

The term document has already come up a few times while discussing what a Wikibase entity is, and if that should change (be that in name only, code or structures), including in my first post of this series.

Looking at the very first definition of entity in the duck duck go search that I performed 6 seconds ago, an entity is:

Something that exists as a particular and discrete unit.

The American Heritage® Dictionary of the English Language, 5th Edition

At the most basic level, it’s fairly straightforward to say that a Wikibase doesn’t hold the actual entities (such as a type of tree), rather data about said entities.

And in a nutshell, this data is collected within a document.

Image from “What is the semantic web” by onotext.com

Quoting a few choice people again, before diving deeper into this topic…

The “entities” in the Wikibase base are not Entities. They are descriptions of entities. The entity is the thing in the world not the data we have about it, even tough colloquially, we don’t make the distinction. But we have separate URIs for the thing and the description in the abstract and for specific renderings.
I think that’s important to mention when discussing what an entity “is”.

Daniel Kinzler in conversation, June 2024

The data model chose to use the term “Entity” for the top-level Thing/class in the hierarchy of the data model. But in reality, a better term would have been “Document” or “Record”. In general, the confusion is often due simply to folks that are more familiar with one of the domains than the other, between OOP Objects and Semantic Web Objects.

Thad Guidry in a comment, June 2024

Semantic Web

OWL, or Web Ontology Language, is a semantic web standard used to represent complex information about things, groups of things, and relationships between things, enabling better data interoperability and understanding across diverse systems. It’s built on RDF, or Resource Description Framework, which is another W3C standard.

Both the OWL reference and the RDF concepts spec include references to OWL documents and RDF documents respectively.

Moving further up through the W3C documentation, we can find some other text that talks about documents briefly as we move our way up to the DOM.

As Thad pointed out in a comment on my last post, all of these, and Wikibase today with it’s idea of an EntityDocument borrow from the semantic web terminologies in calling things a document.

I’m not sure if this is necessarily a super useful connection to make, as again, using the same or similar terms to refer to similar or different things is one of the issues with the now overloaded term entity within Wikibase in the first place.

EntityDocument in Wikibase

When the idea of EntityDocument was first introduced to Wikibase back in 2014, it was a fairly bare and generic concept. This falls in line with the idea that whatever we call these things, a key attribute is an identity.

From memory, the idea was primarily introduced to try to reduce the coupling, and legacy cruft that made up an Entity base class in PHP that both Item and Property shared at the time, thought I have no doubt that some thought went into the naming too.

Having a quick look at an old version of this Entity class which I believe has long since departed, we see many things that can confuse and conflate the area of entities within Wikibase, and “if X should be an entity” discussions while introducing new things to Wikibase.

abstract class Entity implements \Comparable, ClaimAggregate, \Serializable, FingerprintProvider {
    public abstract function getType();
    public function getId() {};
    public function setId( $id ) {};

    public function serialize(){};
    public function unserialize( $value ) {};

    public function clear() {};
    public function isEmpty() {};
    public function equals( $that ) {};
    public function copy() {};

    public function setLabel( $languageCode, $value ) {};
    public function setLabels( array $labels ) {};
    public function removeLabel( $languageCodes = array() ) {};
    public function getLabels( array $languageCodes = null ) {};
    public function getLabel( $languageCode ) {};

    public function setDescription( $languageCode, $value ) {};
    public function setDescriptions( array $descriptions ) {};
    public function removeDescription( $languageCodes = array() ) {};
    public function getDescriptions( array $languageCodes = null ) {};
    public function getDescription( $languageCode ) {};

    public function getAliases( $languageCode ) {};
    public function getAllAliases( array $languageCodes = null ) {};
    public function setAliases( $languageCode, array $aliases ) {};
    public function setAllAliases( array $aliasLists ) {};
    public function addAliases( $languageCode, array $aliases ) {};
    public function removeAliases( $languageCode, array $aliases ) {};

    public function setClaims( Claims $claims ) {};
    public function addClaim( Claim $claim ) {};
    public function getClaims() {};
    public function hasClaims() {};
    public function newClaim( Snak $mainSnak ) {};

    public function getFingerprint() {};
    public function setFingerprint( Fingerprint $fingerprint ) {};
}Code language: PHP (php)

When the class above was used, all Wikibase entities did indeed have labels, descriptions, aliases, and claims. The notable thing lacking here are sitelinks which were already individually implemented for the Item domain.

Today’s EntityDocument interface massively simplifies this, but also includes a few more methods that were deemed useful or needed for all EntityDocuments to work within the Wikibase system.

interface EntityDocument {
    public function getType();
    public function getId();
    public function setId( $id );
    public function isEmpty();
    public function equals( $target );
    public function copy();
}Code language: PHP (php)

EntityContent

Within the EntityDocument interface, there is a hint at the fact that EntityDocuments probably have some form of content, as they can be empty (or not), can be copied, and also have an equality check, however the interface at this level doesn’t force anything upon the user (when the old Entity class did). This is where EntityContent comes into play within Wikibase, and also where the binding to MediaWiki begins.

EntityContent is the connection between the EntityDocument, and the Content concept that lives within Pages on a MediaWiki site. And within EntityContent and its respective ContentHandler probably live many of the legacy “bad” things that still exist for Wikibase entities from ye olde early Wikibase days.

To roughly summarize, the content and the handler bring along the following traits, which are provided in some way by Wikibase for all entities currently (even if the implementation may not be the tidiest)

  • Content contains an EntityDocument (which has an ID, and whatever else that entity decides to have)
  • Content has a size
  • Content can be empty
  • Content can be copied
  • Content can have some meta properties
  • Content can be compared to other content, and information about the differences can be deduced
  • Content can have patches applied to it
  • Content can be equal to other content
  • Content validation and or filtering might be a thing, so content can be validated, and can be in a valid or invalid state.
  • Content can be a redirect to some other content, and that redirect must have a textual representation
  • Content can be countable, and perhaps if it is empty or a redirect it should not be counted
  • Content can be searched using some textual representation of itself
  • Content can have a short textual representation suitable for use in edit summaries and log messages summarized itself

With some of these in mind, you can see why EntityDocuments have some methods, such as copy, isEmpty and equals. And you can likely also tie many of these bullet points into having a brief and human-understandable representation of otherwise meaningless structured information.

Diving deeper into the Content concepts within MediaWiki can be helpful in some cases, but not really for this post. If you are curious, take a closer look at EntityHandler in Wikibase, which contains some more details.

In Summary

In summary, this blog post delves into the terminology and structure of entities in the Wikibase system, emphasizing the distinction between real-world entities and their data descriptions. The term “Entity” can be misleading, as it merges the concept of the actual object with its data representation. Alternatives like “Document” or “Record” could provide more clarity, but also pose their own risks.

The post traces the evolution from the initial Entity class to the streamlined EntityDocument interface used today, which simplifies the structure while maintaining essential attributes.

Maybe EntityDocument could just be renamed IdentifiableContentDocument if Entity is still too overloaded? *shrug*

In many ways, I’d be of the opinion that the name doesn’t really matter, it’s what the thing is, what it is trying to do, and how it works within a system, and also in new and alternative domains.

Series Navigation<< Wikibase: What is an entity?Lexeme and MediaInfo, implementing EntityDocument >>