Lexeme and MediaInfo, implementing EntityDocument

June 6, 2024 0 By addshore
This entry is part 3 of 4 in the series Wikibase ecosystem

As we continue the journey, looking at Entity and EntityDocument within Wikibase, another useful thing to look at are the third and fourth widely used (at least within the Wikimedia space) entity types for Wikibase.

Both of these entity types make use of the EntityDocument, with none of the old assumptions baked into the Entity base class that used to exist.

MediaWiki extensions

As these entity types were decoupled from the main body of Wikibase, they were developed as MediaWiki extensions. https://www.mediawiki.org/wiki/Extension:WikibaseMediaInfo and https://www.mediawiki.org/wiki/Extension:WikibaseLexeme

This was the easy choice at the time, and probably still makes perfect sense, as Wikibase itself is a MediaWiki extension, and there is already a common pattern of extensions extending extensions. This ultimately saves some work around coding an extension mechanism, though we should remember that ultimately the Wikibase codebase has free choice when it comes to choose how it can be extended.

In terms of adding entity types, this is mainly documented in the entitytypes documentation, but this is very much surface level documentation of a system that is still really in its first version, and was originally created for Items and Properties to be somewhat pluggable within Wikibase itself.

As with many of the entity types around Wikibase, although they were primarily developed for the Wikimedia use cases, in this case Wikimedia Commons and Wiktionary, today both extensions (I believe) see use in other Wikibases too.

MediaInfo

The MediaInfo entity type took the approach of sticking quite close to the internal representations that older Wikibase entity types, such as Item, used.

You can see this by looking at the PHP class for the entity type, which includes many of the same interfaces.

class MediaInfo
	implements StatementListProvidingEntity,
            LabelsProvider,
            DescriptionsProvider,
            ClearableEntityCode language: PHP (php)

A notable difference compared with the Item definition is the lack of AliasesProvider, so MediaInfo do not have any concept of aliases.

In fact, you could argue that from a user perspective, MediaInfo entities also have no concept of labels, as within user interfaces these are referred to as captions (though the API, among other things, does allow the phrase “label” to spill out into the user land).

            "labels": {
                "en": {
                    "language": "en",
                    "value": "Maiden in Mindelo, Cape Verde, 2022"
                }
            }Code language: JavaScript (javascript)

Connecting this back to my last post, these labels, or captions (depending on where you are looking), fulfil the human-understandable aspect of what is probably a requirement of an entity within Wikibase.

Arguably, there could be some extra abstraction here, and Wikibase should provide a mechanism for human readability attributes, that are going to be some form of short text, that can be in multiple languages, and exposed in various interfaces etc in the ways that they need to be.

Also, arguably, this is what labels are. But it’s easy to see how, through another lens, labels feel “very Item and Property ‘ye olde’ Wikibase”, and possibly come with some debt associated with them that could be cleaned up.

I’ll briefly touch on the identifiable and linked data attributes, and say that MediaInfo entities are identified through unique and stable IDs, similar to Items and Properties, for example M127101037. They are also linked via statements, ripped right from Items and Properties, too.

Lexeme

The Lexeme entity type ended up in a slightly different place in terms of code implementation, and I hope you’ll excuse me if I gloss over the concept of sub entities here, but I’d be happy to talk about them in a future post.

Immediately looking at the PHP code definition of a Lexeme entity…

class Lexeme implements StatementListProvidingEntity, ClearableEntityCode language: PHP (php)

You won’t see any mention of labels, descriptions, or aliases in terms of a human-understandable aspect, Lexeme instead has custom parts that fulfil this.

Similar ideas are followed to labels here, and arguably, with some of the legacy stuff surrounding labels removed, perhaps these are just labels? (or perhaps we just start calling them human-readable representations?)

The side effect of re implementing this human-readable element for Lexemes is far-reaching, with the amount of additional code increasing, and the number of systems or modules that need to learn about this slightly different, but also similar / the same component, being plentyfull.

Within the Wikibase codebases, we can for example compare Lexeme with MediaInfo, in the way that Lexeme must implement its own LINK_FORMATTER_CALLBACK (that tells MediaWiki how to format links to Lexeme entities), so that the lemma etc can be shown, however MediaInfo gets this for free, as labels come with a default that displays the label.

My opinion

You won’t get away from my opinion in this series of blog posts, so here’s a more concrete take on my opinion that entities need a human-readable aspect.

Labels were version 1, and used in Items and Properties on Wikidata. This tightly scoped project (Wikidata) has slowly evolved into a much wider ecosystem than some thought in the early days, and thus Items and Properties are used far and wide.

Many lessons have been learned along the way, and lots of these lessons are built into labels, as they are known and used on Items on Wikidata. For example, in order to integrate with the many tools MediaWiki provides, it needs to be able to format a link to something like Q64, to something human-readable, such as “Berlin”. This is no different a use case to displaying L4 as “windsurf”. And in all cases, these human-readable texts are short, and multilingual.

The support within Wikibase for using these in a generic way is lacking, and has been ever since they were shoehorned into MediaInfo, and even prior to this. When Properties were first introduced to Wikibase, you could already see the cracks and assumptions that were baked into labels (and other parts of entities) that slowly had to be teased out (and even today should be further removed than they are). For example, the uniqueness constraints across labels and descriptions. From the docs:

If an Item has a label and a description for a given language, no other Item may have that same combination for the same language.

It’s a shame that the V1 of the new Wikibase REST API has taken so long to appear within the Wikibase space, as I’d love to see endpoints for /entity/L1/humanReadable/name/en and /entity/Q2/humanReadable/detail/de appear for labels and descriptions respectively. Hell, call them what you want, we already have labels and descriptions, we might as well stick with that, just don’t assume that everything about them has to be exactly the same on every entity type, they are in essence just a human-readable representation of the document, and that representation should consistently be exposed in a bunch of places.

I want to ramble further about automated human-readable name and summaries here, and how this further highlights the potential separation between implementations of this human-readable aspect in current entities and what might exist in the future, but perhaps that’s for another post.

The pinch of salt

A pinch of salt again here is I haven’t spent much of the past year neck deep in these code bases, and I may be miss remembering some things, such as if the default link formatter actually works with labels alone? or does it also require descriptions. My counterpoint to this is it doesn’t matter, just make it work with labels alone.

I’m also only scratching the surface, and choosing to primarily focus on the human-readable aspect of these entity types. Though, I feel the same principles can be carried across most attributes of the entities that exist today within Wikibase. According to a previous post, the human-readable aspect is 1/3 of the problem anyway.

Looking particularly at the REST API endpoints that I half joke about above. I love the fact the Wikibase API is hopefully soon getting a revamp to increase modern developer usability and facilitate modern tooling around usage. I just wish that Wikibase was already at a point that new APIs changing behaviours and bringing additional functionality and abstractions to the framework / platform were being developed. Also, I hate the fact that I put humanReadable, name and detail in those paths…

In summary

Labels are a pretty good starting point (in my opinion) for a human-readable aspect of something that is identifiable by some non human-readable ID (entity), especially when the whole idea is for these things to integrate into MediaWiki and be exposed in a consistent manner for users. Many lessons have been learnt throughout their current ~11 year lifespan, and reimplementing the idea of human-readable text for an ID from scratch will lead to things being missed, and years down the line ultimately (and hopefully) ending up in the same place, just now with 2 systems to maintain.

To quote a half wise German man I was recently chatting to on Telegram regarding rewrites:

I think part of the attraction of rewrites is that it saves you the trouble of understanding the old system. But in my experience, that just means that you’ll repeat the mistakes because you didn’t take the time to learn from the past.

Wise ish German man

I don’t personally care about the name, but labels are already in use, so why not just call them that? You can see MediaInfo that ultimately ended up in the middle ground (and probably could have done with more support from the core of Wikibase) now have an ongoing confusing state to deal with.

Series Navigation<< Wikibase, from Entity to EntityDocumentWikibase Repository development environment (mwcli) >>