EntitySchema, and the entity flip-flop

July 11, 2024 3 By addshore
This entry is part 5 of 9 in the series Wikibase ecosystem

The EntitySchema extension, previously called WikibaseSchema, has had an interesting life since its initial creation back in early 2019.

The main point this story is intended to highlight is that EntitySchema started off its planned life as an Entity within a Wikibase. As the development team started work on an initial version, it flipped away from an entity. And in continued development, it has slowly inched its way back towards perhaps being an Entity.

Background

As is noted in the first ADR of the extension (which was actually written in 2023), the team initially decided to try and develop the extension entirely separate from Wikibase

Although Entity Schemas relate to Wikibase entities by name and purpose, the implementation of the EntitySchema extension, at the time of this decision, is completely decoupled from Wikibase, and the concept of Entities that it adds to MediaWiki. Thus, a MediaWiki instance can theoretically operate with only the EntitySchema extension, and without the Wikibase extension installed.

Keeping EntitySchema separate from Wikibase, and the idea of an Entity it provides altogether, was a conscious decision to not marry its implementation to the inherent complexity of Wikibase itself. As well as an attempt to avoid overloading EntitySchema with unnecessary functionality so that its ongoing implementation could be done iteratively and in a more flexible, organic manner, to answer user’s needs as they are brought to us.

0001 Extend Entity Schema to support additional “traits” ADR

In a nutshell, this extension, and the developments and discussions about it over the past years (and that are still happening today), was one of the things that has led me to recently writing a series of blog posts about what I think an “entity” is from my perspective, as well as looking at some other entities, and the use of EntityDocument in the codebase.

Project kick-off

Internally within WMDE, the extension started off (having already been planned and discussed for some time) with a series of kick-off meetings in December 2018. The first of which was deemed to have too many open questions, hence the follow-up of a second. Ultimately, a team formed around the creation of the extension and this started further discussions.

I feel it worth noting that during these years at WMDE, “The Journey Model” was being used, which is a modification of the Spotify model for making agile work at scale. A key part of this that likely ties into some of the interesting presentation and development along this journey is the desire that a team working on a feature would “not last more than 1Q(uarter) to avoid long running teams. This short running team would be called a hike.

As I remember it, Product had presented a high level overview of a problem:

  • enable humans and machines to find items that do not fit a certain shape in order to find mistakes and omissions in our data
  • increase the confidence in our data

As well as some key usage scenarios for a first version:

  • I’d like to find existing schemas that are relevant to me
  • I’d like to understand what an existing schema does
  • I’d like to discuss an existing schema and form modelling consensus
  • I’d like to adapt an existing schema
  • I’d like to be able to store a new schema
  • I’d like to be able to test a set of entities against an existing schema

And some vision into the future:

  • First: Allow storing of schemas
  • Next: Allow storing of explanatory text and categorizing schemas
  • Later: Allow checking the current entity against a schema

It was clear to see, through the mockups of what this thing might look like, that Product’s intention back then was for this to be an Entity (whatever that meant in the Wikibase of 2018). The page mockup explicitly references a “termbox”, something that to date only appears on Entity pages.

Even ahead of this kick-off meeting, other elements of discussion certainly pointed toward a Schema being an Entity, such as “schemas will be identified by a sequential number, prefixed by the letter “O” (wd:O123)”.

But this strongly ties back to what you think an entity is of course.

Scoping

After the kick-off, discussions within the implementing team lead to the conclusion that the currently scoped work could be carried out entirely separate from Wikibase. I don’t believe this was documented publicly anywhere, but it wouldn’t surprise me if there is some documentation hidden in an internal only WMDE Google Doc.

Development

Thanks to the ever open records that are contained within the Wikimedia Phabricator instance, Gerrit UI, and git repos (mirrored to github), we can get a pretty good idea of how the early development happened.

You can find all Phabricator tasks in the order that they were created, the Gerrit code reviews that were happening, and all commits that were ultimately made to the extension – although some of these links may need tweaking as the number of pages grow!

2019 hike

I may have missed things in the below summary, but I think it gives a rather nice condensed view on how extension development may work, and also how EntitySchema evolved in the past.

Roughly speaking, and glossing over small unimportant patches, development looked something like this:

So a working extension was created and released to the community in a single quarter!

Within the list of things that the team had to work through above you can see many that existed, one way or another, within Wikibase already, that Wikibase either forces you to use, or optionally provides for you to use.

Edit conflict handling, ID generation, undos, restores and patches, translatable edit summaries (as well as basic out of the box edit summaries), search index integration, figuring out how to configure MediaWiki content correctly (not editable directly, immovable namespaces etc), special pages for basic interactions, diffing, storage.

This list, and the first iteration of development above, will become more relevant as I continue this series of blog posts, so watch out for what is next!

The intermediate

Between 2019 and 2023, a series of small pokes, prods, adjustments and minor updates happened to the extension.

2023 focus

In 2023, a second round of work started on the extension. This was tracked in a series of milestones (M1-M5) on Phabricator with the general overall goals of make EntitySchemas linkable from statements on Entities, appear as formatted text instead of IDs in many MediaWiki interfaces, and use a standard termbox (as is done on Items and Properties).

In summary:

And this is the time period that saw a series of ADRs start to be written about how the extension was going to continue being developed.

Ultimately these overall goals, I would argue, come for free from Wikibase and being an Entity, as did many of the code patches that needed to be implemented in the 2019 effort. And this is again something I hope to explore in future posts.

Complexity of Wikibase

ADR1 does reference inherent complexity of Wikibase in its introduction, but I’ll refer back some quotes from a prior post here.

Many lessons have been learnt throughout their current ~11 year lifespan, and reimplementing the idea of human-readable text for an ID from scratch will lead to things being missed, and years down the line ultimately (and hopefully) ending up in the same place, just now with 2 systems to maintain.

addshore.com – Lexeme and MediaInfo, implementing EntityDocument – June 2024

The quote above is specifically talking about labels, descriptions and aliases, and how they are handeled in Wikibase, and mapped to MediaWiki, in particular how they are stored and exposed to users.

Roughly, what I see reflecting on the ~5 year lifetime of the EntitySchema extension is exactly this.

Something minimal was implemented, which didn’t meet the requirements that Product was already aware of, and meet the feature level that Wikibase and Entities already provided across the board. And many years later M3 is finally delivering this baseline functionality to the extension.

I think part of the attraction of rewrites is that it saves you the trouble of understanding the old system. But in my experience, that just means that you’ll repeat the mistakes because you didn’t take the time to learn from the past.

Wise-ish German man

I don’t think it’s neccesarily only mistakes that end up being repeated, but also core assumed functionality that ends up being missed because it is assumed by some, or not understood by others, and seen as unneccesary complexity.

EntitySchema is now kind of an Entity?

Now to talk about the main reason that I wrote this post at all…

The Support additional types in wbsearchentities Gerrit change that was merged in the past weeks.

In a nutshell, this patch adds EntitySchemas, which are not an entity in code, to the websearchentities action API module of Wikibase for be found by their ID.

https://www.wikidata.org/w/api.php?action=wbsearchentities&search=E24&language=en&format=jsonfm&type=entity-schema

So, according to wbsearchentities, EntitySchema are now an Entity, they have an entityId, and also labels, descriptions and aliases that share at least some commonality between all other entities and themselves.

Reasoning & Approach

This change relates to the M2 milestone in the last batch of work done on the extension, “Linking to EntitySchemas in statements”.

What I believe happened is the team implemented their own “expert” (a UI element in Wikibase that is used for editing a particular type of data, that is normally the target value of a statement).

But along part of that journey, likely to save time, they decided to make use of the existing wbsearchentities API to return the results for the expert to use. Some other modifications were also needed to Wikibase to enable this “non entity id” value to be used as a statement value.

This ultimately has lead to a working feature, but I’d argue this is at some rather confusing cost.

Implications

EntitySchema is now in a weird middle ground. As time progresses, it slowly looks more and more like an entity, but it still doesn’t quack like one, and when looking under the surface, the complexity required to maintain this second not-an-entity system along side the existing entity system is going up and up.

Taking a look specifically at the way EntitySchema now hooks into the wikibase entity search. The wiring for search for all entities used to look like this.

'WikibaseRepo.EntitySearchHelperCallbacks' => function ( MediaWikiServices $services ): array {
	return WikibaseRepo::getEntityTypeDefinitions( $services )
		->get( EntityTypeDefinitions::ENTITY_SEARCH_CALLBACK );
},Code language: PHP (php)

Ultimately this looks at all registered entities, and uses their defined ENTITY_SEARCH_CALLBACK to add things to the Wikibase entity search results.

Adding EntitySchema to this search, as it is not an entity and can thus not make use of entity registration, has introduced another hook to this point for things that are not entities to use.

'WikibaseRepo.EntitySearchHelperCallbacks' => function ( MediaWikiServices $services ): array {
	$callbacks = WikibaseRepo::getEntityTypeDefinitions( $services )
		->get( EntityTypeDefinitions::ENTITY_SEARCH_CALLBACK );

	$services->getHookContainer()->run( 'WikibaseRepoEntitySearchHelperCallbacks', [ &$callbacks ] );
	return $callbacks;
},Code language: PHP (php)

It may look small, but this is ultimately the start of a second entity registration system.

More explanation is now included along side some parts of Wikibase code, explaining this complexity too.

/**
 * @internal
 * @return string[] List of entity type identifiers for search.
 * This includes all the {@link self::getEnabledEntityTypes() enabled entity types},
 * and potentially additional types that are not registered with Wikibase’s entity registration yet.
 * Such “types” must be used with caution, as they may not support anything other than search.
 */
public static function getEnabledEntityTypesForSearch( ContainerInterface $services = null ): array {
	return ( $services ?: MediaWikiServices::getInstance() )
		->get( 'WikibaseRepo.EnabledEntityTypesForSearch' );
}Code language: PHP (php)

Wording in many places along this search path just no longer makes sense in terms of what an Entity is known to be in code within the Wikibase extension.

What if?

My general hope is that as the team comes to need more and more things that are provided by Wikibase Entities “for free” via existing interfaces, or as the team needs EntitySchemas to exist within the Wikibase ecosystem, rather than next to it (such as adding to the query service), EntitySchema will eventually end up an Entity.

Of course there is work to be done around Entity registration, and tidying up the legacy that has existed for over 10 years at this point.

I highlighted many of these issues some years ago in a badly named branch to Wikibase where I added a new Entity type. Ultimately, this will be the branch that I will rewrite in my next few blog posts, stepping through issues the code changes identify.

Series Navigation<< Wikibase Repository development environment (mwcli)Wikibase Phrase Entity, Creation >>