Wikidata maxlag, via the ApiMaxLagInfo hook

March 4, 2022 0 By addshore

Wikidata tinkers with the concept of maxlag that has existed in MediaWiki for some years in order to slow automated editing at times of lag in various systems.

Here you will find a little introduction to MediaWiki maxlag, and the ways that Wikidata hooks into the value, altering it for its needs.

Screenshot of the “Wikidata Edits” grafana dashboard showing increased maxlag and decreased edits

As you can see above, a high maxlag can cause automated editing to reduce or stop on wikidata.org

MediaWiki maxlag

Maxlag was introduced in MediaWiki 1.10 (2007), moving to api.php only implementation in 1.27 (2016).

If you are running MediaWiki on a replicated database cluster, then maxlag will indicate the number of seconds of replication database lag.

Since MediaWiki 1.29 (2017) you have also been able to increase maxlag reported to users depending on the number of jobs in the job queue using the $wgJobQueueIncludeInMaxLagFactor setting.

The maxlag parameter can be passed to api.php through a URL parameter or POST data. It is an integer number of seconds.

If the specified lag is exceeded at the time of the request, the API returns an error (with 200 status code, see task T33156) like the following

Manual:Maxlag parameter
{
  "error": {
    "code": "maxlag",
    "info": "Waiting for 10.64.48.35: 0.613402 seconds lagged.",
    "host": "10.64.48.35",
    "lag": 0.613402,
    "type": "db",
  },
  "servedby": "mw1375"
}Code language: JSON / JSON with Comments (json)

Users can retrieve the maxlag at any time by sending a value of -1 (the lag is always 0 or more, so this always presents the error.

Generally speaking the maxlag on most sites with replication should be below a second.

Users of Wikimedia sites are recommended to use a maxlag=5 value, and this is the default in may tools such as Pywikibot. You can read more about this on the API Etiquette page.

Modifying maxlag

Back in 2018 in response to ongoing Wikidata dispatch lag related issues, we implemented a way to modify how and when the maxlag error was shown to users from extensions. Conveniently the maxlag error already included a type value, and we had the plan to add another!

The new hook ApiMaxLagInfo was born (gerrit change, documentation)

Receivers of the hook will get the MediaWiki calculated $lagInfo, and can decide if they have a system that is more lagged than the current lag they have been passed. If they do, they can overwrite this $lagInfo and pass it on.

In the diagram below we can see this in action

  • MediaWiki determins the maxlag to be 0.7s from its sources (replication & optional jobqueue)
  • Hook implementation 1 determines its own maxlag to be 0.1s, and decides to not change the already existing 0.7s
  • Hook implementation 2 determines its own maxlag to be 1.5s, so it replaces the lower 0.7s
  • The final maxlg is 1.5s and this is what is used when checking the maxlag parameter provides by the user, or when displaying the lagged value to the user

Factors

As with the optional MediaWiki job queue maxlag integration, all usages of the ApiMaxLagInfo hook generally come with their own configurable factor.

This is due to the expectation that ~5 seconds of lag is when automated clients should back off.

For some systems, we actually want that to happen at say 2 minutes of lag, or when the job queue has 1000 entries. The factor allows this translation.

// 5 = 1000 / 200
$lag = $jobs / $factor;Code language: PHP (php)

Dispatching

Dispatching or change propagation has been part of Wikibase since the early days. This mechanism keeps track of changes that happen on Wikibase, emitting events in the form of MediaWiki Jobs to any clients (such as Wikipedia) that need to be notified about the change for one reason or another.

Historically dispatching has had some issues with being slow, which in turn leads to updates not reaching sites such as Wikipedia in a reasonable amount of time. This is deemed to be bad as it means that things such as vandalism fixes take longer to appear.

Before adding dispatch lag to max lag the Wikidata team already had monitoring for the lag, and would often run additional copies of the dispatch process to clear backlogs.

You can find numerous issues relating to dispatch lag issues, and before this was added to maxlag Wikidata admins would normally go around talking to editors making large numbers of edits, of blocking bots.

Dispatch lag was origionally added as a type of maxlag in 2018, and was the first usage of the ApiMaxLagInfo hook.

The value used was the median lag for all clients with a factor applied. This was calculated during each request, as this median lag value was readily available in Wikibase code.

$dispatchLag = $medianLag / $dispatchLagToMaxLagFactor;Code language: PHP (php)

Later in 2018 we inadvertently made dispatching much faster. The whole system was rewritten in 2021 as part of T48643: [Story] Dispatching via job queue (instead of cron script), and a decision was made to no longer include dispatch lag in maxlag.

Queryservice updates

The query service update lag was added as a type of maxlag on Wikidata since 2019 due to the ongoing issues that the query service was having staying up to date with Wikidata. You can find an ADR on the decision in the Wikidata.org MediaWiki extension.

The value used was calculated in a maintenance script that is run every minute. This script takes the lastUpdated values of query service backends as recorded in prometheus looking for the backend that is the next most lagged server from the median. This value is then stored in a cache with a TTL of around 70 seconds. (code)

public function getLag(): ?int {
	$lags = $this->getLags();
	// Take the median lag +1
	sort( $lags );
	return $lags[(int)floor( count( $lags ) / 2 + 1 )] ?? null;
}Code language: PHP (php)

During every request, this cached value is then checked and a factor applied. (code)

$dispatchLag = $store->getLag() / $factor;Code language: PHP (php)

In 2021 the Wikidata Query Service had fully switched over to its’ new streaming updater which should mostly tackle the lag issues.