Using Hue & Hive to quickly determine Wikidata API maxlag usage

July 3, 2018 0 By addshore

Hue, or Hadoop User Experience is described by its documentation pages as “a Web application that enables you to easily interact with an Hadoop cluster”.

The Wikimedia Foundation has a Hue frontend for their Hadoop cluster, which contains various datasets including web requests, API usage and the MediaWiki edit history for all hosted sites. The install can be accessed at https://hue.wikimedia.org/ using Wikimedia LDAP for authentication.

Once logged in Hue can be used to write Hive queries with syntax highlighting, auto suggestions and formatting, as well as allowing users to save queries with names and descriptions, run queries from the browser and watch hadoop job execution state.

The Wikidata & maxlag bit

MediaWiki has a maxlag API parameter that can be passed alongside API requests in order to cause errors / stop writes from happening when the DB servers are lagging behind the master. Within MediaWiki this lag can also be raised when the JobQueue is very full. Recently Wikibase introduced the ability to raise this lag when the Dispatching of changes to client projects is also lagged behind. In order to see how effective this will be, we can take a look at previous API calls.

Within the Hadoop DataLake there is an apiaction table that contains all API calls to Wikimedia sites (which includes Wikidata) along with the parameters used (with some data redacted) and other details such as user agent.

The query below counts the number of calls that were successful to the Wikidata API, using the Wikidata writing API actions for June 2016 that did not come from internal services.

SELECT COUNT(*) AS COUNT,
       params["maxlag"] as maxlag
FROM apiaction
WHERE wiki = "wikidatawiki"
  AND haderror = FALSE
  AND params["action"] RLIKE '^wbl?(create|edit|set|add|remove|link|merge)'
  AND useragent != '127.0.0.1'
  AND YEAR = 2018
  AND MONTH = 06
GROUP BY params["maxlag"]
ORDER BY COUNT DESC LIMIT 25;

In Hue this looks like:

The play button to the left of the query can be used to start the job, and the running query will then appear in the “Query History” section of the page:

Once the query has completed the raw results can be viewed in the browser under the “Results” tab:

And the results can even be quickly visualized in the browser (using more than just a pie chart…):

This shows us that the majority of writing API calls use a maxlag of 5, however around 1/3 of calls in June 2018 either passed no maxlag value or a maxlag value that is so high it would probably never be reached during regular operation.