Addshore

It's a blog

Tag: Wikimedia (page 1 of 2)

Using Hue & Hive to quickly determine Wikidata API maxlag usage

Hue, or Hadoop User Experience is described by its documentation pages as “a Web application that enables you to easily interact with an Hadoop cluster”.

The Wikimedia Foundation has a Hue frontend for their Hadoop cluster, which contains various datasets including web requests, API usage and the MediaWiki edit history for all hosted sites. The install can be accessed at https://hue.wikimedia.org/ using Wikimedia LDAP for authentication.

Once logged in Hue can be used to write Hive queries with syntax highlighting, auto suggestions and formatting, as well as allowing users to save queries with names and descriptions, run queries from the browser and watch hadoop job execution state.

The Wikidata & maxlag bit

MediaWiki has a maxlag API parameter that can be passed alongside API requests in order to cause errors / stop writes from happening when the DB servers are lagging behind the master. Within MediaWiki this lag can also be raised when the JobQueue is very full. Recently Wikibase introduced the ability to raise this lag when the Dispatching of changes to client projects is also lagged behind. In order to see how effective this will be, we can take a look at previous API calls.

Continue reading

The Wikimedia Server Admin Logs

The Wikimedia Server Admin Log or SAL for short is a timestamped log of actions performed on the Wikimedia cluster by users such as roots and deployers. The log is stored on the WikiTech Wikimedia project and can be found at the following URL: https://wikitech.wikimedia.org/wiki/Server_Admin_Log

An example entry in the log could be:

As well as the main cluster SAL there are also logs for release engineering (jenkins, zuul, and other CI things) and individual logs for each project that uses Wikimedia Cloud VPS.

A tool has been created for easy SAL navigation which can be found at https://tools.wmflabs.org/sal

Each SAL can be selected at the top of the tool, with ‘Other’ providing you with a list of all Cloud VPS SALs.

The search and date filters can then be used to find entries throughout history.

Continue reading

WMDE: CI, Deploys & Config changes intro

This is an internal WMDE presentation made to introduce people to the land of Wikimedia CI, MediaWiki deployment and config changes.

This briefly covers:

  • Jenkins & Zuul
  • CI Config
  • Beta Cluster
  • SWAT
  • The MediaWiki train
  • Monitoring
  • mediawiki-config
  • The Docker & Kubernetes future?

From 0 to Kubernetes cluster with Ingress on custom VMs

While working on a new Mediawiki project, and trying to setup a Kubernetes cluster on Wikimedia Cloud VPS to run it on, I hit a couple of snags. These were mainly to do with ingress into the cluster through a single static IP address and some sort of load balancer, which is usually provided by your cloud provider. I faffed around with various NodePort things, custom load balancer setups and ingress configurations before finally getting to a solution that worked for me using ingress and a traefik load balancer.

Below you’ll find my walk through, which works on Wikimedia Cloud VPS. Cloud VPS is an openstack powered public cloud solution. The walkthrough should also work for any other VPS host or a bare metal setup with few or no alterations.

Continue reading

Wikibase docker images

This is a belated post about the Wikibase docker images that I recently created for the Wikidata 5th birthday. You can find the various images on docker hub and matching Dockerfiles on github. These images combined allow you to quickly create docker containers for Wikibase backed by MySQL and with a SPARQL query service running alongside updating live from the Wikibase install.

A setup was demoed at the first Wikidatacon event in Berlin on the 29th of October 2017 and can be seen at roughly 41:10 in the demo of presents video which can be seen below.

Continue reading

Wikimedia Hackathon 2017: mediawiki-docker-dev showcase presentation

This presentation was used during the 2017 Wikimedia Hackathon showcase presentations.

The code shown can be found @ https://github.com/addshore/mediawiki-docker-dev

Wikimedia Commons Android App Pre-Hackathon

Wikimedia Commons Logo

The Wikimedia Commons Android App allows users to upload photos to Commons directly from their phone.

The website for the app details some of the features and the code can be found on GitHub.

A hackathon was organized in Prague to work on the app in the run up to the yearly Wikimedia Hackathon which is in Vienna this year.

A group of 7 developers worked on the app over a few days and as well as meeting each other and learning from each other they also managed to work on various improvements which I have summarised below.

2 factor authentication (nearly)

Work has been done towards allowing 2fa logins to the app.

Lots of the login & authentication code has been refactored and the app now uses the clientlogin API module provided by Mediawiki instead of the older login module.

When building to debug the 2fa input box will appear if you have 2fa login enabled, however the current production build will not show this box and simply display a message saying that 2fa is not currently supported. This is due to a small amount of session handling work that the app still needs.

Better menu & Logout

As development on the app was fairly non existent between mid 2013 and 2016 the UI generally fell behind. This is visible in forms, buttons as well as app layout.

One significant push was made to drop the old style ‘burger’ menu from the top right of the app and replace it with a new slide out menu draw including a feature image and icons for menu items.

Uploaded images display limit

Some users have run into issues with the number of upload contributions that the app loads by default in the contributions activity. The default has always been 500 and this can cause memory exhaustion / OOM and a crash on some memory limited phones.

In an attempt to fix and generally speed up the app a recent upload limit has been added to the settings which will limit the number images and image details that are displayed, however the app will still fetch and store more than this on the device.

Nearby places enhancements

The nearby places enhancements probably account for the largest portion of development time at the pre hackathon. The app has always had a list of nearby places that don’t have images on commons but now the app also has a map!

The map is powered by the mapbox SDK and the current beta uses the mapbox tiles however part of the plan for the Vienna hackathon is to switch this to using the wikimedia hosted map tiles at https://maps.wikimedia.org.

The map also contains clickable pins that provide a small pop up pulling information from Wikidata including the label and description of the item as well as providing two buttons to get directions to the place or read the Wikipedia article.

Image info coordinates & image date

Extra information has also been added to the image details view and the image date and coordinates of the image can now be seen in the app.

Summary of hackathon activity

The contributions and authors that worked on the app during the pre hackathon can be found on Github at the following link.

Roughly 66 commits were made between the 11th and 19th of May 2017 by 9 contributors.

Screenshot Gallery

The RevisionSlider

The RevisionSlider is an extension for MediaWiki that has just been deployed on all Wikipedias and other Wikimedia websites as a beta feature. The extension was developed by Wikimedia Germany as part of their focus on technical wishes of the German speaking Wikimedia community. This post will look at the RevisionSliders design, development and use so far.

Continue reading

WMDE: Metrics & Data Gatherings

Below you will find an internal WMDE presentation covering the general area of WMDE Metric & Data Gatherings from 2016.

This presentation follows on from the initial introduction to engineering analytics activities.

The presentation skims through:

  • WMDE Grafana dashboards
  • The Wikimedia Analytics landscape
  • Grafana & graphite
  • Hadoop, Kafka, Hive & Oozie
  • EventLogging, Mysql replicas & MediaWiki logs
  • Out Analytics scripts
  • How to get access

Un-deleting 500,000 Wikidata items

Since some time in January of this year I have been on a mission to un-delete all Wikidata items that were merged into other items before the redirect functionality of Wikidata existed. Finally I am done (well nearly). This is the short story…

Reasoning

Earlier this year I pointed out the importance of redirects on Wikidata in a blog post. At the time I was amazed at how the community nearly said that they were not going to create redirects for merged items…. but thank the higher powers that the discussion just swung in favour of redirects.

Redirects are needed to maintain the persistent identifiers that Wikidata has. When two items relate to the same concept, they are merged and one of the identifiers must then be left pointing to the identifier now holding the data of the concept.

Listing approach

Since Wikidata began there have been around 1,000,000 log entries deleting pages, which equates to roughly the same number of items deleted, although some deleted items may also have been restored. This was a great starting point. The basic query to get this result was can be found below.

I removed quite a few items from this initial list by looking at at items that had already been restored and were already redirects. To do this I had to find all of the redirects!

At this stage I could have probably tried and remove more items depending on if they currently exist, but there was very little point. In fact it turned out that there was very little point in the above query as prior to my run very few items were un-deleted in order to create redirects.

The next step was to determine which of the logged deletions were actually due to the item being merged into another item. This is fairly easy as most cases of merges used the merge gadget on Wikidata.org. So if the summary matched the following regular expression! I would therefore assume it was deleted due to being merged / a duplicate of another item.

And of course in order to create a redirect I would have to be able to identify a target, so, match Q id links.

I then had a fairly  nice list, although it was still large, but it was time to actually start trying to create these redirects!

Editing approach

So firstly I should point out that such a task is only possible while using an Admin account, as you need to be able to see deleted revisions / un-delete items. Secondly it is not possible to create a redirect over a deleted item and also not possible to restore an item when that would create a conflict on the site, for example due to duplicate site links on items or duplicate joined labels and descriptions.

I split the list up into 104 different sections, each containing exactly 10,000 item IDs. I could then fire up multiple processes to try and create these redirects to make the task go as quickly as possible.

The process of touching a single ID was:

  1. Make sure that the target of the merge exists. If it does not then log to a file, if it does, continue.
  2. Try to un-delete the item. If the deletion fails log to a file, if it is successful continue.
  3. Try to clear the item (as you can only create redirects over empty items). This either results in an edit or no edit, it doesn’t really matter.
  4. Try to create the redirect, this should never fail! If it does log to a fail file that I can clean up after.

The approach on the whole worked very well. As far as I know there were no incorrect un-deletions and nothing failing in the middle.

The first of 2 snags that I hit was the rate at which I was trying to edit was causing the dispatch lag on wikidata to increase. There was no real solution to this other than to keep an eye on the lag and if it ever increased above a certain level to stop editing.

The second snag was causing multiple database locks during the final day of running, although again this was not really a snag as all the transactions recovered. The deadlocks can be seen in the graph below:

The result

  • 500,000 more item IDs now point to the correct locations.
  • We have an accurate idea of how many items have actually been deleted due to not being notable / being test items.
  • The reasoning for redirects has been reinforced in the community.

Final note

One of the steps in the editing approach was to attempt to un-elete an item and if un-deleting were to fail to log the item ID to a log file.

As a result I have now identified a list of roughly 6000 items that should be redirects but and not currently be un-deleted in order to be created.

See https://phabricator.wikimedia.org/T71166

It looks like there is still a bit of work to be done!

Again, sorry for the lack of images :/

Older posts

© 2018 Addshore

Theme by Anders NorenUp ↑