Addshore

It's a blog

Tag: Wikidata (page 2 of 3)

Impact of Wikimania Mexico 2015 on Wikidata

Recently Wikidata celebrated its third birthday. For the occasion I ran the map generation script that I have talked about before again to see what had changed in the geo coordinate landscape of Wikidata!

I found, well, Mexico blossomed!

The image to the left is from June 2015, the right October 2015 and Wikimania was in July 2015!

I will be keeping an eye out for what happens on the map around Esino Lario in 2016 to see what impact the event has on Wikidata again.

Full maps

Un-deleting 500,000 Wikidata items

Since some time in January of this year I have been on a mission to un-delete all Wikidata items that were merged into other items before the redirect functionality of Wikidata existed. Finally I am done (well nearly). This is the short story…

Reasoning

Earlier this year I pointed out the importance of redirects on Wikidata in a blog post. At the time I was amazed at how the community nearly said that they were not going to create redirects for merged items…. but thank the higher powers that the discussion just swung in favour of redirects.

Redirects are needed to maintain the persistent identifiers that Wikidata has. When two items relate to the same concept, they are merged and one of the identifiers must then be left pointing to the identifier now holding the data of the concept.

Listing approach

Since Wikidata began there have been around 1,000,000 log entries deleting pages, which equates to roughly the same number of items deleted, although some deleted items may also have been restored. This was a great starting point. The basic query to get this result was can be found below.

I removed quite a few items from this initial list by looking at at items that had already been restored and were already redirects. To do this I had to find all of the redirects!

At this stage I could have probably tried and remove more items depending on if they currently exist, but there was very little point. In fact it turned out that there was very little point in the above query as prior to my run very few items were un-deleted in order to create redirects.

The next step was to determine which of the logged deletions were actually due to the item being merged into another item. This is fairly easy as most cases of merges used the merge gadget on Wikidata.org. So if the summary matched the following regular expression! I would therefore assume it was deleted due to being merged / a duplicate of another item.

And of course in order to create a redirect I would have to be able to identify a target, so, match Q id links.

I then had a fairly  nice list, although it was still large, but it was time to actually start trying to create these redirects!

Editing approach

So firstly I should point out that such a task is only possible while using an Admin account, as you need to be able to see deleted revisions / un-delete items. Secondly it is not possible to create a redirect over a deleted item and also not possible to restore an item when that would create a conflict on the site, for example due to duplicate site links on items or duplicate joined labels and descriptions.

I split the list up into 104 different sections, each containing exactly 10,000 item IDs. I could then fire up multiple processes to try and create these redirects to make the task go as quickly as possible.

The process of touching a single ID was:

  1. Make sure that the target of the merge exists. If it does not then log to a file, if it does, continue.
  2. Try to un-delete the item. If the deletion fails log to a file, if it is successful continue.
  3. Try to clear the item (as you can only create redirects over empty items). This either results in an edit or no edit, it doesn’t really matter.
  4. Try to create the redirect, this should never fail! If it does log to a fail file that I can clean up after.

The approach on the whole worked very well. As far as I know there were no incorrect un-deletions and nothing failing in the middle.

The first of 2 snags that I hit was the rate at which I was trying to edit was causing the dispatch lag on wikidata to increase. There was no real solution to this other than to keep an eye on the lag and if it ever increased above a certain level to stop editing.

The second snag was causing multiple database locks during the final day of running, although again this was not really a snag as all the transactions recovered. The deadlocks can be seen in the graph below:

The result

  • 500,000 more item IDs now point to the correct locations.
  • We have an accurate idea of how many items have actually been deleted due to not being notable / being test items.
  • The reasoning for redirects has been reinforced in the community.

Final note

One of the steps in the editing approach was to attempt to un-elete an item and if un-deleting were to fail to log the item ID to a log file.

As a result I have now identified a list of roughly 6000 items that should be redirects but and not currently be un-deleted in order to be created.

See https://phabricator.wikimedia.org/T71166

It looks like there is still a bit of work to be done!

Again, sorry for the lack of images :/

Wikimedia Grafana graphs of Wikidata profiling information

I recently discovered the Wikimedia Grafana instance. After poking it for a little while here are some slightly interesting graphs that I managed to extract.

Continue reading

Barack Obama GeneaWiki, 1 year later

GeneaWiki is a tool created by Magnus Manske to visualize the family of a person using data pulled from Wikidata.

I used the GeneaWiki tool as an example use of Wikidata in a presentation a year ago (2014) and below you can see the screenshot I took from it. It shows 10 people in Barack Obamas family tree / web.

GenaWiki Q76 2014

 

When creating a new presentation this year (2015) I went back to GeneaWiki to take another screenshot and this is what I found!

GenaWiki Q76 2015

Around 30 people now! :)

Yay, more data!

https://tools.wmflabs.org/magnus-toolserver/ts2/geneawiki/?q=Q76

Review of the big Interwiki link migration

Wikidata was launched on 30 October 2012 and was the first new project of the Wikimedia Foundation since 2006. The first phase enabled items to be created and filled with basic information: a label – a name or title, aliases – alternative terms for the label, a description, and links to articles about the topic in all the various language editions of Wikipedia.

On 14 January 2013, the Hungarian Wikipedia became the first to enable the provision of interlanguage links via Wikidata. This functionality was slowly enabled on more sites until it was enabled on all Wikipedias on the 6th March.

The side bar that these interlanguage links are used to generate can be seen to the right. Continue reading

Wikidata Map – 19 months on

The last Wikidata map generation, as last discussed here and as originally created by Denny Vrandečić was on the 7th of November 2013. Recently I have started rewriting the code that generates the maps, stored on github, and boom, a new map!

The old code

The old version of the wikidata-analysis repo, which generated the maps (along with other things) was terribly inefficient. The whole task of analysing the dump and generating data for various visualisations was tied together using a bash script which ran multiple python scripts in turn.

  • The script took somewhere between 6 and 12 hours to run.
  • At some points this script needed over 6GB of memory to run. And this was running when Wikidata was much smaller, this probably wouldn’t even run any more.
  • All of the code was hard to read, follow and understand.
  • The code was not maintained and thus didn’t actually run any more.

The Rewrite

The initial code that generated the map can mainly be found in the following two repositories which were included as sub-modules into the main repo:

The code worked on the Mediawiki page dumps for Wikidata and relied on the internal representation of Wikidata items and thus as this changed everything broke.

The wda repository pointed toward the Wikidata-Toolkit which is written in Java and is actively maintained, and thus the rewrite began! The rewrite is much faster, easily understood and easily expandable (maybe I will make another post about it once it is done)!

The change to the map in 19 months

Unfortunately according to the settings of my blog currently I can not upload the 2 versions of the map so will instead link to the the twitter post announcing the new map as well as the images used there (not full size).

The tweet can be found here.

Wikidata map 7 Nov 2013

Wikidata map 3 June 2015

As you an see, the bottom map contains MORE DOTS! Yay!

Still to do

  • Stop the rewrite of the dump analyser using somewhere between 1 and 2GB ram.
    • Problem: Currently the rewrite takes the data it wants and collects it in a Java JSON object writing to disk at the end of the entire dump has been read. Because of this lots of data ends up in this JSON object and thus in memory, and as we analyse things more this problem is only going to get worse.
    • Solution: Write all data we want directly to disk. After the dump has fully been analysed read all of these output files individually and put them in the format we want (probably JSON).
  • Make all of the analysis run whenever a new JSON dump is available!
  • Keep all of the old data that is generated! This will mean we will be able to look at past maps. Previously the maps were overwritten every day.
  • Fix the interactive map!
    • Problem: Due to the large amount of data that is now loaded (compared with then the interactive map last worked 19 months ago) the interactive map crashes all browsers that try to load it.
    • Solution: Optimise the JS code for the interactive map!
  • Add more data to the interactive map! (of course once the task above is done)

Maps Maps Maps!

Redirects on Wikidata

the Wikidata LogoRedirects in Wikidata are basically the same as redirects on normal wiki pages. However, they have a slightly different meaning and intention.

The main reason why we need redirects is because we want to provide stable identifiers.

The merging of two items has been a common place task since Wikidata gain momentum as often two different people or bots create items representing the same topic or concept, thus the data from one needs to be moved into the other and the empty item removed.

The problem with this is that removing the empty item of course means that one of the two identifiers no longer points to the topic or concept that is represents. This is therefore not a stable identifier..

Continue reading

Wikidata map visualizations

the Wikidata LogoIn 2013 and 2014 I made a few presentations to various groups of people talking about Wikidata.

When creating those presentations I used as many graphical representations of the data in Wikidata as possible to try and give people an clearer picture of what is already stored.

One of the best visualisations at the time was the Wikidata map created by Denny Vrandečić which came after the introduction of coordinate locations to Wikidata.

Below you can see a GIF showing the additions of the coordinate location property to Wikidata items over roughly the first 40 days of enabling the coordinates data type.

By Denny Vrandecic and Lydia Pintscher (Own work) [CC0], via Wikimedia Commons

 Below are some of the images that I extracted from the full map for use in my presentations. Although they are now quite outdated they are still great to look at!

Continue reading

Wikimania Open Data Weekend

Wikimania 2014 logoThe Open Data Weekend was fringe event to the Wikimania conference this year. It took place in Frobisher Room One at The Barbican Centre in London on the 5 & 6th of July and was well attended.

The weekend included:

  • Discussion about how open data, specifically Wikidata, is helping the Wikimedia movement as a whole covering its current integration with sister projects such as Wikipedia and Wikisource as well as future integration with these as well as Wikimedia Commons.
  • A general discussion on open data and the philosophy and the Semantic Web technologies.
  • Exploring various tools and applications that run and depend on the data stored within Wikidata and Wikimedia projects.

Unfortunately I am writing this post in 2015  and various details have fallen out of my mind… Luckily there is an EtherPad that contains lots of notes!

Continue reading

Zürich Wikimedia Hackathon

This year the Wikimedia Hackathon was held in Zürich, Switzerland from the 9th to 11th May 2014. The organization of the event was great, from lanyards and badges that included a USB memory stick to a city map and a ticket for public transport, Wikimedia Switzerland had prepared fantastic hackathon.

More than 150 developers, engineers, sysadmins, and technology enthusiasts gathered coming from more than 30 countries aiming to share knowledge about new and existing technologies, fix bugs, come up with new ideas and work together on tools and systems relating to the Wikimedia movement.

As the name suggests a lot of time at a hackathon is spent ‘hacking’ (coding and such) there are also workshops available on all days. This year these workshops and talks included multiple sessions on ‘Vagrant’ working toward a production like development system, ‘Open data’ looking at Wikidata and government open data as well as sessions of ‘Phabricator’ and ‘Jenkins’.

Hackathons are not just a place to hack, but they provide people with a crucial time to allow people with different specialisms and interests to meet each other in person, put faces to names and names to pseudonyms, to build relationships and in turn build the movement.

Until next time!

Image Credits:

  • Logo: By Original: Trevor Parscal Modification: Lokal_Profil [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
  • Photo: By Christian Meixner (Own work) [CC BY 3.0 (http://creativecommons.org/licenses/by/3.0)], via Wikimedia Commons
Older posts Newer posts

© 2017 Addshore

Theme by Anders NorenUp ↑