Un-deleting 500,000 Wikidata items
Since some time in January of this year I have been on a mission to un-delete all Wikidata items that were merged into other items before the redirect functionality of Wikidata existed. Finally I am done (well nearly). This is the short story…
Earlier this year I pointed out the importance of redirects on Wikidata in a blog post. At the time I was amazed at how the community nearly said that they were not going to create redirects for merged items…. but thank the higher powers that the discussion just swung in favour of redirects.
Redirects are needed to maintain the persistent identifiers that Wikidata has. When two items relate to the same concept, they are merged and one of the identifiers must then be left pointing to the identifier now holding the data of the concept.
Since Wikidata began there have been around 1,000,000 log entries deleting pages, which equates to roughly the same number of items deleted, although some deleted items may also have been restored. This was a great starting point. The basic query to get this result was can be found below.
SELECT * FROM logging where log_type= 'delete' and log_action = 'delete' and log_namespace = 0
I removed quite a few items from this initial list by looking at at items that had already been restored and were already redirects. To do this I had to find all of the redirects!
SELECT p1.page_title AS title, rd_title, p1.page_namespace as namespace, rd_namespace FROM page as p1 LEFT JOIN redirect ON ((rd_from=p1.page_id)) LEFT JOIN page as p2 ON ((p2.page_namespace=rd_namespace) AND (p2.page_title=rd_title)) WHERE p1.page_is_redirect = '1' AND p1.page_namespace = 0
At this stage I could have probably tried and remove more items depending on if they currently exist, but there was very little point. In fact it turned out that there was very little point in the above query as prior to my run very few items were un-deleted in order to create redirects.
The next step was to determine which of the logged deletions were actually due to the item being merged into another item. This is fairly easy as most cases of merges used the merge gadget on Wikidata.org. So if the summary matched the following regular expression! I would therefore assume it was deleted due to being merged / a duplicate of another item.
And of course in order to create a redirect I would have to be able to identify a target, so, match Q id links.
I then had a fairly nice list, although it was still large, but it was time to actually start trying to create these redirects!
So firstly I should point out that such a task is only possible while using an Admin account, as you need to be able to see deleted revisions / un-delete items. Secondly it is not possible to create a redirect over a deleted item and also not possible to restore an item when that would create a conflict on the site, for example due to duplicate site links on items or duplicate joined labels and descriptions.
I split the list up into 104 different sections, each containing exactly 10,000 item IDs. I could then fire up multiple processes to try and create these redirects to make the task go as quickly as possible.
The process of touching a single ID was:
- Make sure that the target of the merge exists. If it does not then log to a file, if it does, continue.
- Try to un-delete the item. If the deletion fails log to a file, if it is successful continue.
- Try to clear the item (as you can only create redirects over empty items). This either results in an edit or no edit, it doesn’t really matter.
- Try to create the redirect, this should never fail! If it does log to a fail file that I can clean up after.
The approach on the whole worked very well. As far as I know there were no incorrect un-deletions and nothing failing in the middle.
The first of 2 snags that I hit was the rate at which I was trying to edit was causing the dispatch lag on wikidata to increase. There was no real solution to this other than to keep an eye on the lag and if it ever increased above a certain level to stop editing.
The second snag was causing multiple database locks during the final day of running, although again this was not really a snag as all the transactions recovered. The deadlocks can be seen in the graph below:
- 500,000 more item IDs now point to the correct locations.
- We have an accurate idea of how many items have actually been deleted due to not being notable / being test items.
- The reasoning for redirects has been reinforced in the community.
One of the steps in the editing approach was to attempt to un-elete an item and if un-deleting were to fail to log the item ID to a log file.
As a result I have now identified a list of roughly 6000 items that should be redirects but and not currently be un-deleted in order to be created.
It looks like there is still a bit of work to be done!
Again, sorry for the lack of images :/