Wikidata Map – 19 months on
The last Wikidata map generation, as last discussed here and as originally created by Denny Vrandečić was on the 7th of November 2013. Recently I have started rewriting the code that generates the maps, stored on github, and boom, a new map!
The old code
The old version of the wikidata-analysis repo, which generated the maps (along with other things) was terribly inefficient. The whole task of analysing the dump and generating data for various visualisations was tied together using a bash script which ran multiple python scripts in turn.
- The script took somewhere between 6 and 12 hours to run.
- At some points this script needed over 6GB of memory to run. And this was running when Wikidata was much smaller, this probably wouldn’t even run any more.
- All of the code was hard to read, follow and understand.
- The code was not maintained and thus didn’t actually run any more.
The initial code that generated the map can mainly be found in the following two repositories which were included as sub-modules into the main repo:
The code worked on the Mediawiki page dumps for Wikidata and relied on the internal representation of Wikidata items and thus as this changed everything broke.
The wda repository pointed toward the Wikidata-Toolkit which is written in Java and is actively maintained, and thus the rewrite began! The rewrite is much faster, easily understood and easily expandable (maybe I will make another post about it once it is done)!
The change to the map in 19 months
Unfortunately according to the settings of my blog currently I can not upload the 2 versions of the map so will instead link to the the twitter post announcing the new map as well as the images used there (not full size).
The tweet can be found here.
As you an see, the bottom map contains MORE DOTS! Yay!
Still to do
- Stop the rewrite of the dump analyser using somewhere between 1 and 2GB ram.
- Problem: Currently the rewrite takes the data it wants and collects it in a Java JSON object writing to disk at the end of the entire dump has been read. Because of this lots of data ends up in this JSON object and thus in memory, and as we analyse things more this problem is only going to get worse.
- Solution: Write all data we want directly to disk. After the dump has fully been analysed read all of these output files individually and put them in the format we want (probably JSON).
- Make all of the analysis run whenever a new JSON dump is available!
- Keep all of the old data that is generated! This will mean we will be able to look at past maps. Previously the maps were overwritten every day.
- Fix the interactive map!
- Problem: Due to the large amount of data that is now loaded (compared with then the interactive map last worked 19 months ago) the interactive map crashes all browsers that try to load it.
- Solution: Optimise the JS code for the interactive map!
- Add more data to the interactive map! (of course once the task above is done)
Maps Maps Maps!