Wikidata references from Microdata
Recently some articles appeared on the English Wikipedia Signpost about Wikidata (1, 2, 3). Reading these articles, especially the second and third, pushed me to try to make a dent in the ‘problem’ of references on Wikidata. It turns out that this is actually not that hard!
I have written a script as part of my addwiki libraries and the ‘aww’ command line tool (still to be fully released). The main code for the this specific command in its current version can be found here.
The script can be passed either a single Item ID or some SPARQL matchers as shown below:
aww wm:wd:ref --item Q464933
aww wm:wd:ref --sparql P31:Q11424 --sparql P161:?
The script will then either act on a single item if passed or perform a SPARQL query and retrieve a list of Item IDs.
Each Item is then loaded and its type is checked (using instance of) against a list of configured values, currently Q5 (human) and Q11424 (film) which are in turn mapped to the schema.org types Person and Movie. For each type there is then a further mapping of Wikidata properties to schema.org properties, for example P19(place of birth) to ‘birthPlace’ for humans and P57(director) to ‘director’ for films. These mappings can be used to check microdata on webpages against the data contained in Wikidata.
Microdata is collected by loading all of the external links used on all of the Wikipedia articles for the loaded Item and parsing the HTML. When all of the checks succeed and the data on Wikidata matches the microdata a reference is added.
As you can see the total references added for the three items shown in the example above was 55, the diffs are linked below.
- “The Decline of the American Empire” (Q1197742)
- “Ocean’s 11” (Q464933)
- “Les liaisons dangereuses” (Q1498136)
- More types: As explained above the script currently only works for people and films, but both Wikidata and schema.org cover far more data than this so the script could likely be easily expanded in this areas.
- More property maps: Currently there are still many properties on both schema.org and Wikidata for the enabled types that lack a mapping.
- Better sourcing of microdata: The current approach of finding microdata is simply load all Wikipedia external links and hope that some of them will have some microdata. This is network intensive and currently the slowest part of the script. It is currently possible to create custom Google search engines to match a specific schema.org type, for example films and search the web for pages containing microdata. However there is not actually any ‘nice’ API for search queries like this (hint hint Google).
- Why stop at microdata: Other standards of structured data in webpages exist, so others could also be covered?
This is another step in the right direction in terms of fixing things on a large scale. This is the beauty of having machine-readable data in Wikidata and the larger web.
Being able to add references on mass has reminded me how much duplicate information the current reference system includes. For example, a single Item could have 100 statements each which can be referenced to a single web page. This reference data must then be included 100 times!
Well done sire, well done.
I can haz codes? :)
It is all on github in the addwiki org. If you want pointing to anything in particular then let me know!
[…] If you want to import things into Wikidata, then have a look at this reference Microdata import script by Addshore. […]
Yes!! Let’s go ahead, change our data model and allow defining references per item or even making references entities on their own.
Yes, although thinking about this again the distinction between source and reference. A URL is the source, as is a book (which would be an item). The reference is the source with additional data including retrieval URL etc.
It may be quite hard to remove the apparent duplication, as it may actually be needed….
Nice work! How could one help with mapping? (Wouldn’t that be suitable to actually store in the property items?)
Currently you can find the mappings in the command class which can be found on github @ https://github.com/addwiki/wikimedia-commands/blob/master/src/WikidataReferencer/WikidataReferencerCommand.php
I haven’t really had any time to work on pushing anything relating to this forward in the past months but I am keen to do so!
All help is greatly appreciated!
Good work, Adam. I like your idea. Another aspect of the reference problem that you illustrate is that an external reference could perhaps be more effectively represented as some form of WD entity (or data node), rather than just a URI. Then, the reference could contain “back reference” metadata about what is referenced from it. The representation of references for an Entity would then just be a question of asking the reference node to return a graph of “has referenced statements” rather than duplicating statement reference pointers ad infinitum to a single source.