Wikidata references from Microdata

December 29, 2015 8 By addshore

Recently some articles appeared on the English Wikipedia Signpost about Wikidata (1, 2, 3). Reading these articles, especially the second and third, pushed me to try to make a dent in the ‘problem’ of references on Wikidata. It turns out that this is actually not that hard!

Script overview

I have written a script as part of my addwiki libraries and the ‘aww’ command line tool (still to be fully released). The main code for the this specific command in its current version can be found here.

The script can be passed either a single Item ID or some SPARQL matchers as shown below:

aww wm:wd:ref --item Q464933

OR

aww wm:wd:ref --sparql P31:Q11424 --sparql P161:?

The script will then either act on a single item if passed or perform a SPARQL query and retrieve a list of Item IDs.

Each Item is then loaded and its type is checked (using instance of) against a list of configured values, currently Q5 (human) and Q11424 (film) which are in turn mapped to the schema.org types Person and Movie. For each type there is then a further mapping of Wikidata properties to schema.org properties, for example P19(place of birth) to ‘birthPlace’ for humans and P57(director) to ‘director’ for films. These mappings can be used to check microdata on webpages against the data contained in Wikidata.

Microdata is collected by loading all of the external links used on all of the Wikipedia articles for the loaded Item and parsing the HTML. When all of the checks succeed and the data on Wikidata matches the microdata a reference is added.

Example command line output

As you can see the total references added for the three items shown in the example above was 55, the diffs are linked below.

 

Further development

  • More types: As explained above the script currently only works for people and films, but both Wikidata and schema.org cover far more data than this so the script could likely be easily expanded in this areas.
  • More property maps: Currently there are still many properties on both schema.org and Wikidata for the enabled types that lack a mapping.
  • Better sourcing of microdata: The current approach of finding microdata is simply load all Wikipedia external links and hope that some of them will have some microdata. This is network intensive and currently the slowest part of the script. It is currently possible to create custom Google search engines to match a specific schema.org type, for example films and search the web for pages containing microdata. However there is not actually any ‘nice’ API for search queries like this (hint hint Google).
  • Why stop at microdata: Other standards of structured data in webpages exist, so others could also be covered?

Other thoughts

This is another step in the right direction in terms of fixing things on a large scale. This is the beauty of having machine-readable data in Wikidata and the larger web.

Being able to add references on mass has reminded me how much duplicate information the current reference system includes. For example, a single Item could have 100 statements each which can be referenced to a single web page. This reference data must then be included 100 times!