Addshore

It's a blog

Wikidata references from Microdata

Recently some articles appeared on the English Wikipedia Signpost about Wikidata (1, 2, 3). Reading these articles, especially the second and third, pushed me to try to make a dent in the ‘problem’ of references on Wikidata. It turns out that this is actually not that hard!

Script overview

I have written a script as part of my addwiki libraries and the ‘aww’ command line tool (still to be fully released). The main code for the this specific command in its current version can be found here.

The script can be passed either a single Item ID or some SPARQL matchers as shown below:

OR

The script will then either act on a single item if passed or perform a SPARQL query and retrieve a list of Item IDs.

Each Item is then loaded and its type is checked (using instance of) against a list of configured values, currently Q5 (human) and Q11424 (film) which are in turn mapped to the schema.org types Person and Movie. For each type there is then a further mapping of Wikidata properties to schema.org properties, for example P19(place of birth) to ‘birthPlace’ for humans and P57(director) to ‘director’ for films. These mappings can be used to check microdata on webpages against the data contained in Wikidata.

Microdata is collected by loading all of the external links used on all of the Wikipedia articles for the loaded Item and parsing the HTML. When all of the checks succeed and the data on Wikidata matches the microdata a reference is added.

Example command line output

As you can see the total references added for the three items shown in the example above was 55, the diffs are linked below.

 

Further development

  • More types: As explained above the script currently only works for people and films, but both Wikidata and schema.org cover far more data than this so the script could likely be easily expanded in this areas.
  • More property maps: Currently there are still many properties on both schema.org and Wikidata for the enabled types that lack a mapping.
  • Better sourcing of microdata: The current approach of finding microdata is simply load all Wikipedia external links and hope that some of them will have some microdata. This is network intensive and currently the slowest part of the script. It is currently possible to create custom Google search engines to match a specific schema.org type, for example films and search the web for pages containing microdata. However there is not actually any ‘nice’ API for search queries like this (hint hint Google).
  • Why stop at microdata: Other standards of structured data in webpages exist, so others could also be covered?

Other thoughts

This is another step in the right direction in terms of fixing things on a large scale. This is the beauty of having machine-readable data in Wikidata and the larger web.

Being able to add references on mass has reminded me how much duplicate information the current reference system includes. For example, a single Item could have 100 statements each which can be referenced to a single web page. This reference data must then be included 100 times!

 

7 Comments

  1. Well done sire, well done.

    I can haz codes? :)

  2. > Being able to add references on mass has reminded me how much duplicate information the current reference system includes.

    Yes!! Let’s go ahead, change our data model and allow defining references per item or even making references entities on their own.

    • addshore

      January 26, 2016 at 2:53 pm

      Yes, although thinking about this again the distinction between source and reference. A URL is the source, as is a book (which would be an item). The reference is the source with additional data including retrieval URL etc.
      It may be quite hard to remove the apparent duplication, as it may actually be needed….

  3. Nice work! How could one help with mapping? (Wouldn’t that be suitable to actually store in the property items?)

  4. Good work, Adam. I like your idea. Another aspect of the reference problem that you illustrate is that an external reference could perhaps be more effectively represented as some form of WD entity (or data node), rather than just a URI. Then, the reference could contain “back reference” metadata about what is referenced from it. The representation of references for an Entity would then just be a question of asking the reference node to return a graph of “has referenced statements” rather than duplicating statement reference pointers ad infinitum to a single source.

Leave a Reply

© 2017 Addshore

Theme by Anders NorenUp ↑