Wikidata user and project talk page connection graph

December 12, 2021 0 By addshore

Talk pages are a pretty key part of how wikis have worked over the years. Realtime chat apps and services are probably changing this dynamic somewhat, but they are still used, and also most of the history of these pages is still recorded.

I started up an IPython Notebook to try and take a look at some of the connections between different users on Wikidata over the years. Below you’ll find a few representations of these connections, as well as notable things I spotted along the way, the generating code, SQL query and more!

The data

MediaWiki maintains links tables for all pages, so getting all of the current links out of Wikidata is very easy. I made use of the Wikimedia Cloud Quarry service to run this query and host a CSV of the results.

SELECT
  SUBSTRING_INDEX(page_title, '/', 1) AS t1,
  pl_from_namespace AS t1ns,
  SUBSTRING_INDEX(pl_title, '/', 1) AS t2,
  pl_namespace AS t2ns
FROM pagelinks, page
WHERE pl_namespace IN (3,5) AND pl_from_namespace IN (3,5)
AND page_id = pl_from AND page_title != pl_title
GROUP BY t1, t2Code language: PHP (php)

I then loaded this data directly into an IPython Notebook and did some cleaning, such as removing all IP addresses. I then spent quite some time applying more filtering and twiddling knobs to try and get some graphics out that are easy to look at. The first attempts looked like solid blobs as you can see in this tweet.

You can find a copy of the Notebook on notebooksharing.space.

The Graphs

For all of these graphs, edges are relationships between user talk pages and project talk pages on Wikidata. Edges occur if their talk pages (or subpages) are linked. Various filtering is then applied (see notebook) to visually show the graph in a nice way.

The first graph tries to show as much of the community as possible. Generally speaking, any page names, be that user name or project page names, that are toward the middle of the graph have the most connections to other nodes. This centre section includes many long time Wikidata users, as well as key project pages such as “Request for comment”, “Property Proposal”, “Notability” and more.

Each edge must connect to a node with 200 other potential edges, and all nodes must have at least 25 potential edges. Everything else is hidden.

The next graph moves toward highlighting the hubs of these link graphs, now requiring hubs with 900 links rather than 200. 10 or so very well linked users pop out at this point.

The names that appear within the centre of these nodes probably make us a core part of the community over the years.

Each edge must connect to a node with 900 other potential edges, and all nodes must have at least 10 potential edges. Everything else is hidden.

The final graph focuses on these key hubs once again, filtering out the rest of the cruft. We see that there are 5 hubs that have over 1500 potential edges.

There are now also some key connectors between these hubs that can be easily identified in the middle, even if some of the names are hard to read.

Each edge must connect to a node with 1500 other potential edges, and all nodes must have at least 5 potential edges. Everything else is hidden.