WikiCrowd at 50k answers

April 2, 2022 0 By addshore

In January 2022 I published a new Wikimedia tool called WikiCrowd.

This tool allows people to answer simple questions to contribute edits to Wikimedia projects such as Wikimedia Commons and Wikidata.

It’s designed to be able to deal with a wide variety of questions, but due to time constraints, the extent of the current questions covers Aliases for Wikidata, and Depict statements for Wikimedia Commons.

The tool has just surpassed 55k questions, 50k answers, 32k edits and 75 users.

Thanks to @pmgpmgpmgpmg (Twitter, Github) and @waldyrious (Twitter, Github) for their sustained contributions to the project filling issues as well as contributing code and question definitions.

User Leaderboard

Though I haven’t implemented a leaderboard as part of the tool, the number of questions answered, and resulting edits are tracked in the backend.

Thus, of the 50k answers, we can take a look at who contributed to the crowd!

  1. PMG: 35,581 answers resulting in 21,084 edits at a 59% edit rate
  2. I dream of horses: 4543 answers resulting in 3184 edits at a 70% edit rate
  3. Tiefenschaerfe: 3749 answers resulting in 3207 edits at an 85% edit rate
  4. Addshore: 3049 answers resulting in 2133 edits at a 69% edit rate
  5. OutdoorAcorn: 708 answers resulting in 526 edits at a 74% edit rate
  6. Waldyrious: 443 answers resulting in 310 edits at a 69% edit rate
  7. Fences and windows: 409 answers resulting in 242 edits at a 59% edit rate
  8. Amazomagisto: 328 answers resulting in 211 edits at a 64 % edit rate

Thanks to all of the 75 users that have given the tool a go in the past months.

Answer overview

  • Yes is the favourite answer with 32,192 occurrences
  • No comes second with 13,473 occurrences
  • And a total of 3,818 questions were skipped altogether

In the future skipped questions will likely be presented to a user a second time.

Question overview

Depicts questions have by far been the most popular, and also the easiest to generate more interesting groups of questions for.

  • 48,236 Depicts questions
  • 776 Alias questions
  • 471 Depicts refinement questions

The question mega groups were split into subgroups.

  • Depicts has had 45 different things that could be depicted
  • Aliases can be added from 3 different language Wikipedias
  • Depicts refinement has been used on 19 of the 45 depicted things

Question success rate

Some questions are harder than others, and some questions have better filtering in terms of candidate answers than others.

For this reason, I suspect that some questions will have a much higher success rate, than others, and some with more skips.

At a high level, the groups of questions have quite different yes rates.

  • Depicts: 65% yes, 27% no, 8% skip
  • Alias: 54% yes, 23% no, 21% skip
  • Depicts refinement: 95% yes, 2% no, 2% skip

If we take a deeper dive into the depict questions, we can probably see some depictions that are hard to spot or commons categories that possibly include a wider variety of media around a core subject.

An example of this would be categories for US presidents that also include whole categories for election campaigns, or demonstrations, neither of which would normally feature the president.

jet aircraft95.19%3.48%1.33%
steam locomotive85.24%7.48%7.28%
house cat74.26%16.31%9.43%
electric toothbrush48.79%34.76%16.45%
Barack Obama28.29%70.23%1.49%
pie chart21.13%61.76%17.11%
covered bridge3.51%79.61%16.88%
Summary of depict questions (where over ~1000 questions exist) ordered by yes %

The % rate of yes answers could be used to decide the ease of questions allowing some users to pick harder categories, or forcing new users to try easy questions first.

As question generation is tweaked, particularly for depicts questions where categories can be excluded, we should also see the yes % change over time. Slowly tuning question generation to get to a 80% yes range could be fun!

Of course, none of this is implemented yet ;)…

Queries behind this data

Just in case this needs to be generated again, here are the queries used.

For the user leader boards…

	->select('username', DB::raw('count(*) as answers'))
	->orderBy('answers', 'desc')

	->select('username', DB::raw('count(*) as edits'))
	->orderBy('edits', 'desc')
	->get();Code language: PHP (php)

And the question yes rate data came from the following query and a pivot table…

	->select('','answer',DB::raw('count(*) as counted'))
	->join('answers','answers.question_id','=','', 'left outer')
	->join('edits','edits.question_id','=','', 'left outer')
	->get();Code language: PHP (php)

Looking forward

Come and contribute, code, issues or ideas on the Github repo.

Next blog post at 100k? Or maybe now that there are cron jobs for question generation (people don’t have to wait for me) 250k is a more sensible next step.