Wikidata query service Blazegraph JNL file on Cloudflare R2 and Internet Archive

August 24, 2023 5 By addshore
This entry is part 3 of 3 in the series Your own Wikidata Query Service

At the end of 2022, I published a Blazegraph JNL file for Wikidata in a Google Cloud bucket for 1 month for folks to download and determine if it was useful.

Thanks to Arno from weblyzard, inflatador from the WMF search platform team, and Mark from the Internet Archive for the recent conversations around this topic.

You can now grab some new JNL files from a few days ago, hosted on either the Internet Archive or Cloudflare R2.

If you want to use these files, check out the relevant section in my previous post.

Host changes

The use of a Google Cloud bucket came with a large downside of data egress cost overheads being around 121 Euros per download when exiting Google Cloud. Some folks still found the file useful despite this cost, and some folks also just used it inside Google Cloud.

After Arno reached out to me, I thought about who might be up for hosting multiple terabytes of data for some time with free data transfer (and ideally covering hosting costs), and ended up emailing Mark from the Internet Archive, who said I should just go ahead and upload it.

Conveniently has an S3 like API to use, so sending the data there should be as easy as sending it to Google Cloud.

While talking to inflatador about extracting a new JNL file from a Wikimedia production host, they informed me that Cloudflare R2 (S3 compatible buckets) have free data egress. 🥳

So, moving forward I’ll try to provide the latest JNL file on Cloudflare R2, and can store a few historical copies as well.

Uploading the files

I struggled for multiple days to make the uploads to both targets actually complete to 100%, as can be seen by the rather network-intensive period for the wdqs1009 host in Wikimedia production that I was using to retrieve the JNL file from.

The first transfer was to which worked first time, though I only used CURL, and am unsure if the validity of the transfer would have actually been checked?

curl --location --header 'x-amz-auto-make-bucket:1' \
     --header 'x-archive-meta01-subject:wikidata' \
     --header 'x-archive-meta02-subject:jnl' \
     --header 'x-archive-meta-mediatype:data' \
     --header 'x-archive-meta-licenseurl:' \
     --header "authorization: LOW aaa:bbb" \
     --upload-file /srv/wdqs/wikidata.jnl \ language: JavaScript (javascript)

It took roughly 15 hours to transfer the 1.2TB file, which is ≈ 80 GB/hour (could probably go faster if using an S3 CLI tool) then seemingly has a bunch of post-processing that happens, so the file was not available to download for another day or so.

R2 couldn’t just take a 1.2TB file in a single chunk through a curl file upload, so I looked at tools that would work with it.

I couldn’t figure out how to make s3cmd actually work (which is the tool I used previously), but settled on rclone which is provided as an example in the R2 docs.

The second transfer, which hit 120MB/s was to R2 using the cat command, though this failed with some chunk errors at the end. This could be due to the lack of chunk validity checks, but also could be due to the number of concurrent uploaders I was trying to use at the time (32). (See issue on Github)

As did the third transfer used copyto, which seemingly uploaded 1TB and then failed right at the end. This could have been due to the fact that the query service process on the host restarted and was touching the JNL file.

Finally, the fourth upload worked, taking around 4.5 hours which is ≈ 266.67 GB/hour.

rclone copyto -P /srv/wdqs/wikidata.jnl cloudflare:addshore-wikidata-jnl/2023-08-22.jnl --s3-upload-cutoff=2G --s3-chunk-size=2G --transfers=4 --s3-upload-concurrency=4

Download speeds

I did a quick check on the download speed from the two new targets with a simple wget, and as expected R2 is significantly faster to download from than over HTTP.

Is it possible that using some S3 tool to download these files might result in faster download speeds?

Moving forward

Feel free to reach out to me if you want a newer copy of the JNL file published. The whole process should now take less than 12 hours to get a file to for it to start processing and to get a live file on R2.

There are discussions about providing these JNL files more regularly via a defined process perhaps on, and I have opened a ticket to focus that conversation. T344905

Hosting these files on R2 does cost, so feel free to Buy me a coffee to support the cost.

Series Navigation<< A first Wikidata query service JNL file for public use