Wikidata query service Blazegraph JNL file on Cloudflare R2 and Internet Archive
At the end of 2022, I published a Blazegraph JNL file for Wikidata in a Google Cloud bucket for 1 month for folks to download and determine if it was useful.
You can now grab some new JNL files from a few days ago, hosted on either the Internet Archive or Cloudflare R2.
Cloudflare R2 (faster, won’t be around forever)Deleted December 2023
- Internet Archive (slower, will be around for quite some time)
If you want to use these files, check out the relevant section in my previous post.
The use of a Google Cloud bucket came with a large downside of data egress cost overheads being around 121 Euros per download when exiting Google Cloud. Some folks still found the file useful despite this cost, and some folks also just used it inside Google Cloud.
After Arno reached out to me, I thought about who might be up for hosting multiple terabytes of data for some time with free data transfer (and ideally covering hosting costs), and ended up emailing Mark from the Internet Archive, who said I should just go ahead and upload it.
Conveniently archive.org has an S3 like API to use, so sending the data there should be as easy as sending it to Google Cloud.
So, moving forward I’ll try to provide the latest JNL file on Cloudflare R2, and archive.org can store a few historical copies as well.
Uploading the files
I struggled for multiple days to make the uploads to both targets actually complete to 100%, as can be seen by the rather network-intensive period for the wdqs1009 host in Wikimedia production that I was using to retrieve the JNL file from.
The first transfer was to archive.org which worked first time, though I only used CURL, and am unsure if the validity of the transfer would have actually been checked?
curl --location --header 'x-amz-auto-make-bucket:1' \
--header 'x-archive-meta01-subject:wikidata' \
--header 'x-archive-meta02-subject:jnl' \
--header 'x-archive-meta-mediatype:data' \
--header 'x-archive-meta-licenseurl:https://creativecommons.org/publicdomain/zero/1.0/' \
--header "authorization: LOW aaa:bbb" \
--upload-file /srv/wdqs/wikidata.jnl \
It took roughly 15 hours to transfer the 1.2TB file, which is ≈ 80 GB/hour (could probably go faster if using an S3 CLI tool)
Archive.org then seemingly has a bunch of post-processing that happens, so the file was not available to download for another day or so.
R2 couldn’t just take a 1.2TB file in a single chunk through a curl file upload, so I looked at tools that would work with it.
The second transfer, which hit 120MB/s was to R2 using the cat command, though this failed with some chunk errors at the end. This could be due to the lack of chunk validity checks, but also could be due to the number of concurrent uploaders I was trying to use at the time (32). (See issue on Github)
As did the third transfer used copyto, which seemingly uploaded 1TB and then failed right at the end. This could have been due to the fact that the query service process on the host restarted and was touching the JNL file.
Finally, the fourth upload worked, taking around 4.5 hours which is ≈ 266.67 GB/hour.
rclone copyto -P /srv/wdqs/wikidata.jnl cloudflare:addshore-wikidata-jnl/2023-08-22.jnl --s3-upload-cutoff=2G --s3-chunk-size=2G --transfers=4 --s3-upload-concurrency=4
I did a quick check on the download speed from the two new targets with a simple wget, and as expected R2 is significantly faster to download from than archive.org over HTTP.
Is it possible that using some S3 tool to download these files might result in faster download speeds?
Feel free to reach out to me if you want a newer copy of the JNL file published. The whole process should now take less than 12 hours to get a file to archive.org for it to start processing and to get a live file on R2.
There are discussions about providing these JNL files more regularly via a defined process perhaps on dumps.wikimedia.org, and I have opened a ticket to focus that conversation. T344905
Hosting these files on R2 does cost, so feel free to Buy me a coffee to support the cost.