Cloudflare workers for wikibase.cloud uptime & status
Recently I wanted to create a live status page for wikibase.cloud, that also tracking the status of the various services and response times, so that people in the Telegram group might be able to try and correlate their experiences (possibly slow behaviour) with what was seen by others in other locations on other sites, without needing to message in the Telegram group.
In a way, this could be seen as an iteration on the current status page for the service, which is maintained as a static site on Github, making use of cState, a static status page.
I initially chose to experiment with Cloudflare Workers to do the minutely checks, after looking around at the current offerings for free online code running (thinking Heroku style etc).
Why Workers?
The pricing model is very simple, and also everything that I would want to do would be within the free plan.
That is 100,000 per day, and 10 milliseconds of CPU time per invocation.
Checking every minute, I should be running, 1440 times a day. And as workers are billed with CPU time, and I would be spending most of my time waiting for IO (the checks) or IO (writing the data) everything should stay within the 10ms limit.
There is also detailed docs and examples for using crons within workers.
Workers now also seemed to come with an analytics DB, which could be easily enabled, and you could throw data into (such as the status checks). My usage of this would also fall within the free plan.
The code
The final state of the code while the checks were run from Cloudflare can be seen in this commit on Github.
The cron in essence:
- Looked at various services response time, and expected a 200
- Checked that other services such as SPARQL and Elasticsearch actually returned data that we expected to exist.
- Checked and reported maxlag as reported by the MediaWiki API.
Writing data into the analytics DB was a breeze from the worker, and I wrote this small wrapper method so that I could see what data would be recorded while running locally in the dev setup (as an analytics DB is not currently provided there.
async function writeData(env: Env, blobs : string[], doubles: number[], indexes: string[]) {
// If we are in wrangler dev
if (env.WBCLOUD_STATUS) {
env.WBCLOUD_STATUS.writeDataPoint({ 'blobs': blobs, 'doubles': doubles, 'indexes': indexes});
} else {
// Analytics engine not currently supported locally
// https://github.com/cloudflare/workers-sdk/issues/4383
console.log("Not in wrangler dev, skipping writeDataPoint");
console.log({ 'blobs': blobs, 'doubles': doubles, 'indexes': indexes});
}
}
Code language: JavaScript (javascript)
Having this covered by the local dev SDK itself is tracked under issue 4383 and or issue 5532
Display
I wrote a little HTTP endpoint for the worker as well, that would return cached data from the analytics DB.
Ultimately, this DB lets you query it over HTTP using SQL.
SELECT toStartOfInterval(timestamp, INTERVAL '1' MINUTE) AD timestamp, index1 AD check, double1 AD status, double2 AD bytes, double3 AS time, double4 AS extra
FROM wbc_status
WHERE blob1 = 'dev_wbc_check_0001'
AND timestamp >= toDateTime(toUnixTimestamp(now()) - 7*24*60*60)
AND timestamp <= now()
ORDER BY timestamp DESC
Code language: PHP (php)
A very simple, but ugly, HTML page then loaded all of this data in JS, and munged it together into some graphs making use of plotly.
Move to Toolforge
Now, this little project certainly taught me to love Cloudflare workers, and in fact I now make use of them in other projects. The fact they are billed on CPU time is excellent, and the execution cost is low, and come with use of the caching API, which in some use cases is killer.
I ultimately decided that I wanted to use my Cloudflare free plan for other things, as well as open up the status page a little more for possibly contributions from other Wikimedia folks, so to Toolforge!
In principle, much of the idea is the same, except there are no additional checks for editing speed, as well as how long edits take to appear in both SPARQL and Elasticsearch. Rather than writing to a fancy DB, I instead just write to CSV files (one per day) and load and process these in JS.
For example, time, latency, status…
00:03:56,394,1
00:10:36,413,1
00:15:18,313,1
00:18:06,414,1
00:19:49,334,1
Code language: CSS (css)
It’s now all written in python, and makes use of threads. I have seen some decrease in reliability after the rewrite, and things were certainly easier to deal with, and logically think about when starting up an individual isolated worker run, every time a minute of time passed.
Anyway, the graphs still look nice!
Can you provide your toolforge link or source there?
Github code https://github.com/addshore/wikibase-cloud-status
This is what is running on toolforge https://github.com/addshore/wikibase-cloud-status/blob/main/py/index.py
The HTML is currently only on github pages https://addshore.github.io/wikibase-cloud-status
Now that the dataset is full, i think I need to reduce the number of CSV files that get downloaded, as it quite offen fails on the first load and requires a refresh.