Google Cloud Storage upload with s3cmd

I recently had to upload a large (1TB) file from a Wikimedia server into a Google Cloud Storage bucket. gsutil was not available to me, but s3cmd was. So here is a little how to post for uploading to a Google Cloud Storage bucket using the S3 API and s3cmd.

S3cmd is a free command line tool and client for uploading, retrieving and managing data in Amazon S3 and other cloud storage service providers that use the S3 protocol, such as Google Cloud Storage or DreamHost DreamObjects.

https://s3tools.org/s3cmd

s3cmd was already installed on my system, but if you want to get it, see the download page.

Read more

Hunting YouTube Crypto Scams

Back in April 2022 I got annoyed by how prevalent cryptocurrency scams were still on YouTube years after I had first seen them. I spent a few minutes going through the scams that I easily found with a search for live streams including either “ETH” to “BTC” and reporting them via the YouTube flag / report system. Many hours later there were eventually taken down, but not before more scam live streams were already running to take their place.

Really I wanted (and still want) YouTube to do a better job… They have all of the information that should make shutting these down in the first seconds of them being live. But I figured I’d see how easy this would be to automate as a system using only the public APIs etc.

This post covers the initial prototype, followed by the scam-hunter web app which ran for some months before I sunset it last week. TLDR; lots of money was stolen while I was looking at these scam streams.

Example of the scam

When running, these streams are very easy to find by just searching for them (Live streams that mention “BTC” or “ETH”. You’ll either end up with streams displaying charts of the values compared with other crypto assets, or scam streams.

The scam streams take a variety of different forms, but not of them make use of pre-recorded videos of conversations with folks such as Elon Musk talking about cryptocurrencies, while also promoting a website such as MuskLiveNow.Tech (I made this one up) which claims to be running a giveaway event.

Screenshot of a crypto scan on YouTube from April 2022

Read more

Sailing: Month 1

I’m writing a lot of content over on sailinghannahpenn.co.uk, and I want to share some of that here, linking to most of the posts and adding a little more.

We actually wrote most of the initial blog posts in Month 1, as we had already set off before creating the blog. I’m looking forward to being able to look back on the blog in the future.

I tried writing a diary while driving my old van through Europe, but I stopped halfway through. These blogs feel like something I can continue.

Month 1 covers the initial splash of the boat after having lots of work down, all the way through to the day before the Biscay crossing (day 1 to day 31)

Read more

A digital nomad boat experience

I have always been somewhat of a digital nomad in my working life, an opportunity that mainly exists due to my very flexible job as a software engineer at Wikimedia Germany.

Working in Isla Mujeres, Quintana Roo, Mexico 2019

Over the years I have been primarily based in the UK but have travelled with work to California, South Africa, Israel and many places in Europe, among others.

As well as these work trips, I managed an extended vacation in 2019 through Central America where I worked around 10 hours per week, as well as other hops to Portugal etc.

This is the part that I find myself attempting now again in 2022. But rather than Central America, it will be “the world” on a boat, with slightly less regular mobile data connection.

On previous trips, I didn’t really blog much, at least not about the travel. The one post I have from around 6 months in Central America was a post detailing travel between 2 places. And for work trips, if I blog I focus on the work aspects, such as this post on the Lyon Wikimedia Hackathon. I want that to change with this sailing adventure.

In fact, as I write this I am in the middle of the bay of Biscay, and I just listened to Between the Brackets episode 117 where Yaron and Brian mentioned my little adventure. (Partly the reason I decided to write this post). So here are some more details about the plan.

For more sailing details and to follow along, you might also want to check out sailinghannahpenn.co.uk where there will be more sailing content this year. There are already posts covering the first 50+ days of sailing!

Read more

Modifying default width of WordPress pages using new themes like Blockbase

I switched to the Blockbase WordPress theme a few weeks ago as it supports full-site editing, which brings blocks to all parts of your site rather than just posts and pages.

I found the content to be quite narrow out of the box at 620px.

Screenshot showing narrow text out of the box with the Blocbase theme
Screenshot of the default content width (620px) from my 4K display

I really wanted the default view for folks with wider screens such as myself to be a little wider, so I went on an adventure to change this.

Read more

Wikidata query service updater evolution

The Wikidata Query Service (WDQS) sits in front of Wikidata and provides access to query its data via a SPARQL API. The query service itself is built on top of Blazegraph, but in many regards is very similar to any other triple store that provides a SPARQL API.

In the early days of the query service (circa 2015), the service was only run by Wikidata, hence the name. However, as interest and usage of Wikibase continued to grow more people started running a query service of their own, for data in their own Wikibase. But you’ll notice most people still refer to it as WDQS today.

Whereas most core Wikibase functionality is developed by Wikimedia Deutschland, the query service is developed by the search platform team at the Wikimedia Foundation, with a focus on wikidata.org, but also a goal of keeping it useable outside of Wikimedia infrastructure.

The query service itself currently works as a whole application rather than just a database. Under the surface, this can roughly be split into 2 key parts

  • Backend Blazegraph database that stores and indexes data
  • Updater process that takes data from a Wikibase and puts it in the database

This actually means that you can run your own query service, without running a Wikibase at all. For example, you can load the whole of Wikidata into a query service that you operate, and have it stay up to date with current events. Though in practice this is quite some work, and expense on storage and indexing and I expect not many folks do this.

Over time the updater element of the query service updater has iterated through some changes. The updater now packaged with Wikibase as used by most folks outside of the Wikimedia infrastructure is now 2 steps behind the updater used for Wikidata itself.

The updater generations look something like this:

  • HTTP API Recent Changes polling updater (used by most Wikibases)
  • Kafka based Recent Changes polling updater
  • Streaming updater (used on Wikidata)

Let’s take a look at a high-level overview of these updaters, what has changed and why. I’ll also be applying some pretty arbitrary / gut feeling scores to 4 categories for each updater.

Read more

Infrastructure as Code for wbstack deployments

This entry is part 12 of 12 in the series WBStack

For most of its life wbstack was a mostly one-man operation. This certainly sped up the decision making process around features, requests, communication and prioritization, I also had to maintain a complex and young project supporting hundreds of sites on the side of my regular 8 hour day job.

In order to ensure that I’d feel comfortable with this extra context, be able to support the platform for multiple years, have a platform that could grow and scale from day one and also leave the future of the platform with as many possibilities as possible I roughly followed a few principles throughout implementation and operation.

  • Scalability: Tink about scale at multiple levels. Everything was either already horizontally scalable, or the path to get to horizontal scalability had been thought out
  • Automation: Automate actions, if you have 2 of something now, pretend you have 1000 of them instead and develop the solution to fit
  • Infrastructure as code: All infrastructure configuration was contained somehow in the deploy repository
  • Cloud agnostic: Things would be cloud-agnostic where possible, resulting in most things being in Kubernetes or using other external services
  • Own fewer things: Try to not create many new services or codebases, or take ownership of forks that should not exist, as this will become too much work

The one part of the above list that I want to dive into more in this post is infrastructure as code and how it worked for the multi-year lifespan of wbstack, before the move to wikibase.cloud.

Read more

WikiCrowd at 50k answers

In January 2022 I published a new Wikimedia tool called WikiCrowd.

This tool allows people to answer simple questions to contribute edits to Wikimedia projects such as Wikimedia Commons and Wikidata.

It’s designed to be able to deal with a wide variety of questions, but due to time constraints, the extent of the current questions covers Aliases for Wikidata, and Depict statements for Wikimedia Commons.

Read more

Wikidata maxlag, via the ApiMaxLagInfo hook

Wikidata tinkers with the concept of maxlag that has existed in MediaWiki for some years in order to slow automated editing at times of lag in various systems.

Here you will find a little introduction to MediaWiki maxlag, and the ways that Wikidata hooks into the value, altering it for its needs.

Screenshot of the “Wikidata Edits” grafana dashboard showing increased maxlag and decreased edits

As you can see above, a high maxlag can cause automated editing to reduce or stop on wikidata.org

Read more

Altering a Gerrit change (git-review workflow)

I don’t use git-review for Gerrit interactions. This is primarily because back in 2012/2013 I couldn’t get git-review installed, and someone presented me with an alternative that worked. Years later I realized that this was actually the documented way of pushing changes to Gerrit.

As a little introduction to what this workflow looks, and a comparison with git-review I have created 2 overview posts altering a gerrit change on the Wikimedia gerrit install. I’m not trying to convince you, either way, is better, merely show the similarities/difference and what is happening behind the scenes.

Be sure to take a look at the other post “Altering a Gerrit change (git workflow)

One prerequisite of this workflow is that you have git-review installed and a .gitreview file in your repository!

I’ll be taking a change from the middle of last year, rebasing it, making a change, and pushing it back for review. Fundamentally the 2 approaches do the same thing, just one (git-review) requires an external tool.

Read more