Recently I have been spending lots of time looking at the Wikimedia graphite set-up due to working on Grafana dashboards. In exchange for what some people had been doing for me I decided to take a quick look down the list of open Graphite tickets and found T116031. Sometimes it is great when such a small fix can have such a big impact!
After digging through all of the code I eventually discovered the method which sends Mediawiki metrics to Statsd is SamplingStatsdClient::send. This method is an overridden version of StatsdClient::send which is provided by liuggio/statsd-php-client. However a bug has existed in the sampling client ever since its creation!
The fix for the bug can be found on gerrit and only a +10 -4 line change (only 2 of those lines were actually code).
// Before $data = $this->sampleData( $data ); $messages = array_map( 'strval', $data ); $data = $this->reduceCount( $data ); $this->send( $messages ); //After $data = $this->sampleData( $data ); $data = array_map( 'strval', $data ); $data = $this->reduceCount( $data ); $this->send( $data );
The result of deploying this fix on the Wikimedia cluster can be seen below.
You can see a reduction from roughly 85kpps to 25kpps at the point of deployment. This is over a 50% decrease!
A decrease in bytes received can also be seen, even though the same number of metrics are being sent. This is due to the reduction in packet overhead, a drop of roughly 1MBps at deployment.
The little things really are great. Now to see if we can reduce that packet count even more!