tl;dr: Scap will deploy to canary servers and check for error-log spikes in the 
next version (to be released Soon™).

In light of recent incidents[0] which have created outages accompanied by 
large, easily detectable, error-rate spikes, a patch has recently landed in 
Scap[1] that will:

   1. Push changes to a set of canary servers[2] before syncing to proxy servers
   2. Wait a configurable length of time (currently 20 seconds[3]) for any 
errors to have time to make themselves known
   3. Query Logstash (using a script written by Gabriel Wicke[4]) to determine 
if the error rate has increased over a configurable threshold (currently 

Big thanks to the folks that helped in this effort: Gabriel Wicke, Filippo 
Giunchedi and Giuseppe Lavagetto, Bryan Davis and Erik Bernhardson (for their 
mad Logstash skillz)!

It is noteworthy, that in instances where expedience is required—we're in the 
middle of an outage and who cares what Logstash has to say—the `--force` flag 
can be added to skip canary checks all together (i.e. `scap sync-file --force 
wmf-config/InitialiseSettings 'Panic!!'`).

The RelEng team's eventual goal is still to move MediaWiki deployments to the 
more robust and resillient Scap3 deployment framework. There is some 
high-priority work that has to happen before the Scap3 move. In the interim, we 
are taking steps (like this one) to respond to incidents and keep deployments 

Hopefully, this work and the error-rate alert work from Ori last week[6] will 
allow everyone to be more conscientious and more keenly aware of deployments 
that cause large aberrations in the rate of errors.

Your Friendly Neighborhood Release Engineering Team

is the recent example I could find, but there have been others.

Wikitech-l mailing list

Reply via email to