Note to deployers: when syncing certain config changes (e.g. adding a new variable) that touch both InitialiseSettings and CommonSettings, you will now need to use sync-dir wmf-config, because individual sync-files will likely fail if the intermediate state throws notices/errors.
(It was a good idea to do this before, but it'll be more strongly enforced now.) On Jul 25, 2016 12:35, "Tyler Cipriani" <[email protected]> wrote: > tl;dr: Scap will deploy to canary servers and check for error-log spikes > in the next version (to be released Soon™). > > In light of recent incidents[0] which have created outages accompanied by > large, easily detectable, error-rate spikes, a patch has recently landed in > Scap[1] that will: > > 1. Push changes to a set of canary servers[2] before syncing to proxy > servers > 2. Wait a configurable length of time (currently 20 seconds[3]) for any > errors to have time to make themselves known > 3. Query Logstash (using a script written by Gabriel Wicke[4]) to > determine if the error rate has increased over a configurable threshold > (currently 10-fold[5]) > > Big thanks to the folks that helped in this effort: Gabriel Wicke, Filippo > Giunchedi and Giuseppe Lavagetto, Bryan Davis and Erik Bernhardson (for > their mad Logstash skillz)! > > It is noteworthy, that in instances where expedience is required—we're in > the middle of an outage and who cares what Logstash has to say—the > `--force` flag can be added to skip canary checks all together (i.e. `scap > sync-file --force wmf-config/InitialiseSettings 'Panic!!'`). > > The RelEng team's eventual goal is still to move MediaWiki deployments to > the more robust and resillient Scap3 deployment framework. There is some > high-priority work that has to happen before the Scap3 move. In the > interim, we are taking steps (like this one) to respond to incidents and > keep deployments safe. > > Hopefully, this work and the error-rate alert work from Ori last week[6] > will allow everyone to be more conscientious and more keenly aware of > deployments that cause large aberrations in the rate of errors. > > <3, > Your Friendly Neighborhood Release Engineering Team > > [0]. > https://wikitech.wikimedia.org/wiki/Incident_documentation/20160601-MediaWiki > is the recent example I could find, but there have been others. > [1]. https://phabricator.wikimedia.org/D248 > [2]. https://gerrit.wikimedia.org/r/#/c/294742/ > [3]. https://github.com/wikimedia/scap/blob/master/scap/config.py#L19 > [4]. https://gerrit.wikimedia.org/r/#/c/292505/ > [5]. https://github.com/wikimedia/scap/blob/master/scap/config.py#L18 > [6]. https://gerrit.wikimedia.org/r/#/c/300327/ > > _______________________________________________ > Ops mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/ops > _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
