If the intermediate state throws notices/errors, wouldn't it be a better idea to sync-file in the correct order to prevent such notices/errors?
On 25 July 2016 at 21:54, Roan Kattouw <[email protected]> wrote: > Note to deployers: when syncing certain config changes (e.g. adding a new > variable) that touch both InitialiseSettings and CommonSettings, you will > now need to use sync-dir wmf-config, because individual sync-files will > likely fail if the intermediate state throws notices/errors. > > (It was a good idea to do this before, but it'll be more strongly enforced > now.) > > On Jul 25, 2016 12:35, "Tyler Cipriani" <[email protected]> wrote: > >> tl;dr: Scap will deploy to canary servers and check for error-log spikes >> in the next version (to be released Soon™). >> >> In light of recent incidents[0] which have created outages accompanied by >> large, easily detectable, error-rate spikes, a patch has recently landed in >> Scap[1] that will: >> >> 1. Push changes to a set of canary servers[2] before syncing to proxy >> servers >> 2. Wait a configurable length of time (currently 20 seconds[3]) for >> any errors to have time to make themselves known >> 3. Query Logstash (using a script written by Gabriel Wicke[4]) to >> determine if the error rate has increased over a configurable threshold >> (currently 10-fold[5]) >> >> Big thanks to the folks that helped in this effort: Gabriel Wicke, >> Filippo Giunchedi and Giuseppe Lavagetto, Bryan Davis and Erik Bernhardson >> (for their mad Logstash skillz)! >> >> It is noteworthy, that in instances where expedience is required—we're in >> the middle of an outage and who cares what Logstash has to say—the >> `--force` flag can be added to skip canary checks all together (i.e. `scap >> sync-file --force wmf-config/InitialiseSettings 'Panic!!'`). >> >> The RelEng team's eventual goal is still to move MediaWiki deployments to >> the more robust and resillient Scap3 deployment framework. There is some >> high-priority work that has to happen before the Scap3 move. In the >> interim, we are taking steps (like this one) to respond to incidents and >> keep deployments safe. >> >> Hopefully, this work and the error-rate alert work from Ori last week[6] >> will allow everyone to be more conscientious and more keenly aware of >> deployments that cause large aberrations in the rate of errors. >> >> <3, >> Your Friendly Neighborhood Release Engineering Team >> >> [0]. >> https://wikitech.wikimedia.org/wiki/Incident_documentation/20160601-MediaWiki >> is the recent example I could find, but there have been others. >> [1]. https://phabricator.wikimedia.org/D248 >> [2]. https://gerrit.wikimedia.org/r/#/c/294742/ >> [3]. https://github.com/wikimedia/scap/blob/master/scap/config.py#L19 >> [4]. https://gerrit.wikimedia.org/r/#/c/292505/ >> [5]. https://github.com/wikimedia/scap/blob/master/scap/config.py#L18 >> [6]. https://gerrit.wikimedia.org/r/#/c/300327/ >> >> _______________________________________________ >> Ops mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/ops >> > > _______________________________________________ > Ops mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/ops > > _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
