If the intermediate state throws notices/errors, wouldn't it be a better
idea to sync-file in the correct order to prevent such notices/errors?

On 25 July 2016 at 21:54, Roan Kattouw <[email protected]> wrote:

> Note to deployers: when syncing certain config changes (e.g. adding a new
> variable) that touch both InitialiseSettings and CommonSettings, you will
> now need to use sync-dir wmf-config, because individual sync-files will
> likely fail if the intermediate state throws notices/errors.
>
> (It was a good idea to do this before, but it'll be more strongly enforced
> now.)
>
> On Jul 25, 2016 12:35, "Tyler Cipriani" <[email protected]> wrote:
>
>> tl;dr: Scap will deploy to canary servers and check for error-log spikes
>> in the next version (to be released Soon™).
>>
>> In light of recent incidents[0] which have created outages accompanied by
>> large, easily detectable, error-rate spikes, a patch has recently landed in
>> Scap[1] that will:
>>
>>    1. Push changes to a set of canary servers[2] before syncing to proxy
>> servers
>>    2. Wait a configurable length of time (currently 20 seconds[3]) for
>> any errors to have time to make themselves known
>>    3. Query Logstash (using a script written by Gabriel Wicke[4]) to
>> determine if the error rate has increased over a configurable threshold
>> (currently 10-fold[5])
>>
>> Big thanks to the folks that helped in this effort: Gabriel Wicke,
>> Filippo Giunchedi and Giuseppe Lavagetto, Bryan Davis and Erik Bernhardson
>> (for their mad Logstash skillz)!
>>
>> It is noteworthy, that in instances where expedience is required—we're in
>> the middle of an outage and who cares what Logstash has to say—the
>> `--force` flag can be added to skip canary checks all together (i.e. `scap
>> sync-file --force wmf-config/InitialiseSettings 'Panic!!'`).
>>
>> The RelEng team's eventual goal is still to move MediaWiki deployments to
>> the more robust and resillient Scap3 deployment framework. There is some
>> high-priority work that has to happen before the Scap3 move. In the
>> interim, we are taking steps (like this one) to respond to incidents and
>> keep deployments safe.
>>
>> Hopefully, this work and the error-rate alert work from Ori last week[6]
>> will allow everyone to be more conscientious and more keenly aware of
>> deployments that cause large aberrations in the rate of errors.
>>
>> <3,
>> Your Friendly Neighborhood Release Engineering Team
>>
>> [0].
>> https://wikitech.wikimedia.org/wiki/Incident_documentation/20160601-MediaWiki
>> is the recent example I could find, but there have been others.
>> [1]. https://phabricator.wikimedia.org/D248
>> [2]. https://gerrit.wikimedia.org/r/#/c/294742/
>> [3]. https://github.com/wikimedia/scap/blob/master/scap/config.py#L19
>> [4]. https://gerrit.wikimedia.org/r/#/c/292505/
>> [5]. https://github.com/wikimedia/scap/blob/master/scap/config.py#L18
>> [6]. https://gerrit.wikimedia.org/r/#/c/300327/
>>
>> _______________________________________________
>> Ops mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/ops
>>
>
> _______________________________________________
> Ops mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/ops
>
>
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to