One thing that impressed me when I started working with WMF is that
reverting in production is as safe as I have ever seen any production
environment.  In the 20 months or so I've been here, I think I only
remember one change that left behind corrupt data in prod, and that change
was made by a volunteer, the bug was manifested in beta labs but we failed
to recognize the importance of the bug, and then the change to the code was
merged on Thanksgiving Day by someone not on the team affected by the
change-- one of those perfect storm sort of problems.

We're good at reverting.


On Thu, Oct 31, 2013 at 12:26 PM, Toby Negrin <tneg...@wikimedia.org> wrote:

> How easy is it to rollback production changes? Is this something that can
> be consistently done easily with our current tools. At other high traffic
> sites I've worked at this has been an important component of production
> engineering.
>
> -Toby
>
>
> On Wed, Oct 30, 2013 at 6:12 PM, Greg Grossmeier <g...@wikimedia.org>wrote:
>
>> First: Thanks for responding to this and writing it up.
>>
>> <quote name="Yuri Astrakhan" date="2013-10-31" time="04:53:44 +0400">
>> > == Recomendations ==
>> > * Allow a bit more time between deployments and observe fatalmonitor
>> before
>> > and after
>>
>> Agreed.
>>
>> I put a ton of blame on myself for not slowing down the cadence of LD
>> slots when a bunch of people are trying to get in on the same day.
>>
>> For future LDs I am going to explicitly ask everyone to do what Yuri
>> suggests (monitor fatals after your deploy) before saying that you're
>> done. 5 minutes post-deploy of watching the fatalmonitor isn't
>> unreasonable, I don't think.
>>
>> Relatedly, I think we should reassess the Lightning Deploys :)
>>
>> Not necessarily to get rid of them (probably not), but:
>> 1) how many deploys can go in one LD? How many do we *want* to go?
>>
>> 2) from 1, is the length of the LD long enough/too long?
>>
>> 3) LD management is still pretty high-communication ("Alright, who's in
>> line? Who's up next? Are you done yet?") There are basic tools that can
>> help with this (Etsy has an IRC "pushbot" that manages the queue mostly
>> automatically, for instance); I'll look into those/test them.
>>
>> 4) probably more, aka: your thoughts?
>>
>>
>> Greg
>>
>> PS: graph of the fatals attached, just for completenesses sake.
>>
>> --
>> | Greg Grossmeier            GPG: B2FA 27B1 F7EB D327 6B8E |
>> | identi.ca: @greg                A18D 1138 8E47 FAC8 1C7D |
>>
>> _______________________________________________
>> Engineering mailing list
>> engineer...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/engineering
>>
>>
>
> _______________________________________________
> Engineering mailing list
> engineer...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/engineering
>
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to