Re: [Wikitech-l] Wikipedia is down

Erik Bernhardson Tue, 27 Oct 2015 10:06:32 -0700

Below i have finished out my thoughts that i was writing before the
early-send.

On Oct 27, 2015 7:29 AM, "Risker" <[email protected]> wrote:
>
> On 27 October 2015 at 09:57, Brad Jorsch (Anomie) <[email protected]>
> wrote:
>
> > On Tue, Oct 27, 2015 at 8:02 AM, Risker <[email protected]> wrote:
> >
> > > The incident report does not go far enough back into the history of
the
> > > incident.  It does not explain how this code managed to get into the
> > > deployment chain with a fatal error in it.
> >
> >
> > Actually, it does. Erik writes "This occured because the patch for the
> > CirrusSearch repository that removed the schema should have been
deployed
> > before the change that adds it to the WikimediaEvents repository."
> >
> > In other words, there was nothing wrong with the code itself. The
problem
> > was that the multiple pieces of the change needed to be done in a
> > particular order during the manual backporting process, but they were
not
> > done in that order.
> >
> > If this had waited for the train deployment, both pieces would have been
> > done simultaneously and it wouldn't have been an issue, just as it
wasn't
> > an issue when these changes were done in master and automatically
deployed
> > to Beta Labs.
> >
> >
> That's a start, Brad.  But even as someone who has limited experience with
>  software deployment, I can think of at least half a dozen questions that
> I'd be asking here:
>
>    - Why wasn't it part of the deployment train

This was a fix for something that broke during the previous deployment
train. Specifically a hook was changed in core and not noticed in the
extension until the events from javascript stopped coming into our logging
tables.

>    - As a higher level question, what are the thresholds for using a SWAT
I'm on my phone so y
This is
>    deployment as opposed to the regular deployment train, are these
standards
>    being followed, and are they the right standards. (Even I notice that
most
>    of the big problems seem to come with deployments outside of the
deployment
>    train.)
>    - How was the code reviewed and tested before deployment

Code was reviewed and tested as normal, and that process worked as I would
expect. What was missing was perhaps clear documentation on the order of
patches. As an aside the way this is solved in other organizations, like
google, is to have a single repository contain all the code. This has
various other problems associated to it but it provides much stronger
guarantees against patches being applied in the wrong order. Because of the
nature of our project this type of solution is a non-starter, but perhaps
somewhere between where we are and an omni repo would make sense.

>    - Why did it appear to work in some contexts (indicated in your
response
>    as master and Beta Labs) but not in the production context

Because, as stated in the report and by brad, the code itself works. The
code was redeployed after the outage with no errors  because the second
time it was deployed in the correct order. This is why code review didn't
catch the fatal and the error didn't show up in beta labs. This was an
issue primarily with deployment process.

>    - How are we ensuring that deployments that require multiple sequential
>    steps are (a) identified and (b) implemented in a way that those steps
are
>    followed in the correct order
>
>
> Notice how none of the questions are "what was wrong with the code" or
"who
> screwed up".  They're all systems questions. This is a systems problem.
> Even in situations where there *is* a problem with the code or someone
> *did* screw up, the root cause usually comes back to having single points
> of failure (e.g. one person having the ability to [unintentionally] get
> problem code deployed, or weaknesses in the code review and testing
> process).
>
> Risker/Anne
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

At a higher level, this was a 9 minute outage instead of a 2 or 3 minute
outage due to two mistakes I made while doing the revert. Both of these are
in the incident report.

First the monitor I was watching from our log server to tell me it needs a
rollback did not report this error adding a minute or two before the
rollback started. Had these errors been included in the `fatalmonitor` the
revert would have started the same minute that the code went out.  We have
other monitors that have been added in the past year that I should have
been looking at as well.

Second I reverted multiple patches from within gerrit (our code review
tool) which takes too long when the site is down. I can only point to
inexperience here. Others who have previously taken our sites down informed
me that the proper way is to revert directly on the deployment server and
follow up with changes in gerrit after the fire has been put out. I've been
deploying patches at wmf for a couple years and have always in the past
reverted through gerrit, but those didn't need the extra speedy recovery as
the site was not down. All prior cases of my personal experience the
problem deployment was only logging errors or some specific piece of
functionality was not working.

Going up another level comes to our deployment tooling specifically. RelEng
is working on a project called scap3 which brings our deployment process
closer to what you should expect from a top 10 website. It includes canary
deployments (eg 1% of servers) along with a single command that undoes the
entire deployment. Canary deployments allow to see an error before it is
deployed everywhere, and a one command rollback operation would have likely
brought the site back 3 to 4 minutes faster than how I reverted the patches.

I did not link the scap3 portions as an actionable because, in my mind,
that's not a single actionable thing. Scap3 is a major overhaul of our
deploy process. Additionally this is already a priority in RelEng.
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Wikipedia is down

Reply via email to