Re: [Wikitech-l] Wikipedia is down

Greg Grossmeier Tue, 27 Oct 2015 09:01:58 -0700

Hi all,

Thanks, for the discussion. As you can imagine, it is high on my list to
figure out why we've had 2 outages in the past couple weeks caused by
config changes like this (more accurately, what we can do to prevent
it).

I think, after reading Brad's, Oliver's, and Erik's (partial, early
release due to train) responses most of Risker's questions are answered.
I'll just give a bit more from my perspective.

<quote name="Brad Jorsch (Anomie)" date="2015-10-27" time="10:56:28 -0400">
> On Tue, Oct 27, 2015 at 10:29 AM, Risker <[email protected]> wrote:
> 
> >    - Why wasn't it part of the deployment train
> >
> 
> Good question, and one that needs someone involved in this backport to
> answer.

Erik and Oliver answered this.

> >    - As a higher level question, what are the thresholds for using a SWAT
> >    deployment as opposed to the regular deployment train, are these
> > standards
> >    being followed, and are they the right standards. (Even I notice that
> > most
> >    of the big problems seem to come with deployments outside of the
> > deployment
> >    train.)
> >
> 
> My understanding is that SWAT is supposed to be for WMF configuration
> changes (i.e. the operations/mediawiki-config repo, which this wasn't) and
> for urgent bug fixes that can't wait for the weekly train. But my
> understanding might be too strict, so I'd recommend waiting for a more
> official answer than mine.

Also answered by Erik, and
https://wikitech.wikimedia.org/wiki/SWAT_deploys#Guidelines

> >    - How was the code reviewed and tested before deployment
> >
> 
> First, it was reviewed before being merged into master. Then the SWAT
> deployer is supposed to review the backport for potential issues, although
> they may lack the domain-specific knowledge that the original reviewers
> have to spot issues like the one here.

All code that is pushed out via a SWAT window has these review points:
* Pre-commit review and testing done in Gerrit/Jenkins
* Post-commit testing on Beta Cluster (which is automatically updated
  from master every 10 minutes)
* The backport (what is deployed in the SWAT) is again reviewed/tested
  in Gerrit/Jenkins (will catch stupid errors at this point)
* The SWAT deployer does their own review of the patch before committing
  and deploying.

> >    - Why did it appear to work in some contexts (indicated in your response
> >    as master and Beta Labs) but not in the production context
> >
> 
> You're assuming this code wouldn't have worked in the production context if
> deployed correctly. It's like asking "Why does it work to change a
> lightbulb normally, but it doesn't work if the bulb-changer forgets to
> remove the burned-out bulb before trying to put the new one in?"

This has been answered a few times already and is answered in your next
question; the ordering was the issue.

> >    - How are we ensuring that deployments that require multiple sequential
> >    steps are (a) identified and (b) implemented in a way that those steps
> > are
> >    followed in the correct order
> >
> 
> It requires that the people proposing/implementing the change identify the
> prerequisites required. There's currently no automated way to do this, and
> even if some automated mechanism such as "Depends-On" tags on the git
> commits were implemented it would require that people correctly use the
> mechanism and that the mechanism can be automatically tracked during
> backports as well as normal development merges.

Really, this is why we have people do this work instead of machines:
People know the (ever evolving) complex system that makes of production
Wikimedia "servers" (already a mix of bare metal and virtualization,
tons of interconnected services, etc) and thus people are the ones who
can make the choices.

We don't have the funds and person-hours to have a complete, 1:1, mirror
of production as a test environment. It literally costs twice as much
for the hardware and then another non-zero amount of people FTEs.

> There's also the possibility that unit testing could catch such issues when
> the changes are merged to the deployment branches before being deployed,
> and our Release Engineering team has been working on increasing the number
> of extension unit tests run. But that requires we have unit tests that
> cover everything, which we don't so things can still slip through. It also
> wouldn't handle the case where the individual files of the change are
> individually deployed out of order, although at a glance it doesn't seem
> like that was the issue here.

Unit tests, which I think are related but not the whole solution here,
are important to RelEng and I hope we (RelEng) can work with the rest of
the engineering staff and community to improve our coverage ASAP. This
isn't a task that "QA" or "RelEng" can do; it needs to be owned by all
engineers.

See, for example: 
https://integration.wikimedia.org/cover/mediawiki-core/master/php/

That's not a great place to be. And that is just MW Core.

> Taking this further to discuss plans, implementation, and mitigation of the
> remaining process issues is a discussion for the Release Engineering team,
> and may already be happening somewhere. Once people in SF get into work
> they might have further comments along these lines.

This is mostly touching on scap3, which Erik described a little in his
email. Scap3 (it'll just be 'scap', don't worry, scap3 is just the
working name of the new feature set we're adding to it) will allow us to
catch this kind of error much faster and before many people see it
through the use of canary deploys with automated health checks (those
health checks are configurable).

We (RelEng) will talk with Erik more directly about why it was a 9 min
outage vs 2 min soon and see what else we are missing from scap3 or the
higher level process.

Thanks all,

Greg

-- 
| Greg Grossmeier            GPG: B2FA 27B1 F7EB D327 6B8E |
| identi.ca: @greg                A18D 1138 8E47 FAC8 1C7D |

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Wikipedia is down

Reply via email to