Re: [Wikitech-l] [Wmfall] Datacenter Switchover recap

Giuseppe Lavagetto Thu, 13 Sep 2018 23:33:01 -0700

On Thu, Sep 13, 2018 at 7:49 AM Bryan Davis <bd...@wikimedia.org> wrote:


>
> Everyone involved worked hard to make this happen, but I'd like to
> give a special shout out to Giuseppe Lavagetto for taking the time to
> follow up on a VisualEditor problem that affected Wikitech
> (<https://phabricator.wikimedia.org/T163438>). We noticed during the
> April 2017 switchover that the client side code for VE was failing to
> communicate with the backend component while the wikis were being
> served from the Dallas datacenter. We guessed that this was a
> configuration error of some sort, but did not take the time to debug
> in depth. When the issue reoccurred during the current datacenter
> switch, Giuseppe took a deep dive into the code and configuration,
> identified the configuration difference that triggered the problem,
> and made a patch for the Parsoid backend that fixes Wikitech.
>
>
While I'm flattered by the compliments, I think it's fair to underline the
problem was partly caused by a patch I made to Parsoid some time ago. So I
mostly cleaned up a problem I caused - does this count for getting a new
t-shirt, even if I fixed it with more than one year of delay? :P

On the other hand, I want to join the choir praising the work that has been
done for the switchover, and take the time to list all the things we've
done collectively to make it as uneventful and fast (read-only time was
less than 8 minutes this time) as it was:
- Mediawiki now fetches its read-only state and which datacenter is the
master from etcd, eliminating the need for a code deployment
- We now connect to our per-datacenter distributed cache via mcrouter,
which allows us to keep the caches in various datacenters consistent. This
eliminated the need to wipe the cache during the read-only phase, thus
resulting in a big reduction in the time we went to read-only
- Our old jobqueue not only gave me innumerable debugging nightmares, but
was hard and tricky to handle in a multi-datacenter environment. We have
substituted it with a more modern system which needed no intervention
during the switchover
- Our media storage system (Swift + thumbor) is now active-active and we
write and read from both datacenters
- We created a framework for easily automate complex orchestration tasks
(like a switchover) called "spicerack", which will benefit our operations
in general and has the potential to reduce the toil on the SRE team, while
proven, automated procedures can be coded for most events.
- Last but not least, the Dallas datacenter (codenamed "codfw") needed
little to no tuning when we moved all traffic, and we had to fix virtually
nothing that went out of sync during the last 1.4 years. I know this might
sound unimpressive, but keeping a datacenter that's not really used in good
shape and in sync is a huge accomplishment in itself; I've never seen
before such a show of flawless execution and collective discipline.

So I want to congratulate everyone who was involved in the process, that
includes most of the people on the core platform, performance, search and
SRE teams, but a special personal thanks goes to
- The whole SRE team, and really anyone working on our production
environment, for keeping the Dallas datacenter in good shape for more than
a year, so that we didn't need to adjust almost anything pre or
post-switchover Alexandros and Riccardo for driving most of the process and
allowing me to care about the switchover for less than a week before it
happened and, yes, to take the time to fix that bug too :)

Cheers,

Giuseppe
P.S. I'm sure I forgot someone / something amazing we've done; I apologize
in advance.
-- 
Giuseppe Lavagetto
Principal Site Reliability Engineer, Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Wmfall] Datacenter Switchover recap

Reply via email to