Re: [Wikitech-l] [Wmfall] Datacenter Switchover recap

2018-09-14 Thread Mukunda Modell
This is great!

Thank you to everyone involved, for the really important work that you are
all doing, and thanks to Alexandros, Timo & Giuseppe for sharing the
highlights. It's great to know that so many pieces can come together in
just 8 minutes. This really is an impressive (and important!)
accomplishment. You've set the bar so high that it'll be a real challenge*
to do it any better next year!

* A challenge which I have no doubt will lead to many more improvements to
the infrastructure between now and the next DC-switchover.

On Fri, Sep 14, 2018 at 2:18 AM Giuseppe Lavagetto 
wrote:

> Sorry for the copy/paste fail, I meant
>
>
>
>> So I want to congratulate everyone who was involved in the process, that
>> includes most of the people on the core platform, performance, search and
>> SRE teams, but a special personal thanks goes to
>> Alexandros and Riccardo for driving most of the process and allowing me
>> to care about the switchover for less than a week before it happened and,
>> yes, to take the time to fix that bug too :)
>>
>>
> Cheers,
>
> Giuseppe
> --
> Giuseppe Lavagetto
> Principal Site Reliability Engineer, Wikimedia Foundation
> ___
> Wmfall mailing list
> wmf...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wmfall
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Wmfall] Datacenter Switchover recap

2018-09-14 Thread Giuseppe Lavagetto
Sorry for the copy/paste fail, I meant



> So I want to congratulate everyone who was involved in the process, that
> includes most of the people on the core platform, performance, search and
> SRE teams, but a special personal thanks goes to
> Alexandros and Riccardo for driving most of the process and allowing me to
> care about the switchover for less than a week before it happened and, yes,
> to take the time to fix that bug too :)
>
>
Cheers,

Giuseppe
-- 
Giuseppe Lavagetto
Principal Site Reliability Engineer, Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Wmfall] Datacenter Switchover recap

2018-09-14 Thread Giuseppe Lavagetto
On Thu, Sep 13, 2018 at 7:49 AM Bryan Davis  wrote:

>
> Everyone involved worked hard to make this happen, but I'd like to
> give a special shout out to Giuseppe Lavagetto for taking the time to
> follow up on a VisualEditor problem that affected Wikitech
> (). We noticed during the
> April 2017 switchover that the client side code for VE was failing to
> communicate with the backend component while the wikis were being
> served from the Dallas datacenter. We guessed that this was a
> configuration error of some sort, but did not take the time to debug
> in depth. When the issue reoccurred during the current datacenter
> switch, Giuseppe took a deep dive into the code and configuration,
> identified the configuration difference that triggered the problem,
> and made a patch for the Parsoid backend that fixes Wikitech.
>
>
While I'm flattered by the compliments, I think it's fair to underline the
problem was partly caused by a patch I made to Parsoid some time ago. So I
mostly cleaned up a problem I caused - does this count for getting a new
t-shirt, even if I fixed it with more than one year of delay? :P

On the other hand, I want to join the choir praising the work that has been
done for the switchover, and take the time to list all the things we've
done collectively to make it as uneventful and fast (read-only time was
less than 8 minutes this time) as it was:
- Mediawiki now fetches its read-only state and which datacenter is the
master from etcd, eliminating the need for a code deployment
- We now connect to our per-datacenter distributed cache via mcrouter,
which allows us to keep the caches in various datacenters consistent. This
eliminated the need to wipe the cache during the read-only phase, thus
resulting in a big reduction in the time we went to read-only
- Our old jobqueue not only gave me innumerable debugging nightmares, but
was hard and tricky to handle in a multi-datacenter environment. We have
substituted it with a more modern system which needed no intervention
during the switchover
- Our media storage system (Swift + thumbor) is now active-active and we
write and read from both datacenters
- We created a framework for easily automate complex orchestration tasks
(like a switchover) called "spicerack", which will benefit our operations
in general and has the potential to reduce the toil on the SRE team, while
proven, automated procedures can be coded for most events.
- Last but not least, the Dallas datacenter (codenamed "codfw") needed
little to no tuning when we moved all traffic, and we had to fix virtually
nothing that went out of sync during the last 1.4 years. I know this might
sound unimpressive, but keeping a datacenter that's not really used in good
shape and in sync is a huge accomplishment in itself; I've never seen
before such a show of flawless execution and collective discipline.

So I want to congratulate everyone who was involved in the process, that
includes most of the people on the core platform, performance, search and
SRE teams, but a special personal thanks goes to
- The whole SRE team, and really anyone working on our production
environment, for keeping the Dallas datacenter in good shape for more than
a year, so that we didn't need to adjust almost anything pre or
post-switchover Alexandros and Riccardo for driving most of the process and
allowing me to care about the switchover for less than a week before it
happened and, yes, to take the time to fix that bug too :)

Cheers,

Giuseppe
P.S. I'm sure I forgot someone / something amazing we've done; I apologize
in advance.
-- 
Giuseppe Lavagetto
Principal Site Reliability Engineer, Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Wmfall] Datacenter Switchover recap

2018-09-13 Thread Victoria Coleman
Thank you Bryan and thank you Giuseppe. It is terrific to hear of such good 
work and even better to have it celebrated! Proud of you both!

Victoria 

> On Sep 13, 2018, at 1:49 AM, Bryan Davis  wrote:
> 
> On Wed, Sep 12, 2018 at 11:16 AM, Alexandros Kosiaris
>  wrote:
>> Hello all,
>> 
>> Today we've successfully migrated our wikis (MediaWiki and associated
>> services)
>> from our primary data center (eqiad) to our secondary (codfw), an exercise
>> we've done for the 3rd year in a row. During the most critical part of the
>> switch today, the wikis were in read-only mode for a duration of 7 and a
>> half minutes - a significant improvement from last year.
> 
> Everyone involved worked hard to make this happen, but I'd like to
> give a special shout out to Giuseppe Lavagetto for taking the time to
> follow up on a VisualEditor problem that affected Wikitech
> (). We noticed during the
> April 2017 switchover that the client side code for VE was failing to
> communicate with the backend component while the wikis were being
> served from the Dallas datacenter. We guessed that this was a
> configuration error of some sort, but did not take the time to debug
> in depth. When the issue reoccurred during the current datacenter
> switch, Giuseppe took a deep dive into the code and configuration,
> identified the configuration difference that triggered the problem,
> and made a patch for the Parsoid backend that fixes Wikitech.
> 
> Wikitech is a low volume wiki for both edits and reads, and for
> various historical and technical reasons is different from all other
> wikis that we host. Keeping it available for reading is important to
> our technical teams because it hosts many of the troubleshooting
> playbooks that we use to diagnose and correct operational problems on
> the rest of the wikis. Taking the time to work on an editing bug that
> only impacted edits done using VisualEditor is awesome, but not the
> sort of thing I would normally expect to be worked on promptly. For
> me, Giuseppe's work on this bug is a sign that that he cares about the
> small details, and also that the rest of the switchover went well
> giving him the time to investigate lower impact edge cases like this.
> 
> 
> Bryan
> -- 
> Bryan Davis  Wikimedia Foundation
> [[m:User:BDavis_(WMF)]] Manager, Technical EngagementBoise, ID USA
> irc: bd808v:415.839.6885 x6855
> 
> ___
> Wmfall mailing list
> wmf...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wmfall


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Wmfall] Datacenter Switchover recap

2018-09-13 Thread Gilles Dubuc
Congratulations, awesome work!

On Thu, Sep 13, 2018 at 7:49 AM Bryan Davis  wrote:

> On Wed, Sep 12, 2018 at 11:16 AM, Alexandros Kosiaris
>  wrote:
> > Hello all,
> >
> > Today we've successfully migrated our wikis (MediaWiki and associated
> > services)
> > from our primary data center (eqiad) to our secondary (codfw), an
> exercise
> > we've done for the 3rd year in a row. During the most critical part of
> the
> > switch today, the wikis were in read-only mode for a duration of 7 and a
> > half minutes - a significant improvement from last year.
>
> Everyone involved worked hard to make this happen, but I'd like to
> give a special shout out to Giuseppe Lavagetto for taking the time to
> follow up on a VisualEditor problem that affected Wikitech
> (). We noticed during the
> April 2017 switchover that the client side code for VE was failing to
> communicate with the backend component while the wikis were being
> served from the Dallas datacenter. We guessed that this was a
> configuration error of some sort, but did not take the time to debug
> in depth. When the issue reoccurred during the current datacenter
> switch, Giuseppe took a deep dive into the code and configuration,
> identified the configuration difference that triggered the problem,
> and made a patch for the Parsoid backend that fixes Wikitech.
>
> Wikitech is a low volume wiki for both edits and reads, and for
> various historical and technical reasons is different from all other
> wikis that we host. Keeping it available for reading is important to
> our technical teams because it hosts many of the troubleshooting
> playbooks that we use to diagnose and correct operational problems on
> the rest of the wikis. Taking the time to work on an editing bug that
> only impacted edits done using VisualEditor is awesome, but not the
> sort of thing I would normally expect to be worked on promptly. For
> me, Giuseppe's work on this bug is a sign that that he cares about the
> small details, and also that the rest of the switchover went well
> giving him the time to investigate lower impact edge cases like this.
>
>
> Bryan
> --
> Bryan Davis  Wikimedia Foundation
> [[m:User:BDavis_(WMF)]] Manager, Technical EngagementBoise, ID USA
> irc: bd808v:415.839.6885 x6855
>
> ___
> Wmfall mailing list
> wmf...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wmfall
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Wmfall] Datacenter Switchover recap

2018-09-12 Thread Bryan Davis
On Wed, Sep 12, 2018 at 11:16 AM, Alexandros Kosiaris
 wrote:
> Hello all,
>
> Today we've successfully migrated our wikis (MediaWiki and associated
> services)
> from our primary data center (eqiad) to our secondary (codfw), an exercise
> we've done for the 3rd year in a row. During the most critical part of the
> switch today, the wikis were in read-only mode for a duration of 7 and a
> half minutes - a significant improvement from last year.

Everyone involved worked hard to make this happen, but I'd like to
give a special shout out to Giuseppe Lavagetto for taking the time to
follow up on a VisualEditor problem that affected Wikitech
(). We noticed during the
April 2017 switchover that the client side code for VE was failing to
communicate with the backend component while the wikis were being
served from the Dallas datacenter. We guessed that this was a
configuration error of some sort, but did not take the time to debug
in depth. When the issue reoccurred during the current datacenter
switch, Giuseppe took a deep dive into the code and configuration,
identified the configuration difference that triggered the problem,
and made a patch for the Parsoid backend that fixes Wikitech.

Wikitech is a low volume wiki for both edits and reads, and for
various historical and technical reasons is different from all other
wikis that we host. Keeping it available for reading is important to
our technical teams because it hosts many of the troubleshooting
playbooks that we use to diagnose and correct operational problems on
the rest of the wikis. Taking the time to work on an editing bug that
only impacted edits done using VisualEditor is awesome, but not the
sort of thing I would normally expect to be worked on promptly. For
me, Giuseppe's work on this bug is a sign that that he cares about the
small details, and also that the rest of the switchover went well
giving him the time to investigate lower impact edge cases like this.


Bryan
-- 
Bryan Davis  Wikimedia Foundation
[[m:User:BDavis_(WMF)]] Manager, Technical EngagementBoise, ID USA
irc: bd808v:415.839.6885 x6855

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Wmfall] Datacenter Switchover recap

2018-09-12 Thread Victoria Coleman
+!

Great, steady progress!

Best,

Victoria

> On Sep 12, 2018, at 3:54 PM, Toby Negrin  wrote:
> 
> Seriously -- this is some complicated, difficult stuff that one day may be 
> critical to keeping our projects available to everyone (but let's hope not)
> 
> Well done indeed!
> 
> -Toby
> 
> On Wed, Sep 12, 2018 at 12:53 PM, Greg Grossmeier  > wrote:
> 
> > Hello all,
> > 
> > Today we've successfully migrated our wikis (MediaWiki and associated
> > services)
> > from our primary data center (eqiad) to our secondary (codfw)
> 
> Well done, all.
> 
> Greg
> 
> -- 
> | Greg GrossmeierGPG: B2FA 27B1 F7EB D327 6B8E |
> | Release Team ManagerA18D 1138 8E47 FAC8 1C7D |
> 
> ___
> Wmfall mailing list
> wmf...@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/wmfall 
> 
> 
> ___
> Wmfall mailing list
> wmf...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wmfall

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Wmfall] Datacenter Switchover recap

2018-09-12 Thread Toby Negrin
Seriously -- this is some complicated, difficult stuff that one day may be
critical to keeping our projects available to everyone (but let's hope not)

Well done indeed!

-Toby

On Wed, Sep 12, 2018 at 12:53 PM, Greg Grossmeier 
wrote:

> 
> > Hello all,
> >
> > Today we've successfully migrated our wikis (MediaWiki and associated
> > services)
> > from our primary data center (eqiad) to our secondary (codfw)
>
> Well done, all.
>
> Greg
>
> --
> | Greg GrossmeierGPG: B2FA 27B1 F7EB D327 6B8E |
> | Release Team ManagerA18D 1138 8E47 FAC8 1C7D |
>
> ___
> Wmfall mailing list
> wmf...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wmfall
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Wmfall] Datacenter Switchover recap

2018-09-12 Thread Greg Grossmeier

> Hello all,
> 
> Today we've successfully migrated our wikis (MediaWiki and associated
> services)
> from our primary data center (eqiad) to our secondary (codfw)

Well done, all.

Greg

-- 
| Greg GrossmeierGPG: B2FA 27B1 F7EB D327 6B8E |
| Release Team ManagerA18D 1138 8E47 FAC8 1C7D |

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l