Congrats Mark and everyone else involved. This is a big step for reliability and performance of the sites and a difficult technical task to say the least.
Well done! -Toby On Thu, Apr 21, 2016 at 8:37 AM, Mark Bergsma <[email protected]> wrote: > We've just completed the switch back, and all services are running from > our main data center eqiad (Ashburn) again. > > The process went very smooth this time around. In the past two days > leading up to this, we've been able to either fix or work around the most > important issues we encountered on Tuesday. This meant that we had no real > setbacks or unanticipated delays today, and therefore were able to complete > the most time pressing and user-impacting part (during which MediaWiki is > read-only) in 20 minutes, down from ~45 minutes two days ago. > > However, we'll be doing this again in the future, and until then we'll > work on improving and further automating this process to get it down to > hopefully much lower levels of impact and duration. > > Please let us know if you see any issues which may be caused by the > switch-over(s). > > Thanks much to everyone involved! > > Mark > > > > On Thu, Apr 21, 2016 at 3:53 PM, Mark Bergsma <[email protected]> wrote: > >> Hi everyone, >> >> After we've been successfully serving our sites from our backup >> data-center codfw (Dallas) for the past two days, we're now starting our >> switch back to eqiad (Ashburn) as planned[1]. >> >> We've already moved cache traffic back to eqiad, and within the next >> minutes, we'll disable editing by going read-only for approximately 30 >> minutes - hopefully a bit faster than 2 days ago. >> >> [1] http://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/ >> >> On Tue, Apr 19, 2016 at 6:00 PM, Mark Bergsma <[email protected]> wrote: >> >>> Hi all, >>> >>> Today the data center switch-over commenced as planned, and has just >>> fully completed successfully. We are now serving our sites from codfw >>> (Dallas, Texas) for the next 2 days if all stays well. >>> >>> We switched the wikis to read-only (editing disabled) at 14:02 UTC, and >>> went back read-write at 14:48 UTC - a little longer than planned. While >>> edits were possible then, unfortunately at that time Special:Recent Changes >>> (and related change feeds) were not yet working due to an unexpected >>> configuration problem with our Redis servers until 15:10 UTC, when we found >>> and fixed the issue. The site has stayed up and available for readers >>> throughout the entire migration. >>> >>> Overall the procedure was a success with few problems along the way. >>> However we've also carefully kept track of any issues and delays we >>> encountered for evaluation to improve and speed up the procedure, and >>> reducing impact to our users - some of which will already be implemented >>> for our switch back on Thursday. >>> >>> We're still expecting to find (possibly subtle) issues today, and would >>> like everyone who notices anything to use the following channels to report >>> them: >>> >>> 1. File a Phabricator issue with project #codfw-rollout >>> 2. Report issues on IRC: Freenode channel #wikimedia-tech (if urgent) >>> 3. Send an e-mail to the Operations list: [email protected] >>> >>> We're not done yet, but thanks to all who have helped so far. :-) >>> >>> Mark >>> >> >> -- >> Mark Bergsma <[email protected]> >> Lead Operations Architect >> Director of Technical Operations >> Wikimedia Foundation >> > > > > -- > Mark Bergsma <[email protected]> > Lead Operations Architect > Director of Technical Operations > Wikimedia Foundation > > _______________________________________________ > Ops mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/ops > > _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
