Re: [Wikitech-l] Datacenter Switchback recap

2018-10-11 Thread Alexandros Kosiaris
A minor correction:

> During the most critical part of the switch
> today, the wikis were in read-only mode for a duration of 4 minutes
> and 41 seconds.

This was yesterday, not today.

-- 
Alexandros Kosiaris 

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Datacenter Switchback recap

2018-10-11 Thread Alexandros Kosiaris
Hello everyone,

Today we've concluded the successful migration of our wikis (MediaWiki
and associated services) from our secondary datacenter (codfw) back to
the primary one (eqiad). During the most critical part of the switch
today, the wikis were in read-only mode for a duration of 4 minutes
and 41 seconds. That's a significant improvement over the 7 mins and
34 seconds we achieved during the inverse process we concluded a month
ago, which was already significantly better than last year. I 'd like
to believe that it's the result of the increasing amount of experience
we are building and trust we are putting in the process and tools that
we have developed for this.

Although the switchback process itself has been largely automated and
went pretty smoothly, there have been some issues that we experienced:

- CentralNotice banners stayed online for a longer time than necessary
due to miscommunication issues. This has now been documented and will
be avoided in the future.

- After the switchback we 've experienced increased load to all our
mediawiki application servers. The root cause has been identified and
mitigation against it will be put in place. The summary is non working
replication of parsercache between the 2 datacenters.

- Last, but not least and probably the most important of all issues, a
data inconsistency was detected in wikidata (s8). Namely some articles
that were present in codfw but were not replicated in eqiad. We are
still investigating the root cause of this while applying corrective
actions to mitigate the user impact as quickly as possible.

All wikis are now served from our primary data center again.

Should you experience any issue that is deemed related to the
switchover process, please feel free to file a ticket in Phabricator
and tag it with the Datacenter-Switchover-2018 project tag[1]. We will
monitor this tag closely and keep any and all issues updated.

We'd like to thank everyone for their hard work in ensuring any
(potential) issues got resolved timely, for automating the process
whenever and wherever possible, and for making this datacenter
switchover and switchback a success!

-- 
Alexandros Kosiaris 

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l