How’d we do in our strive for operational excellence last month? Read on to 
find out!

Incidents
6 documented incidents last month. That's above the two-year and five-year 
median of 4 per month (per Incident graphs 
<https://codepen.io/Krinkle/full/wbYMZK>).

2021-11-04 large file upload timeouts 
<https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-04_large_file_upload_timeouts>;
 Impact: For 9 months, editors were unable to upload large files (e.g. to 
Commons). Editors would receive generic error messages, typically after a 
timeout. In retrospect, a dozen different distinct production errors had been 
reported and regularly observed that were related and provided different clues, 
however most of these remained untriaged and uninvestigated for months. This 
may be related to the affected components having no active code steward.

2021-11-05 TOC language converter 
<https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-05_TOC_language_converter>;
 Impact: For 6 hours, wikis experienced a blank or missing table of contents on 
many pages. For up to 3 days prior, wikis that have multiple language variants 
(such as Chinese Wikipedia) displayed the table of contents in an incorrect or 
inconsistent language variant (which are not understandable to some readers).

2021-11-10 cirrussearch commonsfile outage 
<https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-10_cirrussearch_commonsfile_outage>;
 Impact: For ~2.5 hours, the Search results page was unavailable on many wikis 
(except English Wikipedia). On Wikimedia Commons the search suggestions feature 
was unresponsive as well.

2021-11-18 codfw ipv6 network 
<https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-18_codfw_ipv6_network>;
 Impact: For 8 minutes, the Codfw cluster experienced partial loss of IPv6 
connectivity for upload.wikimedia.org. This did not affect availability of the 
service because the "Happy Eyeballs 
<https://en.wikipedia.org/wiki/Happy_Eyeballs>" algorithm ensures browsers (and 
other clients) automatically fallback to IPv4. The Codfw cluster generally 
serves Mexico and parts of the US and Canada. The upload.wikimedia.org service 
serves photos and other media/document files, such as displayed in Wikipedia 
articles.

2021-11-23 core network routing 
<https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-23_Core_Network_Routing>;
 Impact: For about 12 minutes, Eqiad was unable to reach hosts in other data 
centers via public IP addresses. This was due to a BGP routing error. There was 
no impact on end-user traffic, and impact on internal traffic was limited (only 
Icinga alerts themselves) because internal traffic generally uses local IP 
subnets which we currently route with OSPF instead of BGP.

2021-11-25 eventgate-main outage 
<https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-25_eventgate-main_outage>;
 Impact: For about 3 minutes, eventgate-main was down. This resulted in 25,000 
MediaWiki backend errors due to inability to queue new jobs. About 1000 
user-facing web requests failed (HTTP 500 Error). Event production briefly 
dropped from ~3000 per second to 0 per second.

Incident follow-up
Remember to review and schedule Incident Follow-up work 
<https://phabricator.wikimedia.org/project/view/4758/> in Phabricator, which 
are preventive measures and tech debt mitigations written down after an 
incident is concluded. Read more about past incidents at Incident status 
<https://wikitech.wikimedia.org/wiki/Incident_status> on Wikitech.

Recently resolved incident follow-up:

Disable DPL on wikis that aren't using it 
<https://phabricator.wikimedia.org/T287916>
Filed after a July 2021 incident, done by Amir (Ladsgroup) and Kunal (Legoktm).

Create easy access to MySQL ports for faster incident response and maintenance 
<https://phabricator.wikimedia.org/T291352>
Filed in Sep 2021, and carried out by Stevie (Kormat).

Create paging alert for primary DB hosts 
<https://phabricator.wikimedia.org/T233684>
Filed after a Sept 2019 incident, done by Stevie (Kormat).


Trends
November saw 27 new production error reports of which 14 were resolved, and 13 
remain open and carry over to the next month.

Of the 301 errors still open from previous months, 16 were resolved. Together 
with the 13 carried over from November that brings the workboard to 298 
unresolved tasks.

Figure 1: Unresolved error reports by month 
<https://phabricator.wikimedia.org/phame/post/view/261/production_excellence_38_november_2021/#trends>.


Outstanding errors
Take a look at the workboard and look for tasks that could use your help.

→  https://phabricator.wikimedia.org/tag/wikimedia-production-error/

💡 Did you know:
*To find your team's error reports, use the appropriate ***"Filter" link in the 
sidebar of the workboard***.*

Issues carried over from recent months:

Apr 2021:
9 of 42 issues left.
May 2021:
16 of 54 issues left.
Jun 2021:
9 of 26 issues left.
Jul 2021:
11 of 31 issues left.
Aug 2021:
10 of 46 issues left.
Sep 2021:
10 of 24 issues left.
Oct 2021:
20 of 49 issues left.
Nov 2021:
13 of 27 new issues 
<https://phabricator.wikimedia.org/maniphest/query/0W0Nuk9umBDc/#R> are carried 
forward.


Thanks!
Thank you to everyone who helped by reporting, investigating, or resolving 
problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof



🔗 Share or read later via https://phabricator.wikimedia.org/phame/post/view/261/
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to