I just want to say thank you so much for these emails, they're great on
their own, but together they paint a clear picture at a level usually
inaccessible for those of us outside everyday mw development.  Thank you!

On Sat, Dec 11, 2021 at 20:39 Krinkle <krin...@fastmail.com> wrote:

> How’d we do in our strive for operational excellence last month? Read on
> to find out!
> Incidents
>
> 6 documented incidents last month. That's above the two-year and five-year
> median of 4 per month (per Incident graphs
> <https://codepen.io/Krinkle/full/wbYMZK>).
>
> 2021-11-04 large file upload timeouts
> <https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-04_large_file_upload_timeouts>;
> Impact: For 9 months, editors were unable to upload large files (e.g. to
> Commons). Editors would receive generic error messages, typically after a
> timeout. In retrospect, a dozen different distinct production errors had
> been reported and regularly observed that were related and provided
> different clues, however most of these remained untriaged and
> uninvestigated for months. This may be related to the affected components
> having no active code steward.
>
> 2021-11-05 TOC language converter
> <https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-05_TOC_language_converter>;
> Impact: For 6 hours, wikis experienced a blank or missing table of contents
> on many pages. For up to 3 days prior, wikis that have multiple language
> variants (such as Chinese Wikipedia) displayed the table of contents in an
> incorrect or inconsistent language variant (which are not understandable to
> some readers).
>
> 2021-11-10 cirrussearch commonsfile outage
> <https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-10_cirrussearch_commonsfile_outage>;
> Impact: For ~2.5 hours, the Search results page was unavailable on many
> wikis (except English Wikipedia). On Wikimedia Commons the search
> suggestions feature was unresponsive as well.
>
> 2021-11-18 codfw ipv6 network
> <https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-18_codfw_ipv6_network>;
> Impact: For 8 minutes, the Codfw cluster experienced partial loss of IPv6
> connectivity for upload.wikimedia.org. This did not affect availability
> of the service because the "Happy Eyeballs
> <https://en.wikipedia.org/wiki/Happy_Eyeballs>" algorithm ensures
> browsers (and other clients) automatically fallback to IPv4. The Codfw
> cluster generally serves Mexico and parts of the US and Canada. The
> upload.wikimedia.org service serves photos and other media/document
> files, such as displayed in Wikipedia articles.
>
> 2021-11-23 core network routing
> <https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-23_Core_Network_Routing>;
> Impact: For about 12 minutes, Eqiad was unable to reach hosts in other data
> centers via public IP addresses. This was due to a BGP routing error. There
> was no impact on end-user traffic, and impact on internal traffic was
> limited (only Icinga alerts themselves) because internal traffic generally
> uses local IP subnets which we currently route with OSPF instead of BGP.
>
> 2021-11-25 eventgate-main outage
> <https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-25_eventgate-main_outage>;
> Impact: For about 3 minutes, eventgate-main was down. This resulted in
> 25,000 MediaWiki backend errors due to inability to queue new jobs. About
> 1000 user-facing web requests failed (HTTP 500 Error). Event production
> briefly dropped from ~3000 per second to 0 per second.
> Incident follow-up
>
> Remember to review and schedule Incident Follow-up work
> <https://phabricator.wikimedia.org/project/view/4758/> in Phabricator,
> which are preventive measures and tech debt mitigations written down after
> an incident is concluded. Read more about past incidents at Incident
> status <https://wikitech.wikimedia.org/wiki/Incident_status> on Wikitech.
>
> Recently resolved incident follow-up:
>
> Disable DPL on wikis that aren't using it
> <https://phabricator.wikimedia.org/T287916>
> Filed after a July 2021 incident, done by Amir (Ladsgroup) and Kunal
> (Legoktm).
>
> Create easy access to MySQL ports for faster incident response and
> maintenance <https://phabricator.wikimedia.org/T291352>
> Filed in Sep 2021, and carried out by Stevie (Kormat).
>
> Create paging alert for primary DB hosts
> <https://phabricator.wikimedia.org/T233684>
> Filed after a Sept 2019 incident, done by Stevie (Kormat).
>
> Trends
>
> November saw 27 new production error reports of which 14 were resolved,
> and 13 remain open and carry over to the next month.
>
> Of the 301 errors still open from previous months, 16 were resolved.
> Together with the 13 carried over from November that brings the workboard
> to 298 unresolved tasks.
> Figure 1: Unresolved error reports by month
> <https://phabricator.wikimedia.org/phame/post/view/261/production_excellence_38_november_2021/#trends>
> .
>
>
> Outstanding errors
>
> Take a look at the workboard and look for tasks that could use your help.
> →  https://phabricator.wikimedia.org/tag/wikimedia-production-error/
>
> 💡 Did you know:
> *To find your team's error reports, use the appropriate **"Filter" link
> in the sidebar of the workboard**.*
>
> Issues carried over from recent months:
>
> Apr 2021:
> 9 of 42 issues left.
> May 2021:
> 16 of 54 issues left.
> Jun 2021:
> 9 of 26 issues left.
> Jul 2021:
> 11 of 31 issues left.
> Aug 2021:
> 10 of 46 issues left.
> Sep 2021:
> 10 of 24 issues left.
> Oct 2021:
> 20 of 49 issues left.
> Nov 2021:
> 13 of 27 new issues
> <https://phabricator.wikimedia.org/maniphest/query/0W0Nuk9umBDc/#R> are
> carried forward.
>
> Thanks!
>
> Thank you to everyone who helped by reporting, investigating, or resolving
> problems in Wikimedia production. Thanks!
>
> Until next time,
>
> – Timo Tijhof
>
>
> 🔗 Share or read later via
> https://phabricator.wikimedia.org/phame/post/view/261/
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to