How’d we do in our strive for operational excellence last month? Read on to 
find out!

Incidents 
Last month we experienced 2 (public) incidents. This is below the three-year 
median of 3 incidents a month (Incident graphs 
<https://codepen.io/Krinkle/full/wbYMZK>).

2022-04-06 esams network 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-04-06_esams_network>
Impact: For 30 minutes, wikis were slow or unreachable for a portion of clients 
to the Esams data center. Esams is one of two DCs primarily serving Europe, 
Middle East, and Africa.

2022-04-26 cr2-eqord down 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-04-26_cr2-eqord_down>
Impact: No external impact. Internally, for 2 hours we were unable to access 
our Eqord routers by any means. This was due to a fiber cut on a redundant link 
to Eqiad, which then coincided with planned vendor maintenance on the links to 
Ulsfo and Eqiad. See also Network design 
<https://wikitech.wikimedia.org/wiki/Network_design>.



Incident follow-up 
Remember to review and schedule Incident Follow-up work 
<https://phabricator.wikimedia.org/project/view/4758/> in Phabricator, which 
are preventive measures and tech debt mitigations written down after an 
incident is concluded. Read more about past incidents at Incident status 
<https://wikitech.wikimedia.org/wiki/Incident_status> on Wikitech.

Recently resolved incident follow-up:

Reduce mysql grants for wikiadmin scripts 
<https://phabricator.wikimedia.org/T249683>
Filed in 2020 after the wikidata drop-table incident (details 
<https://wikitech.wikimedia.org/wiki/Incidents/2020-04-07_Wikidata%27s_wb_items_per_site_table_dropped>).
 Carried out over the last six months by Ladsgroup (SRE Data Persistence).

Improve reliability of Toolforge k8s cron jobs 
<https://phabricator.wikimedia.org/T308204> and Re-enable CronJobControllerV2 
<https://phabricator.wikimedia.org/T308205>
Filed earlier this week after a Toolforge incident and carried out by Majavah.


Trends 
During the month of April we reported 27 new production errors 
<https://phabricator.wikimedia.org/maniphest/query/OZ99DkeJf85D/#R>. Of these 
new errors, we resolved 14, and the remaining 13 are still open and have 
carried over to May.

Last month, the workboard totalled 298 unresolved error reports. Of these older 
reports that carried over from previous months, 16 were resolved. Most of these 
were reports from before 2019.

The new total, including some tasks for the current month of May, is 292. A 
slight decrease! (spreadsheet 
<https://docs.google.com/spreadsheets/d/e/2PACX-1vTrUCAI10hIroYDU-i5_8s7pony8M71ATXrFRiXXV7t5-tITZYrTRLGch-3iJbmeG41ZMcj1vGfzZ70/pubhtml>).

Take a look at the workboard and look for tasks that could use your help.

→  https://phabricator.wikimedia.org/tag/wikimedia-production-error/ 




Thanks! 

Thank you to everyone who helped by reporting, investigating, or resolving 
problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof



🔗 Share or read later via https://phabricator.wikimedia.org/phame/post/view/284/
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to