How are we doing in our strive for operational excellence? Read on to find out!

Incidents
There were 6 incidents in June this year. That's double the median of three per 
month, over the past two years (Incident graphs 
<https://codepen.io/Krinkle/full/wbYMZK>).

2022-06-01 cloudelastic 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-06-01_Lost_index_in_cloudelastic>
Impact: For 41 days, Cloudelastic was missing search results about files from 
commons.wikimedia.org.

2022-06-10 overload varnish haproxy 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-06-10_overload_varnish_haproxy>
Impact: For 3 minutes, wiki traffic was disrupted in multiple regions for 
cached and logged-in responses.

2022-06-12 appserver latency 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-06-12_appserver_latency>
Impact: For 30 minutes, wiki backends were intermittently slow or unresponsive, 
affecting a portion of logged-in requests and uncached page views.

2022-06-16 MariaDB password 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-06-16_MariaDB_password_leak>
Impact: For 2 hours, a current production database password was publicly known. 
Other measures ensured that no data could be compromised (e.g. firewalls and 
selective IP grants).

2022-06-21 asw-a2-codfw power 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-06-21_asw-a2-codfw_accidental_power_cycle>
Impact: For 11 minutes, one of the Codfw server racks lost network 
connectivity. Among the affected servers was an LVS host. Another LVS host in 
Codfw automatically took over its load balancing responsibility for wiki 
traffic. During the transition, there was a brief increase in latency for 
regions served by Codfw (Mexico, and parts of US/Canada).

2022-06-30 asw-a4-codfw power 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-06-30_asw-a4-codfw_accidental_power_cycle>
Impact: For 18 minutes, servers in the A4-codfw rack lost network connectivity. 
Little to no external impact.

Incident follow-up
Recently completed incident follow-up:


Audit database usage of GlobalBlocking extension 
<https://phabricator.wikimedia.org/T307648>
Filed by Amir (Ladsgroup) in May following an outage due to db load from 
GlobalBlocking. Amir reduced the extensions' DB load by 10%, through avoiding 
checks for edit traffic from WMCS and Toolforge. And he implemented stats for 
monitoring GlobalBlocking DB queries going forward.


Reduce Lilypond shellouts from VisualEditor 
<https://phabricator.wikimedia.org/T312319>
Filed by Reuven (RLazarus) and Kunal (Legoktm) after a shellbox incident. Ed 
(Esanders) and Sammy (TheresNoTime) improved the Score extension's VisualEditor 
plugin to increase its debounce duration.

Remember to review and schedule Incident Follow-up work 
<https://phabricator.wikimedia.org/project/view/4758/> in Phabricator! These 
are preventive measures and tech debt mitigations written down after an 
incident is concluded. Read more about past incidents at Incident status 
<https://wikitech.wikimedia.org/wiki/Incident_status> on Wikitech.

Trends

In June and July (which is almost over), we reported 27 new production errors 
<https://phabricator.wikimedia.org/maniphest/query/WDqlrITVmIoX/#R> and 25 
production errors 
<https://phabricator.wikimedia.org/maniphest/query/pzOAOpbnF3PX/#R> 
respectively. Of these 52 new issues, 27 were closed in weeks since then, and 
25 remain unresolved and will carry over to August.

We also addressed 25 stagnant problems that we carried over from previous 
months, thus the workboard overall remains at exactly 299 unresolved production 
errors.

Take a look at the Wikimedia-production-error 
<https://phabricator.wikimedia.org/tag/wikimedia-production-error/> workboard 
and look for tasks that could use your help.

💡 *Did you know?* To zoom in and find your team's error reports, use the 
appropriate "Filter" link in the sidebar of the workboard .
For the month-over-month numbers, refer to the spreadsheet data 
<https://docs.google.com/spreadsheets/d/e/2PACX-1vTrUCAI10hIroYDU-i5_8s7pony8M71ATXrFRiXXV7t5-tITZYrTRLGch-3iJbmeG41ZMcj1vGfzZ70/pubhtml>.

Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving 
problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof




🔗 Share or read later via 
https://phabricator.wikimedia.org/phame/post/view/292/ 

_______________________________________________
Wikitech-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to