[Wikitech-l] Production Excellence #42: April 2022

2022-04-21 Thread Krinkle
How’d we do in our strive for operational excellence last month? Read on to 
find out!

Incidents
We've had quite the month, with 8 documented incidents. That's more than double 
the two-year median of three a month (Incident graphs 
).

2022-03-01 ulsfo network 

Impact: For 20 minutes, clients normally routed to Ulsfo were unable to reach 
our projects. This includes New Zealand, parts of Canada, and the United States 
west coast.

2022-03-04 esams availability banner sampling 

Impact: For 1.5 hours, all wikis were largely unreachable from Europe (via 
Esams), with more limited impact across the globe via other data centers as 
well.

2022-03-06 wdqs-categories 

Impact: For 1.5 hours, some requests to the public Wikidata Query Service API 
were sporadically blocked.

2022-03-10 site availability 

Impact: For 12 min, all wikis were unreachable to logged-in users, and to 
unregistered users trying to access uncached content.

2022-03-27 api 
Impact: For ~4 hours, in three segments of 1-2 hours each over two days, there 
were higher levels of failed or slow MediaWiki API requests.

2022-03-27 wdqs outage 

Impact: For 30 minutes, all WDQS queries failed due to an internal deadlock.

2022-03-29 network 

Impact: For approximately 5 minutes, Wikipedia and other Wikimedia sites were 
slow or inaccessible for many users, mostly in Europe/Africa/Asia. (Details not 
public at this time.)

2022-03-31 api errors 

Impact: For 22 minutes, API server and app server availability were slightly 
decreased (~0.1% errors, all for s7-hosted wikis such as Spanish Wikipedia), 
and the latency of API servers was elevated as well.




Incident follow-up

Remember to review and schedule Incident Follow-up (Sustainability) 
 in Phabricator, which 
are preventive measures and tech debt mitigations written down after an 
incident is concluded. Read more about past incidents at Incident status 
 on Wikitech. Some 
recently completed sustainability work:

Add linecard diversity to router-to-router interconnect at Codfw 

Filed by Chris (SRE Infra) in 2020 after an incident where all hosts in the 
Codfw data center lost connectivity at once. Completed by Arzhel and Cathal 
(SRE Infra), and Papaul (DC Ops); including in Esams where the same issue 
existed.

Expand parser tests to cover language conversation variants in 
table-of-contents output 
Suggested and carried out by CScott (Parsoid) after reviewing an incident in 
November. The TOC on wikis that rely on the LanguageConverter service (such as 
Chinese Wikipedia) were no longer localized

Fix unquoted URL parameters in Icgina health checks 

Suggested by Riccardo (SRE Infra) in response to an early warning signal for 
TLS certificate expiry. He realized that automated checks for a related cluster 
were still claiming to be in good health, when they in fact should have been 
firing a similar warning. Carried out by Filippo and Dzahn.

Provide automation to quickly show replication status when primary is down 

Filed in April by Jaime (SRE Data Persistence), carried out by John and 
Ladsgroup.




Trends

Since the last edition, we resolved 24 of the 301 unresolved errors that 
carried over from previous months.
In March, we created 54 new production errors 
. That's 
quite high compared to the twenty-odd reports we find most months. Of these, 17 
remain open today a month later.

In the month of April, so far, we reported 20 new errors 
 of which 
also 17 remain open today.

The production error workboard once again adds up to exactly 298 open tasks 
(spreadsheet 
).


Take a look at the workboard and look for tasks that could use your help.
→  https://phabricator.wikimedia.org/tag/wikimedia-production-error/ 




Thanks!
Thank you to everyone who helped by reporting, investigating, or resolving 
problems in Wikimedia 

[Wikitech-l] 離Trainsperiment Survey Results

2022-04-21 Thread Tyler Cipriani
*tl;dr: *What We Learned from Trainsperiment Week


Release Engineering took the feedback from the Trainsperiment survey and
posted it on our blog
—there
are a lot of cool charts to see!

Trainsperiment week happened the week of March 21st when we deployed
MediaWiki versions 1.39.0-wmf.1–1.39.0-wmf.4 in a single week.

Thank you to everyone who took the time to give us feedback, and worked
with us while we tried something new.

<3

Tyler Cipriani (he/him)
Engineering Manager, Release Engineering
Wikimedia Foundation
___
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/