How’d we do in our strive for operational excellence last month? Read on to 
find out!

Incidents 
By golly, we've had quite the month! 10 documented incidents, which is more 
than three times the two-year median of 3. The last time we experienced ten or 
more incidents in one month, was June 2019 when we had eleven (Incident graphs 
<https://codepen.io/Krinkle/full/wbYMZK>, Excellence monthly of June 2019 
<https://phabricator.wikimedia.org/phame/post/view/163/production_excellence_12_june_2019/>).

I'd like to draw your attention to something positive. As you read the below, 
take note of incidents that did *not* impact public services, and did *not* 
have lasting impact or data loss. For example, the Apache incident 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-05-24_Failed_Apache_restart>
 benefited from PyBal's automatic health-based depooling. The deployment server 
incident <https://wikitech.wikimedia.org/wiki/Incidents/2022-05-02_deployment> 
recovered without loss thanks to Bacula. The Etcd incident 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-05-01_etcd> impact was 
limited by serving stale data. And, the Hadoop incident 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-05-31_Analytics_Data_Lake_-_Hadoop_Namenode_failure>
 recovered by resuming from Kafka right where it left off.

2022-05-01 etcd <https://wikitech.wikimedia.org/wiki/Incidents/2022-05-01_etcd>
Impact: For 2 hours, Conftool could not sync Etcd data between our core data 
centers. Puppet and some other internal services were unavailable or out of 
sync. The issue was isolated, with no impact on public services.

2022-05-02 deployment server 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-05-02_deployment>
Impact: For 4 hours, we could not update or deploy MediaWiki and other 
services, due to corruption on the active deployment server. No impact on 
public services.

2022-05-05 site outage 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-05-05_Wikimedia_full_site_outage>
Impact: For 20 minutes, all wikis were unreachable for logged-in users and 
non-cached pages. This was due to a GlobalBlocks schema change causing 
significant slowdown in a frequent database query.

2022-05-09 Codfw confctl 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-05-09_confctl>
Impact: For 5 minutes, all web traffic routed to Codfw received error 
responses. This affected central USA and South America (local time after 
midnight). The cause was human error and lack of CLI parameter validation.

2022-05-09 exim-bdat-errors 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-05-09_exim-bdat-errors>
Impact: During five days, about 14,000 incoming emails from Gmail users to 
wikimedia.org were rejected and returned to sender.

2022-05-21 varnish cache busting 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-05-21_varnish_cache_busting>
Impact: For 2 minutes, all wikis and services behind our CDN were unavailable 
to all users.

2022-05-24 failed Apache restart 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-05-24_Failed_Apache_restart>
Impact: For 35 minutes, numerous internal services that use Apache on the 
backend were down. This included Kibana (logstash) and Matomo (piwik). For 20 
of those minutes, there was also reduced MediaWiki server capacity, but no 
measurable end-user impact for wiki traffic.

2022-05-25 de.wikipedia.org 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-05-25_de.wikipedia.org>
Impact: For 6 minutes, a portion of logged-in users and non-cached pages 
experienced a slower response or an error. This was due to increased load on 
one of the databases.

2022-05-26 m1 database hardware 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-05-26_Database_hardware_failure>
Impact: For 12 minutes, internal services hosted on the m1 database (e.g. 
Etherpad) were unavailable or at reduced capacity.

2022-05-31 Analytics Hadoop failure 
<https://wikitech.wikimedia.org/wiki/Incidents/2022-05-31_Analytics_Data_Lake_-_Hadoop_Namenode_failure>
Impact: For 1 hour, all HDFS writes and reads were failing. After recovery, 
ingestion from Kafka resumed and caught up. No data loss or other lasting 
impact on the Data Lake.

Incident follow-up 
Recently completed incident follow-up:

Invalid confctl selector should either error out or select nothing 
<https://phabricator.wikimedia.org/T308100>
Filed by Amir (@Ladsgroup <https://phabricator.wikimedia.org/p/Ladsgroup/>) 
after the confctl incident this past month. Giuseppe (@Joe 
<https://phabricator.wikimedia.org/p/Joe/>) implemented CLI parameter 
validation to prevent human error from causing a similar outage in the future.

Backup opensearch dashboards data <https://phabricator.wikimedia.org/T237224>
Filed back in 2019 by Filippo (@fgiunchedi 
<https://phabricator.wikimedia.org/p/fgiunchedi/>). The OpenSearch homepage 
dashboard (at logstash.wikimedia.org) was accidentally deleted last month. 
Bryan (@bd808 <https://phabricator.wikimedia.org/p/bd808/>) tracked down its 
content and re-created it. Cole (@colewhite 
<https://phabricator.wikimedia.org/p/colewhite/>) and Jaime (@jcrespo 
<https://phabricator.wikimedia.org/p/jcrespo/>) worked out a strategy and set 
up automated backups going forward.
Remember to review and schedule Incident Follow-up work 
<https://phabricator.wikimedia.org/project/view/4758/> in Phabricator! These 
are preventive measures and tech debt mitigations written down after an 
incident is concluded. Read more about past incidents at Incident status on 
Wikitech.

💡*Did you know?*: The form on the *Incident status 
<https://wikitech.wikimedia.org/wiki/Incident_status>* page now includes a 
date, to more easily create backdated reports.




Trends 
In May we discovered 28 new production errors 
<https://phabricator.wikimedia.org/maniphest/query/z7vLwJdXtLu2/#R>, of which 
20 remain unresolved and have come with us to June.

Last month the workboard totalled 292 tasks still open from prior months. Since 
the last edition, we completed 11 tasks from previous months, gained 11 
additional errors from May (some of May was counted in last month), and have 7 
fresh errors in the current month of June. As of today, the workboard houses 
299 open production error tasks (spreadsheet and graph 
<https://docs.google.com/spreadsheets/d/e/2PACX-1vTrUCAI10hIroYDU-i5_8s7pony8M71ATXrFRiXXV7t5-tITZYrTRLGch-3iJbmeG41ZMcj1vGfzZ70/pubhtml>,
 phab report <https://phabricator.wikimedia.org/project/reports/1055/>).

Take a look at the workboard and look for tasks that could use your help.
→  https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Thanks!
Thank you to everyone who helped by reporting, investigating, or resolving 
problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof



🔗 Share or read later via https://phabricator.wikimedia.org/phame/post/view/285/


_______________________________________________
Wikitech-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to