Great writeup, Timo — I'm loving the very long (and very nicely laid out)
list of helpers! 😁👍

Cheers,

Deb

--

deb tankersley

program manager, engineering

Wikimedia Foundation


On Tue, Sep 25, 2018 at 1:42 PM Krinkle <[email protected]> wrote:

> 📘 Read this post on Phabricator at
> https://phabricator.wikimedia.org/phame/live/1/post/119/
> ________________________________
>
> How’d we do in our strive for operational excellence last month? Read on to
> find out!
>
> - Month in numbers.
> - Current problems.
> - Highlighted stories.
>
> ## Month in numbers
>
> * 1 documented incident since August 9. [1]
> * 113 Wikimedia-prod-error tasks closed since August 9. [2]
> * 99 Wikimedia-prod-error tasks created since August 9. [3]
>
> ## Current problems
>
> Frequent:
> * [MediaWiki-Logging] Exception from Special:Log (public GET). –
> https://phabricator.wikimedia.org/T201411
> * [Graph] Warning "data error" from ApiGraph in gzdecode. –
> https://phabricator.wikimedia.org/T184128
> * [RemexHtml] Exception "backtrack_limit exhausted" from search index jobs.
> – https://phabricator.wikimedia.org/T201184
>
> Other:
> * [MediaWiki-Redirects] Exception from NS_MEDIA redirect (public GET). –
> https://phabricator.wikimedia.org/T203942
>
> This is an oldie: (Well..., it's an oldie where I come from... 🎸)
> * [FlaggedRevs] Exception from Special:ProblemChanges (since 2011). –
> https://phabricator.wikimedia.org/T176232
>
> Terminology:
> * An Exception (or fatal) causes user actions to be aborted. For example, a
> page would display  "Exception: Unable to render page", instead the article
> content.
> * A Warning (or non-fatal, or error) can produce page views that are
> technically unaware of a problem, but may show corrupt or incomplete
> information.  For example, an article would display the word "null" instead
> of the actual content. Or, a user may be told "You have (null) new
> messages."
>
> The combined volume of infrequent non-fatal errors is high. This limits our
> ability to automatically detect whether a deployment caused problems. The
> “public GET” risks in particular can (and have) caused alerts to fire that
> notify Operations of wikis potentially being down. Such exceptions must not
> be publicly exposed.
>
> With that behind us... Let’s celebrate this month’s highlights!
>
> ## *️⃣ Quiz defect – "0" is not nothing!
>
> Tyler Cipriani (Release Engineering) reported an error in Quiz. Wikiversity
> uses Quiz for interactive learning. Editors define quizzes in the source
> text (wikitext). The Quiz program processes this text, creates checkboxes
> with labels, and sends it to a user. When the sending part failed, "Error:
> Undefined index" appeared in the logs. Volunteer Umherirrender
> investigated.
>
> A line in the source text can: define a question, or an answer, or nothing
> at all. The code that creates checkboxes needs to decide between
> "something" and "nothing". The code utilised the PHP "if" statement for
> this, which compares a value to True and False. The answers to a quiz can
> be any text, which means PHP first transforms the text to one of True or
> False. In doing so, values like "0" became False. This meant the code
> thought "0" was not an answer. The code responsible for sending checkboxes
> did not have this problem. When the code tried to access the checkbox to
> send, it did not exist. Hence, "Error: Undefined index".
>
> Umherirrender fixed the problem by using a strict comparison. A strict
> comparison doesn't transform a value first, it only compares.
>
> – https://phabricator.wikimedia.org/T196684
>
> ## *️⃣ PageTriage enters JobQueue for better performance
>
> Kosta Harlan (from Audiences's Growth team) investigated a warning for
> PageTriage. This extension provides the New Pages Feed tool on the English
> Wikipedia. Each page in the feed has metadata, usually calculated when an
> editor creates a page. Sometimes, this is not available. Then, it must be
> calculated on-demand, when a user triages pages. So far, so good. The
> information was then saved to the database for re-use by other triagers.
> This last part caused the serious performance warning: "Unexpected database
> writes".
>
> Database changes must not happen on page views. The database has many
> replicas for reading, but only one "master" for all writing. We avoid using
> the master during page views to make our systems independent. This is a key
> design principle for MediaWiki performance. [5] It lets a secondary data
> centre build pages without connecting to the primary (which can be far
> away).
>
> Kosta addressed the warning by improving the code that saves the calculated
> information. Instead of saving it immediately, an instruction is now sent
> via a job queue, after the page view is ready. This job queue then
> calculates and saves the information to the master database. The master
> synchronises it to replicas, and then page views can use it.
>
> – https://phabricator.wikimedia.org/T199699 /
> https://gerrit.wikimedia.org/r/455870
>
> ## *️⃣ Tomorrow, may be sooner than you think
>
> After developers submit code to Gerrit, they eagerly await the result from
> Jenkins, an automated test runner. It sometimes incorrectly reported a
> problem with the MergeHistory feature. The code assumed that the tests
> would finish by "tomorrow".
>
> It might be safe to assume our tests will not take one day to finish.
> Unfortunately, the programming utility "strtotime", does not interpret
> "tomorrow" as "this time tomorrow". Instead, it means "the start of
> tomorrow". In other words, the next strike of midnight! The tests use UTC
> as the neutral timezone.
>
> Every day in the 15 minutes before 5 PM in San Francisco (which is midnight
> UTC), code submitted to Code Review, could have mysteriously failing tests.
>
> – Continue at https://gerrit.wikimedia.org/r/452873
>
> ## *️⃣ Continuous Whac-A-Mole
>
> In August, developers started to notice rare and mysterious failures from
> Jenkins. No obvious cause or solution was known at that time.
>
> Later that month, Dan Duvall (Release Engineering team) started exploring
> ways to run our tests faster. Before, we had many small virtual servers,
> where each server runs only one test at a time. The idea: Have a smaller
> group of much larger virtual servers where each server could run many tests
> at the same time. We hope that during busier times this will better share
> the resources between tests. And, during less busy times, allow a single
> test to use more resources.
>
> As implementation of this idea began, the mysterious test failures became
> commonplace. "No space left on device", was a common error. The test
> servers had their hard disk full. This was surprising. The new (larger)
> servers seemed to have enough space to accommodate the number of tests it
> ran at the same time. Together with Antoine Musso and Tyler Cipriani, they
> identified and resolved two problems:
> 1) Some automated tests did not clean up after themselves.
> 2) The test-templates were stored on the "root disk" (the hard drive for
> the operating system), instead of the hard drive with space reserved for
> tests. This root disk is quite small, and is the same size on small servers
> and large servers.
>
> – https://phabricator.wikimedia.org/T202160 /
> https://phabricator.wikimedia.org/T202457
>
> ## 🎉 Thanks!
>
> Thank you to everyone who has helped report, investigate, or resolve
> production errors. Including:
>
> Tpt
> Ankry
> Daimona
> Legoktm
> Volker_E
> Pchelolo
> Dan Duvall
> Gilles Dubuc
> Daniel Kinzler
> Umherirrender
> Greg Grossmeier
> Gergő Tisza (Tgr)
> Sam Reed (Reedy)
> Giuseppe Lavagetto
> Brad Jorsch (Anomie)
> Tim Starling (tstarling)
> Kosta Harlan (kostajh)
> Jaime Crespo (jcrespo)
> Antoine Musso (hashar)
> Roan Kattouw (Catrope)
> Adam WMDE (Addshore)
> Stephane Bisson (SBisson)
> Niklas Laxström (Nikerabbit)
> Thiemo Kreuz (thiemowmde)
> Subramanya Sastry (ssastry)
> This, that and the other (TTO)
> Manuel Aróstegui (Marostegui)
> Bartosz Dziewoński (matmarex)
> James D. Forrester (Jdforrester-WMF)
>
> Thanks!
>
> Until next time,
>
> – Timo Tijhof
> ________________________________
>
> Further reading:
>
> * August 2018 edition. –
> https://lists.wikimedia.org/pipermail/wikitech-l/2018-August/090594.html
> * July 2018 edition. –
> https://lists.wikimedia.org/pipermail/wikitech-l/2018-July/090363.html
>
> Footnotes:
>
> [1] Incidents. –
>
> https://wikitech.wikimedia.org/wiki/Special:AllPages?from=Incident+documentation%2F20180809&to=Incident+documentation%2F20180922&namespace=0
>
> [2] Tasks closed. –
> https://phabricator.wikimedia.org/maniphest/query/wOuWkMNsZheu/#R
> [3] Tasks opened. –
> https://phabricator.wikimedia.org/maniphest/query/6HpdI76rfuDg/#R
> [4] Quiz on Wikiversity. –
>
> https://en.wikiversity.org/wiki/How_things_work_college_course/Conceptual_physics_wikiquizzes/Velocity_and_acceleration
>
> [5] Operate multiple datacenters. –
>
> https://www.mediawiki.org/wiki/Requests_for_comment/Master-slave_datacenter_strategy_for_MediaWiki
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to