Thanks for this report Timo. It's great to celebrate successes like these
and I appreciate the time you sre clearly taking to write about them. We
don't do this enough.

Thanks Umherirrender, Kosta and Dan and all others involved in this
initiative!

On Tue, Sep 25, 2018, 11:42 AM Krinkle <[email protected]> wrote:

> 📘 Read this post on Phabricator at
> https://phabricator.wikimedia.org/phame/live/1/post/119/
> ________________________________
>
> How’d we do in our strive for operational excellence last month? Read on to
> find out!
>
> - Month in numbers.
> - Current problems.
> - Highlighted stories.
>
> ## Month in numbers
>
> * 1 documented incident since August 9. [1]
> * 113 Wikimedia-prod-error tasks closed since August 9. [2]
> * 99 Wikimedia-prod-error tasks created since August 9. [3]
>
> ## Current problems
>
> Frequent:
> * [MediaWiki-Logging] Exception from Special:Log (public GET). –
> https://phabricator.wikimedia.org/T201411
> * [Graph] Warning "data error" from ApiGraph in gzdecode. –
> https://phabricator.wikimedia.org/T184128
> * [RemexHtml] Exception "backtrack_limit exhausted" from search index jobs.
> – https://phabricator.wikimedia.org/T201184
>
> Other:
> * [MediaWiki-Redirects] Exception from NS_MEDIA redirect (public GET). –
> https://phabricator.wikimedia.org/T203942
>
> This is an oldie: (Well..., it's an oldie where I come from... 🎸)
> * [FlaggedRevs] Exception from Special:ProblemChanges (since 2011). –
> https://phabricator.wikimedia.org/T176232
>
> Terminology:
> * An Exception (or fatal) causes user actions to be aborted. For example, a
> page would display  "Exception: Unable to render page", instead the article
> content.
> * A Warning (or non-fatal, or error) can produce page views that are
> technically unaware of a problem, but may show corrupt or incomplete
> information.  For example, an article would display the word "null" instead
> of the actual content. Or, a user may be told "You have (null) new
> messages."
>
> The combined volume of infrequent non-fatal errors is high. This limits our
> ability to automatically detect whether a deployment caused problems. The
> “public GET” risks in particular can (and have) caused alerts to fire that
> notify Operations of wikis potentially being down. Such exceptions must not
> be publicly exposed.
>
> With that behind us... Let’s celebrate this month’s highlights!
>
> ## *️⃣ Quiz defect – "0" is not nothing!
>
> Tyler Cipriani (Release Engineering) reported an error in Quiz. Wikiversity
> uses Quiz for interactive learning. Editors define quizzes in the source
> text (wikitext). The Quiz program processes this text, creates checkboxes
> with labels, and sends it to a user. When the sending part failed, "Error:
> Undefined index" appeared in the logs. Volunteer Umherirrender
> investigated.
>
> A line in the source text can: define a question, or an answer, or nothing
> at all. The code that creates checkboxes needs to decide between
> "something" and "nothing". The code utilised the PHP "if" statement for
> this, which compares a value to True and False. The answers to a quiz can
> be any text, which means PHP first transforms the text to one of True or
> False. In doing so, values like "0" became False. This meant the code
> thought "0" was not an answer. The code responsible for sending checkboxes
> did not have this problem. When the code tried to access the checkbox to
> send, it did not exist. Hence, "Error: Undefined index".
>
> Umherirrender fixed the problem by using a strict comparison. A strict
> comparison doesn't transform a value first, it only compares.
>
> – https://phabricator.wikimedia.org/T196684
>
> ## *️⃣ PageTriage enters JobQueue for better performance
>
> Kosta Harlan (from Audiences's Growth team) investigated a warning for
> PageTriage. This extension provides the New Pages Feed tool on the English
> Wikipedia. Each page in the feed has metadata, usually calculated when an
> editor creates a page. Sometimes, this is not available. Then, it must be
> calculated on-demand, when a user triages pages. So far, so good. The
> information was then saved to the database for re-use by other triagers.
> This last part caused the serious performance warning: "Unexpected database
> writes".
>
> Database changes must not happen on page views. The database has many
> replicas for reading, but only one "master" for all writing. We avoid using
> the master during page views to make our systems independent. This is a key
> design principle for MediaWiki performance. [5] It lets a secondary data
> centre build pages without connecting to the primary (which can be far
> away).
>
> Kosta addressed the warning by improving the code that saves the calculated
> information. Instead of saving it immediately, an instruction is now sent
> via a job queue, after the page view is ready. This job queue then
> calculates and saves the information to the master database. The master
> synchronises it to replicas, and then page views can use it.
>
> – https://phabricator.wikimedia.org/T199699 /
> https://gerrit.wikimedia.org/r/455870
>
> ## *️⃣ Tomorrow, may be sooner than you think
>
> After developers submit code to Gerrit, they eagerly await the result from
> Jenkins, an automated test runner. It sometimes incorrectly reported a
> problem with the MergeHistory feature. The code assumed that the tests
> would finish by "tomorrow".
>
> It might be safe to assume our tests will not take one day to finish.
> Unfortunately, the programming utility "strtotime", does not interpret
> "tomorrow" as "this time tomorrow". Instead, it means "the start of
> tomorrow". In other words, the next strike of midnight! The tests use UTC
> as the neutral timezone.
>
> Every day in the 15 minutes before 5 PM in San Francisco (which is midnight
> UTC), code submitted to Code Review, could have mysteriously failing tests.
>
> – Continue at https://gerrit.wikimedia.org/r/452873
>
> ## *️⃣ Continuous Whac-A-Mole
>
> In August, developers started to notice rare and mysterious failures from
> Jenkins. No obvious cause or solution was known at that time.
>
> Later that month, Dan Duvall (Release Engineering team) started exploring
> ways to run our tests faster. Before, we had many small virtual servers,
> where each server runs only one test at a time. The idea: Have a smaller
> group of much larger virtual servers where each server could run many tests
> at the same time. We hope that during busier times this will better share
> the resources between tests. And, during less busy times, allow a single
> test to use more resources.
>
> As implementation of this idea began, the mysterious test failures became
> commonplace. "No space left on device", was a common error. The test
> servers had their hard disk full. This was surprising. The new (larger)
> servers seemed to have enough space to accommodate the number of tests it
> ran at the same time. Together with Antoine Musso and Tyler Cipriani, they
> identified and resolved two problems:
> 1) Some automated tests did not clean up after themselves.
> 2) The test-templates were stored on the "root disk" (the hard drive for
> the operating system), instead of the hard drive with space reserved for
> tests. This root disk is quite small, and is the same size on small servers
> and large servers.
>
> – https://phabricator.wikimedia.org/T202160 /
> https://phabricator.wikimedia.org/T202457
>
> ## 🎉 Thanks!
>
> Thank you to everyone who has helped report, investigate, or resolve
> production errors. Including:
>
> Tpt
> Ankry
> Daimona
> Legoktm
> Volker_E
> Pchelolo
> Dan Duvall
> Gilles Dubuc
> Daniel Kinzler
> Umherirrender
> Greg Grossmeier
> Gergő Tisza (Tgr)
> Sam Reed (Reedy)
> Giuseppe Lavagetto
> Brad Jorsch (Anomie)
> Tim Starling (tstarling)
> Kosta Harlan (kostajh)
> Jaime Crespo (jcrespo)
> Antoine Musso (hashar)
> Roan Kattouw (Catrope)
> Adam WMDE (Addshore)
> Stephane Bisson (SBisson)
> Niklas Laxström (Nikerabbit)
> Thiemo Kreuz (thiemowmde)
> Subramanya Sastry (ssastry)
> This, that and the other (TTO)
> Manuel Aróstegui (Marostegui)
> Bartosz Dziewoński (matmarex)
> James D. Forrester (Jdforrester-WMF)
>
> Thanks!
>
> Until next time,
>
> – Timo Tijhof
> ________________________________
>
> Further reading:
>
> * August 2018 edition. –
> https://lists.wikimedia.org/pipermail/wikitech-l/2018-August/090594.html
> * July 2018 edition. –
> https://lists.wikimedia.org/pipermail/wikitech-l/2018-July/090363.html
>
> Footnotes:
>
> [1] Incidents. –
>
> https://wikitech.wikimedia.org/wiki/Special:AllPages?from=Incident+documentation%2F20180809&to=Incident+documentation%2F20180922&namespace=0
>
> [2] Tasks closed. –
> https://phabricator.wikimedia.org/maniphest/query/wOuWkMNsZheu/#R
> [3] Tasks opened. –
> https://phabricator.wikimedia.org/maniphest/query/6HpdI76rfuDg/#R
> [4] Quiz on Wikiversity. –
>
> https://en.wikiversity.org/wiki/How_things_work_college_course/Conceptual_physics_wikiquizzes/Velocity_and_acceleration
>
> [5] Operate multiple datacenters. –
>
> https://www.mediawiki.org/wiki/Requests_for_comment/Master-slave_datacenter_strategy_for_MediaWiki
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- 
Jon Robson
twitter: @jdlrobson
linkedin: https://www.linkedin.com/in/jorobson/
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to