Great writeup, Timo — I'm loving the very long (and very nicely laid out) list of helpers! 😁👍
Cheers, Deb -- deb tankersley program manager, engineering Wikimedia Foundation On Tue, Sep 25, 2018 at 1:42 PM Krinkle <[email protected]> wrote: > 📘 Read this post on Phabricator at > https://phabricator.wikimedia.org/phame/live/1/post/119/ > ________________________________ > > How’d we do in our strive for operational excellence last month? Read on to > find out! > > - Month in numbers. > - Current problems. > - Highlighted stories. > > ## Month in numbers > > * 1 documented incident since August 9. [1] > * 113 Wikimedia-prod-error tasks closed since August 9. [2] > * 99 Wikimedia-prod-error tasks created since August 9. [3] > > ## Current problems > > Frequent: > * [MediaWiki-Logging] Exception from Special:Log (public GET). – > https://phabricator.wikimedia.org/T201411 > * [Graph] Warning "data error" from ApiGraph in gzdecode. – > https://phabricator.wikimedia.org/T184128 > * [RemexHtml] Exception "backtrack_limit exhausted" from search index jobs. > – https://phabricator.wikimedia.org/T201184 > > Other: > * [MediaWiki-Redirects] Exception from NS_MEDIA redirect (public GET). – > https://phabricator.wikimedia.org/T203942 > > This is an oldie: (Well..., it's an oldie where I come from... 🎸) > * [FlaggedRevs] Exception from Special:ProblemChanges (since 2011). – > https://phabricator.wikimedia.org/T176232 > > Terminology: > * An Exception (or fatal) causes user actions to be aborted. For example, a > page would display "Exception: Unable to render page", instead the article > content. > * A Warning (or non-fatal, or error) can produce page views that are > technically unaware of a problem, but may show corrupt or incomplete > information. For example, an article would display the word "null" instead > of the actual content. Or, a user may be told "You have (null) new > messages." > > The combined volume of infrequent non-fatal errors is high. This limits our > ability to automatically detect whether a deployment caused problems. The > “public GET” risks in particular can (and have) caused alerts to fire that > notify Operations of wikis potentially being down. Such exceptions must not > be publicly exposed. > > With that behind us... Let’s celebrate this month’s highlights! > > ## *️⃣ Quiz defect – "0" is not nothing! > > Tyler Cipriani (Release Engineering) reported an error in Quiz. Wikiversity > uses Quiz for interactive learning. Editors define quizzes in the source > text (wikitext). The Quiz program processes this text, creates checkboxes > with labels, and sends it to a user. When the sending part failed, "Error: > Undefined index" appeared in the logs. Volunteer Umherirrender > investigated. > > A line in the source text can: define a question, or an answer, or nothing > at all. The code that creates checkboxes needs to decide between > "something" and "nothing". The code utilised the PHP "if" statement for > this, which compares a value to True and False. The answers to a quiz can > be any text, which means PHP first transforms the text to one of True or > False. In doing so, values like "0" became False. This meant the code > thought "0" was not an answer. The code responsible for sending checkboxes > did not have this problem. When the code tried to access the checkbox to > send, it did not exist. Hence, "Error: Undefined index". > > Umherirrender fixed the problem by using a strict comparison. A strict > comparison doesn't transform a value first, it only compares. > > – https://phabricator.wikimedia.org/T196684 > > ## *️⃣ PageTriage enters JobQueue for better performance > > Kosta Harlan (from Audiences's Growth team) investigated a warning for > PageTriage. This extension provides the New Pages Feed tool on the English > Wikipedia. Each page in the feed has metadata, usually calculated when an > editor creates a page. Sometimes, this is not available. Then, it must be > calculated on-demand, when a user triages pages. So far, so good. The > information was then saved to the database for re-use by other triagers. > This last part caused the serious performance warning: "Unexpected database > writes". > > Database changes must not happen on page views. The database has many > replicas for reading, but only one "master" for all writing. We avoid using > the master during page views to make our systems independent. This is a key > design principle for MediaWiki performance. [5] It lets a secondary data > centre build pages without connecting to the primary (which can be far > away). > > Kosta addressed the warning by improving the code that saves the calculated > information. Instead of saving it immediately, an instruction is now sent > via a job queue, after the page view is ready. This job queue then > calculates and saves the information to the master database. The master > synchronises it to replicas, and then page views can use it. > > – https://phabricator.wikimedia.org/T199699 / > https://gerrit.wikimedia.org/r/455870 > > ## *️⃣ Tomorrow, may be sooner than you think > > After developers submit code to Gerrit, they eagerly await the result from > Jenkins, an automated test runner. It sometimes incorrectly reported a > problem with the MergeHistory feature. The code assumed that the tests > would finish by "tomorrow". > > It might be safe to assume our tests will not take one day to finish. > Unfortunately, the programming utility "strtotime", does not interpret > "tomorrow" as "this time tomorrow". Instead, it means "the start of > tomorrow". In other words, the next strike of midnight! The tests use UTC > as the neutral timezone. > > Every day in the 15 minutes before 5 PM in San Francisco (which is midnight > UTC), code submitted to Code Review, could have mysteriously failing tests. > > – Continue at https://gerrit.wikimedia.org/r/452873 > > ## *️⃣ Continuous Whac-A-Mole > > In August, developers started to notice rare and mysterious failures from > Jenkins. No obvious cause or solution was known at that time. > > Later that month, Dan Duvall (Release Engineering team) started exploring > ways to run our tests faster. Before, we had many small virtual servers, > where each server runs only one test at a time. The idea: Have a smaller > group of much larger virtual servers where each server could run many tests > at the same time. We hope that during busier times this will better share > the resources between tests. And, during less busy times, allow a single > test to use more resources. > > As implementation of this idea began, the mysterious test failures became > commonplace. "No space left on device", was a common error. The test > servers had their hard disk full. This was surprising. The new (larger) > servers seemed to have enough space to accommodate the number of tests it > ran at the same time. Together with Antoine Musso and Tyler Cipriani, they > identified and resolved two problems: > 1) Some automated tests did not clean up after themselves. > 2) The test-templates were stored on the "root disk" (the hard drive for > the operating system), instead of the hard drive with space reserved for > tests. This root disk is quite small, and is the same size on small servers > and large servers. > > – https://phabricator.wikimedia.org/T202160 / > https://phabricator.wikimedia.org/T202457 > > ## 🎉 Thanks! > > Thank you to everyone who has helped report, investigate, or resolve > production errors. Including: > > Tpt > Ankry > Daimona > Legoktm > Volker_E > Pchelolo > Dan Duvall > Gilles Dubuc > Daniel Kinzler > Umherirrender > Greg Grossmeier > Gergő Tisza (Tgr) > Sam Reed (Reedy) > Giuseppe Lavagetto > Brad Jorsch (Anomie) > Tim Starling (tstarling) > Kosta Harlan (kostajh) > Jaime Crespo (jcrespo) > Antoine Musso (hashar) > Roan Kattouw (Catrope) > Adam WMDE (Addshore) > Stephane Bisson (SBisson) > Niklas Laxström (Nikerabbit) > Thiemo Kreuz (thiemowmde) > Subramanya Sastry (ssastry) > This, that and the other (TTO) > Manuel Aróstegui (Marostegui) > Bartosz Dziewoński (matmarex) > James D. Forrester (Jdforrester-WMF) > > Thanks! > > Until next time, > > – Timo Tijhof > ________________________________ > > Further reading: > > * August 2018 edition. – > https://lists.wikimedia.org/pipermail/wikitech-l/2018-August/090594.html > * July 2018 edition. – > https://lists.wikimedia.org/pipermail/wikitech-l/2018-July/090363.html > > Footnotes: > > [1] Incidents. – > > https://wikitech.wikimedia.org/wiki/Special:AllPages?from=Incident+documentation%2F20180809&to=Incident+documentation%2F20180922&namespace=0 > > [2] Tasks closed. – > https://phabricator.wikimedia.org/maniphest/query/wOuWkMNsZheu/#R > [3] Tasks opened. – > https://phabricator.wikimedia.org/maniphest/query/6HpdI76rfuDg/#R > [4] Quiz on Wikiversity. – > > https://en.wikiversity.org/wiki/How_things_work_college_course/Conceptual_physics_wikiquizzes/Velocity_and_acceleration > > [5] Operate multiple datacenters. – > > https://www.mediawiki.org/wiki/Requests_for_comment/Master-slave_datacenter_strategy_for_MediaWiki > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
