TLDR: Those with shell access at WMF can now run maintenance scripts on mwdebug 
hosts, and use the *--profiler=text* option to produce a report detailing how 
long the script spent in each MediaWiki component, class, and function.

== *What* ==

The MediaWiki platform at WMF consists of broadly four different deployments: 
app servers, api_app servers, job runners, and maint servers (diagram 
<https://wikitech.wikimedia.org/wiki/MediaWiki_at_WMF#/media/File:MediaWiki_infrastructure_2022.png>).
 Each can be thought of as its own service for a specific purpose, composed of 
a subset of components from the MediaWiki codebase with fast local access to 
each. The largest of these is the appservers cluster, which is served by 150 
servers of dedicated hardware across the two application data centers 
<https://wikitech.wikimedia.org/wiki/Data_centers>, and is responsible for 
responding to index.php (i.e. page views) and load.php (CSS, JS, and 
localisation via ResourceLoader).

Today, we focus on the smallest of these: mwmaint servers 
<https://wikitech.wikimedia.org/wiki/Maintenance_server>. This is backed by two 
heavy-duty servers, one in each data center, that autonomously run essential 
tasks at a predefined schedule (i.e. not in direct response to a user action). 
Each of these ~50 different tasks is implemented as a MediaWiki maintenance 
script. Important examples include: sending email notifications (Echo 
extension), timely pruning of sensitive PII (CheckUser extension), computing 
mentee and link recommendation data (GrowthExperiments), and reclaiming disk 
space for expired caches (core/ParserCache).

== *Why ==*

We have detailed debug performance profiling in production for web requests via 
the WikimediaDebug extension, and we have detailed profiling in local 
development for both web requests and maintenance scripts (Docker recipe 
<https://www.mediawiki.org/wiki/MediaWiki-Docker/Configuration_recipes/Profiling>).

What was missing is a way to profile maintenance scripts in production. This is 
important as maintenance scripts tend take many minutes or hours to process 
vast amounts of production data. While generally easy to debug locally for 
functional analysis, the performance bottlenecks individual teams care about 
are likely specific to the size of the data and the performance of other 
production components.

Thanks also to Ahmon Dancy (RelEng), Giuseppe Lavagetto (SRE), and Aaron Schulz 
(Performance Team) for making this work possible, and Niklas Laxström (LangEng) 
for coming up with the idea.

== *What's New* ==

Documentation: 
https://wikitech.wikimedia.org/wiki/WikimediaDebug#Plaintext_CLI_profile

To profile a Maintenance script, run the script from the shell with *mwscript* 
as you normally would, but instead of connecting your terminal to 
mwmaint1002.eqiad.wmnet, connect to one of the *mwdebug* hosts (such as 
mwdebug1001.eqiad.wmnet). Then pass the *--profiler=text* option to generate a 
report with the performance analysis, which will be printed after the task is 
finished. Like so:

> $ mwscript showSiteStats.php --wiki=nlwiki --profiler=text
> Number of articles:  2122688
> Number of users   :  1276507
> 
> <!--
>  100.00% 114.964   1 - main()
>  …
>  22.42% 25.776     1 - ShowSiteStats::execute
>  16.61% 19.096     2 - Wikimedia\Rdbms\LoadBalancer::getServerConnection
>   4.80% 5.522      1 - Maintenance::shutdown
>   4.41% 5.065      1 - Wikimedia\Rdbms\Database::initConnection
>   3.07% 3.530      1 - DeferredUpdates::doUpdates
>   2.66% 3.061      1 - Wikimedia\Rdbms\Database::select
>   2.38% 2.739      1 - Wikimedia\Rdbms\Database::query
>   1.95% 2.240      1 - section.SELECT * FROM `site_stats` LIMIT N 
>   1.48% 1.700      1 - Wikimedia\Rdbms\DatabaseMysqli::mysqlConnect
>   …
> -->

== *A peak behind the curtain* ==

Read on if you'd like to learn what hurdles we had to overcome for this to 
"simply" work in production, like it did for local development. The journey 
started when Niklas (WMF LangEng) proposed the idea at 
https://phabricator.wikimedia.org/T253547.

*Firstly*, the profiler engine. In 2019 (blogpost 
<https://techblog.wikimedia.org/2019/12/16/wikimediadebug-v2-is-here/>), after 
we migrated from HHVM to PHP 7, we had to look for a new profiler engine for 
backend performance. We adopted the open source php-tideways package, and this 
has powered our browser-facing profiler since. Naturally, this was already 
installed on the mwdebug servers for that purpose. However, the package, and 
the accompanying *rdtsc* setting, were only set for php-fpm (web server), it 
was not yet enabled for php-cli (command-line).

*Secondly*, the Profiler component in MediaWiki core had gotten out of sync 
with the needs of the Skin and Maintenance components. Over years of 
refactoring and more parts of each component gaining an active owner, the parts 
that lacked an owner eroded and stopped working, including the integration 
between these components. We decided to take active ownership over the 
remaining parts of MediaWiki-Core-Profiler and fix the disconnect. The meta 
work for that included identifying and re-triaging open issues under a new 
#mediawiki-core-profiler 
<https://phabricator.wikimedia.org/tag/mediawiki-core-profiler/> tag, 
automating discovery to our team inbox via Phabricator Herald rule, enlisting 
on Maintainers <https://www.mediawiki.org/wiki/Developers/Maintainers>, and 
automatic discovery of changesets to our code review dashboard. 
<https://gerrit.wikimedia.org/r/p/wikimedia/+/dashboard/teams:performance>

The "output" step of the profiler was originally the responsibility of a 
wfLogProfilingData function that both webpages (Skin) and CLI (Maintenance) 
called upon toward the end of the response. After this function became 
deprecated, and further refactoring in the Skin and Maintenance component took 
on (some of) its responsibilities directly, the "output" step got lost. Adding 
this back in was non-trivial because by now, the code in question had gained an 
unintentional dependency on the WebRequest object, which is not valid in a CLI 
context. The code in question was simplified and decoupled in change 725152 
<https://gerrit.wikimedia.org/r/c/mediawiki/core/+/725152/> and change 725440 
<https://gerrit.wikimedia.org/r/c/mediawiki/core/+/725440>, when then made it 
possible to add back the "output" step in change 838884 
<https://gerrit.wikimedia.org/r/c/mediawiki/core/+/838884/>. We also removed 
various options 
<https://gerrit.wikimedia.org/r/q/message:profiler+project:mediawiki/core+branch:master+is:merged>
 (such as CMDLINE_CALLBACK <https://phabricator.wikimedia.org/T305422>) that we 
choose to not support and maintain going forward (undocumented, unused, broken, 
or without known use case after research).

*Thirdly*, in parallel with our work above, another team refactored MediaWiki's 
SettingsLoader and MaintenanceRunner components. The Profiler dates back more 
than a decade and still relied on the fact that settings could be changed by 
CLI script options such as *--profiler=text*. The new SettingsLoader and the 
refactored MaintenanceRunner streamlined the order of operations during process 
startup, which had the side effect of initialising the profiler slightly 
earlier than it used to. This meant it would initialise before the 
--profiler=text option was applied, and thus the profiler that was initialised 
was unconditionally the "null" profiler. This did not produce an error since 
that is a valid configuration, the configuration of effectively all traffic 
besides ad-hoc debugging. I remedied this 
<https://gerrit.wikimedia.org/r/c/mediawiki/core/+/875396/> by recognising the 
CLI option as its own setting (separate from the main one), and passing it down 
via dependency injection. Thus not relying on the subtle order of operations. 

The Profiler component is now significantly leaner than it was before, and its 
requirements are either explicitly coded through dependency injection 
requirements, or simplified/refactored such that they do not place requirements 
on other components.

*Fourthly*, the wmf-config repository, where we control how and when MediaWiki 
Profiler can be enabled, had been changed by us the year before, to feature a 
new sampling profiler to produce detailed flame graphs over production traffic 
(blog post 
<https://techblog.wikimedia.org/2021/03/03/profiling-php-in-production-at-scale/>).
 I realised that this configuration assumed a web context. Another tweak over 
there <https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/874419/> 
made this safe to enabled in CLI/Maintenance script contexts.

*Fifthly*, in order to actually enable the wmf-config/Profiler code in CLI, we 
needed to set the same *php.ini* configuration for php-cli as we already did 
years earlier for php-fpm (web server). In doing so, we ran into the problem 
that one server in particular (deploy1002, from where the MediaWiki train runs 
via Scap) had two mutually exclusive copies of the MediaWiki deployment 
(/srv/mediawiki and /srv/mediawiki-staging). This meant that preloading a 
profiler for either of them from php-cli would break the other. I identified 
this long-standing technical debt to Ahmon Dancy (RelEng), who then went on a 
whole journey of his own to radically simplify our deployment servers to not 
have these two separate installations (T329857 
<https://phabricator.wikimedia.org/T329857>).

*Finally*, with a one-line change in Puppet 
<https://gerrit.wikimedia.org/r/c/operations/puppet/+/910882/> config, the CLI 
profiler is now enabled in production (Thanks Giuseppe Lavagetto, from SRE). 
All in all, this made the different parts of codebase, and different parts of 
our platform less divergent and more unified than before.

Thanks again to Aaron Schulz, Ahmon Dancy, Giuseppe Lavagetto, and Niklas 
Laxström for their help!

--
Timo Tijhof,
Principal Engineer,
Performance Team,
Wikimedia Foundation.
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to