Let me clarify the reasoning for the idea:
We realized that some schema changes (which used to be scheduled like other
deployments) no longer take 1 hour (they can take 1 month, running
continuously like https://phabricator.wikimedia.org/T139090 , because it
affects 3 of our largest tables). Also, they no longer requires read-only
mode or affect code in anyway (unless they are a prerequisite).
On the other side, a schema change, combined with high read or write load
from long-running maintenance jobs, like those of the updateCollation
script, or any other (those where just an example), could potentially make
lagging a worse problem: a single transaction has to store pending changes
during its lifetime, or long-running reads can block and create pileups due
to metadata locking. We want to avoid those, which certainly caused
infrastructure issues in the past.
So, in summary, regular deployments are exclusive from each others.
Long-running maintenance work could affect each other. This is a way for me
(and others) to have visibility of those potential negative interactions,
and make sure we can coordinate: "You are doing work on enwiki? No problem,
we will just run this task for commons". "you need to do an emergency data
recovery? I will wait to do this other task that can wait". Even if only
DBAs use it, it is already useful to not perform incompatible changes at
the same time. But it will be even more useful if everybody uses it!
On Thu, Sep 22, 2016 at 4:27 PM, Alex Monk <am...@wikimedia.org> wrote:
> I had been assuming that puppetised crons were not really relevant...
> On 22 September 2016 at 15:19, Guillaume Lederrey <gleder...@wikimedia.org
> > wrote:
>> Increasing visibility sounds like a great idea! How far do we want to
>> go in that direction? In particular, I'm thinking of a few of the
>> crons we have for Cirrus. For example, we do have daily crons on
>> terbium that re-generate the suggester indices. Those can run for >
>> My understanding is that those kind of crons should not be considered
>> scripts, but standard working parts of the system. Adding them will
>> probably generate more noise than useful information. Is this a
>> reasonable understanding?
>> On Wed, Sep 21, 2016 at 12:29 AM, Greg Grossmeier <g...@wikimedia.org>
>> > In an effort to reduce surprises and potential mishaps it is now
>> > required to include any long running tasks in the deployment
>> > calendar.
>> > "Long running tasks" include any script that is run on production 'work
>> > machines' such as terbium that last for longer than ~1 hour. Think:
>> > migration and maintenance scripts.
>> > This was discussed and proposed in T144661.
>> > Best,
>> > Greg
>> >  https://wikitech.wikimedia.org/wiki/Deployments
>> > Relevant diff:
>> > https://wikitech.wikimedia.org/w/index.php?diff=850923&oldid=850244
>> >  https://phabricator.wikimedia.org/T144661
>> > --
>> > | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E |
>> > | Release Team Manager A18D 1138 8E47 FAC8 1C7D |
>> > _______________________________________________
>> > Engineering mailing list
>> > engineer...@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/engineering
>> Guillaume Lederrey
>> Operations Engineer, Discovery
>> Wikimedia Foundation
>> UTC+2 / CEST
>> Wikitech-l mailing list
> Alex Monk
> VisualEditor/Editing team
> Engineering mailing list
Wikitech-l mailing list