https://bugzilla.wikimedia.org/show_bug.cgi?id=61882

--- Comment #1 from Gabriel Wicke <gwi...@wikimedia.org> ---
From Ryan's main:
TL;DR:

Ensure batching doesn't occur by setting the batch value to 100%:

  service-restart --repo 'parsoid/deploy' --batch='100%'

Long version:

I spent some time tracking this down tonight. From the master the following
commands work reliably:

  1: salt -b '10%' -G 'deployment_target:parsoid' service.restart parsoid
  2: salt -b '10%' -G 'deployment_target:parsoid' deploy.restart
'parsoid/deploy'
  3: salt-run deploy.restart 'parsoid/deploy' '10%'

#2 is a wrapper function for service.restart that maps a repo to a service,
then restarts it (it has other future uses that make it necessary as well). #3
is a runner that can be easily referenced from peers like tin to make handling
security easier, it basically just calls #2.

From tin we call:

  sudo salt-call -l quiet --out json publish.runner deploy.restart
'parsoid/deploy','10%'

This calls #3 on the master via salt's publication system.

The problem here is that publish.runner has a default timeout. When that
timeout is reached, the runner stops executing. Part of the issue is that it
times out at all. The other part of this issue is that this command has no
timeout argument:
<http://docs.saltstack.com/ref/modules/all/salt.modules.publish.html#salt.modules.publish.runner>.

Since the command is being run as a batch, it splits the list of minions up
into chunks and calls salt deploy.restart on each of them. If the sum of the
time of all restarts is greater than the timeout of publish.runner any minions
that didn't get called never do. Since we currently have a small number of
minions this tends to often result in a single minion being left out. If we
increased the number of minions, this would result in many more being left out.

I've opened a bug for this:

  https://github.com/saltstack/salt/issues/10814

I've also added a pull request to allow a timeout argument for publish.runner:

  https://github.com/saltstack/salt/pull/10815

In the meantime it's possible to workaround this issue by simply ensuring
batching doesn't occur by setting the batch value to 100%:

  service-restart --repo 'parsoid/deploy' --batch='100%'

I'm assuming that since parsoid is doing graceful restarts that this shouldn't
be a problem.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to