On Mon, 2018-11-26 at 14:24 +0200, Klecho wrote: > Hi again, > > Just made one simple "parallel shutdown" test with a strange result, > confirming the problem I've described. > > Created a few dummy resources, each of them taking 60s to stop. No > constraints at all. After that issued "stop" to all of them, one by > one. > > Stop operation wasn't attempted for any of the rest until the first > resource stopped. > > When the first resource stopped, all the rest stopped at a same > moment > 120s after the stop commands were issued. > > This confirms that if many resources (VMs) need to be stopped and > first > one starts some update (and a big stop timeout is set), stop attempt > for > the rest won't be made at all, until the first is up. > > Why is this so and is there a way to avoid it?
It has to do with pacemaker's concept of a "transition". When an interesting event happens (like your first stop), pacemaker calculates what actions need to be taken and then does them. A transition may be interrupted between actions by a new event, but any event already begun must complete before a new transition can begin. What happened here is that when you stopped the first resource, a transition was created with that one stop, and that stop was initiated. When the later stops came in, they would cause a new transition, but that first stop has to complete before that transition can begin. There are a few ways around this: * Shutdown will stop all resources on its own, so you could skip the stopping altogether. * If you prefer to ensure all the resources stop successfully before you start the shutdown, you could batch all the "stop" changes into one file and apply that to the config. A stop command sets the resource's target-role meta-attribute to Stopped. Normally, this is applied directly to the live configuration, so it takes effect immediately. However crm and pcs both offer ways to batch commands in a file, then apply it all at once. * Or, you could set the node(s) to standby mode as a transient attribute (using attrd_updater). That would cause all resources to move off those nodes (and stop if there are no nodes remaining). Transient node attributes are erased every time a node leaves the cluster, so it would only have effect until shutdown; when the node rejoined, it would be in regular mode. > > On 11/20/18 12:40 PM, Klechomir wrote: > > Hi list, > > Bumped onto the following issue lately: > > > > When ultiple VMs are given shutdown right one-after-onther and the > > shutdown of > > the first VM takes long, the others aren't being shut down at all > > before the > > first doesn't stop. > > > > "batch-limit" doesn't seem to affect this. > > Any suggestions why this could happen? > > > > Best regards, > > Klecho > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc > > h.pdf > > Bugs: http://bugs.clusterlabs.org > > -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org