Re: [ClusterLabs] VirtualDomain & parallel shutdown

2019-01-08 Thread Jan Pokorný
On 27/11/18 14:35 +0100, Jan Pokorný wrote:
> On 27/11/18 12:29 +0200, Klecho wrote:
>> Big thanks for the answer, but I in your ways around I don't see a solution
>> for the following simple case:
>> 
>> I have a few VMs (VirtualDomain RA) and just want and to stop a few of them,
>> not all.
>> 
>> While the first VM is shutting down (target-role=stopped), it starts some
>> slow update, which could take hours (because of the possible update case,
>> stop timeout is very big).
>> 
>> During these hours of update, no other VM can be stopped at all.
>> 
>> If this isn't avoidable, this could be a quite big flaw, because it blocks
>> basic functionality.
> 
> It looks like having transition "leaves", i.e. particular executive
> manipulations like stop/start operations, last in order of tens of
> minutes and longer is not what's pacemaker design had in mind,
> as opposed ot pushing asychronicity to the extreme (at the cost
> of complexity of the "orthogonality/non-interference tests",
> I think).

Also note that, moreover, extended periods of time in the context
of executing particular OCF/LSB resource operations can result in
relatively serious troubles under some failure scenarios unless
the agents are written in a self-defensive manner (and carefully
tested in practice):

https://lists.clusterlabs.org/pipermail/users/2019-January/016045.html

-- 
Nazdar,
Jan (Poki)


pgpE_tmqRGt0b.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] VirtualDomain & parallel shutdown

2018-11-27 Thread Jan Pokorný
On 27/11/18 12:29 +0200, Klecho wrote:
> Big thanks for the answer, but I in your ways around I don't see a solution
> for the following simple case:
> 
> I have a few VMs (VirtualDomain RA) and just want and to stop a few of them,
> not all.
> 
> While the first VM is shutting down (target-role=stopped), it starts some
> slow update, which could take hours (because of the possible update case,
> stop timeout is very big).
> 
> During these hours of update, no other VM can be stopped at all.
> 
> If this isn't avoidable, this could be a quite big flaw, because it blocks
> basic functionality.

It looks like having transition "leaves", i.e. particular executive
manipulations like stop/start operations, last in order of tens of
minutes and longer is not what's pacemaker design had in mind,
as opposed ot pushing asychronicity to the extreme (at the cost
of complexity of the "orthogonality/non-interference tests",
I think).

But the shutdown procedure can be shortcircuited with possible HA
compromises like this:
- put all the VMs you want to stop into an unmananged state
  (is-managed=false)
- trigger the shutdown independently of the cluster mananagement
- when they are indeed off (as also indicated by crm_mon if the
  monitor operation hasn't been disabled), they can be resurrected
  again when suitable (is-managed=true or dropping that property
  to designate an equivalent default)
can't it?

There is a couple of questions, though, like how to make it that
the particular VM won't be shut down in a standard way, which
would trigger the stated problem again (as mentioned, node
shutdown shall be OK).  Customizing resource agent to become
an active pacemaker observer/influencer beside a purposed
notification mechanism and customization through attributes
and rules feels terribly flawed.

-- 
Nazdar,
Jan (Poki)


pgpX7nXHWcKnM.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] VirtualDomain & parallel shutdown

2018-11-27 Thread Klecho

Hi Ken,

Big thanks for the answer, but I in your ways around I don't see a 
solution for the following simple case:


I have a few VMs (VirtualDomain RA) and just want and to stop a few of 
them, not all.


While the first VM is shutting down (target-role=stopped), it starts 
some slow update, which could take hours (because of the possible update 
case, stop timeout is very big).


During these hours of update, no other VM can be stopped at all.

If this isn't avoidable, this could be a quite big flaw, because it 
blocks basic functionality.


Best regards,

On 11/26/18 10:41 PM, Ken Gaillot wrote:

On Mon, 2018-11-26 at 14:24 +0200, Klecho wrote:

Hi again,

Just made one simple "parallel shutdown" test with a strange result,
confirming the problem I've described.

Created a few dummy resources, each of them taking 60s to stop. No
constraints at all. After that issued "stop" to all of them, one by
one.

Stop operation wasn't attempted for any of the rest until the first
resource stopped.

When the first resource stopped, all the rest stopped at a same
moment
120s after the stop commands were issued.

This confirms that if many resources (VMs) need to be stopped and
first
one starts some update (and a big stop timeout is set), stop attempt
for
the rest won't be made at all, until the first is up.

Why is this so and is there a way to avoid it?

It has to do with pacemaker's concept of a "transition".

When an interesting event happens (like your first stop), pacemaker
calculates what actions need to be taken and then does them. A
transition may be interrupted between actions by a new event, but any
event already begun must complete before a new transition can begin.

What happened here is that when you stopped the first resource, a
transition was created with that one stop, and that stop was initiated.
When the later stops came in, they would cause a new transition, but
that first stop has to complete before that transition can begin.

There are a few ways around this:

* Shutdown will stop all resources on its own, so you could skip the
stopping altogether.

* If you prefer to ensure all the resources stop successfully before
you start the shutdown, you could batch all the "stop" changes into one
file and apply that to the config. A stop command sets the resource's
target-role meta-attribute to Stopped. Normally, this is applied
directly to the live configuration, so it takes effect immediately.
However crm and pcs both offer ways to batch commands in a file, then
apply it all at once.

* Or, you could set the node(s) to standby mode as a transient
attribute (using attrd_updater). That would cause all resources to move
off those nodes (and stop if there are no nodes remaining). Transient
node attributes are erased every time a node leaves the cluster, so it
would only have effect until shutdown; when the node rejoined, it would
be in regular mode.


On 11/20/18 12:40 PM, Klechomir wrote:

Hi list,
Bumped onto the following issue lately:

When ultiple VMs are given shutdown right one-after-onther and the
shutdown of
the first VM takes long, the others aren't being shut down at all
before the
first doesn't stop.

"batch-limit" doesn't seem to affect this.
Any suggestions why this could happen?

Best regards,
Klecho
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
h.pdf
Bugs: http://bugs.clusterlabs.org



--
Klecho

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] VirtualDomain & parallel shutdown

2018-11-27 Thread Tomas Jelinek

Dne 26. 11. 18 v 21:41 Ken Gaillot napsal(a):

On Mon, 2018-11-26 at 14:24 +0200, Klecho wrote:

Hi again,

Just made one simple "parallel shutdown" test with a strange result,
confirming the problem I've described.

Created a few dummy resources, each of them taking 60s to stop. No
constraints at all. After that issued "stop" to all of them, one by
one.

Stop operation wasn't attempted for any of the rest until the first
resource stopped.

When the first resource stopped, all the rest stopped at a same
moment
120s after the stop commands were issued.

This confirms that if many resources (VMs) need to be stopped and
first
one starts some update (and a big stop timeout is set), stop attempt
for
the rest won't be made at all, until the first is up.

Why is this so and is there a way to avoid it?


It has to do with pacemaker's concept of a "transition".

When an interesting event happens (like your first stop), pacemaker
calculates what actions need to be taken and then does them. A
transition may be interrupted between actions by a new event, but any
event already begun must complete before a new transition can begin.

What happened here is that when you stopped the first resource, a
transition was created with that one stop, and that stop was initiated.
When the later stops came in, they would cause a new transition, but
that first stop has to complete before that transition can begin.

There are a few ways around this:

* Shutdown will stop all resources on its own, so you could skip the
stopping altogether.

* If you prefer to ensure all the resources stop successfully before
you start the shutdown, you could batch all the "stop" changes into one
file and apply that to the config. A stop command sets the resource's
target-role meta-attribute to Stopped. Normally, this is applied
directly to the live configuration, so it takes effect immediately.
However crm and pcs both offer ways to batch commands in a file, then
apply it all at once.


With pcs 0.9.157 and newer you can simply specify several resources in 
the "pcs resource disable" command. It has the same effect as batching 
all the stop changes into a file but it is much easier to use.




* Or, you could set the node(s) to standby mode as a transient
attribute (using attrd_updater). That would cause all resources to move
off those nodes (and stop if there are no nodes remaining). Transient
node attributes are erased every time a node leaves the cluster, so it
would only have effect until shutdown; when the node rejoined, it would
be in regular mode.



On 11/20/18 12:40 PM, Klechomir wrote:

Hi list,
Bumped onto the following issue lately:

When ultiple VMs are given shutdown right one-after-onther and the
shutdown of
the first VM takes long, the others aren't being shut down at all
before the
first doesn't stop.

"batch-limit" doesn't seem to affect this.
Any suggestions why this could happen?

Best regards,
Klecho
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
h.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] VirtualDomain & parallel shutdown

2018-11-26 Thread Ken Gaillot
On Mon, 2018-11-26 at 14:24 +0200, Klecho wrote:
> Hi again,
> 
> Just made one simple "parallel shutdown" test with a strange result, 
> confirming the problem I've described.
> 
> Created a few dummy resources, each of them taking 60s to stop. No 
> constraints at all. After that issued "stop" to all of them, one by
> one.
> 
> Stop operation wasn't attempted for any of the rest until the first 
> resource stopped.
> 
> When the first resource stopped, all the rest stopped at a same
> moment 
> 120s after the stop commands were issued.
> 
> This confirms that if many resources (VMs) need to be stopped and
> first 
> one starts some update (and a big stop timeout is set), stop attempt
> for 
> the rest won't be made at all, until the first is up.
> 
> Why is this so and is there a way to avoid it?

It has to do with pacemaker's concept of a "transition".

When an interesting event happens (like your first stop), pacemaker
calculates what actions need to be taken and then does them. A
transition may be interrupted between actions by a new event, but any
event already begun must complete before a new transition can begin.

What happened here is that when you stopped the first resource, a
transition was created with that one stop, and that stop was initiated.
When the later stops came in, they would cause a new transition, but
that first stop has to complete before that transition can begin.

There are a few ways around this:

* Shutdown will stop all resources on its own, so you could skip the
stopping altogether.

* If you prefer to ensure all the resources stop successfully before
you start the shutdown, you could batch all the "stop" changes into one
file and apply that to the config. A stop command sets the resource's
target-role meta-attribute to Stopped. Normally, this is applied
directly to the live configuration, so it takes effect immediately.
However crm and pcs both offer ways to batch commands in a file, then
apply it all at once.

* Or, you could set the node(s) to standby mode as a transient
attribute (using attrd_updater). That would cause all resources to move
off those nodes (and stop if there are no nodes remaining). Transient
node attributes are erased every time a node leaves the cluster, so it
would only have effect until shutdown; when the node rejoined, it would
be in regular mode.

> 
> On 11/20/18 12:40 PM, Klechomir wrote:
> > Hi list,
> > Bumped onto the following issue lately:
> > 
> > When ultiple VMs are given shutdown right one-after-onther and the
> > shutdown of
> > the first VM takes long, the others aren't being shut down at all
> > before the
> > first doesn't stop.
> > 
> > "batch-limit" doesn't seem to affect this.
> > Any suggestions why this could happen?
> > 
> > Best regards,
> > Klecho
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] VirtualDomain & parallel shutdown

2018-11-26 Thread Klecho

Hi again,

Just made one simple "parallel shutdown" test with a strange result, 
confirming the problem I've described.


Created a few dummy resources, each of them taking 60s to stop. No 
constraints at all. After that issued "stop" to all of them, one by one.


Stop operation wasn't attempted for any of the rest until the first 
resource stopped.


When the first resource stopped, all the rest stopped at a same moment 
120s after the stop commands were issued.


This confirms that if many resources (VMs) need to be stopped and first 
one starts some update (and a big stop timeout is set), stop attempt for 
the rest won't be made at all, until the first is up.


Why is this so and is there a way to avoid it?

On 11/20/18 12:40 PM, Klechomir wrote:

Hi list,
Bumped onto the following issue lately:

When ultiple VMs are given shutdown right one-after-onther and the shutdown of
the first VM takes long, the others aren't being shut down at all before the
first doesn't stop.

"batch-limit" doesn't seem to affect this.
Any suggestions why this could happen?

Best regards,
Klecho
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


--
Klecho

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] VirtualDomain & parallel shutdown

2018-11-20 Thread Klechomir
Hi list,
Bumped onto the following issue lately:

When ultiple VMs are given shutdown right one-after-onther and the shutdown of 
the first VM takes long, the others aren't being shut down at all before the 
first doesn't stop.

"batch-limit" doesn't seem to affect this.
Any suggestions why this could happen?

Best regards,
Klecho
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org