Re: [ClusterLabs] crm_resource --wait

2017-10-21 Thread Leon Steffens
Thanks for the update Ken!



From: Ken Gaillot
Sent: Saturday, 21 October 2017 7:06 AM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] crm_resource --wait

I've narrowed down the cause.

When the "standby" transition completes, vm2 has more remaining
utilization capacity than vm1, so the cluster wants to run sv-fencer
there. That should be taken into account in the same transition, but it
isn't, so a second transition is needed to make it happen.

Still investigating a fix. A workaround is to assign some stickiness or
utilization to sv-fencer.

On Wed, 2017-10-11 at 14:01 +1000, Leon Steffens wrote:
> I've attached two files:
> 314 = after standby step
> 315 = after resource update
> 
> On Wed, Oct 11, 2017 at 12:22 AM, Ken Gaillot <kgail...@redhat.com>
> wrote:
> > On Tue, 2017-10-10 at 15:19 +1000, Leon Steffens wrote:
> > > Hi Ken,
> > >
> > > I managed to reproduce this on a simplified version of the
> > cluster,
> > > and on Pacemaker 1.1.15, 1.1.16, as well as 1.1.18-rc1
> > 
> > > The steps to create the cluster are:
> > >
> > > pcs property set stonith-enabled=false
> > > pcs property set placement-strategy=balanced
> > >
> > > pcs node utilization vm1 cpu=100
> > > pcs node utilization vm2 cpu=100
> > > pcs node utilization vm3 cpu=100
> > >
> > > pcs property set maintenance-mode=true
> > >
> > > pcs resource create sv-fencer ocf:pacemaker:Dummy
> > >
> > > pcs resource create sv ocf:pacemaker:Dummy clone notify=false
> > > pcs resource create std ocf:pacemaker:Dummy meta resource-
> > > stickiness=100
> > >
> > > pcs resource create partition1 ocf:pacemaker:Dummy meta resource-
> > > stickiness=100
> > > pcs resource create partition2 ocf:pacemaker:Dummy meta resource-
> > > stickiness=100
> > > pcs resource create partition3 ocf:pacemaker:Dummy meta resource-
> > > stickiness=100
> > >
> > > pcs resource utilization partition1 cpu=5
> > > pcs resource utilization partition2 cpu=5
> > > pcs resource utilization partition3 cpu=5
> > >
> > > pcs constraint colocation add std with sv-clone INFINITY
> > > pcs constraint colocation add partition1 with sv-clone INFINITY
> > > pcs constraint colocation add partition2 with sv-clone INFINITY
> > > pcs constraint colocation add partition3 with sv-clone INFINITY
> > >
> > > pcs property set maintenance-mode=false
> > >  
> > >
> > > I can then reproduce the issues in the following way:
> > >
> > > $ pcs resource
> > >  sv-fencer      (ocf::pacemaker:Dummy): Started vm1
> > >  Clone Set: sv-clone [sv]
> > >      Started: [ vm1 vm2 vm3 ]
> > >  std    (ocf::pacemaker:Dummy): Started vm2
> > >  partition1     (ocf::pacemaker:Dummy): Started vm3
> > >  partition2     (ocf::pacemaker:Dummy): Started vm1
> > >  partition3     (ocf::pacemaker:Dummy): Started vm2
> > >
> > > $ pcs cluster standby vm3
> > >
> > > # Check that all resources have moved off vm3
> > > $ pcs resource
> > >  sv-fencer      (ocf::pacemaker:Dummy): Started vm1
> > >  Clone Set: sv-clone [sv]
> > >      Started: [ vm1 vm2 ]
> > >      Stopped: [ vm3 ]
> > >  std    (ocf::pacemaker:Dummy): Started vm2
> > >  partition1     (ocf::pacemaker:Dummy): Started vm1
> > >  partition2     (ocf::pacemaker:Dummy): Started vm1
> > >  partition3     (ocf::pacemaker:Dummy): Started vm2
> > 
> > Thanks for the detailed information, this should help me get to the
> > bottom of it. From this description, it sounds like a new
> > transition
> > isn't being triggered when it should.
> > 
> > Could you please attach the DC's pe-input file that is listed in
> > the
> > logs after the standby step above? That would simplify analysis.
> > 
> > > # Wait for any outstanding actions to complete.
> > > $ crm_resource --wait --timeout 300
> > > Pending actions:
> > >         Action 22: sv-fencer_monitor_1      on vm2
> > >         Action 21: sv-fencer_start_0    on vm2
> > >         Action 20: sv-fencer_stop_0     on vm1
> > > Error performing operation: Timer expired
> > >
> > > # Check the resources again - sv-fencer is still on vm1
> > > $ pcs resource
> > >  sv-fencer      (ocf::pacemaker:Dummy): Started vm1
> > >  Clone Set: sv-clone [sv]
> > >      Started: [ vm1 vm2 

Re: [ClusterLabs] crm_resource --wait

2017-10-20 Thread Ken Gaillot
I've narrowed down the cause.

When the "standby" transition completes, vm2 has more remaining
utilization capacity than vm1, so the cluster wants to run sv-fencer
there. That should be taken into account in the same transition, but it
isn't, so a second transition is needed to make it happen.

Still investigating a fix. A workaround is to assign some stickiness or
utilization to sv-fencer.

On Wed, 2017-10-11 at 14:01 +1000, Leon Steffens wrote:
> I've attached two files:
> 314 = after standby step
> 315 = after resource update
> 
> On Wed, Oct 11, 2017 at 12:22 AM, Ken Gaillot 
> wrote:
> > On Tue, 2017-10-10 at 15:19 +1000, Leon Steffens wrote:
> > > Hi Ken,
> > >
> > > I managed to reproduce this on a simplified version of the
> > cluster,
> > > and on Pacemaker 1.1.15, 1.1.16, as well as 1.1.18-rc1
> > 
> > > The steps to create the cluster are:
> > >
> > > pcs property set stonith-enabled=false
> > > pcs property set placement-strategy=balanced
> > >
> > > pcs node utilization vm1 cpu=100
> > > pcs node utilization vm2 cpu=100
> > > pcs node utilization vm3 cpu=100
> > >
> > > pcs property set maintenance-mode=true
> > >
> > > pcs resource create sv-fencer ocf:pacemaker:Dummy
> > >
> > > pcs resource create sv ocf:pacemaker:Dummy clone notify=false
> > > pcs resource create std ocf:pacemaker:Dummy meta resource-
> > > stickiness=100
> > >
> > > pcs resource create partition1 ocf:pacemaker:Dummy meta resource-
> > > stickiness=100
> > > pcs resource create partition2 ocf:pacemaker:Dummy meta resource-
> > > stickiness=100
> > > pcs resource create partition3 ocf:pacemaker:Dummy meta resource-
> > > stickiness=100
> > >
> > > pcs resource utilization partition1 cpu=5
> > > pcs resource utilization partition2 cpu=5
> > > pcs resource utilization partition3 cpu=5
> > >
> > > pcs constraint colocation add std with sv-clone INFINITY
> > > pcs constraint colocation add partition1 with sv-clone INFINITY
> > > pcs constraint colocation add partition2 with sv-clone INFINITY
> > > pcs constraint colocation add partition3 with sv-clone INFINITY
> > >
> > > pcs property set maintenance-mode=false
> > >  
> > >
> > > I can then reproduce the issues in the following way:
> > >
> > > $ pcs resource
> > >  sv-fencer      (ocf::pacemaker:Dummy): Started vm1
> > >  Clone Set: sv-clone [sv]
> > >      Started: [ vm1 vm2 vm3 ]
> > >  std    (ocf::pacemaker:Dummy): Started vm2
> > >  partition1     (ocf::pacemaker:Dummy): Started vm3
> > >  partition2     (ocf::pacemaker:Dummy): Started vm1
> > >  partition3     (ocf::pacemaker:Dummy): Started vm2
> > >
> > > $ pcs cluster standby vm3
> > >
> > > # Check that all resources have moved off vm3
> > > $ pcs resource
> > >  sv-fencer      (ocf::pacemaker:Dummy): Started vm1
> > >  Clone Set: sv-clone [sv]
> > >      Started: [ vm1 vm2 ]
> > >      Stopped: [ vm3 ]
> > >  std    (ocf::pacemaker:Dummy): Started vm2
> > >  partition1     (ocf::pacemaker:Dummy): Started vm1
> > >  partition2     (ocf::pacemaker:Dummy): Started vm1
> > >  partition3     (ocf::pacemaker:Dummy): Started vm2
> > 
> > Thanks for the detailed information, this should help me get to the
> > bottom of it. From this description, it sounds like a new
> > transition
> > isn't being triggered when it should.
> > 
> > Could you please attach the DC's pe-input file that is listed in
> > the
> > logs after the standby step above? That would simplify analysis.
> > 
> > > # Wait for any outstanding actions to complete.
> > > $ crm_resource --wait --timeout 300
> > > Pending actions:
> > >         Action 22: sv-fencer_monitor_1      on vm2
> > >         Action 21: sv-fencer_start_0    on vm2
> > >         Action 20: sv-fencer_stop_0     on vm1
> > > Error performing operation: Timer expired
> > >
> > > # Check the resources again - sv-fencer is still on vm1
> > > $ pcs resource
> > >  sv-fencer      (ocf::pacemaker:Dummy): Started vm1
> > >  Clone Set: sv-clone [sv]
> > >      Started: [ vm1 vm2 ]
> > >      Stopped: [ vm3 ]
> > >  std    (ocf::pacemaker:Dummy): Started vm2
> > >  partition1     (ocf::pacemaker:Dummy): Started vm1
> > >  partition2     (ocf::pacemaker:Dummy): Started vm1
> > >  partition3     (ocf::pacemaker:Dummy): Started vm2
> > >
> > > # Perform a random update to the CIB.
> > > $ pcs resource update std op monitor interval=20 timeout=20
> > >
> > > # Check resource status again - sv_fencer has now moved to vm2
> > (the
> > > action crm_resource was waiting for)
> > > $ pcs resource
> > >  sv-fencer      (ocf::pacemaker:Dummy): Started vm2 
> > <<<
> > >  Clone Set: sv-clone [sv]
> > >      Started: [ vm1 vm2 ]
> > >      Stopped: [ vm3 ]
> > >  std    (ocf::pacemaker:Dummy): Started vm2
> > >  partition1     (ocf::pacemaker:Dummy): Started vm1
> > >  partition2     (ocf::pacemaker:Dummy): Started vm1
> > >  partition3     (ocf::pacemaker:Dummy): Started vm2
> > >
> > > I do not get the problem if I:
> > > 1) remove the "std" resource; or
> > > 2) remove 

Re: [ClusterLabs] crm_resource --wait

2017-10-10 Thread Leon Steffens
I've attached two files:
314 = after standby step
315 = after resource update

On Wed, Oct 11, 2017 at 12:22 AM, Ken Gaillot  wrote:

> On Tue, 2017-10-10 at 15:19 +1000, Leon Steffens wrote:
> > Hi Ken,
> >
> > I managed to reproduce this on a simplified version of the cluster,
> > and on Pacemaker 1.1.15, 1.1.16, as well as 1.1.18-rc1
>
> > The steps to create the cluster are:
> >
> > pcs property set stonith-enabled=false
> > pcs property set placement-strategy=balanced
> >
> > pcs node utilization vm1 cpu=100
> > pcs node utilization vm2 cpu=100
> > pcs node utilization vm3 cpu=100
> >
> > pcs property set maintenance-mode=true
> >
> > pcs resource create sv-fencer ocf:pacemaker:Dummy
> >
> > pcs resource create sv ocf:pacemaker:Dummy clone notify=false
> > pcs resource create std ocf:pacemaker:Dummy meta resource-
> > stickiness=100
> >
> > pcs resource create partition1 ocf:pacemaker:Dummy meta resource-
> > stickiness=100
> > pcs resource create partition2 ocf:pacemaker:Dummy meta resource-
> > stickiness=100
> > pcs resource create partition3 ocf:pacemaker:Dummy meta resource-
> > stickiness=100
> >
> > pcs resource utilization partition1 cpu=5
> > pcs resource utilization partition2 cpu=5
> > pcs resource utilization partition3 cpu=5
> >
> > pcs constraint colocation add std with sv-clone INFINITY
> > pcs constraint colocation add partition1 with sv-clone INFINITY
> > pcs constraint colocation add partition2 with sv-clone INFINITY
> > pcs constraint colocation add partition3 with sv-clone INFINITY
> >
> > pcs property set maintenance-mode=false
> >
> >
> > I can then reproduce the issues in the following way:
> >
> > $ pcs resource
> >  sv-fencer  (ocf::pacemaker:Dummy): Started vm1
> >  Clone Set: sv-clone [sv]
> >  Started: [ vm1 vm2 vm3 ]
> >  std(ocf::pacemaker:Dummy): Started vm2
> >  partition1 (ocf::pacemaker:Dummy): Started vm3
> >  partition2 (ocf::pacemaker:Dummy): Started vm1
> >  partition3 (ocf::pacemaker:Dummy): Started vm2
> >
> > $ pcs cluster standby vm3
> >
> > # Check that all resources have moved off vm3
> > $ pcs resource
> >  sv-fencer  (ocf::pacemaker:Dummy): Started vm1
> >  Clone Set: sv-clone [sv]
> >  Started: [ vm1 vm2 ]
> >  Stopped: [ vm3 ]
> >  std(ocf::pacemaker:Dummy): Started vm2
> >  partition1 (ocf::pacemaker:Dummy): Started vm1
> >  partition2 (ocf::pacemaker:Dummy): Started vm1
> >  partition3 (ocf::pacemaker:Dummy): Started vm2
>
> Thanks for the detailed information, this should help me get to the
> bottom of it. From this description, it sounds like a new transition
> isn't being triggered when it should.
>
> Could you please attach the DC's pe-input file that is listed in the
> logs after the standby step above? That would simplify analysis.
>
> > # Wait for any outstanding actions to complete.
> > $ crm_resource --wait --timeout 300
> > Pending actions:
> > Action 22: sv-fencer_monitor_1  on vm2
> > Action 21: sv-fencer_start_0on vm2
> > Action 20: sv-fencer_stop_0 on vm1
> > Error performing operation: Timer expired
> >
> > # Check the resources again - sv-fencer is still on vm1
> > $ pcs resource
> >  sv-fencer  (ocf::pacemaker:Dummy): Started vm1
> >  Clone Set: sv-clone [sv]
> >  Started: [ vm1 vm2 ]
> >  Stopped: [ vm3 ]
> >  std(ocf::pacemaker:Dummy): Started vm2
> >  partition1 (ocf::pacemaker:Dummy): Started vm1
> >  partition2 (ocf::pacemaker:Dummy): Started vm1
> >  partition3 (ocf::pacemaker:Dummy): Started vm2
> >
> > # Perform a random update to the CIB.
> > $ pcs resource update std op monitor interval=20 timeout=20
> >
> > # Check resource status again - sv_fencer has now moved to vm2 (the
> > action crm_resource was waiting for)
> > $ pcs resource
> >  sv-fencer  (ocf::pacemaker:Dummy): Started vm2  <<<
> >  Clone Set: sv-clone [sv]
> >  Started: [ vm1 vm2 ]
> >  Stopped: [ vm3 ]
> >  std(ocf::pacemaker:Dummy): Started vm2
> >  partition1 (ocf::pacemaker:Dummy): Started vm1
> >  partition2 (ocf::pacemaker:Dummy): Started vm1
> >  partition3 (ocf::pacemaker:Dummy): Started vm2
> >
> > I do not get the problem if I:
> > 1) remove the "std" resource; or
> > 2) remove the co-location constraints; or
> > 3) remove the utilization attributes for the partition resources.
> >
> > In these cases the sv-fencer resource is happy to stay on vm1, and
> > crm_resource --wait returns immediately.
> >
> > It looks like the pcs cluster standby call is creating/registering
> > the actions to move the sv-fencer resource to vm2, but it doesn't
> > include it in the cluster transition.  When the CIB is later updated
> > by something else, the action is included in that transition.
> >
> >
> > Regards,
> > Leon
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: 

Re: [ClusterLabs] crm_resource --wait

2017-10-10 Thread Ken Gaillot
On Tue, 2017-10-10 at 15:19 +1000, Leon Steffens wrote:
> Hi Ken,
> 
> I managed to reproduce this on a simplified version of the cluster,
> and on Pacemaker 1.1.15, 1.1.16, as well as 1.1.18-rc1

> The steps to create the cluster are:
> 
> pcs property set stonith-enabled=false
> pcs property set placement-strategy=balanced
> 
> pcs node utilization vm1 cpu=100
> pcs node utilization vm2 cpu=100
> pcs node utilization vm3 cpu=100
> 
> pcs property set maintenance-mode=true
> 
> pcs resource create sv-fencer ocf:pacemaker:Dummy
> 
> pcs resource create sv ocf:pacemaker:Dummy clone notify=false
> pcs resource create std ocf:pacemaker:Dummy meta resource-
> stickiness=100
> 
> pcs resource create partition1 ocf:pacemaker:Dummy meta resource-
> stickiness=100
> pcs resource create partition2 ocf:pacemaker:Dummy meta resource-
> stickiness=100
> pcs resource create partition3 ocf:pacemaker:Dummy meta resource-
> stickiness=100
> 
> pcs resource utilization partition1 cpu=5
> pcs resource utilization partition2 cpu=5
> pcs resource utilization partition3 cpu=5
> 
> pcs constraint colocation add std with sv-clone INFINITY
> pcs constraint colocation add partition1 with sv-clone INFINITY
> pcs constraint colocation add partition2 with sv-clone INFINITY
> pcs constraint colocation add partition3 with sv-clone INFINITY
> 
> pcs property set maintenance-mode=false
>  
> 
> I can then reproduce the issues in the following way:
> 
> $ pcs resource
>  sv-fencer      (ocf::pacemaker:Dummy): Started vm1
>  Clone Set: sv-clone [sv]
>      Started: [ vm1 vm2 vm3 ]
>  std    (ocf::pacemaker:Dummy): Started vm2
>  partition1     (ocf::pacemaker:Dummy): Started vm3
>  partition2     (ocf::pacemaker:Dummy): Started vm1
>  partition3     (ocf::pacemaker:Dummy): Started vm2
> 
> $ pcs cluster standby vm3
> 
> # Check that all resources have moved off vm3
> $ pcs resource
>  sv-fencer      (ocf::pacemaker:Dummy): Started vm1
>  Clone Set: sv-clone [sv]
>      Started: [ vm1 vm2 ]
>      Stopped: [ vm3 ]
>  std    (ocf::pacemaker:Dummy): Started vm2
>  partition1     (ocf::pacemaker:Dummy): Started vm1
>  partition2     (ocf::pacemaker:Dummy): Started vm1
>  partition3     (ocf::pacemaker:Dummy): Started vm2

Thanks for the detailed information, this should help me get to the
bottom of it. From this description, it sounds like a new transition
isn't being triggered when it should.

Could you please attach the DC's pe-input file that is listed in the
logs after the standby step above? That would simplify analysis.

> # Wait for any outstanding actions to complete.
> $ crm_resource --wait --timeout 300
> Pending actions:
>         Action 22: sv-fencer_monitor_1      on vm2
>         Action 21: sv-fencer_start_0    on vm2
>         Action 20: sv-fencer_stop_0     on vm1
> Error performing operation: Timer expired
> 
> # Check the resources again - sv-fencer is still on vm1
> $ pcs resource
>  sv-fencer      (ocf::pacemaker:Dummy): Started vm1
>  Clone Set: sv-clone [sv]
>      Started: [ vm1 vm2 ]
>      Stopped: [ vm3 ]
>  std    (ocf::pacemaker:Dummy): Started vm2
>  partition1     (ocf::pacemaker:Dummy): Started vm1
>  partition2     (ocf::pacemaker:Dummy): Started vm1
>  partition3     (ocf::pacemaker:Dummy): Started vm2
> 
> # Perform a random update to the CIB.
> $ pcs resource update std op monitor interval=20 timeout=20
> 
> # Check resource status again - sv_fencer has now moved to vm2 (the
> action crm_resource was waiting for)
> $ pcs resource
>  sv-fencer      (ocf::pacemaker:Dummy): Started vm2  <<<
>  Clone Set: sv-clone [sv]
>      Started: [ vm1 vm2 ]
>      Stopped: [ vm3 ]
>  std    (ocf::pacemaker:Dummy): Started vm2
>  partition1     (ocf::pacemaker:Dummy): Started vm1
>  partition2     (ocf::pacemaker:Dummy): Started vm1
>  partition3     (ocf::pacemaker:Dummy): Started vm2
> 
> I do not get the problem if I:
> 1) remove the "std" resource; or
> 2) remove the co-location constraints; or
> 3) remove the utilization attributes for the partition resources.
> 
> In these cases the sv-fencer resource is happy to stay on vm1, and
> crm_resource --wait returns immediately.
> 
> It looks like the pcs cluster standby call is creating/registering
> the actions to move the sv-fencer resource to vm2, but it doesn't
> include it in the cluster transition.  When the CIB is later updated
> by something else, the action is included in that transition.
> 
> 
> Regards,
> Leon

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] crm_resource --wait

2017-10-09 Thread Leon Steffens
Hi Ken,

I managed to reproduce this on a simplified version of the cluster, and on
Pacemaker 1.1.15, 1.1.16, as well as 1.1.18-rc1

The steps to create the cluster are:

pcs property set stonith-enabled=false
pcs property set placement-strategy=balanced

pcs node utilization vm1 cpu=100
pcs node utilization vm2 cpu=100
pcs node utilization vm3 cpu=100

pcs property set maintenance-mode=true

pcs resource create sv-fencer ocf:pacemaker:Dummy

pcs resource create sv ocf:pacemaker:Dummy clone notify=false
pcs resource create std ocf:pacemaker:Dummy meta resource-stickiness=100

pcs resource create partition1 ocf:pacemaker:Dummy meta
resource-stickiness=100
pcs resource create partition2 ocf:pacemaker:Dummy meta
resource-stickiness=100
pcs resource create partition3 ocf:pacemaker:Dummy meta
resource-stickiness=100

pcs resource utilization partition1 cpu=5
pcs resource utilization partition2 cpu=5
pcs resource utilization partition3 cpu=5

pcs constraint colocation add std with sv-clone INFINITY
pcs constraint colocation add partition1 with sv-clone INFINITY
pcs constraint colocation add partition2 with sv-clone INFINITY
pcs constraint colocation add partition3 with sv-clone INFINITY

pcs property set maintenance-mode=false


I can then reproduce the issues in the following way:

$ pcs resource
 sv-fencer  (ocf::pacemaker:Dummy): Started vm1
 Clone Set: sv-clone [sv]
 Started: [ vm1 vm2 vm3 ]
 std(ocf::pacemaker:Dummy): Started vm2
 partition1 (ocf::pacemaker:Dummy): Started vm3
 partition2 (ocf::pacemaker:Dummy): Started vm1
 partition3 (ocf::pacemaker:Dummy): Started vm2

$ pcs cluster standby vm3

# Check that all resources have moved off vm3
$ pcs resource
 sv-fencer  (ocf::pacemaker:Dummy): Started vm1
 Clone Set: sv-clone [sv]
 Started: [ vm1 vm2 ]
 Stopped: [ vm3 ]
 std(ocf::pacemaker:Dummy): Started vm2
 partition1 (ocf::pacemaker:Dummy): Started vm1
 partition2 (ocf::pacemaker:Dummy): Started vm1
 partition3 (ocf::pacemaker:Dummy): Started vm2

# Wait for any outstanding actions to complete.
$ crm_resource --wait --timeout 300
Pending actions:
Action 22: sv-fencer_monitor_1  on vm2
Action 21: sv-fencer_start_0on vm2
Action 20: sv-fencer_stop_0 on vm1
Error performing operation: Timer expired

# Check the resources again - sv-fencer is still on vm1
$ pcs resource
 sv-fencer  (ocf::pacemaker:Dummy): Started vm1
 Clone Set: sv-clone [sv]
 Started: [ vm1 vm2 ]
 Stopped: [ vm3 ]
 std(ocf::pacemaker:Dummy): Started vm2
 partition1 (ocf::pacemaker:Dummy): Started vm1
 partition2 (ocf::pacemaker:Dummy): Started vm1
 partition3 (ocf::pacemaker:Dummy): Started vm2

# Perform a random update to the CIB.
$ pcs resource update std op monitor interval=20 timeout=20

# Check resource status again - sv_fencer has now moved to vm2 (the action
crm_resource was waiting for)
$ pcs resource
 sv-fencer  (ocf::pacemaker:Dummy): Started vm2  <<<
 Clone Set: sv-clone [sv]
 Started: [ vm1 vm2 ]
 Stopped: [ vm3 ]
 std(ocf::pacemaker:Dummy): Started vm2
 partition1 (ocf::pacemaker:Dummy): Started vm1
 partition2 (ocf::pacemaker:Dummy): Started vm1
 partition3 (ocf::pacemaker:Dummy): Started vm2

I do not get the problem if I:
1) remove the "std" resource; or
2) remove the co-location constraints; or
3) remove the utilization attributes for the partition resources.

In these cases the sv-fencer resource is happy to stay on vm1, and
crm_resource --wait returns immediately.

It looks like the pcs cluster standby call is creating/registering the
actions to move the sv-fencer resource to vm2, but it doesn't include it in
the cluster transition.  When the CIB is later updated by something else,
the action is included in that transition.


Regards,
Leon
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] crm_resource --wait

2017-10-09 Thread Leon Steffens
>
> > Pending actions:
> > Action 40: sv_fencer_monitor_6 on brilxvm44
> > Action 39: sv_fencer_start_0 on brilxvm44
> > Action 38: sv_fencer_stop_0 on brilxvm43
> > Error performing operation: Timer expired
> >
> > It looks like it's waiting for the sv_fencer fencing agent to start
> > on brilxvm44, even though the current transition did not include that
> > move.
>
> crm_resource --wait doesn't wait for a specific transition to complete;
> it waits until no further actions are needed.
>
> That is one of its limitations, that if something keeps provoking a new
> transition, it will never complete except by timeout.


Thanks Ken,

I understand that crm_resource --wait will wait until no further actions
are needed, but I'm not quite sure of:

1) what is triggering this movement of the sv_fencer resource from vm43 to
vm44
2) why is the action only triggered on a CIB update after the wait has
timed out (setting of node property), and not while crm_resource --wait is
waiting.
3) why is crm_resource --wait waiting for this action if it's only
triggered by the setting of a node property after the wait has timed out?
(i.e. why is this action not triggered if the cluster is aware of the
action?)

The sequence of events is:

1) Put node 3 in standby
2) Wait until no further actions are needed
3) Set property on node 1.

I'll see if I can reproduce this in an independent test and then try it
with a later version of Pacemaker.

Regards,
Leon
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] crm_resource --wait

2017-10-09 Thread Ken Gaillot
On Mon, 2017-10-09 at 16:37 +1000, Leon Steffens wrote:
> Hi all,
> 
> We have a use case where we want to place a node into standby and
> then wait for all the resources to move off the node (and be started
> on other nodes) before continuing.  
> 
> In order to do this we call:
> $ pcs cluster standby brilxvm45
> $ crm_resource --wait --timeout 300
> 
> This works most of the time, but in one of our test environments we
> are hitting a problem:
> 
> When we put the node in standby, the reported cluster transition is:
> 
> $  /usr/sbin/crm_simulate -x pe-input-3595.bz2 -S
> 
> Using the original execution date of: 2017-10-08 16:58:05Z
> ...
> Transition Summary:
>  * Restart sv_fencer    (Started brilxvm43)
>  * Stop    sv.svtest.aa.sv.monitor:1    (brilxvm45)
>  * Move    sv.svtest.aa.26.partition    (Started brilxvm45 ->
> brilxvm43)
>  * Move    sv.svtest.aa.27.partition    (Started brilxvm45 ->
> brilxvm44)
>  * Move    sv.svtest.aa.28.partition    (Started brilxvm45 ->
> brilxvm43)
> 
> We expect crm_resource --wait to return once sv_fencer (a fencing
> device) has been restarted (not sure why it's being restarted), and
> the 3 partition resources have been moved.
> 
> But crm_resource actually times out after 300 seconds with the
> following error:
> 
> Pending actions:
> Action 40: sv_fencer_monitor_6 on brilxvm44
> Action 39: sv_fencer_start_0 on brilxvm44
> Action 38: sv_fencer_stop_0 on brilxvm43
> Error performing operation: Timer expired
> 
> It looks like it's waiting for the sv_fencer fencing agent to start
> on brilxvm44, even though the current transition did not include that
> move.  

crm_resource --wait doesn't wait for a specific transition to complete;
it waits until no further actions are needed.

That is one of its limitations, that if something keeps provoking a new
transition, it will never complete except by timeout.

> 
> After the crm_resource --wait has timed out, we set a property on a
> different node (brilxvm43).  This seems to trigger a new transition
> to move sv_fencer to brilxvm44:
> 
> $  /usr/sbin/crm_simulate -x pe-input-3596.bz2 -S
> Using the original execution date of: 2017-10-08 17:03:27Z
> 
> Transition Summary:
>  * Move    sv_fencer    (Started brilxvm43 -> brilxvm44)
> 
> And from the corosync.log it looks like this transition triggers
> actions 38 - 40 (the ones crm_resource --wait waited for).
> 
> So it looks like the crm_resource --wait knows about the transition
> to move the sv_fencer resource, but the subsequent setting of the
> node property is the one that actually triggers it  (which is too
> late as it gets executed after the wait).
> 
> I have attached the DC's corosync.log for the applicable time period
> (timezone is UTC+10).  (The last few lines in the corosync - the
> interruption of transition 141 - is because of a subsequent standby
> being done for brilxvm43).
> 
> A possible workaround I thought of was to make the sv_fencer resource
> slightly sticky (all the other resources are), but I'm not sure if
> this will just hide the problem for this specific scenario.
> 
> We are using Pacemaker 1.1.15 on RedHat 6.9.
> 
> Regards,
> Leon
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] crm_resource --wait

2017-10-09 Thread Leon Steffens
Hi all,

We have a use case where we want to place a node into standby and then wait
for all the resources to move off the node (and be started on other nodes)
before continuing.

In order to do this we call:
$ pcs cluster standby brilxvm45
$ crm_resource --wait --timeout 300

This works most of the time, but in one of our test environments we are
hitting a problem:

When we put the node in standby, the reported cluster transition is:

$  /usr/sbin/crm_simulate -x pe-input-3595.bz2 -S

Using the original execution date of: 2017-10-08 *16:58:05Z*
...
Transition Summary:
 * Restart sv_fencer(Started brilxvm43)
 * Stopsv.svtest.aa.sv.monitor:1(brilxvm45)
 * Movesv.svtest.aa.26.partition(Started brilxvm45 -> brilxvm43)
 * Movesv.svtest.aa.27.partition(Started brilxvm45 -> brilxvm44)
 * Movesv.svtest.aa.28.partition(Started brilxvm45 -> brilxvm43)

We expect crm_resource --wait to return once sv_fencer (a fencing device)
has been restarted (not sure why it's being restarted), and the 3 partition
resources have been moved.

But crm_resource actually times out after 300 seconds with the following
error:

Pending actions:
Action 40: sv_fencer_monitor_6 on brilxvm44
Action 39: sv_fencer_start_0 on brilxvm44
Action 38: sv_fencer_stop_0 on brilxvm43
Error performing operation: Timer expired

It looks like it's waiting for the sv_fencer fencing agent to start on
brilxvm44, even though the current transition did not include that move.


After the crm_resource --wait has timed out, we set a property on a
different node (brilxvm43).  This seems to trigger a new transition to move
sv_fencer to brilxvm44:

$  /usr/sbin/crm_simulate -x pe-input-3596.bz2 -S
Using the original execution date of: 2017-10-08 *17:03:27Z*

Transition Summary:
 * Movesv_fencer(Started brilxvm43 -> brilxvm44)

And from the corosync.log it looks like this transition triggers actions 38
- 40 (the ones crm_resource --wait waited for).

So it looks like the crm_resource --wait knows about the transition to move
the sv_fencer resource, but the subsequent setting of the node property is
the one that actually triggers it  (which is too late as it gets executed
after the wait).

I have attached the DC's corosync.log for the applicable time period
(timezone is UTC+10).  (The last few lines in the corosync - the
interruption of transition 141 - is because of a subsequent standby being
done for brilxvm43).

A possible workaround I thought of was to make the sv_fencer resource
slightly sticky (all the other resources are), but I'm not sure if this
will just hide the problem for this specific scenario.

We are using Pacemaker 1.1.15 on RedHat 6.9.

Regards,
Leon


wait_corosync.log
Description: Binary data
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org