Re: [ClusterLabs] Singleton resource not being migrated

2016-08-16 Thread Ken Gaillot
On 08/05/2016 05:12 PM, Nikita Koshikov wrote:
> Thanks, Ken,
> 
> On Fri, Aug 5, 2016 at 7:21 AM, Ken Gaillot  > wrote:
> 
> On 08/05/2016 03:48 AM, Andreas Kurz wrote:
> > Hi,
> >
> > On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov  
> > >> wrote:
> >
> > Hello list,
> >
> > Can you, please, help me in debugging 1 resource not being started
> > after node failover ?
> >
> > Here is configuration that I'm testing:
> > 3 nodes(kvm VM) cluster, that have:
> >
> > node 10: aic-controller-58055.test.domain.local
> > node 6: aic-controller-50186.test.domain.local
> > node 9: aic-controller-12993.test.domain.local
> > primitive cmha cmha \
> > params conffile="/etc/cmha/cmha.conf"
> > daemon="/usr/bin/cmhad" pidfile="/var/run/cmha/cmha.pid"
> user=cmha \
> > meta failure-timeout=30 resource-stickiness=1
> > target-role=Started migration-threshold=3 \
> > op monitor interval=10 on-fail=restart timeout=20 \
> > op start interval=0 on-fail=restart timeout=60 \
> > op stop interval=0 on-fail=block timeout=90
> >
> >
> > What is the output of crm_mon -1frA once a node is down ... any failed
> > actions?
> >
> >
> > primitive sysinfo_aic-controller-12993.test.domain.local
> > ocf:pacemaker:SysInfo \
> > params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> > op monitor interval=15s
> > primitive sysinfo_aic-controller-50186.test.domain.local
> > ocf:pacemaker:SysInfo \
> > params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> > op monitor interval=15s
> > primitive sysinfo_aic-controller-58055.test.domain.local
> > ocf:pacemaker:SysInfo \
> > params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> > op monitor interval=15s
> >
> >
> > You can use a clone for this sysinfo resource and a symmetric cluster
> > for a more compact configuration  then you can skip all these
> > location constraints.
> >
> >
> > location cmha-on-aic-controller-12993.test.domain.local cmha 100:
> > aic-controller-12993.test.domain.local
> > location cmha-on-aic-controller-50186.test.domain.local cmha 100:
> > aic-controller-50186.test.domain.local
> > location cmha-on-aic-controller-58055.test.domain.local cmha 100:
> > aic-controller-58055.test.domain.local
> > location sysinfo-on-aic-controller-12993.test.domain.local
> > sysinfo_aic-controller-12993.test.domain.local inf:
> > aic-controller-12993.test.domain.local
> > location sysinfo-on-aic-controller-50186.test.domain.local
> > sysinfo_aic-controller-50186.test.domain.local inf:
> > aic-controller-50186.test.domain.local
> > location sysinfo-on-aic-controller-58055.test.domain.local
> > sysinfo_aic-controller-58055.test.domain.local inf:
> > aic-controller-58055.test.domain.local
> > property cib-bootstrap-options: \
> > have-watchdog=false \
> > dc-version=1.1.14-70404b0 \
> > cluster-infrastructure=corosync \
> > cluster-recheck-interval=15s \
> >
> >
> > Never tried such a low cluster-recheck-interval ... wouldn't do
> that. I
> > saw setups with low intervals burning a lot of cpu cycles in bigger
> > cluster setups and side-effects from aborted transitions. If you
> do this
> > for "cleanup" the cluster state because you see resource-agent errors
> > you should better fix the resource agent.
> 
> Strongly agree -- your recheck interval is lower than the various action
> timeouts. The only reason recheck interval should ever be set less than
> about 5 minutes is if you have time-based rules that you want to trigger
> with a finer granularity.
> 
> Your issue does not appear to be coming from recheck interval, otherwise
> it would go away after the recheck interval passed.
> 
> 
> As of small cluster-recheck-interval - this was only for testing.
>  
> 
> > Regards,
> > Andreas
> >
> >
> > no-quorum-policy=stop \
> > stonith-enabled=false \
> > start-failure-is-fatal=false \
> > symmetric-cluster=false \
> > node-health-strategy=migrate-on-red \
> > last-lrm-refresh=1470334410
> >
> > When 3 nodes online, everything seemed OK, this is output of
> > scoreshow.sh:
> > ResourceScore
>

Re: [ClusterLabs] Singleton resource not being migrated

2016-08-05 Thread Nikita Koshikov
Thanks for reply, Andreas


On Fri, Aug 5, 2016 at 1:48 AM, Andreas Kurz  wrote:

> Hi,
>
> On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov 
> wrote:
>
>> Hello list,
>>
>> Can you, please, help me in debugging 1 resource not being started after
>> node failover ?
>>
>> Here is configuration that I'm testing:
>> 3 nodes(kvm VM) cluster, that have:
>>
>> node 10: aic-controller-58055.test.domain.local
>> node 6: aic-controller-50186.test.domain.local
>> node 9: aic-controller-12993.test.domain.local
>> primitive cmha cmha \
>> params conffile="/etc/cmha/cmha.conf" daemon="/usr/bin/cmhad"
>> pidfile="/var/run/cmha/cmha.pid" user=cmha \
>> meta failure-timeout=30 resource-stickiness=1 target-role=Started
>> migration-threshold=3 \
>> op monitor interval=10 on-fail=restart timeout=20 \
>> op start interval=0 on-fail=restart timeout=60 \
>> op stop interval=0 on-fail=block timeout=90
>>
>
> What is the output of crm_mon -1frA once a node is down ... any failed
> actions?
>

No errors/failed actions. This is a little bit different lab(names
changes), but have the same effect:

root@aic-controller-57150:~# crm_mon -1frA
Last updated: Fri Aug  5 20:14:05 2016  Last change: Fri Aug  5
19:38:34 2016 by root via crm_attribute on
aic-controller-44151.test.domain.local
Stack: corosync
Current DC: aic-controller-57150.test.domain.local (version 1.1.14-70404b0)
- partition with quorum
3 nodes and 7 resources configured

Online: [ aic-controller-57150.test.domain.local
aic-controller-58381.test.domain.local ]
OFFLINE: [ aic-controller-44151.test.domain.local ]

Full list of resources:

 sysinfo_aic-controller-44151.test.domain.local (ocf::pacemaker:SysInfo):
Stopped
 sysinfo_aic-controller-57150.test.domain.local (ocf::pacemaker:SysInfo):
Started aic-controller-57150.test.domain.local
 sysinfo_aic-controller-58381.test.domain.local (ocf::pacemaker:SysInfo):
Started aic-controller-58381.test.domain.local
 Clone Set: clone_p_heat-engine [p_heat-engine]
 Started: [ aic-controller-57150.test.domain.local
aic-controller-58381.test.domain.local ]
 cmha   (ocf::heartbeat:cmha):  Stopped

Node Attributes:
* Node aic-controller-57150.test.domain.local:
+ arch  : x86_64
+ cpu_cores : 3
+ cpu_info  : Intel(R) Xeon(R) CPU E5-2680 v3 @
2.50GHz
+ cpu_load  : 1.04
+ cpu_speed : 4994.21
+ free_swap : 5150
+ os: Linux-3.13.0-85-generic
+ ram_free  : 750
+ ram_total : 5000
+ root_free : 45932
+ var_log_free  : 431543
* Node aic-controller-58381.test.domain.local:
+ arch  : x86_64
+ cpu_cores : 3
+ cpu_info  : Intel(R) Xeon(R) CPU E5-2680 v3 @
2.50GHz
+ cpu_load  : 1.16
+ cpu_speed : 4994.21
+ free_swap : 5150
+ os: Linux-3.13.0-85-generic
+ ram_free  : 750
+ ram_total : 5000
+ root_free : 45932
+ var_log_free  : 431542

Migration Summary:
* Node aic-controller-57150.test.domain.local:
* Node aic-controller-58381.test.domain.local:


>
>> primitive sysinfo_aic-controller-12993.test.domain.local
>> ocf:pacemaker:SysInfo \
>> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
>> op monitor interval=15s
>> primitive sysinfo_aic-controller-50186.test.domain.local
>> ocf:pacemaker:SysInfo \
>> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
>> op monitor interval=15s
>> primitive sysinfo_aic-controller-58055.test.domain.local
>> ocf:pacemaker:SysInfo \
>> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
>> op monitor interval=15s
>>
>
> You can use a clone for this sysinfo resource and a symmetric cluster for
> a more compact configuration  then you can skip all these location
> constraints.
>
>
>> location cmha-on-aic-controller-12993.test.domain.local cmha 100:
>> aic-controller-12993.test.domain.local
>> location cmha-on-aic-controller-50186.test.domain.local cmha 100:
>> aic-controller-50186.test.domain.local
>> location cmha-on-aic-controller-58055.test.domain.local cmha 100:
>> aic-controller-58055.test.domain.local
>> location sysinfo-on-aic-controller-12993.test.domain.local
>> sysinfo_aic-controller-12993.test.domain.local inf:
>> aic-controller-12993.test.domain.local
>> location sysinfo-on-aic-controller-50186.test.domain.local
>> sysinfo_aic-controller-50186.test.domain.local inf:
>> aic-controller-50186.test.domain.local
>> 

Re: [ClusterLabs] Singleton resource not being migrated

2016-08-05 Thread Ken Gaillot
On 08/05/2016 03:48 AM, Andreas Kurz wrote:
> Hi,
> 
> On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov  > wrote:
> 
> Hello list,
> 
> Can you, please, help me in debugging 1 resource not being started
> after node failover ?
> 
> Here is configuration that I'm testing:
> 3 nodes(kvm VM) cluster, that have:
> 
> node 10: aic-controller-58055.test.domain.local
> node 6: aic-controller-50186.test.domain.local
> node 9: aic-controller-12993.test.domain.local
> primitive cmha cmha \
> params conffile="/etc/cmha/cmha.conf"
> daemon="/usr/bin/cmhad" pidfile="/var/run/cmha/cmha.pid" user=cmha \
> meta failure-timeout=30 resource-stickiness=1
> target-role=Started migration-threshold=3 \
> op monitor interval=10 on-fail=restart timeout=20 \
> op start interval=0 on-fail=restart timeout=60 \
> op stop interval=0 on-fail=block timeout=90
> 
> 
> What is the output of crm_mon -1frA once a node is down ... any failed
> actions?
>  
> 
> primitive sysinfo_aic-controller-12993.test.domain.local
> ocf:pacemaker:SysInfo \
> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> op monitor interval=15s
> primitive sysinfo_aic-controller-50186.test.domain.local
> ocf:pacemaker:SysInfo \
> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> op monitor interval=15s
> primitive sysinfo_aic-controller-58055.test.domain.local
> ocf:pacemaker:SysInfo \
> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> op monitor interval=15s
> 
> 
> You can use a clone for this sysinfo resource and a symmetric cluster
> for a more compact configuration  then you can skip all these
> location constraints.
> 
> 
> location cmha-on-aic-controller-12993.test.domain.local cmha 100:
> aic-controller-12993.test.domain.local
> location cmha-on-aic-controller-50186.test.domain.local cmha 100:
> aic-controller-50186.test.domain.local
> location cmha-on-aic-controller-58055.test.domain.local cmha 100:
> aic-controller-58055.test.domain.local
> location sysinfo-on-aic-controller-12993.test.domain.local
> sysinfo_aic-controller-12993.test.domain.local inf:
> aic-controller-12993.test.domain.local
> location sysinfo-on-aic-controller-50186.test.domain.local
> sysinfo_aic-controller-50186.test.domain.local inf:
> aic-controller-50186.test.domain.local
> location sysinfo-on-aic-controller-58055.test.domain.local
> sysinfo_aic-controller-58055.test.domain.local inf:
> aic-controller-58055.test.domain.local
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.14-70404b0 \
> cluster-infrastructure=corosync \
> cluster-recheck-interval=15s \
> 
> 
> Never tried such a low cluster-recheck-interval ... wouldn't do that. I
> saw setups with low intervals burning a lot of cpu cycles in bigger
> cluster setups and side-effects from aborted transitions. If you do this
> for "cleanup" the cluster state because you see resource-agent errors
> you should better fix the resource agent.

Strongly agree -- your recheck interval is lower than the various action
timeouts. The only reason recheck interval should ever be set less than
about 5 minutes is if you have time-based rules that you want to trigger
with a finer granularity.

Your issue does not appear to be coming from recheck interval, otherwise
it would go away after the recheck interval passed.

> Regards,
> Andreas
>  
> 
> no-quorum-policy=stop \
> stonith-enabled=false \
> start-failure-is-fatal=false \
> symmetric-cluster=false \
> node-health-strategy=migrate-on-red \
> last-lrm-refresh=1470334410
> 
> When 3 nodes online, everything seemed OK, this is output of
> scoreshow.sh:
> ResourceScore
> Node   Stickiness #Fail  
>  Migration-Threshold
> cmha-INFINITY
> aic-controller-12993.test.domain.local 1  0
> cmha
>  101 aic-controller-50186.test.domain.local 1  0
> cmha-INFINITY

Everything is not OK; cmha has -INFINITY scores on two nodes, meaning it
won't be allowed to run on them. This is why it won't start after the
one allowed node goes down, and why cleanup gets it working again
(cleanup removes bans caused by resource failures).

It's likely the resource previously failed the maximum allowed times
(migration-threshold=3) on those two nodes.

The next step would be to figure out why the resource is failing. The
pacemaker logs will 

Re: [ClusterLabs] Singleton resource not being migrated

2016-08-05 Thread Andreas Kurz
Hi,

On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov  wrote:

> Hello list,
>
> Can you, please, help me in debugging 1 resource not being started after
> node failover ?
>
> Here is configuration that I'm testing:
> 3 nodes(kvm VM) cluster, that have:
>
> node 10: aic-controller-58055.test.domain.local
> node 6: aic-controller-50186.test.domain.local
> node 9: aic-controller-12993.test.domain.local
> primitive cmha cmha \
> params conffile="/etc/cmha/cmha.conf" daemon="/usr/bin/cmhad"
> pidfile="/var/run/cmha/cmha.pid" user=cmha \
> meta failure-timeout=30 resource-stickiness=1 target-role=Started
> migration-threshold=3 \
> op monitor interval=10 on-fail=restart timeout=20 \
> op start interval=0 on-fail=restart timeout=60 \
> op stop interval=0 on-fail=block timeout=90
>

What is the output of crm_mon -1frA once a node is down ... any failed
actions?


> primitive sysinfo_aic-controller-12993.test.domain.local
> ocf:pacemaker:SysInfo \
> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> op monitor interval=15s
> primitive sysinfo_aic-controller-50186.test.domain.local
> ocf:pacemaker:SysInfo \
> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> op monitor interval=15s
> primitive sysinfo_aic-controller-58055.test.domain.local
> ocf:pacemaker:SysInfo \
> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> op monitor interval=15s
>

You can use a clone for this sysinfo resource and a symmetric cluster for a
more compact configuration  then you can skip all these location
constraints.


> location cmha-on-aic-controller-12993.test.domain.local cmha 100:
> aic-controller-12993.test.domain.local
> location cmha-on-aic-controller-50186.test.domain.local cmha 100:
> aic-controller-50186.test.domain.local
> location cmha-on-aic-controller-58055.test.domain.local cmha 100:
> aic-controller-58055.test.domain.local
> location sysinfo-on-aic-controller-12993.test.domain.local
> sysinfo_aic-controller-12993.test.domain.local inf:
> aic-controller-12993.test.domain.local
> location sysinfo-on-aic-controller-50186.test.domain.local
> sysinfo_aic-controller-50186.test.domain.local inf:
> aic-controller-50186.test.domain.local
> location sysinfo-on-aic-controller-58055.test.domain.local
> sysinfo_aic-controller-58055.test.domain.local inf:
> aic-controller-58055.test.domain.local
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.14-70404b0 \
> cluster-infrastructure=corosync \
> cluster-recheck-interval=15s \
>

Never tried such a low cluster-recheck-interval ... wouldn't do that. I saw
setups with low intervals burning a lot of cpu cycles in bigger cluster
setups and side-effects from aborted transitions. If you do this for
"cleanup" the cluster state because you see resource-agent errors you
should better fix the resource agent.

Regards,
Andreas


> no-quorum-policy=stop \
> stonith-enabled=false \
> start-failure-is-fatal=false \
> symmetric-cluster=false \
> node-health-strategy=migrate-on-red \
> last-lrm-refresh=1470334410
>
> When 3 nodes online, everything seemed OK, this is output of scoreshow.sh:
> ResourceScore Node
>   Stickiness #FailMigration-Threshold
> cmha-INFINITY
> aic-controller-12993.test.domain.local 1  0
> cmha  101
> aic-controller-50186.test.domain.local 1  0
> cmha-INFINITY
> aic-controller-58055.test.domain.local 1  0
> sysinfo_aic-controller-12993.test.domain.local  INFINITY
>  aic-controller-12993.test.domain.local 0  0
> sysinfo_aic-controller-50186.test.domain.local  -INFINITY
> aic-controller-50186.test.domain.local 0  0
> sysinfo_aic-controller-58055.test.domain.local  INFINITY
>  aic-controller-58055.test.domain.local 0  0
>
> The problem starts when 1 node, goes offline (aic-controller-50186). The
> resource cmha is stocked in stopped state.
> Here is the showscores:
> ResourceScore Node
>   Stickiness #FailMigration-Threshold
> cmha-INFINITY
> aic-controller-12993.test.domain.local 1  0
> cmha-INFINITY
> aic-controller-50186.test.domain.local 1  0
> cmha-INFINITY
> aic-controller-58055.test.domain.local 1  0
>
> Even it has target-role=Started pacemaker skipping this resource. And in
> logs I see:
> pengine: info: native_print:  cmha

[ClusterLabs] Singleton resource not being migrated

2016-08-05 Thread Nikita Koshikov
Hello list,

Can you, please, help me in debugging 1 resource not being started after
node failover ?

Here is configuration that I'm testing:
3 nodes(kvm VM) cluster, that have:

node 10: aic-controller-58055.test.domain.local
node 6: aic-controller-50186.test.domain.local
node 9: aic-controller-12993.test.domain.local
primitive cmha cmha \
params conffile="/etc/cmha/cmha.conf" daemon="/usr/bin/cmhad"
pidfile="/var/run/cmha/cmha.pid" user=cmha \
meta failure-timeout=30 resource-stickiness=1 target-role=Started
migration-threshold=3 \
op monitor interval=10 on-fail=restart timeout=20 \
op start interval=0 on-fail=restart timeout=60 \
op stop interval=0 on-fail=block timeout=90
primitive sysinfo_aic-controller-12993.test.domain.local
ocf:pacemaker:SysInfo \
params disk_unit=M disks="/ /var/log" min_disk_free=512M \
op monitor interval=15s
primitive sysinfo_aic-controller-50186.test.domain.local
ocf:pacemaker:SysInfo \
params disk_unit=M disks="/ /var/log" min_disk_free=512M \
op monitor interval=15s
primitive sysinfo_aic-controller-58055.test.domain.local
ocf:pacemaker:SysInfo \
params disk_unit=M disks="/ /var/log" min_disk_free=512M \
op monitor interval=15s

location cmha-on-aic-controller-12993.test.domain.local cmha 100:
aic-controller-12993.test.domain.local
location cmha-on-aic-controller-50186.test.domain.local cmha 100:
aic-controller-50186.test.domain.local
location cmha-on-aic-controller-58055.test.domain.local cmha 100:
aic-controller-58055.test.domain.local
location sysinfo-on-aic-controller-12993.test.domain.local
sysinfo_aic-controller-12993.test.domain.local inf:
aic-controller-12993.test.domain.local
location sysinfo-on-aic-controller-50186.test.domain.local
sysinfo_aic-controller-50186.test.domain.local inf:
aic-controller-50186.test.domain.local
location sysinfo-on-aic-controller-58055.test.domain.local
sysinfo_aic-controller-58055.test.domain.local inf:
aic-controller-58055.test.domain.local
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync \
cluster-recheck-interval=15s \
no-quorum-policy=stop \
stonith-enabled=false \
start-failure-is-fatal=false \
symmetric-cluster=false \
node-health-strategy=migrate-on-red \
last-lrm-refresh=1470334410

When 3 nodes online, everything seemed OK, this is output of scoreshow.sh:
ResourceScore Node
  Stickiness #FailMigration-Threshold
cmha-INFINITY
aic-controller-12993.test.domain.local 1  0
cmha  101
aic-controller-50186.test.domain.local 1  0
cmha-INFINITY
aic-controller-58055.test.domain.local 1  0
sysinfo_aic-controller-12993.test.domain.local  INFINITY
 aic-controller-12993.test.domain.local 0  0
sysinfo_aic-controller-50186.test.domain.local  -INFINITY
aic-controller-50186.test.domain.local 0  0
sysinfo_aic-controller-58055.test.domain.local  INFINITY
 aic-controller-58055.test.domain.local 0  0

The problem starts when 1 node, goes offline (aic-controller-50186). The
resource cmha is stocked in stopped state.
Here is the showscores:
ResourceScore Node
  Stickiness #FailMigration-Threshold
cmha-INFINITY
aic-controller-12993.test.domain.local 1  0
cmha-INFINITY
aic-controller-50186.test.domain.local 1  0
cmha-INFINITY
aic-controller-58055.test.domain.local 1  0

Even it has target-role=Started pacemaker skipping this resource. And in
logs I see:
pengine: info: native_print:  cmha(ocf::heartbeat:cmha):
 Stopped
pengine: info: native_color:  Resource cmha cannot run anywhere
pengine: info: LogActions:Leave   cmha(Stopped)

To recover cmha resource I need to run either:
1) crm resource cleanup cmha
2) crm resource reprobe

After any of the above commands, resource began to be picked up be
pacemaker and I see valid scores:
ResourceScore Node
  Stickiness #FailMigration-Threshold
cmha100
aic-controller-58055.test.domain.local 1  03
cmha101
aic-controller-12993.test.domain.local 1  03
cmha-INFINITY
aic-controller-50186.test.domain.local 1