Hi, On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov <koshi...@gmail.com> wrote:
> Hello list, > > Can you, please, help me in debugging 1 resource not being started after > node failover ? > > Here is configuration that I'm testing: > 3 nodes(kvm VM) cluster, that have: > > node 10: aic-controller-58055.test.domain.local > node 6: aic-controller-50186.test.domain.local > node 9: aic-controller-12993.test.domain.local > primitive cmha cmha \ > params conffile="/etc/cmha/cmha.conf" daemon="/usr/bin/cmhad" > pidfile="/var/run/cmha/cmha.pid" user=cmha \ > meta failure-timeout=30 resource-stickiness=1 target-role=Started > migration-threshold=3 \ > op monitor interval=10 on-fail=restart timeout=20 \ > op start interval=0 on-fail=restart timeout=60 \ > op stop interval=0 on-fail=block timeout=90 > What is the output of crm_mon -1frA once a node is down ... any failed actions? > primitive sysinfo_aic-controller-12993.test.domain.local > ocf:pacemaker:SysInfo \ > params disk_unit=M disks="/ /var/log" min_disk_free=512M \ > op monitor interval=15s > primitive sysinfo_aic-controller-50186.test.domain.local > ocf:pacemaker:SysInfo \ > params disk_unit=M disks="/ /var/log" min_disk_free=512M \ > op monitor interval=15s > primitive sysinfo_aic-controller-58055.test.domain.local > ocf:pacemaker:SysInfo \ > params disk_unit=M disks="/ /var/log" min_disk_free=512M \ > op monitor interval=15s > You can use a clone for this sysinfo resource and a symmetric cluster for a more compact configuration .... then you can skip all these location constraints. > location cmha-on-aic-controller-12993.test.domain.local cmha 100: > aic-controller-12993.test.domain.local > location cmha-on-aic-controller-50186.test.domain.local cmha 100: > aic-controller-50186.test.domain.local > location cmha-on-aic-controller-58055.test.domain.local cmha 100: > aic-controller-58055.test.domain.local > location sysinfo-on-aic-controller-12993.test.domain.local > sysinfo_aic-controller-12993.test.domain.local inf: > aic-controller-12993.test.domain.local > location sysinfo-on-aic-controller-50186.test.domain.local > sysinfo_aic-controller-50186.test.domain.local inf: > aic-controller-50186.test.domain.local > location sysinfo-on-aic-controller-58055.test.domain.local > sysinfo_aic-controller-58055.test.domain.local inf: > aic-controller-58055.test.domain.local > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=1.1.14-70404b0 \ > cluster-infrastructure=corosync \ > cluster-recheck-interval=15s \ > Never tried such a low cluster-recheck-interval ... wouldn't do that. I saw setups with low intervals burning a lot of cpu cycles in bigger cluster setups and side-effects from aborted transitions. If you do this for "cleanup" the cluster state because you see resource-agent errors you should better fix the resource agent. Regards, Andreas > no-quorum-policy=stop \ > stonith-enabled=false \ > start-failure-is-fatal=false \ > symmetric-cluster=false \ > node-health-strategy=migrate-on-red \ > last-lrm-refresh=1470334410 > > When 3 nodes online, everything seemed OK, this is output of scoreshow.sh: > Resource Score Node > Stickiness #Fail Migration-Threshold > cmha -INFINITY > aic-controller-12993.test.domain.local 1 0 > cmha 101 > aic-controller-50186.test.domain.local 1 0 > cmha -INFINITY > aic-controller-58055.test.domain.local 1 0 > sysinfo_aic-controller-12993.test.domain.local INFINITY > aic-controller-12993.test.domain.local 0 0 > sysinfo_aic-controller-50186.test.domain.local -INFINITY > aic-controller-50186.test.domain.local 0 0 > sysinfo_aic-controller-58055.test.domain.local INFINITY > aic-controller-58055.test.domain.local 0 0 > > The problem starts when 1 node, goes offline (aic-controller-50186). The > resource cmha is stocked in stopped state. > Here is the showscores: > Resource Score Node > Stickiness #Fail Migration-Threshold > cmha -INFINITY > aic-controller-12993.test.domain.local 1 0 > cmha -INFINITY > aic-controller-50186.test.domain.local 1 0 > cmha -INFINITY > aic-controller-58055.test.domain.local 1 0 > > Even it has target-role=Started pacemaker skipping this resource. And in > logs I see: > pengine: info: native_print: cmha (ocf::heartbeat:cmha): > Stopped > pengine: info: native_color: Resource cmha cannot run anywhere > pengine: info: LogActions: Leave cmha (Stopped) > > To recover cmha resource I need to run either: > 1) crm resource cleanup cmha > 2) crm resource reprobe > > After any of the above commands, resource began to be picked up be > pacemaker and I see valid scores: > Resource Score Node > Stickiness #Fail Migration-Threshold > cmha 100 > aic-controller-58055.test.domain.local 1 0 3 > cmha 101 > aic-controller-12993.test.domain.local 1 0 3 > cmha -INFINITY > aic-controller-50186.test.domain.local 1 0 3 > > So the questions here - why cluster-recheck doesn't work, and should it do > reprobing ? > How to make migration work or what I missed in configuration that prevents > migration? > > corosync 2.3.4 > pacemaker 1.1.14 > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > >
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org