Thanks Ken, I have figure out root cause. There was race condition while adding resource and creating constrain(location).
When resource was added to cluster, all nodes trying to 'probe' resource via ocf - and because ocf didn't exist on 2/3 nodes at that time - pacemaker mark them as -INFINITY and even updating constrains(adding locations) - doesn't automatically reprobe/clean resource. It was hard to debug - because there were no errors/failed action for this resource, when you run cli commands. On Tue, Aug 16, 2016 at 1:29 PM, Ken Gaillot <[email protected]> wrote: > On 08/05/2016 05:12 PM, Nikita Koshikov wrote: > > Thanks, Ken, > > > > On Fri, Aug 5, 2016 at 7:21 AM, Ken Gaillot <[email protected] > > <mailto:[email protected]>> wrote: > > > > On 08/05/2016 03:48 AM, Andreas Kurz wrote: > > > Hi, > > > > > > On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov < > [email protected] <mailto:[email protected]> > > > <mailto:[email protected] <mailto:[email protected]>>> wrote: > > > > > > Hello list, > > > > > > Can you, please, help me in debugging 1 resource not being > started > > > after node failover ? > > > > > > Here is configuration that I'm testing: > > > 3 nodes(kvm VM) cluster, that have: > > > > > > node 10: aic-controller-58055.test.domain.local > > > node 6: aic-controller-50186.test.domain.local > > > node 9: aic-controller-12993.test.domain.local > > > primitive cmha cmha \ > > > params conffile="/etc/cmha/cmha.conf" > > > daemon="/usr/bin/cmhad" pidfile="/var/run/cmha/cmha.pid" > > user=cmha \ > > > meta failure-timeout=30 resource-stickiness=1 > > > target-role=Started migration-threshold=3 \ > > > op monitor interval=10 on-fail=restart timeout=20 \ > > > op start interval=0 on-fail=restart timeout=60 \ > > > op stop interval=0 on-fail=block timeout=90 > > > > > > > > > What is the output of crm_mon -1frA once a node is down ... any > failed > > > actions? > > > > > > > > > primitive sysinfo_aic-controller-12993.test.domain.local > > > ocf:pacemaker:SysInfo \ > > > params disk_unit=M disks="/ /var/log" > min_disk_free=512M \ > > > op monitor interval=15s > > > primitive sysinfo_aic-controller-50186.test.domain.local > > > ocf:pacemaker:SysInfo \ > > > params disk_unit=M disks="/ /var/log" > min_disk_free=512M \ > > > op monitor interval=15s > > > primitive sysinfo_aic-controller-58055.test.domain.local > > > ocf:pacemaker:SysInfo \ > > > params disk_unit=M disks="/ /var/log" > min_disk_free=512M \ > > > op monitor interval=15s > > > > > > > > > You can use a clone for this sysinfo resource and a symmetric > cluster > > > for a more compact configuration .... then you can skip all these > > > location constraints. > > > > > > > > > location cmha-on-aic-controller-12993.test.domain.local cmha > 100: > > > aic-controller-12993.test.domain.local > > > location cmha-on-aic-controller-50186.test.domain.local cmha > 100: > > > aic-controller-50186.test.domain.local > > > location cmha-on-aic-controller-58055.test.domain.local cmha > 100: > > > aic-controller-58055.test.domain.local > > > location sysinfo-on-aic-controller-12993.test.domain.local > > > sysinfo_aic-controller-12993.test.domain.local inf: > > > aic-controller-12993.test.domain.local > > > location sysinfo-on-aic-controller-50186.test.domain.local > > > sysinfo_aic-controller-50186.test.domain.local inf: > > > aic-controller-50186.test.domain.local > > > location sysinfo-on-aic-controller-58055.test.domain.local > > > sysinfo_aic-controller-58055.test.domain.local inf: > > > aic-controller-58055.test.domain.local > > > property cib-bootstrap-options: \ > > > have-watchdog=false \ > > > dc-version=1.1.14-70404b0 \ > > > cluster-infrastructure=corosync \ > > > cluster-recheck-interval=15s \ > > > > > > > > > Never tried such a low cluster-recheck-interval ... wouldn't do > > that. I > > > saw setups with low intervals burning a lot of cpu cycles in bigger > > > cluster setups and side-effects from aborted transitions. If you > > do this > > > for "cleanup" the cluster state because you see resource-agent > errors > > > you should better fix the resource agent. > > > > Strongly agree -- your recheck interval is lower than the various > action > > timeouts. The only reason recheck interval should ever be set less > than > > about 5 minutes is if you have time-based rules that you want to > trigger > > with a finer granularity. > > > > Your issue does not appear to be coming from recheck interval, > otherwise > > it would go away after the recheck interval passed. > > > > > > As of small cluster-recheck-interval - this was only for testing. > > > > > > > Regards, > > > Andreas > > > > > > > > > no-quorum-policy=stop \ > > > stonith-enabled=false \ > > > start-failure-is-fatal=false \ > > > symmetric-cluster=false \ > > > node-health-strategy=migrate-on-red \ > > > last-lrm-refresh=1470334410 > > > > > > When 3 nodes online, everything seemed OK, this is output of > > > scoreshow.sh: > > > Resource Score > > > Node Stickiness #Fail > > > Migration-Threshold > > > cmha > -INFINITY > > > aic-controller-12993.test.domain.local 1 0 > > > cmha > > > 101 aic-controller-50186.test.domain.local 1 0 > > > cmha > -INFINITY > > > > Everything is not OK; cmha has -INFINITY scores on two nodes, > meaning it > > won't be allowed to run on them. This is why it won't start after the > > one allowed node goes down, and why cleanup gets it working again > > (cleanup removes bans caused by resource failures). > > > > It's likely the resource previously failed the maximum allowed times > > (migration-threshold=3) on those two nodes. > > > > The next step would be to figure out why the resource is failing. The > > pacemaker logs will show any output from the resource agent. > > > > > > Resource was never started on these nodes. Maybe problem is in flow ? We > > deploy: > > > > 1) 1 node with all 3 IPs in corosync.conf > > 2) set no-quorum policy = ignore > > 3) add 2 nodes to corosync cluster > > 4) create resource + 1 location constrain > > 5) add 2 additional constrains > > 6) set no-quorum policy = stop > > > > The time between 4-5 is about 1 minute. And it's clear why 2 nodes were > > -INFINITY in this period. But why when we add 2 more constrains - they > > are not updating scores and cam this be changed ? > > The resource may have never started on those nodes, but are you sure a > start wasn't attempted and failed? If the start failed, the -INFINITY > score would come from the failure, rather than only the cluster being > asymmetric. > > > > > > > > > > aic-controller-58055.test.domain.local 1 0 > > > sysinfo_aic-controller-12993.test.domain.local > INFINITY > > > aic-controller-12993.test.domain.local 0 0 > > > sysinfo_aic-controller-50186.test.domain.local > -INFINITY > > > aic-controller-50186.test.domain.local 0 0 > > > sysinfo_aic-controller-58055.test.domain.local > INFINITY > > > aic-controller-58055.test.domain.local 0 0 > > > > > > The problem starts when 1 node, goes offline > > (aic-controller-50186). > > > The resource cmha is stocked in stopped state. > > > Here is the showscores: > > > Resource Score > > > Node Stickiness #Fail > > > Migration-Threshold > > > cmha > -INFINITY > > > aic-controller-12993.test.domain.local 1 0 > > > cmha > -INFINITY > > > aic-controller-50186.test.domain.local 1 0 > > > cmha > -INFINITY > > > aic-controller-58055.test.domain.local 1 0 > > > > > > Even it has target-role=Started pacemaker skipping this > resource. > > > And in logs I see: > > > pengine: info: native_print: cmha > > (ocf::heartbeat:cmha): > > > Stopped > > > pengine: info: native_color: Resource cmha cannot run > > anywhere > > > pengine: info: LogActions: Leave cmha (Stopped) > > > > > > To recover cmha resource I need to run either: > > > 1) crm resource cleanup cmha > > > 2) crm resource reprobe > > > > > > After any of the above commands, resource began to be picked > up be > > > pacemaker and I see valid scores: > > > Resource Score > > > Node Stickiness #Fail > > > Migration-Threshold > > > cmha 100 > > > aic-controller-58055.test.domain.local 1 0 3 > > > cmha 101 > > > aic-controller-12993.test.domain.local 1 0 3 > > > cmha > -INFINITY > > > aic-controller-50186.test.domain.local 1 0 3 > > > > > > So the questions here - why cluster-recheck doesn't work, and > > should > > > it do reprobing ? > > > How to make migration work or what I missed in configuration > that > > > prevents migration? > > > > > > corosync 2.3.4 > > > pacemaker 1.1.14 >
_______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
