Hi, A user noticed that after changing a non-reloadable (unique) parameter of resource A in our cluster, A wasn't restarted as expected. On closer inspection it turned out that the parameter change was coupled with a utilization change as well, which necessitated shuffling resources around. All but a few resources have allow-migrate set to true. Pacemaker decided to migrate resource B to make room for the grown A, then migrate A to's previous node. And that's it; it didn't restart A, A kept running with the old parameter value, until I manually restarted it later.
It happened with Pacemaker 1.1.16. A further detail which might play a role in the above: resource parameter modification in this setup is a multi-step process to provide mutual exclusion. First we create a dummy unmanaged "lock" resource, then a shadow CIB, in which the parameter and utilization changes are made and a simulation is run, then we commit the shadow CIB and finally delete the "lock" resource. This means that the first transition triggered by the shadow commit is immediately aborted by the "lock" removal on its heels, but surprisingly, the logs don't separate these cleanly (A=vm-eduid-node5, B=vm-pws-web5, vm-rad* are irrelevant, the 8 cluster nodes are vhbl0[1-8]): 16:25:42 crmd[11822]: notice: State transition S_IDLE -> S_POLICY_ENGINE 16:25:43 pengine[11821]: warning: Processing failed op monitor for vm-rad02-vh-dmz-sulinet-hu-eduroam on vhbl03: not running (7) 16:25:43 pengine[11821]: warning: Processing failed op monitor for vm-rad03-vh-dmz-sulinet-hu-eduroam on vhbl03: not running (7) 16:25:43 pengine[11821]: notice: Migrate vm-eduid-node5#011(Started vhbl07 -> vhbl08) 16:25:43 pengine[11821]: notice: Migrate vm-pws-web5#011(Started vhbl08 -> vhbl04) 16:25:43 pengine[11821]: notice: Calculated transition 4376, saving inputs in /var/lib/pacemaker/pengine/pe-input-1584.bz2 16:25:43 pengine[11821]: warning: Processing failed op monitor for vm-rad02-vh-dmz-sulinet-hu-eduroam on vhbl03: not running (7) 16:25:43 pengine[11821]: warning: Processing failed op monitor for vm-rad03-vh-dmz-sulinet-hu-eduroam on vhbl03: not running (7) 16:25:44 pengine[11821]: notice: Removing CIB_LOCK from vhbl01 16:25:44 pengine[11821]: notice: Removing CIB_LOCK from vhbl02 16:25:44 pengine[11821]: notice: Removing CIB_LOCK from vhbl03 16:25:44 pengine[11821]: notice: Removing CIB_LOCK from vhbl04 16:25:44 pengine[11821]: notice: Removing CIB_LOCK from vhbl06 16:25:44 pengine[11821]: notice: Removing CIB_LOCK from vhbl05 16:25:44 pengine[11821]: notice: Removing CIB_LOCK from vhbl07 16:25:44 pengine[11821]: notice: Removing CIB_LOCK from vhbl08 16:25:44 pengine[11821]: notice: Migrate vm-eduid-node5#011(Started vhbl07 -> vhbl08) 16:25:44 pengine[11821]: notice: Migrate vm-pws-web5#011(Started vhbl08 -> vhbl04) 16:25:44 pengine[11821]: notice: Calculated transition 4377, saving inputs in /var/lib/pacemaker/pengine/pe-input-1585.bz2 16:25:44 crmd[11822]: notice: Initiating delete operation CIB_LOCK_delete_0 locally on vhbl08 16:25:44 crmd[11822]: notice: Initiating delete operation CIB_LOCK_delete_0 on vhbl07 16:25:44 crmd[11822]: notice: Initiating delete operation CIB_LOCK_delete_0 on vhbl05 16:25:44 crmd[11822]: notice: Initiating delete operation CIB_LOCK_delete_0 on vhbl06 16:25:44 crmd[11822]: notice: Initiating delete operation CIB_LOCK_delete_0 on vhbl04 16:25:44 crmd[11822]: notice: Initiating delete operation CIB_LOCK_delete_0 on vhbl03 16:25:44 crmd[11822]: notice: Transition aborted by deletion of lrm_resource[@id='CIB_LOCK']: Resource state removal 16:25:45 crmd[11822]: notice: Transition 4377 (Complete=12, Pending=0, Fired=0, Skipped=3, Incomplete=15, Source=/var/lib/pacemaker/pengine/pe-input-1585.bz2): Stopped 16:25:46 pengine[11821]: warning: Processing failed op monitor for vm-rad02-vh-dmz-sulinet-hu-eduroam on vhbl03: not running (7) 16:25:46 pengine[11821]: warning: Processing failed op monitor for vm-rad03-vh-dmz-sulinet-hu-eduroam on vhbl03: not running (7) 16:25:46 pengine[11821]: notice: Removing CIB_LOCK from vhbl01 16:25:46 pengine[11821]: notice: Removing CIB_LOCK from vhbl02 16:25:46 pengine[11821]: notice: Migrate vm-eduid-node5#011(Started vhbl07 -> vhbl08) 16:25:46 pengine[11821]: notice: Migrate vm-pws-web5#011(Started vhbl08 -> vhbl04) 16:25:46 pengine[11821]: notice: Calculated transition 4378, saving inputs in /var/lib/pacemaker/pengine/pe-input-1586.bz2 16:25:46 crmd[11822]: notice: Initiating delete operation CIB_LOCK_delete_0 on vhbl02 16:25:46 crmd[11822]: notice: Initiating delete operation CIB_LOCK_delete_0 on vhbl01 16:25:46 crmd[11822]: notice: Initiating migrate_to operation vm-pws-web5_migrate_to_0 locally on vhbl08 This looks like a resource management bug to me, but maybe we're doing something wrong (certainly not optimally, please forgive that part). Detail logs, pe-input and cib files are still around, but I need advice about where to dig, so I'll be grateful for your comments. -- Thanks, Feri _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
