Hi! The cluster I'm configuring (SLES15 SP2) fenced a node last night. Still unsure what exactly caused the fencing, but looking at the logs I found this "action plan" that lead to fencing:
Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]: notice: * Move prm_cron_snap_test-jeos1 ( h18 -> h19 ) Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]: notice: * Move prm_cron_snap_test-jeos2 ( h19 -> h16 ) Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]: notice: * Move prm_cron_snap_test-jeos3 ( h16 -> h18 ) Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]: notice: * Move prm_cron_snap_test-jeos4 ( h18 -> h19 ) Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]: notice: * Migrate prm_xen_test-jeos1 ( h18 -> h19 ) Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]: notice: * Migrate prm_xen_test-jeos2 ( h19 -> h16 ) Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]: notice: * Migrate prm_xen_test-jeos3 ( h16 -> h18 ) Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]: notice: * Migrate prm_xen_test-jeos4 ( h18 -> h19 ) Those "cron_snap" resources depend on the corresponding xen resources (colocation). Having 4 resources to be distributed equally to three nodes seems to trigger that problem. After fencing the action plan was: Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]: notice: * Move prm_cron_snap_test-jeos2 ( h16 -> h19 ) Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]: notice: * Move prm_cron_snap_test-jeos4 ( h19 -> h16 ) Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]: notice: * Start prm_cron_snap_test-jeos1 ( h18 ) Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]: notice: * Start prm_cron_snap_test-jeos3 ( h19 ) Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]: notice: * Recover prm_xen_test-jeos1 ( h19 -> h18 ) Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]: notice: * Migrate prm_xen_test-jeos2 ( h16 -> h19 ) Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]: notice: * Migrate prm_xen_test-jeos3 ( h18 -> h19 ) Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]: notice: * Migrate prm_xen_test-jeos4 ( h19 -> h16 ) ...some more recoivery actions like that... Currently h18 has two VMs, while the other two nodes have one VM each. Before having added those "cron_snap" resources, I did not detect such "rebalancing". The rebalancing was triggered by this ruleset present in every xen resource: meta 1: resource-stickiness=0 \ meta 2: rule 0: date spec hours=7-19 weekdays=1-5 resource-stickiness=1000 At the moment the related scores (crm_simulate -LUs) look like this (filtered and re-ordered): Original: h16 capacity: utl_ram=231712 utl_cpu=440 Original: h18 capacity: utl_ram=231712 utl_cpu=440 Original: h19 capacity: utl_ram=231712 utl_cpu=440 Remaining: h16 capacity: utl_ram=229664 utl_cpu=420 Remaining: h18 capacity: utl_ram=227616 utl_cpu=400 Remaining: h19 capacity: utl_ram=229664 utl_cpu=420 pcmk__native_allocate: prm_xen_test-jeos1 allocation score on h16: 0 pcmk__native_allocate: prm_xen_test-jeos1 allocation score on h18: 1000 pcmk__native_allocate: prm_xen_test-jeos1 allocation score on h19: -INFINITY native_assign_node: prm_xen_test-jeos1 utilization on h18: utl_ram=2048 utl_cpu=20 pcmk__native_allocate: prm_xen_test-jeos2 allocation score on h16: 0 pcmk__native_allocate: prm_xen_test-jeos2 allocation score on h18: 1000 pcmk__native_allocate: prm_xen_test-jeos2 allocation score on h19: 0 native_assign_node: prm_xen_test-jeos2 utilization on h18: utl_ram=2048 utl_cpu=20 pcmk__native_allocate: prm_xen_test-jeos3 allocation score on h16: 0 pcmk__native_allocate: prm_xen_test-jeos3 allocation score on h18: 0 pcmk__native_allocate: prm_xen_test-jeos3 allocation score on h19: 1000 native_assign_node: prm_xen_test-jeos3 utilization on h19: utl_ram=2048 utl_cpu=20 pcmk__native_allocate: prm_xen_test-jeos4 allocation score on h16: 1000 pcmk__native_allocate: prm_xen_test-jeos4 allocation score on h18: 0 pcmk__native_allocate: prm_xen_test-jeos4 allocation score on h19: 0 native_assign_node: prm_xen_test-jeos4 utilization on h16: utl_ram=2048 utl_cpu=20 Does that ring-shifting of resources look like a bug in pacemaker? Regards, Ulrich _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/