>>> "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> schrieb am 15.01.2021 um 09:36 in Nachricht <60015410020000a10003e...@gwsmtp.uni-regensburg.de>: > Hi! > > The cluster I'm configuring (SLES15 SP2) fenced a node last night. Still > unsure what exactly caused the fencing, but looking at the logs I found this
> "action plan" that lead to fencing: I think I found the reason for fencing: I had renamed a VM, but kept the UUID: Jan 14 20:05:26 h19 libvirtd[4361]: operation failed: domain 'test-jeos' is already defined with uuid 9a0f9ea5-a587-4a99-be44-bce079199c12 > > Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]: notice: * Move > prm_cron_snap_test‑jeos1 ( h18 ‑> h19 ) > Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]: notice: * Move > prm_cron_snap_test‑jeos2 ( h19 ‑> h16 ) > Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]: notice: * Move > prm_cron_snap_test‑jeos3 ( h16 ‑> h18 ) > Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]: notice: * Move > prm_cron_snap_test‑jeos4 ( h18 ‑> h19 ) > Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]: notice: * Migrate > prm_xen_test‑jeos1 ( h18 ‑> h19 ) > Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]: notice: * Migrate > prm_xen_test‑jeos2 ( h19 ‑> h16 ) > Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]: notice: * Migrate > prm_xen_test‑jeos3 ( h16 ‑> h18 ) > Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]: notice: * Migrate > prm_xen_test‑jeos4 ( h18 ‑> h19 ) > > Those "cron_snap" resources depend on the corresponding xen resources > (colocation). > Having 4 resources to be distributed equally to three nodes seems to trigger > that problem. > > After fencing the action plan was: > > Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]: notice: * Move > prm_cron_snap_test‑jeos2 ( h16 ‑> h19 ) > Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]: notice: * Move > prm_cron_snap_test‑jeos4 ( h19 ‑> h16 ) > Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]: notice: * Start > prm_cron_snap_test‑jeos1 ( h18 ) > Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]: notice: * Start > prm_cron_snap_test‑jeos3 ( h19 ) > Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]: notice: * Recover > prm_xen_test‑jeos1 ( h19 ‑> h18 ) > Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]: notice: * Migrate > prm_xen_test‑jeos2 ( h16 ‑> h19 ) > Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]: notice: * Migrate > prm_xen_test‑jeos3 ( h18 ‑> h19 ) > Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]: notice: * Migrate > prm_xen_test‑jeos4 ( h19 ‑> h16 ) > > ...some more recoivery actions like that... > > Currently h18 has two VMs, while the other two nodes have one VM each. > > Before having added those "cron_snap" resources, I did not detect such > "rebalancing". > > The rebalancing was triggered by this ruleset present in every xen resource: > > meta 1: resource‑stickiness=0 \ > meta 2: rule 0: date spec hours=7‑19 weekdays=1‑5 > resource‑stickiness=1000 > > At the moment the related scores (crm_simulate ‑LUs) look like this (filtered > and re‑ordered): > > Original: h16 capacity: utl_ram=231712 utl_cpu=440 > Original: h18 capacity: utl_ram=231712 utl_cpu=440 > Original: h19 capacity: utl_ram=231712 utl_cpu=440 > > Remaining: h16 capacity: utl_ram=229664 utl_cpu=420 > Remaining: h18 capacity: utl_ram=227616 utl_cpu=400 > Remaining: h19 capacity: utl_ram=229664 utl_cpu=420 > > pcmk__native_allocate: prm_xen_test‑jeos1 allocation score on h16: 0 > pcmk__native_allocate: prm_xen_test‑jeos1 allocation score on h18: 1000 > pcmk__native_allocate: prm_xen_test‑jeos1 allocation score on h19: ‑INFINITY > native_assign_node: prm_xen_test‑jeos1 utilization on h18: utl_ram=2048 > utl_cpu=20 > > pcmk__native_allocate: prm_xen_test‑jeos2 allocation score on h16: 0 > pcmk__native_allocate: prm_xen_test‑jeos2 allocation score on h18: 1000 > pcmk__native_allocate: prm_xen_test‑jeos2 allocation score on h19: 0 > native_assign_node: prm_xen_test‑jeos2 utilization on h18: utl_ram=2048 > utl_cpu=20 > > pcmk__native_allocate: prm_xen_test‑jeos3 allocation score on h16: 0 > pcmk__native_allocate: prm_xen_test‑jeos3 allocation score on h18: 0 > pcmk__native_allocate: prm_xen_test‑jeos3 allocation score on h19: 1000 > native_assign_node: prm_xen_test‑jeos3 utilization on h19: utl_ram=2048 > utl_cpu=20 > > pcmk__native_allocate: prm_xen_test‑jeos4 allocation score on h16: 1000 > pcmk__native_allocate: prm_xen_test‑jeos4 allocation score on h18: 0 > pcmk__native_allocate: prm_xen_test‑jeos4 allocation score on h19: 0 > native_assign_node: prm_xen_test‑jeos4 utilization on h16: utl_ram=2048 > utl_cpu=20 > > Does that ring‑shifting of resources look like a bug in pacemaker? > > Regards, > Ulrich > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/