[ClusterLabs] Q: placement-strategy=balanced

Ulrich Windl Fri, 15 Jan 2021 00:36:49 -0800

Hi!

The cluster I'm configuring (SLES15 SP2) fenced a node last night. Still unsure 
what exactly caused the fencing, but looking at the logs I found this "action 
plan" that lead to fencing:


Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]:  notice:  * Move       
prm_cron_snap_test-jeos1              ( h18 -> h19 )
Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]:  notice:  * Move       
prm_cron_snap_test-jeos2              ( h19 -> h16 )
Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]:  notice:  * Move       
prm_cron_snap_test-jeos3              ( h16 -> h18 )
Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]:  notice:  * Move       
prm_cron_snap_test-jeos4              ( h18 -> h19 )
Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]:  notice:  * Migrate    
prm_xen_test-jeos1                    ( h18 -> h19 )
Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]:  notice:  * Migrate    
prm_xen_test-jeos2                    ( h19 -> h16 )
Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]:  notice:  * Migrate    
prm_xen_test-jeos3                    ( h16 -> h18 )
Jan 14 20:05:12 h19 pacemaker-schedulerd[4803]:  notice:  * Migrate    
prm_xen_test-jeos4                    ( h18 -> h19 )

Those "cron_snap" resources depend on the corresponding xen resources 
(colocation).
Having 4 resources to be distributed equally to three nodes seems to trigger 
that problem.

After fencing the action plan was:

Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]:  notice:  * Move       
prm_cron_snap_test-jeos2              ( h16 -> h19 )
Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]:  notice:  * Move       
prm_cron_snap_test-jeos4              ( h19 -> h16 )
Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]:  notice:  * Start      
prm_cron_snap_test-jeos1              (             h18 )
Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]:  notice:  * Start      
prm_cron_snap_test-jeos3              (             h19 )
Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]:  notice:  * Recover    
prm_xen_test-jeos1                    ( h19 -> h18 )
Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]:  notice:  * Migrate    
prm_xen_test-jeos2                    ( h16 -> h19 )
Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]:  notice:  * Migrate    
prm_xen_test-jeos3                    ( h18 -> h19 )
Jan 14 20:05:26 h19 pacemaker-schedulerd[4803]:  notice:  * Migrate    
prm_xen_test-jeos4                    ( h19 -> h16 )

...some more recoivery actions like that...

Currently h18 has two VMs, while the other two nodes have one VM each.

Before having added those "cron_snap" resources, I did not detect such 
"rebalancing".

The rebalancing was triggered by this ruleset present in every xen resource:

        meta 1: resource-stickiness=0 \
        meta 2: rule 0: date spec hours=7-19 weekdays=1-5 
resource-stickiness=1000

At the moment the related scores (crm_simulate -LUs) look like this (filtered 
and re-ordered):

Original: h16 capacity: utl_ram=231712 utl_cpu=440
Original: h18 capacity: utl_ram=231712 utl_cpu=440
Original: h19 capacity: utl_ram=231712 utl_cpu=440

Remaining: h16 capacity: utl_ram=229664 utl_cpu=420
Remaining: h18 capacity: utl_ram=227616 utl_cpu=400
Remaining: h19 capacity: utl_ram=229664 utl_cpu=420

pcmk__native_allocate: prm_xen_test-jeos1 allocation score on h16: 0
pcmk__native_allocate: prm_xen_test-jeos1 allocation score on h18: 1000
pcmk__native_allocate: prm_xen_test-jeos1 allocation score on h19: -INFINITY
native_assign_node: prm_xen_test-jeos1 utilization on h18: utl_ram=2048 
utl_cpu=20

pcmk__native_allocate: prm_xen_test-jeos2 allocation score on h16: 0
pcmk__native_allocate: prm_xen_test-jeos2 allocation score on h18: 1000
pcmk__native_allocate: prm_xen_test-jeos2 allocation score on h19: 0
native_assign_node: prm_xen_test-jeos2 utilization on h18: utl_ram=2048 
utl_cpu=20

pcmk__native_allocate: prm_xen_test-jeos3 allocation score on h16: 0
pcmk__native_allocate: prm_xen_test-jeos3 allocation score on h18: 0
pcmk__native_allocate: prm_xen_test-jeos3 allocation score on h19: 1000
native_assign_node: prm_xen_test-jeos3 utilization on h19: utl_ram=2048 
utl_cpu=20

pcmk__native_allocate: prm_xen_test-jeos4 allocation score on h16: 1000
pcmk__native_allocate: prm_xen_test-jeos4 allocation score on h18: 0
pcmk__native_allocate: prm_xen_test-jeos4 allocation score on h19: 0
native_assign_node: prm_xen_test-jeos4 utilization on h16: utl_ram=2048 
utl_cpu=20

Does that ring-shifting of resources look like a bug in pacemaker?

Regards,
Ulrich


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Q: placement-strategy=balanced

Reply via email to