[ClusterLabs] Cluster (sometimes) hangs during shutdown - EL9

Larry G. Mills via Users Mon, 12 May 2025 15:12:41 -0700

Hello all,

I have a fairly simple two-node cluster that supports three resources - 
promotable Postgres, fencing, and virtual IP.  This cluster is running on 
AlmaLinux 9.5 (RHEL9 variant).  In recent months, I have noticed that the 
cluster will occasionally hang when shutting down.  I use "pcs" to manage the 
cluster, so the shutdown command used is "pcs cluster stop -all".


During the last hang, I observed that all the resources appeared to be shut 
down except the virtual IP - the VIP remained in the "Started" state, and the 
cluster remained running on the node where the VIP was running.    I eventually 
was able to stop the cluster by issuing a "pcs cluster stop -all 
-request-timeout=1".

I have been using this same cluster configuration (across multiple OS releases) 
for years, and have never experienced a shutdown hang before.  Unfortunately, I 
can not reliably reproduce the scenario, but it has definitely happened on 
multiple occasions.


Some config information:

Linux node1.my.org 5.14.0-503.38.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Apr 
18 08:52:10 EDT 2025 x86_64 x86_64 x86_64 GNU/Linux
corosync.x86_64                                                                 
       3.1.8-2.el9
pacemaker.x86_64                                                                
       2.1.8-3.el9
pcs.x86_64                                                                      
       0.11.8-1.el9_5.1.alma.1


Cluster constraints:

Location Constraints:
  resource 'fence_node1' avoids node 'node1.my.orig' with score INFINITY
  resource 'fence_node2' avoids node 'node2.my.org' with score INFINITY
Colocation Constraints:
  Started resource 'pgsql-ha-vip' with Promoted resource 'pgsql-clone'
    score=INFINITY
Order Constraints:
  promote resource 'pgsql-clone' then start resource 'pgsql-ha-vip'
    symmetrical=0 kind=Mandatory
  demote resource 'pgsql-clone' then stop resource 'pgsql-ha-vip'
    symmetrical=0 kind=Mandatory


Although I'm not super adept at parsing the pacemaker logs, the following error 
messages looked problematic:

May 08 14:59:19.492 node1.my.org pacemaker-schedulerd[7000] (log_list_item)     
notice: Actions: Stop       pgsql-ha-vip     ( node1.my.org )  due to node 
availability (blocked)
May 08 14:59:19.492 node1.my.org pacemaker-schedulerd[7000] 
(pcmk__create_graph)        crit: Cannot shut down node1.my.org because of 
pgsql-ha-vip: blocked (pgsql-ha-vip_stop_0)


A sanitized pacemaker log of the hang event is attached - 5/8/2025 @14:59.

Is this a latent configuration problem that's just now showing up, or a problem 
with the pacemaker version's currently in EL9?

Any thoughts appreciated,

Larry Mills

pacemaker.log.gz
Description: pacemaker.log.gz

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Cluster (sometimes) hangs during shutdown - EL9

Reply via email to