Hello all, I have a fairly simple two-node cluster that supports three resources - promotable Postgres, fencing, and virtual IP. This cluster is running on AlmaLinux 9.5 (RHEL9 variant). In recent months, I have noticed that the cluster will occasionally hang when shutting down. I use "pcs" to manage the cluster, so the shutdown command used is "pcs cluster stop -all".
During the last hang, I observed that all the resources appeared to be shut
down except the virtual IP - the VIP remained in the "Started" state, and the
cluster remained running on the node where the VIP was running. I eventually
was able to stop the cluster by issuing a "pcs cluster stop -all
-request-timeout=1".
I have been using this same cluster configuration (across multiple OS releases)
for years, and have never experienced a shutdown hang before. Unfortunately, I
can not reliably reproduce the scenario, but it has definitely happened on
multiple occasions.
Some config information:
Linux node1.my.org 5.14.0-503.38.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Apr
18 08:52:10 EDT 2025 x86_64 x86_64 x86_64 GNU/Linux
corosync.x86_64
3.1.8-2.el9
pacemaker.x86_64
2.1.8-3.el9
pcs.x86_64
0.11.8-1.el9_5.1.alma.1
Cluster constraints:
Location Constraints:
resource 'fence_node1' avoids node 'node1.my.orig' with score INFINITY
resource 'fence_node2' avoids node 'node2.my.org' with score INFINITY
Colocation Constraints:
Started resource 'pgsql-ha-vip' with Promoted resource 'pgsql-clone'
score=INFINITY
Order Constraints:
promote resource 'pgsql-clone' then start resource 'pgsql-ha-vip'
symmetrical=0 kind=Mandatory
demote resource 'pgsql-clone' then stop resource 'pgsql-ha-vip'
symmetrical=0 kind=Mandatory
Although I'm not super adept at parsing the pacemaker logs, the following error
messages looked problematic:
May 08 14:59:19.492 node1.my.org pacemaker-schedulerd[7000] (log_list_item)
notice: Actions: Stop pgsql-ha-vip ( node1.my.org ) due to node
availability (blocked)
May 08 14:59:19.492 node1.my.org pacemaker-schedulerd[7000]
(pcmk__create_graph) crit: Cannot shut down node1.my.org because of
pgsql-ha-vip: blocked (pgsql-ha-vip_stop_0)
A sanitized pacemaker log of the hang event is attached - 5/8/2025 @14:59.
Is this a latent configuration problem that's just now showing up, or a problem
with the pacemaker version's currently in EL9?
Any thoughts appreciated,
Larry Mills
pacemaker.log.gz
Description: pacemaker.log.gz
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
