Public bug reported:

[Description]
We are seeing a problem where our remote compute node gets fenced out of 
nowhere right after we run the crm resource cleanup command.

Looking at the logs, it seems that when the cleanup command is executed,
Pacemaker tries to open a new connection to the remote node. When it
does this, the old connection drops with the error Disconnecting from
Pacemaker Remote node... due to unexpected client takeover.

We do not know what exactly is causing this. It looks like maybe a race
condition in how Pacemaker handles multiple connections during the
cleanup probe. But since the remote node was completely healthy and up,
the STONITH/fencing action was totally unexpected.

[Steps to reproduce]
Other attempts to run crm cleanup executed as expected.

[Expected Behavior]
The cleanup command should clear the failcounts and probe the resource without 
destroying the active connection to the remote node. Fencing should not happen 
if the node is healthy.

[Actual Behavior]
The connection is dropped due to "unexpected client takeover", the monitor 
operation fails, the connection is marked as "unrecoverable", and the node gets 
fenced.

Logs:

Cleanup:
Jun 09 12:52:46 pacemaker-unit-3-lxd-17 sudo[1621188]:   ubuntu : TTY=pts/1 ; 
PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/sbin/crm resource cleanup

Controller:
Jun 09 12:52:51.982 pacemaker-unit-5-lxd-14 pacemaker-controld  [1565] 
(remote_lrm_op_callback)         error: Disconnecting from Pacemaker Remote 
node compute-host-18.my.domain due to unexpected client takeover
Jun 09 12:52:51.982 pacemaker-unit-5-lxd-14 pacemaker-controld  [1565] 
(remote_lrm_op_callback)         error: Lost connection to Pacemaker Remote 
node compute-host-18.my.domain
Jun 09 12:52:52.454 pacemaker-unit-5-lxd-14 pacemaker-schedulerd[1564] 
(pe_fence_node)  warning: Remote node compute-host-18.my.domain will be fenced: 
remote connection is unrecoverable
Jun 09 12:52:52.566 pacemaker-unit-5-lxd-14 pacemaker-schedulerd[1564] (stage6) 
        warning: Scheduling Node compute-host-18.my.domain for STONITH

Remote node:
Jun 09 12:52:51.189 compute-host-18 pacemaker-remoted   [5422] 
(pcmk__accept_remote_connection)         info: Accepted new remote client 
connection from ::ffff:192.168.30.197
Jun 09 12:52:51.929 compute-host-18 pacemaker-remoted   [5422] 
(remoted__read_handshake_data)   notice: Remote client connection accepted
Jun 09 12:52:51.981 compute-host-18 pacemaker-remoted   [5422] 
(lrmd_remote_client_msg)

I'm attaching more logs for inspection.

[Environment]
Ubuntu 22.04.5 LTS
Kernel: 5.15.0-131-generic
pacemaker: 2.1.2-1ubuntu3.1

This is a 3 nodes cluster. Cluster was healthy during the incident.

** Affects: pacemaker (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2159283

Title:
  Pacemaker remote connection drops with "unexpected client takeover"
  after running crm resource cleanup, causing unexpected fencing

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/2159283/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to