Pacemaker logs with CIB messages ** Attachment added: "pacemaker-logs.txt" https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/2159283/+attachment/5980066/+files/pacemaker-logs.txt
** Description changed: [Description] We are seeing a problem where our remote compute node gets fenced out of nowhere right after we run the crm resource cleanup command. Looking at the logs, it seems that when the cleanup command is executed, Pacemaker tries to open a new connection to the remote node. When it does this, the old connection drops with the error Disconnecting from Pacemaker Remote node... due to unexpected client takeover. We do not know what exactly is causing this. It looks like maybe a race condition in how Pacemaker handles multiple connections during the cleanup probe. But since the remote node was completely healthy and up, the STONITH/fencing action was totally unexpected. - [Steps to reproduce] Other attempts to run crm cleanup executed as expected. [Expected Behavior] The cleanup command should clear the failcounts and probe the resource without destroying the active connection to the remote node. Fencing should not happen if the node is healthy. [Actual Behavior] The connection is dropped due to "unexpected client takeover", the monitor operation fails, the connection is marked as "unrecoverable", and the node gets fenced. Logs: Cleanup: Jun 09 12:52:46 pacemaker-unit-3-lxd-17 sudo[1621188]: ubuntu : TTY=pts/1 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/sbin/crm resource cleanup Controller: Jun 09 12:52:51.982 pacemaker-unit-5-lxd-14 pacemaker-controld [1565] (remote_lrm_op_callback) error: Disconnecting from Pacemaker Remote node compute-host-18.my.domain due to unexpected client takeover Jun 09 12:52:51.982 pacemaker-unit-5-lxd-14 pacemaker-controld [1565] (remote_lrm_op_callback) error: Lost connection to Pacemaker Remote node compute-host-18.my.domain Jun 09 12:52:52.454 pacemaker-unit-5-lxd-14 pacemaker-schedulerd[1564] (pe_fence_node) warning: Remote node compute-host-18.my.domain will be fenced: remote connection is unrecoverable Jun 09 12:52:52.566 pacemaker-unit-5-lxd-14 pacemaker-schedulerd[1564] (stage6) warning: Scheduling Node compute-host-18.my.domain for STONITH Remote node: Jun 09 12:52:51.189 compute-host-18 pacemaker-remoted [5422] (pcmk__accept_remote_connection) info: Accepted new remote client connection from ::ffff:192.168.30.197 Jun 09 12:52:51.929 compute-host-18 pacemaker-remoted [5422] (remoted__read_handshake_data) notice: Remote client connection accepted - Jun 09 12:52:51.981 compute-host-18 pacemaker-remoted [5422] (lrmd_remote_client_msg) + Jun 09 12:52:51.981 compute-host-18 pacemaker-remoted [5422] (lrmd_remote_client_msg) I'm attaching more logs for inspection. + + [Environment] + Ubuntu 22.04.5 LTS + Kernel: 5.15.0-131-generic + pacemaker: 2.1.2-1ubuntu3.1 + + This is a 3 nodes cluster. Cluster was healthy during the incident. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2159283 Title: Pacemaker remote connection drops with "unexpected client takeover" after running crm resource cleanup, causing unexpected fencing To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/2159283/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
