Pacemaker logs with CIB messages

** Attachment added: "pacemaker-logs.txt"
   
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/2159283/+attachment/5980066/+files/pacemaker-logs.txt

** Description changed:

  [Description]
  We are seeing a problem where our remote compute node gets fenced out of 
nowhere right after we run the crm resource cleanup command.
  
  Looking at the logs, it seems that when the cleanup command is executed,
  Pacemaker tries to open a new connection to the remote node. When it
  does this, the old connection drops with the error Disconnecting from
  Pacemaker Remote node... due to unexpected client takeover.
  
  We do not know what exactly is causing this. It looks like maybe a race
  condition in how Pacemaker handles multiple connections during the
  cleanup probe. But since the remote node was completely healthy and up,
  the STONITH/fencing action was totally unexpected.
- 
  
  [Steps to reproduce]
  Other attempts to run crm cleanup executed as expected.
  
  [Expected Behavior]
  The cleanup command should clear the failcounts and probe the resource 
without destroying the active connection to the remote node. Fencing should not 
happen if the node is healthy.
  
  [Actual Behavior]
  The connection is dropped due to "unexpected client takeover", the monitor 
operation fails, the connection is marked as "unrecoverable", and the node gets 
fenced.
  
  Logs:
  
  Cleanup:
  Jun 09 12:52:46 pacemaker-unit-3-lxd-17 sudo[1621188]:   ubuntu : TTY=pts/1 ; 
PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/sbin/crm resource cleanup
  
  Controller:
  Jun 09 12:52:51.982 pacemaker-unit-5-lxd-14 pacemaker-controld  [1565] 
(remote_lrm_op_callback)       error: Disconnecting from Pacemaker Remote node 
compute-host-18.my.domain due to unexpected client takeover
  Jun 09 12:52:51.982 pacemaker-unit-5-lxd-14 pacemaker-controld  [1565] 
(remote_lrm_op_callback)       error: Lost connection to Pacemaker Remote node 
compute-host-18.my.domain
  Jun 09 12:52:52.454 pacemaker-unit-5-lxd-14 pacemaker-schedulerd[1564] 
(pe_fence_node)        warning: Remote node compute-host-18.my.domain will be 
fenced: remote connection is unrecoverable
  Jun 09 12:52:52.566 pacemaker-unit-5-lxd-14 pacemaker-schedulerd[1564] 
(stage6)       warning: Scheduling Node compute-host-18.my.domain for STONITH
  
  Remote node:
  Jun 09 12:52:51.189 compute-host-18 pacemaker-remoted   [5422] 
(pcmk__accept_remote_connection)       info: Accepted new remote client 
connection from ::ffff:192.168.30.197
  Jun 09 12:52:51.929 compute-host-18 pacemaker-remoted   [5422] 
(remoted__read_handshake_data)         notice: Remote client connection accepted
- Jun 09 12:52:51.981 compute-host-18 pacemaker-remoted   [5422] 
(lrmd_remote_client_msg) 
+ Jun 09 12:52:51.981 compute-host-18 pacemaker-remoted   [5422] 
(lrmd_remote_client_msg)
  
  I'm attaching more logs for inspection.
+ 
+ [Environment]
+ Ubuntu 22.04.5 LTS
+ Kernel: 5.15.0-131-generic
+ pacemaker: 2.1.2-1ubuntu3.1
+ 
+ This is a 3 nodes cluster. Cluster was healthy during the incident.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2159283

Title:
  Pacemaker remote connection drops with "unexpected client takeover"
  after running crm resource cleanup, causing unexpected fencing

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/2159283/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to