[ClusterLabs] Automatic restart of Pacemaker after reboot and filesystem unmount problem

Grégory Sacré Tue, 14 Jul 2020 09:21:29 -0700

Dear all,


I'm pretty new to Pacemaker so I must be missing something but I cannot find it 
in the documentation.

I'm setting up a SAMBA File Server cluster with DRBD and Pacemaker. Here are 
the relevant pcs commands related to the mount part:

user $ sudo pcs cluster cib fs_cfg
user $ sudo pcs -f fs_cfg resource create VPSFSMount Filesystem 
device="/dev/drbd1" directory="/srv/vps-fs" fstype="gfs2" "options=acl,noatime"
  Assumed agent name 'ocf:heartbeat:Filesystem' (deduced from 'Filesystem')

It all works fine, here is an extract of the pcs status command:

user $ sudo pcs status
Cluster name: vps-fs
Stack: corosync
Current DC: vps-fs-04 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Tue Jul 14 11:13:55 2020
Last change: Tue Jul 14 10:31:36 2020 by root via cibadmin on vps-fs-04

2 nodes configured
7 resources configured

Online: [ vps-fs-03 vps-fs-04 ]

Full list of resources:

stonith_vps-fs (stonith:external/ssh): Started vps-fs-04
Clone Set: dlm-clone [dlm]
     Started: [ vps-fs-03 vps-fs-04 ]
Master/Slave Set: VPSFSClone [VPSFS]
     Masters: [ vps-fs-03 vps-fs-04 ]
Clone Set: VPSFSMount-clone [VPSFSMount]
     Started: [ vps-fs-03 vps-fs-04 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

I can start CTDB (SAMBA cluster manager manually) and it's fine. However, CTDB 
shares a lock file between both nodes which is located on the shared mount 
point.

The problem comes from the moment I reboot one of the servers (vps-fs-04) and 
Pacemaker (and Corosync) are started automatically upon boot (I'm talking about 
unexpected reboot, not maintenance reboot which I didn't try yet).
After reboot, the server (vps-fs-04) comes back online and in the cluster but 
the one that wasn't rebooted has an issue with the mount resource:

user $ sudo pcs status
Cluster name: vps-fs
Stack: corosync
Current DC: vps-fs-03 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Tue Jul 14 11:33:44 2020
Last change: Tue Jul 14 10:31:36 2020 by root via cibadmin on vps-fs-04

2 nodes configured
7 resources configured

Node vps-fs-03: UNCLEAN (online)
Online: [ vps-fs-04 ]

Full list of resources:

stonith_vps-fs (stonith:external/ssh): Started vps-fs-03
Clone Set: dlm-clone [dlm]
     Started: [ vps-fs-03 vps-fs-04 ]
Master/Slave Set: VPSFSClone [VPSFS]
     Masters: [ vps-fs-03 ]
     Slaves: [ vps-fs-04 ]
Clone Set: VPSFSMount-clone [VPSFSMount]
     VPSFSMount (ocf::heartbeat:Filesystem):    FAILED vps-fs-03
     Stopped: [ vps-fs-04 ]

Failed Actions:
* VPSFSMount_stop_0 on vps-fs-03 'unknown error' (1): call=65, status=Timed 
Out, exitreason='Couldn't unmount /srv/vps-fs; trying cleanup with KILL',
    last-rc-change='Tue Jul 14 11:23:46 2020', queued=0ms, exec=60011ms


Daemon Status:
  corosync: active/enabled
 pacemaker: active/enabled
  pcsd: active/enabled

The problem seems to come from the fact that the mount point (/srv/vps-fs) is 
busy (probably CTDB lock file) but what I don't understand is why does the 
server not rebooted (vps-fs-03) need to remount an already mounted file system 
when the other node comes back online.

I've checked the 'ocf:heartbeat:Filesystem' documentation but nothing seemed to 
help. The only thing I did was to change the following:

user $ sudo pcs resource update VPSFSMount fast_stop="no" op monitor 
timeout="60"

However this didn't help. Google doesn't give me much help either (but maybe 
I'm not searching for the right thing).

Thank you in advance for any pointer!


Kr,

Gregory

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Automatic restart of Pacemaker after reboot and filesystem unmount problem

Reply via email to