Dear all,
I'm pretty new to Pacemaker so I must be missing something but I cannot find it in the documentation. I'm setting up a SAMBA File Server cluster with DRBD and Pacemaker. Here are the relevant pcs commands related to the mount part: user $ sudo pcs cluster cib fs_cfg user $ sudo pcs -f fs_cfg resource create VPSFSMount Filesystem device="/dev/drbd1" directory="/srv/vps-fs" fstype="gfs2" "options=acl,noatime" Assumed agent name 'ocf:heartbeat:Filesystem' (deduced from 'Filesystem') It all works fine, here is an extract of the pcs status command: user $ sudo pcs status Cluster name: vps-fs Stack: corosync Current DC: vps-fs-04 (version 1.1.18-2b07d5c5a9) - partition with quorum Last updated: Tue Jul 14 11:13:55 2020 Last change: Tue Jul 14 10:31:36 2020 by root via cibadmin on vps-fs-04 2 nodes configured 7 resources configured Online: [ vps-fs-03 vps-fs-04 ] Full list of resources: stonith_vps-fs (stonith:external/ssh): Started vps-fs-04 Clone Set: dlm-clone [dlm] Started: [ vps-fs-03 vps-fs-04 ] Master/Slave Set: VPSFSClone [VPSFS] Masters: [ vps-fs-03 vps-fs-04 ] Clone Set: VPSFSMount-clone [VPSFSMount] Started: [ vps-fs-03 vps-fs-04 ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled I can start CTDB (SAMBA cluster manager manually) and it's fine. However, CTDB shares a lock file between both nodes which is located on the shared mount point. The problem comes from the moment I reboot one of the servers (vps-fs-04) and Pacemaker (and Corosync) are started automatically upon boot (I'm talking about unexpected reboot, not maintenance reboot which I didn't try yet). After reboot, the server (vps-fs-04) comes back online and in the cluster but the one that wasn't rebooted has an issue with the mount resource: user $ sudo pcs status Cluster name: vps-fs Stack: corosync Current DC: vps-fs-03 (version 1.1.18-2b07d5c5a9) - partition with quorum Last updated: Tue Jul 14 11:33:44 2020 Last change: Tue Jul 14 10:31:36 2020 by root via cibadmin on vps-fs-04 2 nodes configured 7 resources configured Node vps-fs-03: UNCLEAN (online) Online: [ vps-fs-04 ] Full list of resources: stonith_vps-fs (stonith:external/ssh): Started vps-fs-03 Clone Set: dlm-clone [dlm] Started: [ vps-fs-03 vps-fs-04 ] Master/Slave Set: VPSFSClone [VPSFS] Masters: [ vps-fs-03 ] Slaves: [ vps-fs-04 ] Clone Set: VPSFSMount-clone [VPSFSMount] VPSFSMount (ocf::heartbeat:Filesystem): FAILED vps-fs-03 Stopped: [ vps-fs-04 ] Failed Actions: * VPSFSMount_stop_0 on vps-fs-03 'unknown error' (1): call=65, status=Timed Out, exitreason='Couldn't unmount /srv/vps-fs; trying cleanup with KILL', last-rc-change='Tue Jul 14 11:23:46 2020', queued=0ms, exec=60011ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled The problem seems to come from the fact that the mount point (/srv/vps-fs) is busy (probably CTDB lock file) but what I don't understand is why does the server not rebooted (vps-fs-03) need to remount an already mounted file system when the other node comes back online. I've checked the 'ocf:heartbeat:Filesystem' documentation but nothing seemed to help. The only thing I did was to change the following: user $ sudo pcs resource update VPSFSMount fast_stop="no" op monitor timeout="60" However this didn't help. Google doesn't give me much help either (but maybe I'm not searching for the right thing). Thank you in advance for any pointer! Kr, Gregory
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/