Re: [ClusterLabs] Problems with pcs/corosync/pacemaker/drbd/vip/nfs

Digimer Mon, 14 Mar 2016 10:57:12 -0700

On 14/03/16 01:47 PM, Todd Hebert wrote:
> Hello,
> 
> I'm working on setting up a test-system that can handle NFS failover.
> 
> The base is CentOS 7.
> I'm using ZVOL block devices out of ZFS to back DRBD replicated volumes.
> 
> I have four DRBD resources (r0, r1, r2, r3, which are /dev/drbd1 drbd2 drbd3 
> and drbd4 respectively)
> 
> These all have XFS filesystems on them that mount properly and serve content 
> etc..
> 
> I tried using corosync/pacemaker/drbd/lsb:nfs-kernel-server on Ubuntu, and it 
> would serve content on the primary server without issue.  Any attempt to 
> failover or migrate services.. .everything would say it migrated fine, and 
> the filesystems would be mounted and readable/writable etc.., but the NFS 
> clients access to them would "pause"
> 
> This appears to be an issue with the nfs-kernel-server in Ubuntu, where it 
> simply would not recognise the NFS session information, which was on a 
> replicated volume.
> 
> If the primary node is put back online, everything migrates back "perfect" 
> and traffic that had "paused" on failover to the secondary system resumes, 
> even if it's been sitting there for 15-20 minutes not working.
> 
> There is no difference in behaviour between offlining the primary node, and 
> migrating lsb:nfs-kernel-server to another node (by it's primitive name, not 
> as lsb:nfs-kernel-server, obviously)
> 
> If I create new connections into NFS while test-sanb is active, they work, 
> only to "freeze" as above with an offline or migrate away from test-sanb, so 
> symptoms are the same in both "directions"
> 
> ----
> 
> After not being able to get the lsb:nfs-kernel-server working properly in 
> Ubuntu, and reading similar stories from other users after a series of 
> googles, I switched over to CentOS 7.
> CentOS 7, instead of lsb:nfs-kernel-server, I am trying to use 
> systemd:nfs-server, since CentOS 7 uses systemd, rather than sysinitv for 
> managing services.
> 
> Pretty much everything in the configuration except lsb:nfs-kernel-server came 
> right over.
> 
> Now, everything will run properly on the primary node (as was the case with 
> Ubuntu) but...
> If I put the "test-sana" node into standby, first NFS stops, then the VIP 
> stops, then the three NFS-shared filesystems get umounted (perfect so far)
> Then.. it appears that parts of the NFS service, either idmapd or rpcbind 
> haven't released their hold on the rpc_pipefs filesystem, so it's still 
> mounted... it's mounted inside /var/lib/nfs, which is on the last drbd volume.
> Pacemaker, or some other element detects that rpc_pipefs was still mounted, 
> umounts it, then umounts /var/lib/nfs, which should clear the way for 
> everything else to work.. but that's not what happens.
> 
> At this point, the ms_drbd_r<N> should demote to "Secondary" on the primary 
> mode, allowing for the secondary node to promote to "Primary" and services to 
> start on "test-sanb", but instead the drbd processes on "test-sana" end up 
> marked as "Stopped" and checking `cat /proc/drbd` shows that the volumes are 
> still Primary/Secondary UpToDate/UpToDate on test-sana (and the opposite on 
> test-sanb)
> 
> It takes AGES (several minutes) for things to reach this state.
> 
> They stay this way indefinitely.  If I manually demote DRBD resources on 
> test-sana, they end up listed as "Master" in a "crm status" or "pcs status" 
> again, and eventually the status changes to "Primary/Secondary" in /proc/drbd 
> as well.
> 
> If I put node test-sana back online (node online test-sana) it takes a few 
> seconds for services to start back up and serve content again.
> 
> Since I cannot get services to run on test-sanb at all thus far, I don't know 
> if the symptoms would be the same in both directions.
> I can't find any differences in the two nodes that should account for this.
> 
> ---
> 
> In any case, what I need to arrive at is a working solution for NFS failure 
> across two nodes.
> 
> I have several systems where I'm using just heartbeat in order to failover IP 
> address, drbd, and nfs, but for single instances of drbd/nfs
> 
> I cannot find any working examples for either Ubuntu 14.04, nor CentOS 7 for 
> this scenario.  (There are some out there for Ubuntu, but they do not appear 
> to actually work with modern pacemaker et al.)
> 
> Does anyone have an example of working configurations for this?
> 
> My existing pacemaker configuration can be found here:  
> http://paste.ie/view/c766a4ff


You *must* configure fencing. Set up stonith in pacemaker, test it, and
then when it is working, hook DRBD into it (fencing
resource-and-stonith; {un,}fence-handler "/path/to/crm-{un,}fence-peer.sh".

Debugging failover issues without fencing tested and working is a waste
of time.

> As I mentioned, the configurations are nearly identical for both the Ubuntu 
> 14.04 and CentOS 7 setups, and the hardware used is the same in both cases.
> 
> I know that I do not have Stonith configured, as well.  I do not have access 
> to any fencing devices for a test system and have to rely on software only.
> 
> _______________________________________________
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Problems with pcs/corosync/pacemaker/drbd/vip/nfs

Reply via email to