Re: [Linux-cluster] I/O to gfs2 hanging or not hanging after heartbeat loss

Jonathan Davies Mon, 18 Apr 2016 06:20:09 -0700


On 15/04/16 17:14, David Teigland wrote:

However, on some occasions, I observe that node A continues in the loop
believing that it is successfully writing to the file


node A has the exclusive lock, so it continues writing...

but, according to
node C, the file stops being updated. (Meanwhile, the file written by
node B continues to be up-to-date as read by C.) This is concerning --
it looks like I/O writes are being completed on node A even though other
nodes in the cluster cannot see the results.


Is node C blocked trying to read the file A is writing?  That what we'd
expect until recovery has removed node A.  Or are C's reads completing
while A continues writing the file?  That would not be correct.

However, if A happens to own the DLM lock, it does not need
to ask DLM's permission because it owns the lock. Therefore, it goes
on writing. Meanwhile, the other node can't get DLM's permission to
get the lock back, so it hangs.


The description sounds like C might not be hanging in read as we'd expect
while A continues writing.  If that's the case, then it implies that dlm
recovery has been completed by nodes B and C (removing A), which allows
the lock to be granted to C for reading.  If dlm recovery on B/C has
completed, it means that A should have been fenced, so A should not be
able to write once C is given the lock.


Thanks Bob and Dave for your very helpful insights.

Your line of reasoning led me to realise that I am running dlm withfencing disabled, which explains everything. Node C was not hanging inread while A continued to write; it was constantly returning an oldvalue. I presume that's legitimate as C believes the value it saw lastmust still be up-to-date because A must have been fenced so couldn'thave updated it. (It also explains why I didn't see anything useful inthe logs.)

When I run the same test with fencing enabled then, although A continueswriting after the failure, the read on C hangs until A is fenced, atwhich point it is able to read the last value A wrote. That's exactlywhat I want.


Apologies for the noise, and thanks for the explanations.

Jonathan

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] I/O to gfs2 hanging or not hanging after heartbeat loss

Reply via email to