>>> Eric Ren <z...@suse.com> schrieb am 13.10.2016 um 09:31 in Nachricht
<e23ba209-fdc6-987e-db14-5c57b72c6...@suse.com>:
> Hi,
> 
> On 10/10/2016 10:46 PM, Ulrich Windl wrote:
>> Hi!
>>
>> I observed an interesting thing: In a three node cluster (SLES11 SP4) with 
> cLVM and OCFS2 on top, one node was fenced as the OCFS2 filesystem was 
> somehow busy on unmount. We have (for paranoid reasons mainly) an excessive 
> long fencing timout for SBD: 180 seconds
>>
>> While one node was actually reset immediately (the cluster was still waiting 
> for the fencing to "complete" through timeout), the other nodes seemed to 
> freeze the filesystem. Thus I observed a read delay > 140 seconds on one 
> node, 
> the other was also close to 140 seconds.
> ocfs2 and cLVM are both depending on DLM. DLM deamon will notify them to 
> stop service (which 
> means any cluster locking
> request would be blocked) during the fencing process.
> 
> So I'm wondering why it takes so long to finish the fencing process?

As I wrote: Using SBD this is paranoia (as fencing doesn't report back a status 
like "completed" or "failed". Actually the fencing only needs a few seconds, 
but the timeout is 3 minutes. Only then the cluster believes that the node is 
down now (our servers boot so slowly that they are not up within three minutes, 
also). Why three minutes? Writing to a SCSI disk may be retried up to one 
minute, and reading may also be retried for a minute. So for a bad SBD disk (or 
some strange transport problem) it could take two minutes until the receiving 
SBD gets the fencing command. If the timeout is too low, resources could be 
restarted before the node was actually fenced, causing data corruption.

Ulrich
P.S: One common case where our SAN disks seem slow is "Online" firmware update 
where a controller may be down 20 to 30 seconds. Multipathing is expected to 
switch to another controller within a few seconds. However the commands to test 
the disk in multipath are also SCSI commands that may hang for a while...

> 
> Eric
>>
>> This was not expected for a cluster filesystem (by me).
>>
>> I wonder: Is that expected bahavior?
>>
>> Regards,
>> Ulrich
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users@clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>>
> 
> 
> _______________________________________________
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 





_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to