If you want to reduce the multipath switching time, when one
controller goes down
https://www.redhat.com/archives/dm-devel/2009-April/msg00266.html

2016-10-13 10:27 GMT+02:00 Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de>:
>>>> Eric Ren <z...@suse.com> schrieb am 13.10.2016 um 09:31 in Nachricht
> <e23ba209-fdc6-987e-db14-5c57b72c6...@suse.com>:
>> Hi,
>>
>> On 10/10/2016 10:46 PM, Ulrich Windl wrote:
>>> Hi!
>>>
>>> I observed an interesting thing: In a three node cluster (SLES11 SP4) with
>> cLVM and OCFS2 on top, one node was fenced as the OCFS2 filesystem was
>> somehow busy on unmount. We have (for paranoid reasons mainly) an excessive
>> long fencing timout for SBD: 180 seconds
>>>
>>> While one node was actually reset immediately (the cluster was still waiting
>> for the fencing to "complete" through timeout), the other nodes seemed to
>> freeze the filesystem. Thus I observed a read delay > 140 seconds on one 
>> node,
>> the other was also close to 140 seconds.
>> ocfs2 and cLVM are both depending on DLM. DLM deamon will notify them to
>> stop service (which
>> means any cluster locking
>> request would be blocked) during the fencing process.
>>
>> So I'm wondering why it takes so long to finish the fencing process?
>
> As I wrote: Using SBD this is paranoia (as fencing doesn't report back a 
> status like "completed" or "failed". Actually the fencing only needs a few 
> seconds, but the timeout is 3 minutes. Only then the cluster believes that 
> the node is down now (our servers boot so slowly that they are not up within 
> three minutes, also). Why three minutes? Writing to a SCSI disk may be 
> retried up to one minute, and reading may also be retried for a minute. So 
> for a bad SBD disk (or some strange transport problem) it could take two 
> minutes until the receiving SBD gets the fencing command. If the timeout is 
> too low, resources could be restarted before the node was actually fenced, 
> causing data corruption.
>
> Ulrich
> P.S: One common case where our SAN disks seem slow is "Online" firmware 
> update where a controller may be down 20 to 30 seconds. Multipathing is 
> expected to switch to another controller within a few seconds. However the 
> commands to test the disk in multipath are also SCSI commands that may hang 
> for a while...
>
>>
>> Eric
>>>
>>> This was not expected for a cluster filesystem (by me).
>>>
>>> I wonder: Is that expected bahavior?
>>>
>>> Regards,
>>> Ulrich
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users@clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
>
>
> _______________________________________________
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
  .~.
  /V\
 //  \\
/(   )\
^`~'^

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to