Re: [ClusterLabs] DLM hanging when corosync is OK causes cluster to hang

Digimer Tue, 19 Jan 2016 17:15:06 -0800

On 19/01/16 08:04 PM, Jan Pokorný wrote:
> On 11/01/16 11:59 -0500, Digimer wrote:
>>   We hit a strange problem where a RAID controller on a node failed,
>> causing DLM (gfs2/clvmd) to hang, but the node was never fenced. I
>> assume this was because corosync was still working.
>>
>>   Is there a way in rhel6/cman/rgmanager to have a node suicide or get
>> fenced in a condition like this?
> 
> something like this in the crontab (beside cron and other components
> are now the SPOF and I/O spike or DoS will finish the apocalypse)?
> 
> */1 * * * * timeout 30s touch <file on respective fs> || fence_node <myself>
> 
> Sophistications at the components you mentioned might be preferred,
> though.


Oh, I didn't know about 'timeout'! I can use that to make a more
intelligent check inside ScanCore before pulling the trigger. My plan
was, if I could detect the fault, call 'echo c > /proc/sysrq-trigger'
and let the node get fenced normally.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] DLM hanging when corosync is OK causes cluster to hang

Reply via email to