On 19/01/16 08:04 PM, Jan Pokorný wrote: > On 11/01/16 11:59 -0500, Digimer wrote: >> We hit a strange problem where a RAID controller on a node failed, >> causing DLM (gfs2/clvmd) to hang, but the node was never fenced. I >> assume this was because corosync was still working. >> >> Is there a way in rhel6/cman/rgmanager to have a node suicide or get >> fenced in a condition like this? > > something like this in the crontab (beside cron and other components > are now the SPOF and I/O spike or DoS will finish the apocalypse)? > > */1 * * * * timeout 30s touch <file on respective fs> || fence_node <myself> > > Sophistications at the components you mentioned might be preferred, > though.
Oh, I didn't know about 'timeout'! I can use that to make a more intelligent check inside ScanCore before pulling the trigger. My plan was, if I could detect the fault, call 'echo c > /proc/sysrq-trigger' and let the node get fenced normally. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
