Hi! I think you problem can be reduced to a monitor that can detect that both IBs are down. Maybe test-reading (with timeout) from the filesystem? But take care: umount will not kill root processes that busy the filesystem in the normal stop-resource-case.
Try that? Regards, Ulrich >>> Marcin Dulak <[email protected]> schrieb am 19.08.2015 um 12:31 in Nachricht <CABJoABZfMv0N=xaz7qkp_5rciso4wbzlxj8dyslav1qccvv...@mail.gmail.com>: > Hi, > > i have two questions stated in the email's subject, but let me describe my > system first. > > I have a Lustre over infiniband setup constiting of mgs, mds, and two oss, > each oss has two ost's, but the questions are not specific to Lustre. > Each server has two IPoIB interfaces which provide multipath redundancy to > the SAN block devices. > I'm using the crm configuration generated by the make-lustre-crm-config.py > script > available at https://github.com/gc3-uzh-ch/schroedinger-lustre-ha > After some changes (hostnames, IPs, and the fact that in my setup I have > two IPoIB interfaces > instead of just one), the script creates the attached crm.txt. > > I'm familiar with https://ourobengr.com/ha/ , which says: > "If a stop (umount of the Lustre filesystem in this case) fails, > the node will be fenced/STONITHd because this is the only safe thing to do". > > I have a working STONITH, with corosync communicating over eth0 interface. > Let's take the example of server-02, which mounts Lustre's mdt. > The server-02 is powered-off if I disable the eth0 interface on it, > and mdt moves onto server-01 as expected. > However if instead both IPoIB interfaces go down on server-02, > the mdt is moved to server-01, but no STONITH is performed on server-02. > This is expected, because there is nothing in the configuration that > triggers > STONITH in case of IB connection loss. > Hovever if IPoIB is flapping this setup could lead to mdt moving > back and forth between server-01 and server-02. > Should I have STONITH shutting down a node that misses both IpoIB > (remember they are passively redundant, only one active at a time) > interfaces? > If so, how to achieve that? > > The context for the second question: the configuration contains the > following Filesystem template: > > rsc_template lustre-target-template ocf:heartbeat:Filesystem \ > op monitor interval=120 timeout=60 OCF_CHECK_LEVEL=10 \ > op start interval=0 timeout=300 on-fail=fence \ > op stop interval=0 timeout=300 on-fail=fence > > How can I make umount/mount of Filesystem fail in order to test STONITH > action in these cases? > > Extra question: where can I find the documentation/source what > on-fail=fence is doing? > Or what does it mean on-fail=stop in the ethmonitor template below (what is > stopped?)? > > rsc_template netmonitor-30sec ethmonitor \ > params repeat_count=3 repeat_interval=10 \ > op monitor interval=15s timeout=60s \ > op start interval=0s timeout=60s on-fail=stop \ > > Marcin _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
