24.05.2018 02:57, Jason Gauthier пишет: > I'm fairly new to clustering under Linux. I've basically have one shared > storage resource right now, using dlm, and gfs2. > I'm using fibre channel and when both of my nodes are up (2 node cluster) > dlm and gfs2 seem to be operating perfectly. > If I reboot node B, node A works fine and vice-versa. > > When node B goes offline unexpectedly, and become unclean, dlm seems to > block all IO to the shared storage. > > dlm knows node B is down: > > # dlm_tool status > cluster nodeid 1084772368 quorate 1 ring seq 32644 32644 > daemon now 865695 fence_pid 18186 > fence 1084772369 nodedown pid 18186 actor 1084772368 fail 1527119246 fence > 0 now 1527119524 > node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0 > node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0 0 > > on the same server, I see these messages in my daemon.log > May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could not kick > (reboot) node 1084772369/(null) : No route to host (-113) > May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113 nodeid > 1084772369 > > I can recover from the situation by forcing it (or bring the other node > back online) > dlm_tool fence_ack 1084772369 > > cluster config is pretty straighforward. > node 1084772368: alpha > node 1084772369: beta > primitive p_dlm_controld ocf:pacemaker:controld \ > op monitor interval=60 timeout=60 \ > meta target-role=Started \ > params args="-K -L -s 1" > primitive p_fs_gfs2 Filesystem \ > params device="/dev/sdb2" directory="/vms" fstype=gfs2 > primitive stonith_sbd stonith:external/sbd \ > params pcmk_delay_max=30 sbd_device="/dev/sdb1" \ > meta target-role=Started
What is the status of stonith resource? Did you configure SBD fencing properly? Is sbd daemon up and running with proper parameters? What is output of sbd -d /dev/sdb1 dump sbd -d /dev/sdb1 list on both nodes? Does sbd -d /dev/sdb1 message <other-node> test work in both directions? Does manual fencing using stonith_admin work? > group g_gfs2 p_dlm_controld p_fs_gfs2 > clone cl_gfs2 g_gfs2 \ > meta interleave=true target-role=Started > location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=1.1.16-94ff4df \ > cluster-infrastructure=corosync \ > cluster-name=zeta \ > last-lrm-refresh=1525523370 \ > stonith-enabled=true \ > stonith-timeout=20s > > Any pointers would be appreciated. I feel like this should be working but > I'm not sure if I've missed something. > > Thanks, > > Jason > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org