>>> dan <[email protected]> schrieb am 10.09.2015 um 12:54 in >>> Nachricht <[email protected]>: > Hi > > I have now for a few weeks been trying to get a cluster using pacemaker > to work. We are using Ubuntu 14.04.2 LTS with > corosync 2.3.3-1ubuntu1 > pacemaker 1.1.10+git2013
I don't know how current Ubunto is, but SLES11 SP4 is already at pacemaker 1.1.12. You may have less trouble using a more recent version (if available for you). > > It is a 2 node cluster and it includes a gfs2 file system on top of > drbd. > > After som initial problem with stonith not working due to dlm_stonith > missing (which I fixed by compiling it myself), it looked good. I have > set upp the cluster to power off the other node through stonith instead > of reboot as is default. > > I tested failures by doing init 0, halt -f, pkill -9 coresync on one > node and it worked fine. But then I detected that after the cluster had > been up (both nodes) for 2 days, doing init 0 on one node resulted in > that node hanging during shutdown and the other node failing to stonith > it. And after forcing the hanging node to power off and then powering it Could you find out whay? Maybe the cluster node tried a clean stop/migration of resources, waiting for operations to finish. Waht's in the logs? > on, doing pcs status on it reports not being able to talk to other node > and all resources are stopped. And on the other node (which have been > running the whole time) pcs status hangs (crm status works and says that > all is up) and the gfs2 file system is blocking. Doing init 0 on this > node never shuts it down, a reboot -f does work and after it is upp > again the entire cluster is ok. I'm old schol and always use "shutdown -h|-r now" ;-) > > So in short, everything works fine after a fresh boot of both two nodes > but after 2 days a requested shutdown of one node (using init 0) hangs > and the other node stops working correctly. > > Looking at the console on the node I did init 0 on, dlm_controld reports > that cluster is down and then that drbd have problem talking to other > node, and then that gfs2 is blocked. So that is why that node never > powers off - gfs2 and drbd was not shutdown correctly by the pacemaker > before it stopped (or is trying to stop). > > Looking through the logs (syslog and corosync.log) (I have debug mode on > corosync) I can see that on node 1 (the one I left running the whole > time) it does: > > stonith-ng: info: crm_update_peer_proc: pcmk_cpg_membership: Node > node2[2] - corosync-cpg is now offline > crmd: info: crm_update_peer_proc: pcmk_cpg_membership: Node node2[2] > - corosync-cpg is now offline > crmd: info: peer_update_callback: Client node2/peer now has status > [offline] (DC=node2) > > crmd: notice: peer_update_callback: Our peer on the DC is dead If the nodeis actually alive at that time, you have a big configuration or software problem! > > stonith-ng notice: handle_request: Client stonith-api.10797.41ef3128 wants > to fence (off) '2' with device '(any)' > stonith-ng notice: initiate_remote_stonith_op: Initiating remote > operation off for node2: 20f62cf6-90eb-4c53-8da1-30ab > 048de495 (0) > stonith-ng: info: stonith_command: Processed st_fence from > stonith-api.10797: Operation now in progress (-115) > > corosyncdebug [TOTEM ] Resetting old ring state > corosyncdebug [TOTEM ] recovery to regular 1-0 Ah, the "good old corosync rings". I just guess there a lots of bugs to be fixed. I can imaging that when you have NFS or cLVM or GFS or any other CFS on the same net that a corosync ring uses, corosync will go crazy under network load. > corosyncdebug [MAIN ] Member left: r(0) ip(10.10.1.2) r(1) > ip(192.168.12.142) > corosyncdebug [TOTEM ] waiting_trans_ack changed to 1 > corosyncdebug [TOTEM ] entering OPERATIONAL state. > corosyncnotice [TOTEM ] A new membership (10.10.1.1:588) was formed. > Members left: 2 > corosyncdebug [SYNC ] Committing synchronization for corosync > configuration map access > corosyncdebug [QB ] Not first sync -> no action > corosyncdebug [CPG ] comparing: sender r(0) ip(10.10.1.1) r(1) > ip(192.168.12.140) ; members(old:2 left:1) > corosyncdebug [CPG ] chosen downlist: sender r(0) ip(10.10.1.1) r(1) > ip(192.168.12.140) ; members(old:2 left:1) > corosyncdebug [CPG ] got joinlist message from node 1 > corosyncdebug [SYNC ] Committing synchronization for corosync cluster > closed process group service v1.01 > > and a little later most log entries are: > cib: info: crm_cs_flush: Sent 0 CPG messages (3 remaining, > last=25): Try again (6) > > The Sent 0 CFG messages is logged forever until I force reboot of this node. > > > On node 2 (the one I did init 0) I can find: > stonith-ng[1415]: notice: log_operation: Operation 'monitor' [17088] for > device 'ipmi-fencing-node1' returned: -201 (Generic Pacem > aker error) > several lines from crmd, attrd, pengine about ipmi-fencing > > Hard to know what log entries are important. Yes, I'm still learning, too. > > But as as summary: after power on my 2 node cluster works fine, reboots > and other node failure tests all work fine. But after letting the > cluster run for 2 days, when I do node failure test parts of the cluster > services fails to stop on the node failure is simulated and both nodes > stop working (even though only one node was shutdown). I suspect it's not the running time, but the load at some point in these two days. > > The version of corosync and pacemaker is somewhat old - it is the > official version available for our ubuntu version. Is this a known > problem? Can't tell, sorry! > > I have seen that there are newer versions available, pacemaker has many > changes done as I see on github. If this is a know problem, which > versions of corosync and pacemaker should I try to change to? > > Or do you have some other idea what I can test/try to pin this down? Regards, Ulrich > > Dan > > > _______________________________________________ > Users mailing list: [email protected] > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
