>>> Ken Gaillot <kgail...@redhat.com> schrieb am 28.09.2018 um 15:50 in >>> Nachricht <1538142642.4679.1.ca...@redhat.com>: > On Fri, 2018-09-28 at 15:26 +0530, Prasad Nagaraj wrote: >> Hi Ken - Only if I turn off corosync on the node [ where I crashed >> pacemaker] other nodes are able to detect and put the node as >> OFFLINE. >> Do you have any other guidance or insights into this ? > > Yes, corosync is the cluster membership layer -- if corosync is > successfully running, then the node is a member of the cluster. > Pacemaker's crmd provides a higher level of membership; typically, with > corosync but no crmd, the node shows up as "pending" in status. However > I am not sure how it worked with the old corosync plugin.
Maybe crmd should "feed a watchdog with tranquilizers" (meaning if it stops to do so, the watchdog will become alive and reset the node). ;-) Regards, Ulrich > >> >> Thanks >> Prasad >> >> On Thu, Sep 27, 2018 at 9:33 PM Prasad Nagaraj <prasad.nagaraj76@gmai >> l.com> wrote: >> > Hi Ken - Thanks for the response. Pacemaker is still not running on >> > that node. So I am still wondering what could be the issue ? Any >> > other configurations or logs should I be sharing to understand this >> > more ? >> > >> > Thanks! >> > >> > On Thu, Sep 27, 2018 at 8:08 PM Ken Gaillot <kgail...@redhat.com> >> > wrote: >> > > On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote: >> > > > Hello - I was trying to understand the behavior or cluster when >> > > > pacemaker crashes on one of the nodes. So I hard killed >> > > pacemakerd >> > > > and its related processes. >> > > > >> > > > ------------------------------------------------------------- >> > > ------ >> > > > ------------------------------------- >> > > > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker >> > > > root 74022 1 0 07:53 pts/0 00:00:00 pacemakerd >> > > > 189 74028 74022 0 07:53 ? 00:00:00 >> > > > /usr/libexec/pacemaker/cib >> > > > root 74029 74022 0 07:53 ? 00:00:00 >> > > > /usr/libexec/pacemaker/stonithd >> > > > root 74030 74022 0 07:53 ? 00:00:00 >> > > > /usr/libexec/pacemaker/lrmd >> > > > 189 74031 74022 0 07:53 ? 00:00:00 >> > > > /usr/libexec/pacemaker/attrd >> > > > 189 74032 74022 0 07:53 ? 00:00:00 >> > > > /usr/libexec/pacemaker/pengine >> > > > 189 74033 74022 0 07:53 ? 00:00:00 >> > > > /usr/libexec/pacemaker/crmd >> > > > >> > > > root 75228 50092 0 07:54 pts/0 00:00:00 grep >> > > pacemaker >> > > > [root@SG-mysqlold-907 azureuser]# kill -9 74022 >> > > > >> > > > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker >> > > > root 74030 1 0 07:53 ? 00:00:00 >> > > > /usr/libexec/pacemaker/lrmd >> > > > 189 74032 1 0 07:53 ? 00:00:00 >> > > > /usr/libexec/pacemaker/pengine >> > > > >> > > > root 75303 50092 0 07:55 pts/0 00:00:00 grep >> > > pacemaker >> > > > [root@SG-mysqlold-907 azureuser]# kill -9 74030 >> > > > [root@SG-mysqlold-907 azureuser]# kill -9 74032 >> > > > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker >> > > > root 75332 50092 0 07:55 pts/0 00:00:00 grep >> > > pacemaker >> > > > >> > > > [root@SG-mysqlold-907 azureuser]# crm satus >> > > > ERROR: status: crm_mon (rc=107): Connection to cluster failed: >> > > > Transport endpoint is not connected >> > > > ------------------------------------------------------------- >> > > ------ >> > > > ---------------------------------------------------------- >> > > > >> > > > However, this does not seem to be having any effect on the >> > > cluster >> > > > status from other nodes >> > > > ------------------------------------------------------------- >> > > ------ >> > > > -------------------------------------------------------- >> > > > >> > > > [root@SG-mysqlold-909 azureuser]# crm status >> > > > Last updated: Thu Sep 27 07:56:17 2018 Last change: >> > > Thu Sep >> > > > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909 >> > > > Stack: classic openais (with plugin) >> > > > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) >> > > - >> > > > partition with quorum >> > > > 3 nodes and 3 resources configured, 3 expected votes >> > > > >> > > > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ] >> > > >> > > It most definitely would make the node offline, and if fencing >> > > were >> > > configured, the rest of the cluster would fence the node to make >> > > sure >> > > it's safely down. >> > > >> > > I see you're using the old corosync 1 plugin. I suspect what >> > > happened >> > > in this case is that corosync noticed the plugin died and >> > > restarted it >> > > quickly enough that it had rejoined by the time you checked the >> > > status >> > > elsewhere. >> > > >> > > > >> > > > Full list of resources: >> > > > >> > > > Master/Slave Set: ms_mysql [p_mysql] >> > > > Masters: [ SG-mysqlold-909 ] >> > > > Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ] >> > > > >> > > > >> > > > [root@SG-mysqlold-908 azureuser]# crm status >> > > > Last updated: Thu Sep 27 07:56:08 2018 Last change: >> > > Thu Sep >> > > > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909 >> > > > Stack: classic openais (with plugin) >> > > > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) >> > > - >> > > > partition with quorum >> > > > 3 nodes and 3 resources configured, 3 expected votes >> > > > >> > > > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ] >> > > > >> > > > Full list of resources: >> > > > >> > > > Master/Slave Set: ms_mysql [p_mysql] >> > > > Masters: [ SG-mysqlold-909 ] >> > > > Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ] >> > > > >> > > > ------------------------------------------------------------- >> > > ------ >> > > > --------------------------------------------------- >> > > > >> > > > I am bit surprised that other nodes are not able to detect that >> > > > pacemaker is down on one of the nodes - SG-mysqlold-907 >> > > > >> > > > Even if I kill pacemaker on the node which is a DC - I observe >> > > the >> > > > same behavior with rest of the nodes not detecting that DC is >> > > down. >> > > > >> > > > Could some one explain what is the expected behavior in these >> > > cases ? >> > > > >> > > > I am using corosync 1.4.7 and pacemaker 1.1.14 >> > > > >> > > > Thanks in advance >> > > > Prasad >> > > > >> > > > _______________________________________________ >> > > > Users mailing list: Users@clusterlabs.org >> > > > https://lists.clusterlabs.org/mailman/listinfo/users >> > > > >> > > > Project Home: http://www.clusterlabs.org >> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc >> > > ratch. >> > > > pdf >> > > > Bugs: http://bugs.clusterlabs.org >> > > _______________________________________________ >> > > Users mailing list: Users@clusterlabs.org >> > > https://lists.clusterlabs.org/mailman/listinfo/users >> > > >> > > Project Home: http://www.clusterlabs.org >> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra >> > > tch.pdf >> > > Bugs: http://bugs.clusterlabs.org >> > > > -- > Ken Gaillot <kgail...@redhat.com> > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org