Hi Ken - Only if I turn off corosync on the node [ where I crashed pacemaker] other nodes are able to detect and put the node as OFFLINE. Do you have any other guidance or insights into this ?
Thanks Prasad On Thu, Sep 27, 2018 at 9:33 PM Prasad Nagaraj <[email protected]> wrote: > Hi Ken - Thanks for the response. Pacemaker is still not running on that > node. So I am still wondering what could be the issue ? Any other > configurations or logs should I be sharing to understand this more ? > > Thanks! > > On Thu, Sep 27, 2018 at 8:08 PM Ken Gaillot <[email protected]> wrote: > >> On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote: >> > Hello - I was trying to understand the behavior or cluster when >> > pacemaker crashes on one of the nodes. So I hard killed pacemakerd >> > and its related processes. >> > >> > ------------------------------------------------------------------- >> > ------------------------------------- >> > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker >> > root 74022 1 0 07:53 pts/0 00:00:00 pacemakerd >> > 189 74028 74022 0 07:53 ? 00:00:00 >> > /usr/libexec/pacemaker/cib >> > root 74029 74022 0 07:53 ? 00:00:00 >> > /usr/libexec/pacemaker/stonithd >> > root 74030 74022 0 07:53 ? 00:00:00 >> > /usr/libexec/pacemaker/lrmd >> > 189 74031 74022 0 07:53 ? 00:00:00 >> > /usr/libexec/pacemaker/attrd >> > 189 74032 74022 0 07:53 ? 00:00:00 >> > /usr/libexec/pacemaker/pengine >> > 189 74033 74022 0 07:53 ? 00:00:00 >> > /usr/libexec/pacemaker/crmd >> > >> > root 75228 50092 0 07:54 pts/0 00:00:00 grep pacemaker >> > [root@SG-mysqlold-907 azureuser]# kill -9 74022 >> > >> > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker >> > root 74030 1 0 07:53 ? 00:00:00 >> > /usr/libexec/pacemaker/lrmd >> > 189 74032 1 0 07:53 ? 00:00:00 >> > /usr/libexec/pacemaker/pengine >> > >> > root 75303 50092 0 07:55 pts/0 00:00:00 grep pacemaker >> > [root@SG-mysqlold-907 azureuser]# kill -9 74030 >> > [root@SG-mysqlold-907 azureuser]# kill -9 74032 >> > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker >> > root 75332 50092 0 07:55 pts/0 00:00:00 grep pacemaker >> > >> > [root@SG-mysqlold-907 azureuser]# crm satus >> > ERROR: status: crm_mon (rc=107): Connection to cluster failed: >> > Transport endpoint is not connected >> > ------------------------------------------------------------------- >> > ---------------------------------------------------------- >> > >> > However, this does not seem to be having any effect on the cluster >> > status from other nodes >> > ------------------------------------------------------------------- >> > -------------------------------------------------------- >> > >> > [root@SG-mysqlold-909 azureuser]# crm status >> > Last updated: Thu Sep 27 07:56:17 2018 Last change: Thu Sep >> > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909 >> > Stack: classic openais (with plugin) >> > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) - >> > partition with quorum >> > 3 nodes and 3 resources configured, 3 expected votes >> > >> > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ] >> >> It most definitely would make the node offline, and if fencing were >> configured, the rest of the cluster would fence the node to make sure >> it's safely down. >> >> I see you're using the old corosync 1 plugin. I suspect what happened >> in this case is that corosync noticed the plugin died and restarted it >> quickly enough that it had rejoined by the time you checked the status >> elsewhere. >> >> > >> > Full list of resources: >> > >> > Master/Slave Set: ms_mysql [p_mysql] >> > Masters: [ SG-mysqlold-909 ] >> > Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ] >> > >> > >> > [root@SG-mysqlold-908 azureuser]# crm status >> > Last updated: Thu Sep 27 07:56:08 2018 Last change: Thu Sep >> > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909 >> > Stack: classic openais (with plugin) >> > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) - >> > partition with quorum >> > 3 nodes and 3 resources configured, 3 expected votes >> > >> > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ] >> > >> > Full list of resources: >> > >> > Master/Slave Set: ms_mysql [p_mysql] >> > Masters: [ SG-mysqlold-909 ] >> > Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ] >> > >> > ------------------------------------------------------------------- >> > --------------------------------------------------- >> > >> > I am bit surprised that other nodes are not able to detect that >> > pacemaker is down on one of the nodes - SG-mysqlold-907 >> > >> > Even if I kill pacemaker on the node which is a DC - I observe the >> > same behavior with rest of the nodes not detecting that DC is down. >> > >> > Could some one explain what is the expected behavior in these cases ? >> > >> > I am using corosync 1.4.7 and pacemaker 1.1.14 >> > >> > Thanks in advance >> > Prasad >> > >> > _______________________________________________ >> > Users mailing list: [email protected] >> > https://lists.clusterlabs.org/mailman/listinfo/users >> > >> > Project Home: http://www.clusterlabs.org >> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. >> > pdf >> > Bugs: http://bugs.clusterlabs.org >> -- >> Ken Gaillot <[email protected]> >> _______________________________________________ >> Users mailing list: [email protected] >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> >
_______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
