Re: [ClusterLabs] [Problem] The crmd fails to connect with pengine.
Hi Jan, Hi Ken, Thanks for your comment. I am going to check a little more about the problem of libqb. Many thanks, Hideo Yamauchi. - Original Message - > From: Ken Gaillot > To: Cluster Labs - All topics related to open-source clustering welcomed > > Cc: > Date: 2019/1/3, Thu 01:26 > Subject: Re: [ClusterLabs] [Problem] The crmd fails to connect with pengine. > > On Wed, 2019-01-02 at 15:43 +0100, Jan Pokorný wrote: >> On 28/12/18 05:51 +0900, renayama19661...@ybb.ne.jp wrote: >> > This problem occurred with our users. >> > >> > The following problem occurred in a two-node cluster that does not >> > set STONITH. >> > >> > The problem seems to have occurred in the following procedure. >> > >> > Step 1) Configure the cluster with 2 nodes. The DC node is the >> > second node. >> > Step 2) Several resources are running on the first node. >> > Step 3) It stops almost at the same time in order of 2nd node and >> > 1st node. >> >> Do I decipher the above correctly that the cluster is scheduled for >> shutdown (fully independently node by node or through a single >> trigger >> with a high level management tool?) and starts proceeding in serial >> manner, shutting 2nd node ~ original DC first? >> >> > Step 4) After the second node stops, the first node tries to >> > calculate the state transition for the resource stop. >> > >> > However, crmd fails to connect with pengine and does not calculate >> > state transitions. >> > >> > - >> > Dec 27 08:36:00 rh74-01 crmd[12997]: warning: Setup of client >> > connection failed, not adding channel to mainloop >> > - >> >> Sadly, it looks like details of why this happened would only be >> retained when debugging/tracing verbosity of the log messages >> was enabled, which likely wasn't the case. >> >> Anyway, perhaps providing a wider context of the log messages >> from this first node might shed some light into this. > > Agreed, that's probably the only hope. > > This would have to be a low-level issue like an out-of-memory error, or > something at the libqb level. > >> > As a result, Pacemaker will stop without stopping the resource. >> >> This might have serious consequences in some scenarios, perhaps >> unless some watchdog-based solution (SBD?) was used as a fencing >> of choice since it would not get defused just as the resource >> wasn't stopped, I think... > > Yep, this is unavoidable in this situation. If the last node standing > has an unrecoverable problem, there's no other node remaining to fence > it and recover. > >> > The problem seems to have occurred in the following environment. >> > >> > - libqb 1.0 >> > - corosync 2.4.1 >> > - Pacemaker 1.1.15 >> > >> > I tried to reproduce this problem, but for now it can not be >> > reproduced. >> > >> > Do you know the cause of this problem? >> >> No idea at this point. > -- > Ken Gaillot > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [Problem] The crmd fails to connect with pengine.
On Wed, 2019-01-02 at 15:43 +0100, Jan Pokorný wrote: > On 28/12/18 05:51 +0900, renayama19661...@ybb.ne.jp wrote: > > This problem occurred with our users. > > > > The following problem occurred in a two-node cluster that does not > > set STONITH. > > > > The problem seems to have occurred in the following procedure. > > > > Step 1) Configure the cluster with 2 nodes. The DC node is the > > second node. > > Step 2) Several resources are running on the first node. > > Step 3) It stops almost at the same time in order of 2nd node and > > 1st node. > > Do I decipher the above correctly that the cluster is scheduled for > shutdown (fully independently node by node or through a single > trigger > with a high level management tool?) and starts proceeding in serial > manner, shutting 2nd node ~ original DC first? > > > Step 4) After the second node stops, the first node tries to > > calculate the state transition for the resource stop. > > > > However, crmd fails to connect with pengine and does not calculate > > state transitions. > > > > - > > Dec 27 08:36:00 rh74-01 crmd[12997]: warning: Setup of client > > connection failed, not adding channel to mainloop > > - > > Sadly, it looks like details of why this happened would only be > retained when debugging/tracing verbosity of the log messages > was enabled, which likely wasn't the case. > > Anyway, perhaps providing a wider context of the log messages > from this first node might shed some light into this. Agreed, that's probably the only hope. This would have to be a low-level issue like an out-of-memory error, or something at the libqb level. > > As a result, Pacemaker will stop without stopping the resource. > > This might have serious consequences in some scenarios, perhaps > unless some watchdog-based solution (SBD?) was used as a fencing > of choice since it would not get defused just as the resource > wasn't stopped, I think... Yep, this is unavoidable in this situation. If the last node standing has an unrecoverable problem, there's no other node remaining to fence it and recover. > > The problem seems to have occurred in the following environment. > > > > - libqb 1.0 > > - corosync 2.4.1 > > - Pacemaker 1.1.15 > > > > I tried to reproduce this problem, but for now it can not be > > reproduced. > > > > Do you know the cause of this problem? > > No idea at this point. -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [Problem] The crmd fails to connect with pengine.
On 28/12/18 05:51 +0900, renayama19661...@ybb.ne.jp wrote: > This problem occurred with our users. > > The following problem occurred in a two-node cluster that does not set > STONITH. > > The problem seems to have occurred in the following procedure. > > Step 1) Configure the cluster with 2 nodes. The DC node is the second node. > Step 2) Several resources are running on the first node. > Step 3) It stops almost at the same time in order of 2nd node and 1st node. Do I decipher the above correctly that the cluster is scheduled for shutdown (fully independently node by node or through a single trigger with a high level management tool?) and starts proceeding in serial manner, shutting 2nd node ~ original DC first? > Step 4) After the second node stops, the first node tries to > calculate the state transition for the resource stop. > > However, crmd fails to connect with pengine and does not calculate state > transitions. > > - > Dec 27 08:36:00 rh74-01 crmd[12997]: warning: Setup of client connection > failed, not adding channel to mainloop > - Sadly, it looks like details of why this happened would only be retained when debugging/tracing verbosity of the log messages was enabled, which likely wasn't the case. Anyway, perhaps providing a wider context of the log messages from this first node might shed some light into this. > As a result, Pacemaker will stop without stopping the resource. This might have serious consequences in some scenarios, perhaps unless some watchdog-based solution (SBD?) was used as a fencing of choice since it would not get defused just as the resource wasn't stopped, I think... > The problem seems to have occurred in the following environment. > > - libqb 1.0 > - corosync 2.4.1 > - Pacemaker 1.1.15 > > I tried to reproduce this problem, but for now it can not be reproduced. > > Do you know the cause of this problem? No idea at this point. -- Nazdar, Jan (Poki) pgp2Qns0Ilfx1.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org