Re: [ClusterLabs] [Problem] The crmd fails to connect with pengine.

2019-01-05 Thread renayama19661014
Hi Jan,
Hi Ken,

Thanks for your comment.

I am going to check a little more about the problem of libqb.


Many thanks,
Hideo Yamauchi.


- Original Message -
> From: Ken Gaillot 
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Cc: 
> Date: 2019/1/3, Thu 01:26
> Subject: Re: [ClusterLabs] [Problem] The crmd fails to connect with pengine.
> 
> On Wed, 2019-01-02 at 15:43 +0100, Jan Pokorný wrote:
>>  On 28/12/18 05:51 +0900, renayama19661...@ybb.ne.jp wrote:
>>  > This problem occurred with our users.
>>  > 
>>  > The following problem occurred in a two-node cluster that does not
>>  > set STONITH.
>>  > 
>>  > The problem seems to have occurred in the following procedure.
>>  > 
>>  > Step 1) Configure the cluster with 2 nodes. The DC node is the
>>  > second node.
>>  > Step 2) Several resources are running on the first node.
>>  > Step 3) It stops almost at the same time in order of 2nd node and
>>  > 1st node.
>> 
>>  Do I decipher the above correctly that the cluster is scheduled for
>>  shutdown (fully independently node by node or through a single
>>  trigger
>>  with a high level management tool?) and starts proceeding in serial
>>  manner, shutting 2nd node ~ original DC first?
>> 
>>  > Step 4) After the second node stops, the first node tries to
>>  >         calculate the state transition for the resource stop.
>>  > 
>>  > However, crmd fails to connect with pengine and does not calculate
>>  > state transitions.
>>  > 
>>  > -
>>  > Dec 27 08:36:00 rh74-01 crmd[12997]: warning: Setup of client
>>  > connection failed, not adding channel to mainloop
>>  > -
>> 
>>  Sadly, it looks like details of why this happened would only be
>>  retained when debugging/tracing verbosity of the log messages
>>  was enabled, which likely wasn't the case.
>> 
>>  Anyway, perhaps providing a wider context of the log messages
>>  from this first node might shed some light into this.
> 
> Agreed, that's probably the only hope.
> 
> This would have to be a low-level issue like an out-of-memory error, or
> something at the libqb level.
> 
>>  > As a result, Pacemaker will stop without stopping the resource.
>> 
>>  This might have serious consequences in some scenarios, perhaps
>>  unless some watchdog-based solution (SBD?) was used as a fencing
>>  of choice since it would not get defused just as the resource
>>  wasn't stopped, I think...
> 
> Yep, this is unavoidable in this situation. If the last node standing
> has an unrecoverable problem, there's no other node remaining to fence
> it and recover.
> 
>>  > The problem seems to have occurred in the following environment.
>>  > 
>>  >  - libqb 1.0
>>  >  - corosync 2.4.1
>>  >  - Pacemaker 1.1.15
>>  > 
>>  > I tried to reproduce this problem, but for now it can not be
>>  > reproduced.
>>  > 
>>  > Do you know the cause of this problem?
>> 
>>  No idea at this point.
> -- 
> Ken Gaillot 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Problem] The crmd fails to connect with pengine.

2019-01-02 Thread Ken Gaillot
On Wed, 2019-01-02 at 15:43 +0100, Jan Pokorný wrote:
> On 28/12/18 05:51 +0900, renayama19661...@ybb.ne.jp wrote:
> > This problem occurred with our users.
> > 
> > The following problem occurred in a two-node cluster that does not
> > set STONITH.
> > 
> > The problem seems to have occurred in the following procedure.
> > 
> > Step 1) Configure the cluster with 2 nodes. The DC node is the
> > second node.
> > Step 2) Several resources are running on the first node.
> > Step 3) It stops almost at the same time in order of 2nd node and
> > 1st node.
> 
> Do I decipher the above correctly that the cluster is scheduled for
> shutdown (fully independently node by node or through a single
> trigger
> with a high level management tool?) and starts proceeding in serial
> manner, shutting 2nd node ~ original DC first?
> 
> > Step 4) After the second node stops, the first node tries to
> > calculate the state transition for the resource stop.
> > 
> > However, crmd fails to connect with pengine and does not calculate
> > state transitions.
> > 
> > -
> > Dec 27 08:36:00 rh74-01 crmd[12997]: warning: Setup of client
> > connection failed, not adding channel to mainloop
> > -
> 
> Sadly, it looks like details of why this happened would only be
> retained when debugging/tracing verbosity of the log messages
> was enabled, which likely wasn't the case.
> 
> Anyway, perhaps providing a wider context of the log messages
> from this first node might shed some light into this.

Agreed, that's probably the only hope.

This would have to be a low-level issue like an out-of-memory error, or
something at the libqb level.

> > As a result, Pacemaker will stop without stopping the resource.
> 
> This might have serious consequences in some scenarios, perhaps
> unless some watchdog-based solution (SBD?) was used as a fencing
> of choice since it would not get defused just as the resource
> wasn't stopped, I think...

Yep, this is unavoidable in this situation. If the last node standing
has an unrecoverable problem, there's no other node remaining to fence
it and recover.

> > The problem seems to have occurred in the following environment.
> > 
> >  - libqb 1.0
> >  - corosync 2.4.1
> >  - Pacemaker 1.1.15
> > 
> > I tried to reproduce this problem, but for now it can not be
> > reproduced.
> > 
> > Do you know the cause of this problem?
> 
> No idea at this point.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Problem] The crmd fails to connect with pengine.

2019-01-02 Thread Jan Pokorný
On 28/12/18 05:51 +0900, renayama19661...@ybb.ne.jp wrote:
> This problem occurred with our users.
> 
> The following problem occurred in a two-node cluster that does not set 
> STONITH.
> 
> The problem seems to have occurred in the following procedure.
> 
> Step 1) Configure the cluster with 2 nodes. The DC node is the second node.
> Step 2) Several resources are running on the first node.
> Step 3) It stops almost at the same time in order of 2nd node and 1st node.

Do I decipher the above correctly that the cluster is scheduled for
shutdown (fully independently node by node or through a single trigger
with a high level management tool?) and starts proceeding in serial
manner, shutting 2nd node ~ original DC first?

> Step 4) After the second node stops, the first node tries to
> calculate the state transition for the resource stop.
> 
> However, crmd fails to connect with pengine and does not calculate state 
> transitions.
> 
> -
> Dec 27 08:36:00 rh74-01 crmd[12997]: warning: Setup of client connection 
> failed, not adding channel to mainloop
> -

Sadly, it looks like details of why this happened would only be
retained when debugging/tracing verbosity of the log messages
was enabled, which likely wasn't the case.

Anyway, perhaps providing a wider context of the log messages
from this first node might shed some light into this.

> As a result, Pacemaker will stop without stopping the resource.

This might have serious consequences in some scenarios, perhaps
unless some watchdog-based solution (SBD?) was used as a fencing
of choice since it would not get defused just as the resource
wasn't stopped, I think...

> The problem seems to have occurred in the following environment.
> 
>  - libqb 1.0
>  - corosync 2.4.1
>  - Pacemaker 1.1.15
> 
> I tried to reproduce this problem, but for now it can not be reproduced.
> 
> Do you know the cause of this problem?

No idea at this point.

-- 
Nazdar,
Jan (Poki)


pgp2Qns0Ilfx1.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org