Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-29 Thread Ken Gaillot
On Wed, 2021-04-28 at 19:19 +0200, Jehan-Guillaume de Rorthais wrote:
> On Wed, 28 Apr 2021 12:00:40 -0500
> Ken Gaillot  wrote:
> 
> > On Wed, 2021-04-28 at 18:14 +0200, Jehan-Guillaume de Rorthais
> > wrote:
> > > Hi all,
> > > 
> > > It seems to me the concern raised by Ulrich hasn't been
> > > discussed:
> > > 
> > > On Wed, 12 Apr 2021 Ulrich Windl wrote:
> > >   
> > > > Personally I think an RA calling crm_mon is inherently broken:
> > > > Will
> > > > it ever
> > > > pass ocf-tester?  
> > 
> > Calling the command-line tools in an agent can be OK in some cases.
> > The
> > main concerns are:
> > 
> > * Time-of-check/time-of-use: cluster status can change immediately,
> > so
> > the agent should behave reasonably if a query result is incorrect
> > at
> > the moment it's used. Ideally there would be no case where the
> > agent
> > could incorrectly report success for an action.
> > 
> > * No commands that *change* the configuration (other than setting
> > node
> > attributes) should ever be used. Otherwise there's a potential for
> > an
> > infinite loop between the agent and scheduler.
> > 
> > * It's best to use tools' XML output when available, because that
> > should be stable across Pacemaker releases, while the text output
> > may
> > not be. Aside from crm_mon, XML output is a recent addition, so
> > some
> > consideration must be given to backward compatibility and/or
> > requiring
> > a minimum Pacemaker version.
> > 
> > * Only the configuration section of the CIB has a guaranteed
> > schema.
> > The status section can theoretically change from release to
> > release,
> > although in practice it has changed very little over the years.
> > 
> > I don't use ocf-tester so I can't speak to that, but I suspect it
> > could
> > work if you exported a CIB_file variable with a sample cluster
> > status
> > beforehand. (CIB_file makes the cluster commands act as if the
> > specified file is the live CIB at the moment.)
> > 
> > > Would it be possible to rely on the following command ?
> > > 
> > >   cibadmin --query --xpath "//status/node_state[@join='member']"
> > > | \
> > > grep -Po 'uname="\K[^"]+'
> > > 
> > > 
> > > Regards,  
> > 
> > Only full cluster nodes will have a "join" attribute, so that query
> > won't catch active remote nodes or guest nodes. Whether that's good
> > or
> > bad depends on what you're looking for.
> 
> That was an example to remove the crm_mon dependency with the
> cibadmin one.
> AFAIU this agent, it uses crm_mon to:
> 
> * look for the node hosting the promoted clone
> * look for a node existence
> * look for a node fully joined
> 
> all of these use seems accessible by parsing the cibadmin status
> section
> output (or --xpath).

I would think remote nodes and guest nodes should be considered, too,
unless the agent specifically doesn't support that.

Remote nodes and guest nodes don't join the controller layer, so they
won't have a join entry, but they can resources.

> > The plus side is that it's a query and it returns XML.
> 
> indeed.
> 
> > The downsides are that node status can change quickly, so it could
> > theoretically be inaccurate a moment later when you use it, and the
> > status section is not guaranteed to stay in that format (though I
> > expect that particular part will).
> 
> There's already version checks in pgsql RA code for crm_mon anyway,
> relying on
> OCF_RESKEY_crm_feature_set.
> 
> > A minor point: that query will return the entire node_state XML
> > subtree; you can add -n/--no-children to return just the node_state
> > element itself.
> 
> Nice!
> 
> I was playing with xmllint as well, for an expanded support of
> xmllint, but it
> would add a strong dependency.
> 
> Regards,
> 
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-28 Thread Jehan-Guillaume de Rorthais
On Wed, 28 Apr 2021 12:00:40 -0500
Ken Gaillot  wrote:

> On Wed, 2021-04-28 at 18:14 +0200, Jehan-Guillaume de Rorthais wrote:
> > Hi all,
> > 
> > It seems to me the concern raised by Ulrich hasn't been discussed:
> > 
> > On Wed, 12 Apr 2021 Ulrich Windl wrote:
> >   
> > > Personally I think an RA calling crm_mon is inherently broken: Will
> > > it ever
> > > pass ocf-tester?  
> 
> Calling the command-line tools in an agent can be OK in some cases. The
> main concerns are:
> 
> * Time-of-check/time-of-use: cluster status can change immediately, so
> the agent should behave reasonably if a query result is incorrect at
> the moment it's used. Ideally there would be no case where the agent
> could incorrectly report success for an action.
> 
> * No commands that *change* the configuration (other than setting node
> attributes) should ever be used. Otherwise there's a potential for an
> infinite loop between the agent and scheduler.
> 
> * It's best to use tools' XML output when available, because that
> should be stable across Pacemaker releases, while the text output may
> not be. Aside from crm_mon, XML output is a recent addition, so some
> consideration must be given to backward compatibility and/or requiring
> a minimum Pacemaker version.
> 
> * Only the configuration section of the CIB has a guaranteed schema.
> The status section can theoretically change from release to release,
> although in practice it has changed very little over the years.
> 
> I don't use ocf-tester so I can't speak to that, but I suspect it could
> work if you exported a CIB_file variable with a sample cluster status
> beforehand. (CIB_file makes the cluster commands act as if the
> specified file is the live CIB at the moment.)
> 
> > Would it be possible to rely on the following command ?
> > 
> >   cibadmin --query --xpath "//status/node_state[@join='member']" | \
> > grep -Po 'uname="\K[^"]+'
> > 
> > 
> > Regards,  
> 
> Only full cluster nodes will have a "join" attribute, so that query
> won't catch active remote nodes or guest nodes. Whether that's good or
> bad depends on what you're looking for.

That was an example to remove the crm_mon dependency with the cibadmin one.
AFAIU this agent, it uses crm_mon to:

* look for the node hosting the promoted clone
* look for a node existence
* look for a node fully joined

all of these use seems accessible by parsing the cibadmin status section
output (or --xpath).

> The plus side is that it's a query and it returns XML.

indeed.

> The downsides are that node status can change quickly, so it could
> theoretically be inaccurate a moment later when you use it, and the
> status section is not guaranteed to stay in that format (though I
> expect that particular part will).

There's already version checks in pgsql RA code for crm_mon anyway, relying on
OCF_RESKEY_crm_feature_set.

> A minor point: that query will return the entire node_state XML
> subtree; you can add -n/--no-children to return just the node_state
> element itself.

Nice!

I was playing with xmllint as well, for an expanded support of xmllint, but it
would add a strong dependency.

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-28 Thread Ken Gaillot
On Wed, 2021-04-28 at 18:14 +0200, Jehan-Guillaume de Rorthais wrote:
> Hi all,
> 
> It seems to me the concern raised by Ulrich hasn't been discussed:
> 
> On Wed, 12 Apr 2021 Ulrich Windl wrote:
> 
> > Personally I think an RA calling crm_mon is inherently broken: Will
> > it ever
> > pass ocf-tester?

Calling the command-line tools in an agent can be OK in some cases. The
main concerns are:

* Time-of-check/time-of-use: cluster status can change immediately, so
the agent should behave reasonably if a query result is incorrect at
the moment it's used. Ideally there would be no case where the agent
could incorrectly report success for an action.

* No commands that *change* the configuration (other than setting node
attributes) should ever be used. Otherwise there's a potential for an
infinite loop between the agent and scheduler.

* It's best to use tools' XML output when available, because that
should be stable across Pacemaker releases, while the text output may
not be. Aside from crm_mon, XML output is a recent addition, so some
consideration must be given to backward compatibility and/or requiring
a minimum Pacemaker version.

* Only the configuration section of the CIB has a guaranteed schema.
The status section can theoretically change from release to release,
although in practice it has changed very little over the years.

I don't use ocf-tester so I can't speak to that, but I suspect it could
work if you exported a CIB_file variable with a sample cluster status
beforehand. (CIB_file makes the cluster commands act as if the
specified file is the live CIB at the moment.)

> Would it be possible to rely on the following command ?
> 
>   cibadmin --query --xpath "//status/node_state[@join='member']" | \
> grep -Po 'uname="\K[^"]+'
> 
> 
> Regards,

Only full cluster nodes will have a "join" attribute, so that query
won't catch active remote nodes or guest nodes. Whether that's good or
bad depends on what you're looking for.

The plus side is that it's a query and it returns XML.

The downsides are that node status can change quickly, so it could
theoretically be inaccurate a moment later when you use it, and the
status section is not guaranteed to stay in that format (though I
expect that particular part will).

A minor point: that query will return the entire node_state XML
subtree; you can add -n/--no-children to return just the node_state
element itself.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-28 Thread Jehan-Guillaume de Rorthais
Hi all,

It seems to me the concern raised by Ulrich hasn't been discussed:

On Wed, 12 Apr 2021 Ulrich Windl wrote:

> Personally I think an RA calling crm_mon is inherently broken: Will it ever
> pass ocf-tester?

Would it be possible to rely on the following command ?

  cibadmin --query --xpath "//status/node_state[@join='member']" | \
grep -Po 'uname="\K[^"]+'


Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-23 Thread renayama19661014
Hi Ken,
Hi Klaus,

Thanks for your comment.

>We did not have time to get it into the RHEL 8.4 GA (general
>availability) release, which means for example it will not be in 8.4
>install images, but we did get a 0-day fix, which means that it will be
>available via "yum update" the same day that 8.4 is released.
>
>Thanks for testing the 8.4 build and finding the issue!


Okay!


Best Regards,
Hideo Yamauchi.




- Original Message -
>From: Ken Gaillot 
>To: renayama19661...@ybb.ne.jp 
>Cc: kwenning 
>Date: 2021/4/24, Sat 01:25
>Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
>fails.
> 
>Hi Hideo,
>
>A private reply to follow up:
>
>The fix will be in the 2.1.0 upstream release.
>
>We did not have time to get it into the RHEL 8.4 GA (general
>availability) release, which means for example it will not be in 8.4
>install images, but we did get a 0-day fix, which means that it will be
>available via "yum update" the same day that 8.4 is released.
>
>Thanks for testing the 8.4 build and finding the issue!
>
>On Thu, 2021-04-15 at 11:45 +0900, renayama19661...@ybb.ne.jp wrote:
>> Hi Klaus,
>> Hi Ken,
>> 
>> We have confirmed that the operation is improved by the test.
>> Thank you for your prompt response.
>> 
>> We look forward to including this fix in the release version of RHEL
>> 8.4.
>> 
>> Best Regards,
>> Hideo Yamauchi.
>> 
>> 
>> 
>> - Original Message -
>> > From: "renayama19661...@ybb.ne.jp" 
>> > To: "kwenn...@redhat.com" ; Cluster Labs - All
>> > topics related to open-source clustering welcomed <
>> > users@clusterlabs.org>; Cluster Labs - All topics related to open-
>> > source clustering welcomed 
>> > Cc: 
>> > Date: 2021/4/13, Tue 07:08
>> > Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource
>> > control fails.
>> > 
>> > Hi Klaus,
>> > Hi Ken,
>> > 
>> > >  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342
>> > > with
>> > >  I guess the simplest possible solution to the immediate issue so
>> > >  that we can discuss it.
>> > 
>> > 
>> > Thank you for the fix.
>> > 
>> > 
>> > I have confirmed that the fixes have been merged.
>> > 
>> > I'll test this fix today just in case.
>> > 
>> > Many thanks,
>> > Hideo Yamauchi.
>> > 
>> > 
>> > - Original Message -
>> > >  From: Klaus Wenninger 
>> > >  To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics
>> > > related to 
>> > 
>> > open-source clustering welcomed 
>> > >  Cc: 
>> > >  Date: 2021/4/12, Mon 22:22
>> > >  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql
>> > > resource control 
>> > 
>> > fails.
>> > > 
>> > >  On 4/9/21 5:13 PM, Klaus Wenninger wrote:
>> > > >   On 4/9/21 4:04 PM, Klaus Wenninger wrote:
>> > > > >   On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>> > > > > >   On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>> > > > > > >   On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:
>> > > > > > > >   Hi Klaus,
>> > > > > > > > 
>> > > > > > > >   Thanks for your comment.
>> > > > > > > > 
>> > > > > > > > >   Hmm ... is that with selinux enabled?
>> > > > > > > > >   Respectively do you see any related avc messages?
>> > > > > > > > 
>> > > > > > > >   Selinux is not enabled.
>> > > > > > > >   Isn't crm_mon caused by not returning a response 
>> > 
>> > when 
>> > >  pacemakerd 
>> > > > > > > >   prepares to stop?
>> > > > > > 
>> > > > > >   yep ... that doesn't look good.
>> > > > > >   While in pcmk_shutdown_worker ipc isn't handled.
>> > > > > 
>> > > > >   Stop ... that should actually work as pcmk_shutdown_worker
>> > > > >   should exit quite quickly and proceed after mainloop
>> > > > >   dispatching when called again.
>> > > > >   Don't see anything atm that might be blocking for longer
>> > > > > ...
>> > > > >  

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-15 Thread renayama19661014
Hi ALl,

Sorry...
Due to my operation mistake, the same email was sent multiple times.


Best Regards,
Hideo Yamauchi.


- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> ; Cluster Labs - All topics related to open-source 
> clustering welcomed 
> Cc: 
> Date: 2021/4/15, Thu 11:45
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
> 
> Hi Klaus,
> Hi Ken,
> 
> We have confirmed that the operation is improved by the test.
> Thank you for your prompt response.
> 
> We look forward to including this fix in the release version of RHEL 8.4.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> 
> - Original Message -
>>  From: "renayama19661...@ybb.ne.jp" 
> 
>>  To: "kwenn...@redhat.com" ; Cluster 
> Labs - All topics related to open-source clustering welcomed 
> ; Cluster Labs - All topics related to open-source 
> clustering welcomed 
>>  Cc: 
>>  Date: 2021/4/13, Tue 07:08
>>  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
>> 
>>  Hi Klaus,
>>  Hi Ken,
>> 
>>>   I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 
> with
>> 
>>>   I guess the simplest possible solution to the immediate issue so
>>>   that we can discuss it.
>> 
>> 
>>  Thank you for the fix.
>> 
>> 
>>  I have confirmed that the fixes have been merged.
>> 
>>  I'll test this fix today just in case.
>> 
>>  Many thanks,
>>  Hideo Yamauchi.
>> 
>> 
>>  - Original Message -
>>>   From: Klaus Wenninger 
>>>   To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
>>  open-source clustering welcomed 
>>>   Cc: 
>>>   Date: 2021/4/12, Mon 22:22
>>>   Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource 
> control 
>>  fails.
>>> 
>>>   On 4/9/21 5:13 PM, Klaus Wenninger wrote:
>>>>    On 4/9/21 4:04 PM, Klaus Wenninger wrote:
>>>>>    On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>>>>>>    On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>>>>>>>    On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:
>>>>>>>>    Hi Klaus,
>>>>>>>> 
>>>>>>>>    Thanks for your comment.
>>>>>>>> 
>>>>>>>>>    Hmm ... is that with selinux enabled?
>>>>>>>>>    Respectively do you see any related avc 
> messages?
>>>>>>>> 
>>>>>>>>    Selinux is not enabled.
>>>>>>>>    Isn't crm_mon caused by not returning a 
> response 
>>  when 
>>>   pacemakerd 
>>>>>>>>    prepares to stop?
>>>>>>    yep ... that doesn't look good.
>>>>>>    While in pcmk_shutdown_worker ipc isn't handled.
>>>>>    Stop ... that should actually work as pcmk_shutdown_worker
>>>>>    should exit quite quickly and proceed after mainloop
>>>>>    dispatching when called again.
>>>>>    Don't see anything atm that might be blocking for longer 
> ...
>>>>>    but let me dig into it further ...
>>>>    What happens is clear (thanks Ken for the hint ;-) ).
>>>>    When pacemakerd is shutting down - already when it
>>>>    shuts down the resources and not just when it starts to
>>>>    reap the subdaemons - crm_mon reads that state and
>>>>    doesn't try to connect to the cib anymore.
>>>   I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 
> with
>>>   I guess the simplest possible solution to the immediate issue so
>>>   that we can discuss it.
>>>>>>    Question is why that didn't create issue earlier.
>>>>>>    Probably I didn't test with resources that had 
> crm_mon in
>>>>>>    their stop/monitor-actions but sbd should have run into
>>>>>>    issues.
>>>>>> 
>>>>>>    Klaus
>>>>>>>    But when shutting down a node the resources should be
>>>>>>>    shutdown before pacemakerd goes down.
>>>>>>>    But let me have a look if it can happen that 
> pacemakerd
>>>>>>>    doesn't react to the ipc-pings before. That btw. 
> might 
>>  be
>>>>>>>    lethal for sbd-scenarios (

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-15 Thread renayama19661014
Hi Klaus,
Hi Ken,

We have confirmed that the operation is improved by the test.
Thank you for your prompt response.

We look forward to including this fix in the release version of RHEL 8.4.

Best Regards,
Hideo Yamauchi.


- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: "kwenn...@redhat.com" ; Cluster Labs - All topics 
> related to open-source clustering welcomed ; Cluster 
> Labs - All topics related to open-source clustering welcomed 
> 
> Cc: 
> Date: 2021/4/13, Tue 07:08
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
> 
> Hi Klaus,
> Hi Ken,
> 
>>  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
> 
>>  I guess the simplest possible solution to the immediate issue so
>>  that we can discuss it.
> 
> 
> Thank you for the fix.
> 
> 
> I have confirmed that the fixes have been merged.
> 
> I'll test this fix today just in case.
> 
> Many thanks,
> Hideo Yamauchi.
> 
> 
> - Original Message -
>>  From: Klaus Wenninger 
>>  To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
>>  Cc: 
>>  Date: 2021/4/12, Mon 22:22
>>  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
>> 
>>  On 4/9/21 5:13 PM, Klaus Wenninger wrote:
>>>   On 4/9/21 4:04 PM, Klaus Wenninger wrote:
>>>>   On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>>>>>   On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>>>>>>   On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:
>>>>>>>   Hi Klaus,
>>>>>>> 
>>>>>>>   Thanks for your comment.
>>>>>>> 
>>>>>>>>   Hmm ... is that with selinux enabled?
>>>>>>>>   Respectively do you see any related avc messages?
>>>>>>> 
>>>>>>>   Selinux is not enabled.
>>>>>>>   Isn't crm_mon caused by not returning a response 
> when 
>>  pacemakerd 
>>>>>>>   prepares to stop?
>>>>>   yep ... that doesn't look good.
>>>>>   While in pcmk_shutdown_worker ipc isn't handled.
>>>>   Stop ... that should actually work as pcmk_shutdown_worker
>>>>   should exit quite quickly and proceed after mainloop
>>>>   dispatching when called again.
>>>>   Don't see anything atm that might be blocking for longer ...
>>>>   but let me dig into it further ...
>>>   What happens is clear (thanks Ken for the hint ;-) ).
>>>   When pacemakerd is shutting down - already when it
>>>   shuts down the resources and not just when it starts to
>>>   reap the subdaemons - crm_mon reads that state and
>>>   doesn't try to connect to the cib anymore.
>>  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
>>  I guess the simplest possible solution to the immediate issue so
>>  that we can discuss it.
>>>>>   Question is why that didn't create issue earlier.
>>>>>   Probably I didn't test with resources that had crm_mon in
>>>>>   their stop/monitor-actions but sbd should have run into
>>>>>   issues.
>>>>> 
>>>>>   Klaus
>>>>>>   But when shutting down a node the resources should be
>>>>>>   shutdown before pacemakerd goes down.
>>>>>>   But let me have a look if it can happen that pacemakerd
>>>>>>   doesn't react to the ipc-pings before. That btw. might 
> be
>>>>>>   lethal for sbd-scenarios (if the phase is too long and it
>>>>>>   migh actually not be defined).
>>>>>> 
>>>>>>   My idea with selinux would have been that it might block
>>>>>>   the ipc if crm_mon is issued by execd. But well forget
>>>>>>   about it as it is not enabled ;-)
>>>>>> 
>>>>>> 
>>>>>>   Klaus
>>>>>>> 
>>>>>>>   pgsql needs the result of crm_mon in demote processing 
> and 
>>  stop 
>>>>>>>   processing.
>>>>>>>   crm_mon should return a response even after pacemakerd 
> goes 
>>  into a 
>>>>>>>   stop operation.
>>>>>>> 
>>>>>>>   Best Regards,
>>>>>>>   Hideo Yamauchi.
>>>>>>> 
>>>>>>> 
>>>>>>>   - Original M

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-14 Thread renayama19661014
Hi Klaus,
Hi Ken,

We have confirmed that the operation is improved by the test.
Thank you for your prompt response.

We look forward to including this fix in the release version of RHEL 8.4.

Best Regards,
Hideo Yamauchi.



- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: "kwenn...@redhat.com" ; Cluster Labs - All topics 
> related to open-source clustering welcomed ; Cluster 
> Labs - All topics related to open-source clustering welcomed 
> 
> Cc: 
> Date: 2021/4/13, Tue 07:08
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
> 
> Hi Klaus,
> Hi Ken,
> 
>>  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
> 
>>  I guess the simplest possible solution to the immediate issue so
>>  that we can discuss it.
> 
> 
> Thank you for the fix.
> 
> 
> I have confirmed that the fixes have been merged.
> 
> I'll test this fix today just in case.
> 
> Many thanks,
> Hideo Yamauchi.
> 
> 
> - Original Message -
>>  From: Klaus Wenninger 
>>  To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
>>  Cc: 
>>  Date: 2021/4/12, Mon 22:22
>>  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
>> 
>>  On 4/9/21 5:13 PM, Klaus Wenninger wrote:
>>>   On 4/9/21 4:04 PM, Klaus Wenninger wrote:
>>>>   On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>>>>>   On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>>>>>>   On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:
>>>>>>>   Hi Klaus,
>>>>>>> 
>>>>>>>   Thanks for your comment.
>>>>>>> 
>>>>>>>>   Hmm ... is that with selinux enabled?
>>>>>>>>   Respectively do you see any related avc messages?
>>>>>>> 
>>>>>>>   Selinux is not enabled.
>>>>>>>   Isn't crm_mon caused by not returning a response 
> when 
>>  pacemakerd 
>>>>>>>   prepares to stop?
>>>>>   yep ... that doesn't look good.
>>>>>   While in pcmk_shutdown_worker ipc isn't handled.
>>>>   Stop ... that should actually work as pcmk_shutdown_worker
>>>>   should exit quite quickly and proceed after mainloop
>>>>   dispatching when called again.
>>>>   Don't see anything atm that might be blocking for longer ...
>>>>   but let me dig into it further ...
>>>   What happens is clear (thanks Ken for the hint ;-) ).
>>>   When pacemakerd is shutting down - already when it
>>>   shuts down the resources and not just when it starts to
>>>   reap the subdaemons - crm_mon reads that state and
>>>   doesn't try to connect to the cib anymore.
>>  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
>>  I guess the simplest possible solution to the immediate issue so
>>  that we can discuss it.
>>>>>   Question is why that didn't create issue earlier.
>>>>>   Probably I didn't test with resources that had crm_mon in
>>>>>   their stop/monitor-actions but sbd should have run into
>>>>>   issues.
>>>>> 
>>>>>   Klaus
>>>>>>   But when shutting down a node the resources should be
>>>>>>   shutdown before pacemakerd goes down.
>>>>>>   But let me have a look if it can happen that pacemakerd
>>>>>>   doesn't react to the ipc-pings before. That btw. might 
> be
>>>>>>   lethal for sbd-scenarios (if the phase is too long and it
>>>>>>   migh actually not be defined).
>>>>>> 
>>>>>>   My idea with selinux would have been that it might block
>>>>>>   the ipc if crm_mon is issued by execd. But well forget
>>>>>>   about it as it is not enabled ;-)
>>>>>> 
>>>>>> 
>>>>>>   Klaus
>>>>>>> 
>>>>>>>   pgsql needs the result of crm_mon in demote processing 
> and 
>>  stop 
>>>>>>>   processing.
>>>>>>>   crm_mon should return a response even after pacemakerd 
> goes 
>>  into a 
>>>>>>>   stop operation.
>>>>>>> 
>>>>>>>   Best Regards,
>>>>>>>   Hideo Yamauchi.
>>>>>>> 
>>>>>>> 
>>>>>>>   - Original M

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-12 Thread renayama19661014
Hi Klaus,
Hi Ken,

> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with

> I guess the simplest possible solution to the immediate issue so
> that we can discuss it.


Thank you for the fix.


I have confirmed that the fixes have been merged.

I'll test this fix today just in case.

Many thanks,
Hideo Yamauchi.


- Original Message -
> From: Klaus Wenninger 
> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
> Cc: 
> Date: 2021/4/12, Mon 22:22
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
> 
> On 4/9/21 5:13 PM, Klaus Wenninger wrote:
>>  On 4/9/21 4:04 PM, Klaus Wenninger wrote:
>>>  On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>>>>  On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>>>>>  On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:
>>>>>>  Hi Klaus,
>>>>>> 
>>>>>>  Thanks for your comment.
>>>>>> 
>>>>>>>  Hmm ... is that with selinux enabled?
>>>>>>>  Respectively do you see any related avc messages?
>>>>>> 
>>>>>>  Selinux is not enabled.
>>>>>>  Isn't crm_mon caused by not returning a response when 
> pacemakerd 
>>>>>>  prepares to stop?
>>>>  yep ... that doesn't look good.
>>>>  While in pcmk_shutdown_worker ipc isn't handled.
>>>  Stop ... that should actually work as pcmk_shutdown_worker
>>>  should exit quite quickly and proceed after mainloop
>>>  dispatching when called again.
>>>  Don't see anything atm that might be blocking for longer ...
>>>  but let me dig into it further ...
>>  What happens is clear (thanks Ken for the hint ;-) ).
>>  When pacemakerd is shutting down - already when it
>>  shuts down the resources and not just when it starts to
>>  reap the subdaemons - crm_mon reads that state and
>>  doesn't try to connect to the cib anymore.
> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
> I guess the simplest possible solution to the immediate issue so
> that we can discuss it.
>>>>  Question is why that didn't create issue earlier.
>>>>  Probably I didn't test with resources that had crm_mon in
>>>>  their stop/monitor-actions but sbd should have run into
>>>>  issues.
>>>> 
>>>>  Klaus
>>>>>  But when shutting down a node the resources should be
>>>>>  shutdown before pacemakerd goes down.
>>>>>  But let me have a look if it can happen that pacemakerd
>>>>>  doesn't react to the ipc-pings before. That btw. might be
>>>>>  lethal for sbd-scenarios (if the phase is too long and it
>>>>>  migh actually not be defined).
>>>>> 
>>>>>  My idea with selinux would have been that it might block
>>>>>  the ipc if crm_mon is issued by execd. But well forget
>>>>>  about it as it is not enabled ;-)
>>>>> 
>>>>> 
>>>>>  Klaus
>>>>>> 
>>>>>>  pgsql needs the result of crm_mon in demote processing and 
> stop 
>>>>>>  processing.
>>>>>>  crm_mon should return a response even after pacemakerd goes 
> into a 
>>>>>>  stop operation.
>>>>>> 
>>>>>>  Best Regards,
>>>>>>  Hideo Yamauchi.
>>>>>> 
>>>>>> 
>>>>>>  - Original Message -
>>>>>>>  From: Klaus Wenninger 
>>>>>>>  To: renayama19661...@ybb.ne.jp; Cluster Labs - All 
> topics related 
>>>>>>>  to open-source clustering welcomed 
> 
>>>>>>>  Cc:
>>>>>>>  Date: 2021/4/9, Fri 21:12
>>>>>>>  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, 
> pgsql 
>>>>>>>  resource control fails.
>>>>>>> 
>>>>>>>  On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:
>>>>>>>>    Hi Ken,
>>>>>>>>    Hi All,
>>>>>>>> 
>>>>>>>>    In the pgsql resource, crm_mon is executed in the 
> process of 
>>>>>>>>  demote and
>>>>>>>  stop, and the result is processed.
>>>>>>>>    However, pacemaker included in RHEL8.4beta fails 
> to execute 
>>>>>>>>  this crm_mon.
>>>>>>>>     

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-12 Thread Klaus Wenninger

On 4/9/21 5:13 PM, Klaus Wenninger wrote:

On 4/9/21 4:04 PM, Klaus Wenninger wrote:

On 4/9/21 3:45 PM, Klaus Wenninger wrote:

On 4/9/21 3:36 PM, Klaus Wenninger wrote:

On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:

Hi Klaus,

Thanks for your comment.


Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?


Selinux is not enabled.
Isn't crm_mon caused by not returning a response when pacemakerd 
prepares to stop?

yep ... that doesn't look good.
While in pcmk_shutdown_worker ipc isn't handled.

Stop ... that should actually work as pcmk_shutdown_worker
should exit quite quickly and proceed after mainloop
dispatching when called again.
Don't see anything atm that might be blocking for longer ...
but let me dig into it further ...

What happens is clear (thanks Ken for the hint ;-) ).
When pacemakerd is shutting down - already when it
shuts down the resources and not just when it starts to
reap the subdaemons - crm_mon reads that state and
doesn't try to connect to the cib anymore.

I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
I guess the simplest possible solution to the immediate issue so
that we can discuss it.

Question is why that didn't create issue earlier.
Probably I didn't test with resources that had crm_mon in
their stop/monitor-actions but sbd should have run into
issues.

Klaus

But when shutting down a node the resources should be
shutdown before pacemakerd goes down.
But let me have a look if it can happen that pacemakerd
doesn't react to the ipc-pings before. That btw. might be
lethal for sbd-scenarios (if the phase is too long and it
migh actually not be defined).

My idea with selinux would have been that it might block
the ipc if crm_mon is issued by execd. But well forget
about it as it is not enabled ;-)


Klaus


pgsql needs the result of crm_mon in demote processing and stop 
processing.
crm_mon should return a response even after pacemakerd goes into a 
stop operation.


Best Regards,
Hideo Yamauchi.


- Original Message -

From: Klaus Wenninger 
To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related 
to open-source clustering welcomed 

Cc:
Date: 2021/4/9, Fri 21:12
Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql 
resource control fails.


On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:

  Hi Ken,
  Hi All,

  In the pgsql resource, crm_mon is executed in the process of 
demote and

stop, and the result is processed.
  However, pacemaker included in RHEL8.4beta fails to execute 
this crm_mon.

    - The problem also occurs on github

master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).

  The problem can be easily reproduced in the following ways.

  Step1. Modify to execute crm_mon in the stop process of the 
Dummy resource.

  

  dummy_stop() {
       mon=$(crm_mon -1)
       ret=$?
       ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
       dummy_monitor
       if [ $? =  $OCF_SUCCESS ]; then
           rm ${OCF_RESKEY_state}
       fi
       return $OCF_SUCCESS
  }
  

  Step2. Configure a cluster with two nodes.
  

  [root@rh84-beta01 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) 
- partition

with quorum

     * Last updated: Thu Apr  8 18:00:52 2021
     * Last change:  Thu Apr  8 18:00:38 2021 by root via 
cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta01 rh84-beta02 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01

  Migration Summary:
  

  Step3. Stop the node where the Dummy resource is running. The 
resource will

fail over.

  
  [root@rh84-beta02 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) 
- partition

with quorum

     * Last updated: Thu Apr  8 18:08:56 2021
     * Last change:  Thu Apr  8 18:05:08 2021 by root via 
cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta02 ]
     * OFFLINE: [ rh84-beta01 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02
  

  However, if you look at the log, you can see that the 
execution of crm_mon

in the stop processing of the Dummy resource has failed.

  
  Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI 

crm_mon[102] : Pacemaker daemons shutting down ...
  Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] 
(log_op_output)
notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: 
cluster is not

available on this node ]
Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?

Klaus

  

  Similarly, pgsql also executes crm_mon with demote or stop, so 
control

fails.

  The problem seems to be related to th

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-09 Thread Klaus Wenninger

On 4/9/21 4:04 PM, Klaus Wenninger wrote:

On 4/9/21 3:45 PM, Klaus Wenninger wrote:

On 4/9/21 3:36 PM, Klaus Wenninger wrote:

On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:

Hi Klaus,

Thanks for your comment.


Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?


Selinux is not enabled.
Isn't crm_mon caused by not returning a response when pacemakerd 
prepares to stop?

yep ... that doesn't look good.
While in pcmk_shutdown_worker ipc isn't handled.

Stop ... that should actually work as pcmk_shutdown_worker
should exit quite quickly and proceed after mainloop
dispatching when called again.
Don't see anything atm that might be blocking for longer ...
but let me dig into it further ...

What happens is clear (thanks Ken for the hint ;-) ).
When pacemakerd is shutting down - already when it
shuts down the resources and not just when it starts to
reap the subdaemons - crm_mon reads that state and
doesn't try to connect to the cib anymore.

Question is why that didn't create issue earlier.
Probably I didn't test with resources that had crm_mon in
their stop/monitor-actions but sbd should have run into
issues.

Klaus

But when shutting down a node the resources should be
shutdown before pacemakerd goes down.
But let me have a look if it can happen that pacemakerd
doesn't react to the ipc-pings before. That btw. might be
lethal for sbd-scenarios (if the phase is too long and it
migh actually not be defined).

My idea with selinux would have been that it might block
the ipc if crm_mon is issued by execd. But well forget
about it as it is not enabled ;-)


Klaus


pgsql needs the result of crm_mon in demote processing and stop 
processing.
crm_mon should return a response even after pacemakerd goes into a 
stop operation.


Best Regards,
Hideo Yamauchi.


- Original Message -

From: Klaus Wenninger 
To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related 
to open-source clustering welcomed 

Cc:
Date: 2021/4/9, Fri 21:12
Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql 
resource control fails.


On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:

  Hi Ken,
  Hi All,

  In the pgsql resource, crm_mon is executed in the process of 
demote and

stop, and the result is processed.
  However, pacemaker included in RHEL8.4beta fails to execute 
this crm_mon.

    - The problem also occurs on github

master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).

  The problem can be easily reproduced in the following ways.

  Step1. Modify to execute crm_mon in the stop process of the 
Dummy resource.

  

  dummy_stop() {
       mon=$(crm_mon -1)
       ret=$?
       ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
       dummy_monitor
       if [ $? =  $OCF_SUCCESS ]; then
           rm ${OCF_RESKEY_state}
       fi
       return $OCF_SUCCESS
  }
  

  Step2. Configure a cluster with two nodes.
  

  [root@rh84-beta01 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - 
partition

with quorum

     * Last updated: Thu Apr  8 18:00:52 2021
     * Last change:  Thu Apr  8 18:00:38 2021 by root via 
cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta01 rh84-beta02 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01

  Migration Summary:
  

  Step3. Stop the node where the Dummy resource is running. The 
resource will

fail over.

  
  [root@rh84-beta02 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - 
partition

with quorum

     * Last updated: Thu Apr  8 18:08:56 2021
     * Last change:  Thu Apr  8 18:05:08 2021 by root via 
cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta02 ]
     * OFFLINE: [ rh84-beta01 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02
  

  However, if you look at the log, you can see that the execution 
of crm_mon

in the stop processing of the Dummy resource has failed.

  
  Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI 

crm_mon[102] : Pacemaker daemons shutting down ...
  Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] 
(log_op_output)
notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: 
cluster is not

available on this node ]
Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?

Klaus

  

  Similarly, pgsql also executes crm_mon with demote or stop, so 
control

fails.

  The problem seems to be related to the next fix.
    * Report pacemakerd in state waiting for sbd
     - https://github.com/ClusterLabs/pacemaker/pull/2278

  The problem does not occur with the release version of 
Pacemaker 2.0.5 or

the Pacemake

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-09 Thread Klaus Wenninger

On 4/9/21 3:45 PM, Klaus Wenninger wrote:

On 4/9/21 3:36 PM, Klaus Wenninger wrote:

On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:

Hi Klaus,

Thanks for your comment.


Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?


Selinux is not enabled.
Isn't crm_mon caused by not returning a response when pacemakerd 
prepares to stop?

yep ... that doesn't look good.
While in pcmk_shutdown_worker ipc isn't handled.

Stop ... that should actually work as pcmk_shutdown_worker
should exit quite quickly and proceed after mainloop
dispatching when called again.
Don't see anything atm that might be blocking for longer ...
but let me dig into it further ...

Question is why that didn't create issue earlier.
Probably I didn't test with resources that had crm_mon in
their stop/monitor-actions but sbd should have run into
issues.

Klaus

But when shutting down a node the resources should be
shutdown before pacemakerd goes down.
But let me have a look if it can happen that pacemakerd
doesn't react to the ipc-pings before. That btw. might be
lethal for sbd-scenarios (if the phase is too long and it
migh actually not be defined).

My idea with selinux would have been that it might block
the ipc if crm_mon is issued by execd. But well forget
about it as it is not enabled ;-)


Klaus


pgsql needs the result of crm_mon in demote processing and stop 
processing.
crm_mon should return a response even after pacemakerd goes into a 
stop operation.


Best Regards,
Hideo Yamauchi.


- Original Message -

From: Klaus Wenninger 
To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related 
to open-source clustering welcomed 

Cc:
Date: 2021/4/9, Fri 21:12
Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource 
control fails.


On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:

  Hi Ken,
  Hi All,

  In the pgsql resource, crm_mon is executed in the process of 
demote and

stop, and the result is processed.
  However, pacemaker included in RHEL8.4beta fails to execute this 
crm_mon.

    - The problem also occurs on github

master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).

  The problem can be easily reproduced in the following ways.

  Step1. Modify to execute crm_mon in the stop process of the 
Dummy resource.

  

  dummy_stop() {
       mon=$(crm_mon -1)
       ret=$?
       ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
       dummy_monitor
       if [ $? =  $OCF_SUCCESS ]; then
           rm ${OCF_RESKEY_state}
       fi
       return $OCF_SUCCESS
  }
  

  Step2. Configure a cluster with two nodes.
  

  [root@rh84-beta01 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - 
partition

with quorum

     * Last updated: Thu Apr  8 18:00:52 2021
     * Last change:  Thu Apr  8 18:00:38 2021 by root via cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta01 rh84-beta02 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01

  Migration Summary:
  

  Step3. Stop the node where the Dummy resource is running. The 
resource will

fail over.

  
  [root@rh84-beta02 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - 
partition

with quorum

     * Last updated: Thu Apr  8 18:08:56 2021
     * Last change:  Thu Apr  8 18:05:08 2021 by root via cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta02 ]
     * OFFLINE: [ rh84-beta01 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02
  

  However, if you look at the log, you can see that the execution 
of crm_mon

in the stop processing of the Dummy resource has failed.

  
  Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI 

crm_mon[102] : Pacemaker daemons shutting down ...
  Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] 
(log_op_output)
notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster 
is not

available on this node ]
Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?

Klaus

  

  Similarly, pgsql also executes crm_mon with demote or stop, so 
control

fails.

  The problem seems to be related to the next fix.
    * Report pacemakerd in state waiting for sbd
     - https://github.com/ClusterLabs/pacemaker/pull/2278

  The problem does not occur with the release version of Pacemaker 
2.0.5 or

the Pacemaker included with RHEL8.3.

  This issue has a huge impact on the user.

  Perhaps it also affects the control of other resources that utilize

crm_mon.
  Please improve the release version of RHEL8.4 so that it 
includes Pacemaker

which does not cause this problem.
    * Distributions other than RHE

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-09 Thread Klaus Wenninger

On 4/9/21 3:36 PM, Klaus Wenninger wrote:

On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:

Hi Klaus,

Thanks for your comment.


Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?


Selinux is not enabled.
Isn't crm_mon caused by not returning a response when pacemakerd 
prepares to stop?

yep ... that doesn't look good.
While in pcmk_shutdown_worker ipc isn't handled.
Question is why that didn't create issue earlier.
Probably I didn't test with resources that had crm_mon in
their stop/monitor-actions but sbd should have run into
issues.

Klaus

But when shutting down a node the resources should be
shutdown before pacemakerd goes down.
But let me have a look if it can happen that pacemakerd
doesn't react to the ipc-pings before. That btw. might be
lethal for sbd-scenarios (if the phase is too long and it
migh actually not be defined).

My idea with selinux would have been that it might block
the ipc if crm_mon is issued by execd. But well forget
about it as it is not enabled ;-)


Klaus


pgsql needs the result of crm_mon in demote processing and stop 
processing.
crm_mon should return a response even after pacemakerd goes into a 
stop operation.


Best Regards,
Hideo Yamauchi.


- Original Message -

From: Klaus Wenninger 
To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
open-source clustering welcomed 

Cc:
Date: 2021/4/9, Fri 21:12
Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource 
control fails.


On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:

  Hi Ken,
  Hi All,

  In the pgsql resource, crm_mon is executed in the process of 
demote and

stop, and the result is processed.
  However, pacemaker included in RHEL8.4beta fails to execute this 
crm_mon.

    - The problem also occurs on github

master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).

  The problem can be easily reproduced in the following ways.

  Step1. Modify to execute crm_mon in the stop process of the Dummy 
resource.

  

  dummy_stop() {
       mon=$(crm_mon -1)
       ret=$?
       ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
       dummy_monitor
       if [ $? =  $OCF_SUCCESS ]; then
           rm ${OCF_RESKEY_state}
       fi
       return $OCF_SUCCESS
  }
  

  Step2. Configure a cluster with two nodes.
  

  [root@rh84-beta01 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - 
partition

with quorum

     * Last updated: Thu Apr  8 18:00:52 2021
     * Last change:  Thu Apr  8 18:00:38 2021 by root via cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta01 rh84-beta02 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01

  Migration Summary:
  

  Step3. Stop the node where the Dummy resource is running. The 
resource will

fail over.

  
  [root@rh84-beta02 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - 
partition

with quorum

     * Last updated: Thu Apr  8 18:08:56 2021
     * Last change:  Thu Apr  8 18:05:08 2021 by root via cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta02 ]
     * OFFLINE: [ rh84-beta01 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02
  

  However, if you look at the log, you can see that the execution 
of crm_mon

in the stop processing of the Dummy resource has failed.

  
  Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI 

crm_mon[102] : Pacemaker daemons shutting down ...
  Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] 
(log_op_output)
notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster 
is not

available on this node ]
Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?

Klaus

  

  Similarly, pgsql also executes crm_mon with demote or stop, so 
control

fails.

  The problem seems to be related to the next fix.
    * Report pacemakerd in state waiting for sbd
     - https://github.com/ClusterLabs/pacemaker/pull/2278

  The problem does not occur with the release version of Pacemaker 
2.0.5 or

the Pacemaker included with RHEL8.3.

  This issue has a huge impact on the user.

  Perhaps it also affects the control of other resources that utilize

crm_mon.
  Please improve the release version of RHEL8.4 so that it includes 
Pacemaker

which does not cause this problem.
    * Distributions other than RHEL may also be affected in future 
releases.


  
  This content is the same as the following Bugzilla.
    - https://bugs.clusterlabs.org/show_bug.cgi?id=5471
  

  Best Regards,
  Hideo Yamauchi.

  ___
  Manage your subscriptio

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-09 Thread Klaus Wenninger

On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:

Hi Klaus,

Thanks for your comment.


Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?


Selinux is not enabled.
Isn't crm_mon caused by not returning a response when pacemakerd prepares to 
stop?

But when shutting down a node the resources should be
shutdown before pacemakerd goes down.
But let me have a look if it can happen that pacemakerd
doesn't react to the ipc-pings before. That btw. might be
lethal for sbd-scenarios (if the phase is too long and it
migh actually not be defined).

My idea with selinux would have been that it might block
the ipc if crm_mon is issued by execd. But well forget
about it as it is not enabled ;-)


Klaus


pgsql needs the result of crm_mon in demote processing and stop processing.
crm_mon should return a response even after pacemakerd goes into a stop 
operation.

Best Regards,
Hideo Yamauchi.


- Original Message -

From: Klaus Wenninger 
To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to open-source 
clustering welcomed 
Cc:
Date: 2021/4/9, Fri 21:12
Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
fails.

On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:

  Hi Ken,
  Hi All,

  In the pgsql resource, crm_mon is executed in the process of demote and

stop, and the result is processed.

  However, pacemaker included in RHEL8.4beta fails to execute this crm_mon.
    - The problem also occurs on github

master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).

  The problem can be easily reproduced in the following ways.

  Step1. Modify to execute crm_mon in the stop process of the Dummy resource.
  

  dummy_stop() {
       mon=$(crm_mon -1)
       ret=$?
       ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
       dummy_monitor
       if [ $? =  $OCF_SUCCESS ]; then
           rm ${OCF_RESKEY_state}
       fi
       return $OCF_SUCCESS
  }
  

  Step2. Configure a cluster with two nodes.
  

  [root@rh84-beta01 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition

with quorum

     * Last updated: Thu Apr  8 18:00:52 2021
     * Last change:  Thu Apr  8 18:00:38 2021 by root via cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta01 rh84-beta02 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01

  Migration Summary:
  

  Step3. Stop the node where the Dummy resource is running. The resource will

fail over.

  
  [root@rh84-beta02 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition

with quorum

     * Last updated: Thu Apr  8 18:08:56 2021
     * Last change:  Thu Apr  8 18:05:08 2021 by root via cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta02 ]
     * OFFLINE: [ rh84-beta01 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02
  

  However, if you look at the log, you can see that the execution of crm_mon

in the stop processing of the Dummy resource has failed.

  
  Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI 

crm_mon[102] : Pacemaker daemons shutting down ...

  Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] (log_op_output)

notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not
available on this node ]
Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?

Klaus

  

  Similarly, pgsql also executes crm_mon with demote or stop, so control

fails.

  The problem seems to be related to the next fix.
    * Report pacemakerd in state waiting for sbd
     - https://github.com/ClusterLabs/pacemaker/pull/2278

  The problem does not occur with the release version of Pacemaker 2.0.5 or

the Pacemaker included with RHEL8.3.

  This issue has a huge impact on the user.

  Perhaps it also affects the control of other resources that utilize

crm_mon.

  Please improve the release version of RHEL8.4 so that it includes Pacemaker

which does not cause this problem.

    * Distributions other than RHEL may also be affected in future releases.

  
  This content is the same as the following Bugzilla.
    - https://bugs.clusterlabs.org/show_bug.cgi?id=5471
  

  Best Regards,
  Hideo Yamauchi.

  ___
  Manage your subscription:
  https://lists.clusterlabs.org/mailman/listinfo/users

  ClusterLabs home: https://www.clusterlabs.org/



--
Klaus Wenninger

Senior Software Engineer, EMEA ENG Base Operating Systems

Red Hat

kwenn...@redhat.com

Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Mue

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-09 Thread renayama19661014
Hi Klaus,

Thanks for your comment.

> Hmm ... is that with selinux enabled?

> Respectively do you see any related avc messages?


Selinux is not enabled.
Isn't crm_mon caused by not returning a response when pacemakerd prepares to 
stop?

pgsql needs the result of crm_mon in demote processing and stop processing.
crm_mon should return a response even after pacemakerd goes into a stop 
operation.

Best Regards,
Hideo Yamauchi.


- Original Message -
> From: Klaus Wenninger 
> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
> Cc: 
> Date: 2021/4/9, Fri 21:12
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
> 
> On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:
>>  Hi Ken,
>>  Hi All,
>> 
>>  In the pgsql resource, crm_mon is executed in the process of demote and 
> stop, and the result is processed.
>> 
>>  However, pacemaker included in RHEL8.4beta fails to execute this crm_mon.
>>    - The problem also occurs on github 
> master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).
>> 
>>  The problem can be easily reproduced in the following ways.
>> 
>>  Step1. Modify to execute crm_mon in the stop process of the Dummy resource.
>>  
>> 
>>  dummy_stop() {
>>       mon=$(crm_mon -1)
>>       ret=$?
>>       ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
>>       dummy_monitor
>>       if [ $? =  $OCF_SUCCESS ]; then
>>           rm ${OCF_RESKEY_state}
>>       fi
>>       return $OCF_SUCCESS
>>  }
>>  
>> 
>>  Step2. Configure a cluster with two nodes.
>>  
>> 
>>  [root@rh84-beta01 ~]# crm_mon -rfA1
>>  Cluster Summary:
>>     * Stack: corosync
>>     * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition 
> with quorum
>>     * Last updated: Thu Apr  8 18:00:52 2021
>>     * Last change:  Thu Apr  8 18:00:38 2021 by root via cibadmin on 
> rh84-beta01
>>     * 2 nodes configured
>>     * 1 resource instance configured
>> 
>>  Node List:
>>     * Online: [ rh84-beta01 rh84-beta02 ]
>> 
>>  Full List of Resources:
>>     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01
>> 
>>  Migration Summary:
>>  
>> 
>>  Step3. Stop the node where the Dummy resource is running. The resource will 
> fail over.
>>  
>>  [root@rh84-beta02 ~]# crm_mon -rfA1
>>  Cluster Summary:
>>     * Stack: corosync
>>     * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition 
> with quorum
>>     * Last updated: Thu Apr  8 18:08:56 2021
>>     * Last change:  Thu Apr  8 18:05:08 2021 by root via cibadmin on 
> rh84-beta01
>>     * 2 nodes configured
>>     * 1 resource instance configured
>> 
>>  Node List:
>>     * Online: [ rh84-beta02 ]
>>     * OFFLINE: [ rh84-beta01 ]
>> 
>>  Full List of Resources:
>>     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02
>>  
>> 
>>  However, if you look at the log, you can see that the execution of crm_mon 
> in the stop processing of the Dummy resource has failed.
>> 
>>  
>>  Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI  
> crm_mon[102] : Pacemaker daemons shutting down ...
>>  Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] (log_op_output)  
> notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not 
> available on this node ]
> Hmm ... is that with selinux enabled?
> Respectively do you see any related avc messages?
> 
> Klaus
>>  
>> 
>>  Similarly, pgsql also executes crm_mon with demote or stop, so control 
> fails.
>> 
>>  The problem seems to be related to the next fix.
>>    * Report pacemakerd in state waiting for sbd
>>     - https://github.com/ClusterLabs/pacemaker/pull/2278 
>> 
>>  The problem does not occur with the release version of Pacemaker 2.0.5 or 
> the Pacemaker included with RHEL8.3.
>> 
>>  This issue has a huge impact on the user.
>> 
>>  Perhaps it also affects the control of other resources that utilize 
> crm_mon.
>> 
>>  Please improve the release version of RHEL8.4 so that it includes Pacemaker 
> which does not cause this problem.
>>    * Distributions other than RHEL may also be affected in future releases.
>> 
>>  
>>  This content is the same as the following Bugzilla.
>>    - https://bugs.clusterlabs.org/show_bug.cgi?id=5471 
>>  
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>>  ___
>>  Manage your subscription:
>>  https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>>  ClusterLabs home: https://www.clusterlabs.org/ 
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-09 Thread Klaus Wenninger

On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:

Hi Ken,
Hi All,

In the pgsql resource, crm_mon is executed in the process of demote and stop, 
and the result is processed.

However, pacemaker included in RHEL8.4beta fails to execute this crm_mon.
  - The problem also occurs on github 
master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).

The problem can be easily reproduced in the following ways.

Step1. Modify to execute crm_mon in the stop process of the Dummy resource.


dummy_stop() {
     mon=$(crm_mon -1)
     ret=$?
     ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
     dummy_monitor
     if [ $? =  $OCF_SUCCESS ]; then
         rm ${OCF_RESKEY_state}
     fi
     return $OCF_SUCCESS
}


Step2. Configure a cluster with two nodes.


[root@rh84-beta01 ~]# crm_mon -rfA1
Cluster Summary:
   * Stack: corosync
   * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition with 
quorum
   * Last updated: Thu Apr  8 18:00:52 2021
   * Last change:  Thu Apr  8 18:00:38 2021 by root via cibadmin on rh84-beta01
   * 2 nodes configured
   * 1 resource instance configured

Node List:
   * Online: [ rh84-beta01 rh84-beta02 ]

Full List of Resources:
   * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01

Migration Summary:


Step3. Stop the node where the Dummy resource is running. The resource will 
fail over.

[root@rh84-beta02 ~]# crm_mon -rfA1
Cluster Summary:
   * Stack: corosync
   * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition with 
quorum
   * Last updated: Thu Apr  8 18:08:56 2021
   * Last change:  Thu Apr  8 18:05:08 2021 by root via cibadmin on rh84-beta01
   * 2 nodes configured
   * 1 resource instance configured

Node List:
   * Online: [ rh84-beta02 ]
   * OFFLINE: [ rh84-beta01 ]

Full List of Resources:
   * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02


However, if you look at the log, you can see that the execution of crm_mon in 
the stop processing of the Dummy resource has failed.


Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI  crm_mon[102] 
: Pacemaker daemons shutting down ...
Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] (log_op_output)  notice: 
dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not available on 
this node ]

Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?

Klaus



Similarly, pgsql also executes crm_mon with demote or stop, so control fails.

The problem seems to be related to the next fix.
  * Report pacemakerd in state waiting for sbd
   - https://github.com/ClusterLabs/pacemaker/pull/2278

The problem does not occur with the release version of Pacemaker 2.0.5 or the 
Pacemaker included with RHEL8.3.

This issue has a huge impact on the user.

Perhaps it also affects the control of other resources that utilize crm_mon.

Please improve the release version of RHEL8.4 so that it includes Pacemaker 
which does not cause this problem.
  * Distributions other than RHEL may also be affected in future releases.


This content is the same as the following Bugzilla.
  - https://bugs.clusterlabs.org/show_bug.cgi?id=5471


Best Regards,
Hideo Yamauchi.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-08 Thread renayama19661014
Hi Ken,
Hi All,

In the pgsql resource, crm_mon is executed in the process of demote and stop, 
and the result is processed.

However, pacemaker included in RHEL8.4beta fails to execute this crm_mon.
 - The problem also occurs on github 
master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).

The problem can be easily reproduced in the following ways.

Step1. Modify to execute crm_mon in the stop process of the Dummy resource.


dummy_stop() {
    mon=$(crm_mon -1)
    ret=$?
    ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
    dummy_monitor
    if [ $? =  $OCF_SUCCESS ]; then
        rm ${OCF_RESKEY_state}
    fi
    return $OCF_SUCCESS
}


Step2. Configure a cluster with two nodes.


[root@rh84-beta01 ~]# crm_mon -rfA1
Cluster Summary:
  * Stack: corosync
  * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition with 
quorum
  * Last updated: Thu Apr  8 18:00:52 2021
  * Last change:  Thu Apr  8 18:00:38 2021 by root via cibadmin on rh84-beta01
  * 2 nodes configured
  * 1 resource instance configured

Node List:
  * Online: [ rh84-beta01 rh84-beta02 ]

Full List of Resources:
  * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01

Migration Summary:


Step3. Stop the node where the Dummy resource is running. The resource will 
fail over.

[root@rh84-beta02 ~]# crm_mon -rfA1
Cluster Summary:
  * Stack: corosync
  * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition with 
quorum
  * Last updated: Thu Apr  8 18:08:56 2021
  * Last change:  Thu Apr  8 18:05:08 2021 by root via cibadmin on rh84-beta01
  * 2 nodes configured
  * 1 resource instance configured

Node List:
  * Online: [ rh84-beta02 ]
  * OFFLINE: [ rh84-beta01 ]

Full List of Resources:
  * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02


However, if you look at the log, you can see that the execution of crm_mon in 
the stop processing of the Dummy resource has failed.


Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI  crm_mon[102] 
: Pacemaker daemons shutting down ...
Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] (log_op_output)  notice: 
dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not available on 
this node ]


Similarly, pgsql also executes crm_mon with demote or stop, so control fails.

The problem seems to be related to the next fix.
 * Report pacemakerd in state waiting for sbd
  - https://github.com/ClusterLabs/pacemaker/pull/2278

The problem does not occur with the release version of Pacemaker 2.0.5 or the 
Pacemaker included with RHEL8.3.

This issue has a huge impact on the user.

Perhaps it also affects the control of other resources that utilize crm_mon.

Please improve the release version of RHEL8.4 so that it includes Pacemaker 
which does not cause this problem.
 * Distributions other than RHEL may also be affected in future releases.


This content is the same as the following Bugzilla.
 - https://bugs.clusterlabs.org/show_bug.cgi?id=5471


Best Regards,
Hideo Yamauchi.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/