Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
On Wed, 2021-04-28 at 19:19 +0200, Jehan-Guillaume de Rorthais wrote: > On Wed, 28 Apr 2021 12:00:40 -0500 > Ken Gaillot wrote: > > > On Wed, 2021-04-28 at 18:14 +0200, Jehan-Guillaume de Rorthais > > wrote: > > > Hi all, > > > > > > It seems to me the concern raised by Ulrich hasn't been > > > discussed: > > > > > > On Wed, 12 Apr 2021 Ulrich Windl wrote: > > > > > > > Personally I think an RA calling crm_mon is inherently broken: > > > > Will > > > > it ever > > > > pass ocf-tester? > > > > Calling the command-line tools in an agent can be OK in some cases. > > The > > main concerns are: > > > > * Time-of-check/time-of-use: cluster status can change immediately, > > so > > the agent should behave reasonably if a query result is incorrect > > at > > the moment it's used. Ideally there would be no case where the > > agent > > could incorrectly report success for an action. > > > > * No commands that *change* the configuration (other than setting > > node > > attributes) should ever be used. Otherwise there's a potential for > > an > > infinite loop between the agent and scheduler. > > > > * It's best to use tools' XML output when available, because that > > should be stable across Pacemaker releases, while the text output > > may > > not be. Aside from crm_mon, XML output is a recent addition, so > > some > > consideration must be given to backward compatibility and/or > > requiring > > a minimum Pacemaker version. > > > > * Only the configuration section of the CIB has a guaranteed > > schema. > > The status section can theoretically change from release to > > release, > > although in practice it has changed very little over the years. > > > > I don't use ocf-tester so I can't speak to that, but I suspect it > > could > > work if you exported a CIB_file variable with a sample cluster > > status > > beforehand. (CIB_file makes the cluster commands act as if the > > specified file is the live CIB at the moment.) > > > > > Would it be possible to rely on the following command ? > > > > > > cibadmin --query --xpath "//status/node_state[@join='member']" > > > | \ > > > grep -Po 'uname="\K[^"]+' > > > > > > > > > Regards, > > > > Only full cluster nodes will have a "join" attribute, so that query > > won't catch active remote nodes or guest nodes. Whether that's good > > or > > bad depends on what you're looking for. > > That was an example to remove the crm_mon dependency with the > cibadmin one. > AFAIU this agent, it uses crm_mon to: > > * look for the node hosting the promoted clone > * look for a node existence > * look for a node fully joined > > all of these use seems accessible by parsing the cibadmin status > section > output (or --xpath). I would think remote nodes and guest nodes should be considered, too, unless the agent specifically doesn't support that. Remote nodes and guest nodes don't join the controller layer, so they won't have a join entry, but they can resources. > > The plus side is that it's a query and it returns XML. > > indeed. > > > The downsides are that node status can change quickly, so it could > > theoretically be inaccurate a moment later when you use it, and the > > status section is not guaranteed to stay in that format (though I > > expect that particular part will). > > There's already version checks in pgsql RA code for crm_mon anyway, > relying on > OCF_RESKEY_crm_feature_set. > > > A minor point: that query will return the entire node_state XML > > subtree; you can add -n/--no-children to return just the node_state > > element itself. > > Nice! > > I was playing with xmllint as well, for an expanded support of > xmllint, but it > would add a strong dependency. > > Regards, > -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
On Wed, 28 Apr 2021 12:00:40 -0500 Ken Gaillot wrote: > On Wed, 2021-04-28 at 18:14 +0200, Jehan-Guillaume de Rorthais wrote: > > Hi all, > > > > It seems to me the concern raised by Ulrich hasn't been discussed: > > > > On Wed, 12 Apr 2021 Ulrich Windl wrote: > > > > > Personally I think an RA calling crm_mon is inherently broken: Will > > > it ever > > > pass ocf-tester? > > Calling the command-line tools in an agent can be OK in some cases. The > main concerns are: > > * Time-of-check/time-of-use: cluster status can change immediately, so > the agent should behave reasonably if a query result is incorrect at > the moment it's used. Ideally there would be no case where the agent > could incorrectly report success for an action. > > * No commands that *change* the configuration (other than setting node > attributes) should ever be used. Otherwise there's a potential for an > infinite loop between the agent and scheduler. > > * It's best to use tools' XML output when available, because that > should be stable across Pacemaker releases, while the text output may > not be. Aside from crm_mon, XML output is a recent addition, so some > consideration must be given to backward compatibility and/or requiring > a minimum Pacemaker version. > > * Only the configuration section of the CIB has a guaranteed schema. > The status section can theoretically change from release to release, > although in practice it has changed very little over the years. > > I don't use ocf-tester so I can't speak to that, but I suspect it could > work if you exported a CIB_file variable with a sample cluster status > beforehand. (CIB_file makes the cluster commands act as if the > specified file is the live CIB at the moment.) > > > Would it be possible to rely on the following command ? > > > > cibadmin --query --xpath "//status/node_state[@join='member']" | \ > > grep -Po 'uname="\K[^"]+' > > > > > > Regards, > > Only full cluster nodes will have a "join" attribute, so that query > won't catch active remote nodes or guest nodes. Whether that's good or > bad depends on what you're looking for. That was an example to remove the crm_mon dependency with the cibadmin one. AFAIU this agent, it uses crm_mon to: * look for the node hosting the promoted clone * look for a node existence * look for a node fully joined all of these use seems accessible by parsing the cibadmin status section output (or --xpath). > The plus side is that it's a query and it returns XML. indeed. > The downsides are that node status can change quickly, so it could > theoretically be inaccurate a moment later when you use it, and the > status section is not guaranteed to stay in that format (though I > expect that particular part will). There's already version checks in pgsql RA code for crm_mon anyway, relying on OCF_RESKEY_crm_feature_set. > A minor point: that query will return the entire node_state XML > subtree; you can add -n/--no-children to return just the node_state > element itself. Nice! I was playing with xmllint as well, for an expanded support of xmllint, but it would add a strong dependency. Regards, ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
On Wed, 2021-04-28 at 18:14 +0200, Jehan-Guillaume de Rorthais wrote: > Hi all, > > It seems to me the concern raised by Ulrich hasn't been discussed: > > On Wed, 12 Apr 2021 Ulrich Windl wrote: > > > Personally I think an RA calling crm_mon is inherently broken: Will > > it ever > > pass ocf-tester? Calling the command-line tools in an agent can be OK in some cases. The main concerns are: * Time-of-check/time-of-use: cluster status can change immediately, so the agent should behave reasonably if a query result is incorrect at the moment it's used. Ideally there would be no case where the agent could incorrectly report success for an action. * No commands that *change* the configuration (other than setting node attributes) should ever be used. Otherwise there's a potential for an infinite loop between the agent and scheduler. * It's best to use tools' XML output when available, because that should be stable across Pacemaker releases, while the text output may not be. Aside from crm_mon, XML output is a recent addition, so some consideration must be given to backward compatibility and/or requiring a minimum Pacemaker version. * Only the configuration section of the CIB has a guaranteed schema. The status section can theoretically change from release to release, although in practice it has changed very little over the years. I don't use ocf-tester so I can't speak to that, but I suspect it could work if you exported a CIB_file variable with a sample cluster status beforehand. (CIB_file makes the cluster commands act as if the specified file is the live CIB at the moment.) > Would it be possible to rely on the following command ? > > cibadmin --query --xpath "//status/node_state[@join='member']" | \ > grep -Po 'uname="\K[^"]+' > > > Regards, Only full cluster nodes will have a "join" attribute, so that query won't catch active remote nodes or guest nodes. Whether that's good or bad depends on what you're looking for. The plus side is that it's a query and it returns XML. The downsides are that node status can change quickly, so it could theoretically be inaccurate a moment later when you use it, and the status section is not guaranteed to stay in that format (though I expect that particular part will). A minor point: that query will return the entire node_state XML subtree; you can add -n/--no-children to return just the node_state element itself. -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
Hi all, It seems to me the concern raised by Ulrich hasn't been discussed: On Wed, 12 Apr 2021 Ulrich Windl wrote: > Personally I think an RA calling crm_mon is inherently broken: Will it ever > pass ocf-tester? Would it be possible to rely on the following command ? cibadmin --query --xpath "//status/node_state[@join='member']" | \ grep -Po 'uname="\K[^"]+' Regards, ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
Hi Ken, Hi Klaus, Thanks for your comment. >We did not have time to get it into the RHEL 8.4 GA (general >availability) release, which means for example it will not be in 8.4 >install images, but we did get a 0-day fix, which means that it will be >available via "yum update" the same day that 8.4 is released. > >Thanks for testing the 8.4 build and finding the issue! Okay! Best Regards, Hideo Yamauchi. - Original Message - >From: Ken Gaillot >To: renayama19661...@ybb.ne.jp >Cc: kwenning >Date: 2021/4/24, Sat 01:25 >Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control >fails. > >Hi Hideo, > >A private reply to follow up: > >The fix will be in the 2.1.0 upstream release. > >We did not have time to get it into the RHEL 8.4 GA (general >availability) release, which means for example it will not be in 8.4 >install images, but we did get a 0-day fix, which means that it will be >available via "yum update" the same day that 8.4 is released. > >Thanks for testing the 8.4 build and finding the issue! > >On Thu, 2021-04-15 at 11:45 +0900, renayama19661...@ybb.ne.jp wrote: >> Hi Klaus, >> Hi Ken, >> >> We have confirmed that the operation is improved by the test. >> Thank you for your prompt response. >> >> We look forward to including this fix in the release version of RHEL >> 8.4. >> >> Best Regards, >> Hideo Yamauchi. >> >> >> >> - Original Message - >> > From: "renayama19661...@ybb.ne.jp" >> > To: "kwenn...@redhat.com" ; Cluster Labs - All >> > topics related to open-source clustering welcomed < >> > users@clusterlabs.org>; Cluster Labs - All topics related to open- >> > source clustering welcomed >> > Cc: >> > Date: 2021/4/13, Tue 07:08 >> > Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource >> > control fails. >> > >> > Hi Klaus, >> > Hi Ken, >> > >> > > I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 >> > > with >> > > I guess the simplest possible solution to the immediate issue so >> > > that we can discuss it. >> > >> > >> > Thank you for the fix. >> > >> > >> > I have confirmed that the fixes have been merged. >> > >> > I'll test this fix today just in case. >> > >> > Many thanks, >> > Hideo Yamauchi. >> > >> > >> > - Original Message - >> > > From: Klaus Wenninger >> > > To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics >> > > related to >> > >> > open-source clustering welcomed >> > > Cc: >> > > Date: 2021/4/12, Mon 22:22 >> > > Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql >> > > resource control >> > >> > fails. >> > > >> > > On 4/9/21 5:13 PM, Klaus Wenninger wrote: >> > > > On 4/9/21 4:04 PM, Klaus Wenninger wrote: >> > > > > On 4/9/21 3:45 PM, Klaus Wenninger wrote: >> > > > > > On 4/9/21 3:36 PM, Klaus Wenninger wrote: >> > > > > > > On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote: >> > > > > > > > Hi Klaus, >> > > > > > > > >> > > > > > > > Thanks for your comment. >> > > > > > > > >> > > > > > > > > Hmm ... is that with selinux enabled? >> > > > > > > > > Respectively do you see any related avc messages? >> > > > > > > > >> > > > > > > > Selinux is not enabled. >> > > > > > > > Isn't crm_mon caused by not returning a response >> > >> > when >> > > pacemakerd >> > > > > > > > prepares to stop? >> > > > > > >> > > > > > yep ... that doesn't look good. >> > > > > > While in pcmk_shutdown_worker ipc isn't handled. >> > > > > >> > > > > Stop ... that should actually work as pcmk_shutdown_worker >> > > > > should exit quite quickly and proceed after mainloop >> > > > > dispatching when called again. >> > > > > Don't see anything atm that might be blocking for longer >> > > > > ... >> > > > >
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
Hi ALl, Sorry... Due to my operation mistake, the same email was sent multiple times. Best Regards, Hideo Yamauchi. - Original Message - > From: "renayama19661...@ybb.ne.jp" > To: Cluster Labs - All topics related to open-source clustering welcomed > ; Cluster Labs - All topics related to open-source > clustering welcomed > Cc: > Date: 2021/4/15, Thu 11:45 > Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control > fails. > > Hi Klaus, > Hi Ken, > > We have confirmed that the operation is improved by the test. > Thank you for your prompt response. > > We look forward to including this fix in the release version of RHEL 8.4. > > Best Regards, > Hideo Yamauchi. > > > > - Original Message - >> From: "renayama19661...@ybb.ne.jp" > >> To: "kwenn...@redhat.com" ; Cluster > Labs - All topics related to open-source clustering welcomed > ; Cluster Labs - All topics related to open-source > clustering welcomed >> Cc: >> Date: 2021/4/13, Tue 07:08 >> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control > fails. >> >> Hi Klaus, >> Hi Ken, >> >>> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 > with >> >>> I guess the simplest possible solution to the immediate issue so >>> that we can discuss it. >> >> >> Thank you for the fix. >> >> >> I have confirmed that the fixes have been merged. >> >> I'll test this fix today just in case. >> >> Many thanks, >> Hideo Yamauchi. >> >> >> - Original Message - >>> From: Klaus Wenninger >>> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to >> open-source clustering welcomed >>> Cc: >>> Date: 2021/4/12, Mon 22:22 >>> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource > control >> fails. >>> >>> On 4/9/21 5:13 PM, Klaus Wenninger wrote: >>>> On 4/9/21 4:04 PM, Klaus Wenninger wrote: >>>>> On 4/9/21 3:45 PM, Klaus Wenninger wrote: >>>>>> On 4/9/21 3:36 PM, Klaus Wenninger wrote: >>>>>>> On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote: >>>>>>>> Hi Klaus, >>>>>>>> >>>>>>>> Thanks for your comment. >>>>>>>> >>>>>>>>> Hmm ... is that with selinux enabled? >>>>>>>>> Respectively do you see any related avc > messages? >>>>>>>> >>>>>>>> Selinux is not enabled. >>>>>>>> Isn't crm_mon caused by not returning a > response >> when >>> pacemakerd >>>>>>>> prepares to stop? >>>>>> yep ... that doesn't look good. >>>>>> While in pcmk_shutdown_worker ipc isn't handled. >>>>> Stop ... that should actually work as pcmk_shutdown_worker >>>>> should exit quite quickly and proceed after mainloop >>>>> dispatching when called again. >>>>> Don't see anything atm that might be blocking for longer > ... >>>>> but let me dig into it further ... >>>> What happens is clear (thanks Ken for the hint ;-) ). >>>> When pacemakerd is shutting down - already when it >>>> shuts down the resources and not just when it starts to >>>> reap the subdaemons - crm_mon reads that state and >>>> doesn't try to connect to the cib anymore. >>> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 > with >>> I guess the simplest possible solution to the immediate issue so >>> that we can discuss it. >>>>>> Question is why that didn't create issue earlier. >>>>>> Probably I didn't test with resources that had > crm_mon in >>>>>> their stop/monitor-actions but sbd should have run into >>>>>> issues. >>>>>> >>>>>> Klaus >>>>>>> But when shutting down a node the resources should be >>>>>>> shutdown before pacemakerd goes down. >>>>>>> But let me have a look if it can happen that > pacemakerd >>>>>>> doesn't react to the ipc-pings before. That btw. > might >> be >>>>>>> lethal for sbd-scenarios (
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
Hi Klaus, Hi Ken, We have confirmed that the operation is improved by the test. Thank you for your prompt response. We look forward to including this fix in the release version of RHEL 8.4. Best Regards, Hideo Yamauchi. - Original Message - > From: "renayama19661...@ybb.ne.jp" > To: "kwenn...@redhat.com" ; Cluster Labs - All topics > related to open-source clustering welcomed ; Cluster > Labs - All topics related to open-source clustering welcomed > > Cc: > Date: 2021/4/13, Tue 07:08 > Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control > fails. > > Hi Klaus, > Hi Ken, > >> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with > >> I guess the simplest possible solution to the immediate issue so >> that we can discuss it. > > > Thank you for the fix. > > > I have confirmed that the fixes have been merged. > > I'll test this fix today just in case. > > Many thanks, > Hideo Yamauchi. > > > - Original Message - >> From: Klaus Wenninger >> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to > open-source clustering welcomed >> Cc: >> Date: 2021/4/12, Mon 22:22 >> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control > fails. >> >> On 4/9/21 5:13 PM, Klaus Wenninger wrote: >>> On 4/9/21 4:04 PM, Klaus Wenninger wrote: >>>> On 4/9/21 3:45 PM, Klaus Wenninger wrote: >>>>> On 4/9/21 3:36 PM, Klaus Wenninger wrote: >>>>>> On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote: >>>>>>> Hi Klaus, >>>>>>> >>>>>>> Thanks for your comment. >>>>>>> >>>>>>>> Hmm ... is that with selinux enabled? >>>>>>>> Respectively do you see any related avc messages? >>>>>>> >>>>>>> Selinux is not enabled. >>>>>>> Isn't crm_mon caused by not returning a response > when >> pacemakerd >>>>>>> prepares to stop? >>>>> yep ... that doesn't look good. >>>>> While in pcmk_shutdown_worker ipc isn't handled. >>>> Stop ... that should actually work as pcmk_shutdown_worker >>>> should exit quite quickly and proceed after mainloop >>>> dispatching when called again. >>>> Don't see anything atm that might be blocking for longer ... >>>> but let me dig into it further ... >>> What happens is clear (thanks Ken for the hint ;-) ). >>> When pacemakerd is shutting down - already when it >>> shuts down the resources and not just when it starts to >>> reap the subdaemons - crm_mon reads that state and >>> doesn't try to connect to the cib anymore. >> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with >> I guess the simplest possible solution to the immediate issue so >> that we can discuss it. >>>>> Question is why that didn't create issue earlier. >>>>> Probably I didn't test with resources that had crm_mon in >>>>> their stop/monitor-actions but sbd should have run into >>>>> issues. >>>>> >>>>> Klaus >>>>>> But when shutting down a node the resources should be >>>>>> shutdown before pacemakerd goes down. >>>>>> But let me have a look if it can happen that pacemakerd >>>>>> doesn't react to the ipc-pings before. That btw. might > be >>>>>> lethal for sbd-scenarios (if the phase is too long and it >>>>>> migh actually not be defined). >>>>>> >>>>>> My idea with selinux would have been that it might block >>>>>> the ipc if crm_mon is issued by execd. But well forget >>>>>> about it as it is not enabled ;-) >>>>>> >>>>>> >>>>>> Klaus >>>>>>> >>>>>>> pgsql needs the result of crm_mon in demote processing > and >> stop >>>>>>> processing. >>>>>>> crm_mon should return a response even after pacemakerd > goes >> into a >>>>>>> stop operation. >>>>>>> >>>>>>> Best Regards, >>>>>>> Hideo Yamauchi. >>>>>>> >>>>>>> >>>>>>> - Original M
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
Hi Klaus, Hi Ken, We have confirmed that the operation is improved by the test. Thank you for your prompt response. We look forward to including this fix in the release version of RHEL 8.4. Best Regards, Hideo Yamauchi. - Original Message - > From: "renayama19661...@ybb.ne.jp" > To: "kwenn...@redhat.com" ; Cluster Labs - All topics > related to open-source clustering welcomed ; Cluster > Labs - All topics related to open-source clustering welcomed > > Cc: > Date: 2021/4/13, Tue 07:08 > Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control > fails. > > Hi Klaus, > Hi Ken, > >> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with > >> I guess the simplest possible solution to the immediate issue so >> that we can discuss it. > > > Thank you for the fix. > > > I have confirmed that the fixes have been merged. > > I'll test this fix today just in case. > > Many thanks, > Hideo Yamauchi. > > > - Original Message - >> From: Klaus Wenninger >> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to > open-source clustering welcomed >> Cc: >> Date: 2021/4/12, Mon 22:22 >> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control > fails. >> >> On 4/9/21 5:13 PM, Klaus Wenninger wrote: >>> On 4/9/21 4:04 PM, Klaus Wenninger wrote: >>>> On 4/9/21 3:45 PM, Klaus Wenninger wrote: >>>>> On 4/9/21 3:36 PM, Klaus Wenninger wrote: >>>>>> On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote: >>>>>>> Hi Klaus, >>>>>>> >>>>>>> Thanks for your comment. >>>>>>> >>>>>>>> Hmm ... is that with selinux enabled? >>>>>>>> Respectively do you see any related avc messages? >>>>>>> >>>>>>> Selinux is not enabled. >>>>>>> Isn't crm_mon caused by not returning a response > when >> pacemakerd >>>>>>> prepares to stop? >>>>> yep ... that doesn't look good. >>>>> While in pcmk_shutdown_worker ipc isn't handled. >>>> Stop ... that should actually work as pcmk_shutdown_worker >>>> should exit quite quickly and proceed after mainloop >>>> dispatching when called again. >>>> Don't see anything atm that might be blocking for longer ... >>>> but let me dig into it further ... >>> What happens is clear (thanks Ken for the hint ;-) ). >>> When pacemakerd is shutting down - already when it >>> shuts down the resources and not just when it starts to >>> reap the subdaemons - crm_mon reads that state and >>> doesn't try to connect to the cib anymore. >> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with >> I guess the simplest possible solution to the immediate issue so >> that we can discuss it. >>>>> Question is why that didn't create issue earlier. >>>>> Probably I didn't test with resources that had crm_mon in >>>>> their stop/monitor-actions but sbd should have run into >>>>> issues. >>>>> >>>>> Klaus >>>>>> But when shutting down a node the resources should be >>>>>> shutdown before pacemakerd goes down. >>>>>> But let me have a look if it can happen that pacemakerd >>>>>> doesn't react to the ipc-pings before. That btw. might > be >>>>>> lethal for sbd-scenarios (if the phase is too long and it >>>>>> migh actually not be defined). >>>>>> >>>>>> My idea with selinux would have been that it might block >>>>>> the ipc if crm_mon is issued by execd. But well forget >>>>>> about it as it is not enabled ;-) >>>>>> >>>>>> >>>>>> Klaus >>>>>>> >>>>>>> pgsql needs the result of crm_mon in demote processing > and >> stop >>>>>>> processing. >>>>>>> crm_mon should return a response even after pacemakerd > goes >> into a >>>>>>> stop operation. >>>>>>> >>>>>>> Best Regards, >>>>>>> Hideo Yamauchi. >>>>>>> >>>>>>> >>>>>>> - Original M
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
Hi Klaus, Hi Ken, > I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with > I guess the simplest possible solution to the immediate issue so > that we can discuss it. Thank you for the fix. I have confirmed that the fixes have been merged. I'll test this fix today just in case. Many thanks, Hideo Yamauchi. - Original Message - > From: Klaus Wenninger > To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to > open-source clustering welcomed > Cc: > Date: 2021/4/12, Mon 22:22 > Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control > fails. > > On 4/9/21 5:13 PM, Klaus Wenninger wrote: >> On 4/9/21 4:04 PM, Klaus Wenninger wrote: >>> On 4/9/21 3:45 PM, Klaus Wenninger wrote: >>>> On 4/9/21 3:36 PM, Klaus Wenninger wrote: >>>>> On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote: >>>>>> Hi Klaus, >>>>>> >>>>>> Thanks for your comment. >>>>>> >>>>>>> Hmm ... is that with selinux enabled? >>>>>>> Respectively do you see any related avc messages? >>>>>> >>>>>> Selinux is not enabled. >>>>>> Isn't crm_mon caused by not returning a response when > pacemakerd >>>>>> prepares to stop? >>>> yep ... that doesn't look good. >>>> While in pcmk_shutdown_worker ipc isn't handled. >>> Stop ... that should actually work as pcmk_shutdown_worker >>> should exit quite quickly and proceed after mainloop >>> dispatching when called again. >>> Don't see anything atm that might be blocking for longer ... >>> but let me dig into it further ... >> What happens is clear (thanks Ken for the hint ;-) ). >> When pacemakerd is shutting down - already when it >> shuts down the resources and not just when it starts to >> reap the subdaemons - crm_mon reads that state and >> doesn't try to connect to the cib anymore. > I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with > I guess the simplest possible solution to the immediate issue so > that we can discuss it. >>>> Question is why that didn't create issue earlier. >>>> Probably I didn't test with resources that had crm_mon in >>>> their stop/monitor-actions but sbd should have run into >>>> issues. >>>> >>>> Klaus >>>>> But when shutting down a node the resources should be >>>>> shutdown before pacemakerd goes down. >>>>> But let me have a look if it can happen that pacemakerd >>>>> doesn't react to the ipc-pings before. That btw. might be >>>>> lethal for sbd-scenarios (if the phase is too long and it >>>>> migh actually not be defined). >>>>> >>>>> My idea with selinux would have been that it might block >>>>> the ipc if crm_mon is issued by execd. But well forget >>>>> about it as it is not enabled ;-) >>>>> >>>>> >>>>> Klaus >>>>>> >>>>>> pgsql needs the result of crm_mon in demote processing and > stop >>>>>> processing. >>>>>> crm_mon should return a response even after pacemakerd goes > into a >>>>>> stop operation. >>>>>> >>>>>> Best Regards, >>>>>> Hideo Yamauchi. >>>>>> >>>>>> >>>>>> - Original Message - >>>>>>> From: Klaus Wenninger >>>>>>> To: renayama19661...@ybb.ne.jp; Cluster Labs - All > topics related >>>>>>> to open-source clustering welcomed > >>>>>>> Cc: >>>>>>> Date: 2021/4/9, Fri 21:12 >>>>>>> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, > pgsql >>>>>>> resource control fails. >>>>>>> >>>>>>> On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote: >>>>>>>> Hi Ken, >>>>>>>> Hi All, >>>>>>>> >>>>>>>> In the pgsql resource, crm_mon is executed in the > process of >>>>>>>> demote and >>>>>>> stop, and the result is processed. >>>>>>>> However, pacemaker included in RHEL8.4beta fails > to execute >>>>>>>> this crm_mon. >>>>>>>>
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
On 4/9/21 5:13 PM, Klaus Wenninger wrote: On 4/9/21 4:04 PM, Klaus Wenninger wrote: On 4/9/21 3:45 PM, Klaus Wenninger wrote: On 4/9/21 3:36 PM, Klaus Wenninger wrote: On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote: Hi Klaus, Thanks for your comment. Hmm ... is that with selinux enabled? Respectively do you see any related avc messages? Selinux is not enabled. Isn't crm_mon caused by not returning a response when pacemakerd prepares to stop? yep ... that doesn't look good. While in pcmk_shutdown_worker ipc isn't handled. Stop ... that should actually work as pcmk_shutdown_worker should exit quite quickly and proceed after mainloop dispatching when called again. Don't see anything atm that might be blocking for longer ... but let me dig into it further ... What happens is clear (thanks Ken for the hint ;-) ). When pacemakerd is shutting down - already when it shuts down the resources and not just when it starts to reap the subdaemons - crm_mon reads that state and doesn't try to connect to the cib anymore. I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with I guess the simplest possible solution to the immediate issue so that we can discuss it. Question is why that didn't create issue earlier. Probably I didn't test with resources that had crm_mon in their stop/monitor-actions but sbd should have run into issues. Klaus But when shutting down a node the resources should be shutdown before pacemakerd goes down. But let me have a look if it can happen that pacemakerd doesn't react to the ipc-pings before. That btw. might be lethal for sbd-scenarios (if the phase is too long and it migh actually not be defined). My idea with selinux would have been that it might block the ipc if crm_mon is issued by execd. But well forget about it as it is not enabled ;-) Klaus pgsql needs the result of crm_mon in demote processing and stop processing. crm_mon should return a response even after pacemakerd goes into a stop operation. Best Regards, Hideo Yamauchi. - Original Message - From: Klaus Wenninger To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to open-source clustering welcomed Cc: Date: 2021/4/9, Fri 21:12 Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails. On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote: Hi Ken, Hi All, In the pgsql resource, crm_mon is executed in the process of demote and stop, and the result is processed. However, pacemaker included in RHEL8.4beta fails to execute this crm_mon. - The problem also occurs on github master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f). The problem can be easily reproduced in the following ways. Step1. Modify to execute crm_mon in the stop process of the Dummy resource. dummy_stop() { mon=$(crm_mon -1) ret=$? ocf_log info "### YAMAUCHI crm_mon[${ret}] : ${mon}" dummy_monitor if [ $? = $OCF_SUCCESS ]; then rm ${OCF_RESKEY_state} fi return $OCF_SUCCESS } Step2. Configure a cluster with two nodes. [root@rh84-beta01 ~]# crm_mon -rfA1 Cluster Summary: * Stack: corosync * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition with quorum * Last updated: Thu Apr 8 18:00:52 2021 * Last change: Thu Apr 8 18:00:38 2021 by root via cibadmin on rh84-beta01 * 2 nodes configured * 1 resource instance configured Node List: * Online: [ rh84-beta01 rh84-beta02 ] Full List of Resources: * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta01 Migration Summary: Step3. Stop the node where the Dummy resource is running. The resource will fail over. [root@rh84-beta02 ~]# crm_mon -rfA1 Cluster Summary: * Stack: corosync * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition with quorum * Last updated: Thu Apr 8 18:08:56 2021 * Last change: Thu Apr 8 18:05:08 2021 by root via cibadmin on rh84-beta01 * 2 nodes configured * 1 resource instance configured Node List: * Online: [ rh84-beta02 ] * OFFLINE: [ rh84-beta01 ] Full List of Resources: * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta02 However, if you look at the log, you can see that the execution of crm_mon in the stop processing of the Dummy resource has failed. Apr 08 18:05:17 Dummy(dummy-1)[2631]: INFO: ### YAMAUCHI crm_mon[102] : Pacemaker daemons shutting down ... Apr 08 18:05:17 rh84-beta01 pacemaker-execd [2219] (log_op_output) notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not available on this node ] Hmm ... is that with selinux enabled? Respectively do you see any related avc messages? Klaus Similarly, pgsql also executes crm_mon with demote or stop, so control fails. The problem seems to be related to th
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
On 4/9/21 4:04 PM, Klaus Wenninger wrote: On 4/9/21 3:45 PM, Klaus Wenninger wrote: On 4/9/21 3:36 PM, Klaus Wenninger wrote: On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote: Hi Klaus, Thanks for your comment. Hmm ... is that with selinux enabled? Respectively do you see any related avc messages? Selinux is not enabled. Isn't crm_mon caused by not returning a response when pacemakerd prepares to stop? yep ... that doesn't look good. While in pcmk_shutdown_worker ipc isn't handled. Stop ... that should actually work as pcmk_shutdown_worker should exit quite quickly and proceed after mainloop dispatching when called again. Don't see anything atm that might be blocking for longer ... but let me dig into it further ... What happens is clear (thanks Ken for the hint ;-) ). When pacemakerd is shutting down - already when it shuts down the resources and not just when it starts to reap the subdaemons - crm_mon reads that state and doesn't try to connect to the cib anymore. Question is why that didn't create issue earlier. Probably I didn't test with resources that had crm_mon in their stop/monitor-actions but sbd should have run into issues. Klaus But when shutting down a node the resources should be shutdown before pacemakerd goes down. But let me have a look if it can happen that pacemakerd doesn't react to the ipc-pings before. That btw. might be lethal for sbd-scenarios (if the phase is too long and it migh actually not be defined). My idea with selinux would have been that it might block the ipc if crm_mon is issued by execd. But well forget about it as it is not enabled ;-) Klaus pgsql needs the result of crm_mon in demote processing and stop processing. crm_mon should return a response even after pacemakerd goes into a stop operation. Best Regards, Hideo Yamauchi. - Original Message - From: Klaus Wenninger To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to open-source clustering welcomed Cc: Date: 2021/4/9, Fri 21:12 Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails. On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote: Hi Ken, Hi All, In the pgsql resource, crm_mon is executed in the process of demote and stop, and the result is processed. However, pacemaker included in RHEL8.4beta fails to execute this crm_mon. - The problem also occurs on github master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f). The problem can be easily reproduced in the following ways. Step1. Modify to execute crm_mon in the stop process of the Dummy resource. dummy_stop() { mon=$(crm_mon -1) ret=$? ocf_log info "### YAMAUCHI crm_mon[${ret}] : ${mon}" dummy_monitor if [ $? = $OCF_SUCCESS ]; then rm ${OCF_RESKEY_state} fi return $OCF_SUCCESS } Step2. Configure a cluster with two nodes. [root@rh84-beta01 ~]# crm_mon -rfA1 Cluster Summary: * Stack: corosync * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition with quorum * Last updated: Thu Apr 8 18:00:52 2021 * Last change: Thu Apr 8 18:00:38 2021 by root via cibadmin on rh84-beta01 * 2 nodes configured * 1 resource instance configured Node List: * Online: [ rh84-beta01 rh84-beta02 ] Full List of Resources: * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta01 Migration Summary: Step3. Stop the node where the Dummy resource is running. The resource will fail over. [root@rh84-beta02 ~]# crm_mon -rfA1 Cluster Summary: * Stack: corosync * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition with quorum * Last updated: Thu Apr 8 18:08:56 2021 * Last change: Thu Apr 8 18:05:08 2021 by root via cibadmin on rh84-beta01 * 2 nodes configured * 1 resource instance configured Node List: * Online: [ rh84-beta02 ] * OFFLINE: [ rh84-beta01 ] Full List of Resources: * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta02 However, if you look at the log, you can see that the execution of crm_mon in the stop processing of the Dummy resource has failed. Apr 08 18:05:17 Dummy(dummy-1)[2631]: INFO: ### YAMAUCHI crm_mon[102] : Pacemaker daemons shutting down ... Apr 08 18:05:17 rh84-beta01 pacemaker-execd [2219] (log_op_output) notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not available on this node ] Hmm ... is that with selinux enabled? Respectively do you see any related avc messages? Klaus Similarly, pgsql also executes crm_mon with demote or stop, so control fails. The problem seems to be related to the next fix. * Report pacemakerd in state waiting for sbd - https://github.com/ClusterLabs/pacemaker/pull/2278 The problem does not occur with the release version of Pacemaker 2.0.5 or the Pacemake
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
On 4/9/21 3:45 PM, Klaus Wenninger wrote: On 4/9/21 3:36 PM, Klaus Wenninger wrote: On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote: Hi Klaus, Thanks for your comment. Hmm ... is that with selinux enabled? Respectively do you see any related avc messages? Selinux is not enabled. Isn't crm_mon caused by not returning a response when pacemakerd prepares to stop? yep ... that doesn't look good. While in pcmk_shutdown_worker ipc isn't handled. Stop ... that should actually work as pcmk_shutdown_worker should exit quite quickly and proceed after mainloop dispatching when called again. Don't see anything atm that might be blocking for longer ... but let me dig into it further ... Question is why that didn't create issue earlier. Probably I didn't test with resources that had crm_mon in their stop/monitor-actions but sbd should have run into issues. Klaus But when shutting down a node the resources should be shutdown before pacemakerd goes down. But let me have a look if it can happen that pacemakerd doesn't react to the ipc-pings before. That btw. might be lethal for sbd-scenarios (if the phase is too long and it migh actually not be defined). My idea with selinux would have been that it might block the ipc if crm_mon is issued by execd. But well forget about it as it is not enabled ;-) Klaus pgsql needs the result of crm_mon in demote processing and stop processing. crm_mon should return a response even after pacemakerd goes into a stop operation. Best Regards, Hideo Yamauchi. - Original Message - From: Klaus Wenninger To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to open-source clustering welcomed Cc: Date: 2021/4/9, Fri 21:12 Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails. On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote: Hi Ken, Hi All, In the pgsql resource, crm_mon is executed in the process of demote and stop, and the result is processed. However, pacemaker included in RHEL8.4beta fails to execute this crm_mon. - The problem also occurs on github master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f). The problem can be easily reproduced in the following ways. Step1. Modify to execute crm_mon in the stop process of the Dummy resource. dummy_stop() { mon=$(crm_mon -1) ret=$? ocf_log info "### YAMAUCHI crm_mon[${ret}] : ${mon}" dummy_monitor if [ $? = $OCF_SUCCESS ]; then rm ${OCF_RESKEY_state} fi return $OCF_SUCCESS } Step2. Configure a cluster with two nodes. [root@rh84-beta01 ~]# crm_mon -rfA1 Cluster Summary: * Stack: corosync * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition with quorum * Last updated: Thu Apr 8 18:00:52 2021 * Last change: Thu Apr 8 18:00:38 2021 by root via cibadmin on rh84-beta01 * 2 nodes configured * 1 resource instance configured Node List: * Online: [ rh84-beta01 rh84-beta02 ] Full List of Resources: * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta01 Migration Summary: Step3. Stop the node where the Dummy resource is running. The resource will fail over. [root@rh84-beta02 ~]# crm_mon -rfA1 Cluster Summary: * Stack: corosync * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition with quorum * Last updated: Thu Apr 8 18:08:56 2021 * Last change: Thu Apr 8 18:05:08 2021 by root via cibadmin on rh84-beta01 * 2 nodes configured * 1 resource instance configured Node List: * Online: [ rh84-beta02 ] * OFFLINE: [ rh84-beta01 ] Full List of Resources: * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta02 However, if you look at the log, you can see that the execution of crm_mon in the stop processing of the Dummy resource has failed. Apr 08 18:05:17 Dummy(dummy-1)[2631]: INFO: ### YAMAUCHI crm_mon[102] : Pacemaker daemons shutting down ... Apr 08 18:05:17 rh84-beta01 pacemaker-execd [2219] (log_op_output) notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not available on this node ] Hmm ... is that with selinux enabled? Respectively do you see any related avc messages? Klaus Similarly, pgsql also executes crm_mon with demote or stop, so control fails. The problem seems to be related to the next fix. * Report pacemakerd in state waiting for sbd - https://github.com/ClusterLabs/pacemaker/pull/2278 The problem does not occur with the release version of Pacemaker 2.0.5 or the Pacemaker included with RHEL8.3. This issue has a huge impact on the user. Perhaps it also affects the control of other resources that utilize crm_mon. Please improve the release version of RHEL8.4 so that it includes Pacemaker which does not cause this problem. * Distributions other than RHE
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
On 4/9/21 3:36 PM, Klaus Wenninger wrote: On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote: Hi Klaus, Thanks for your comment. Hmm ... is that with selinux enabled? Respectively do you see any related avc messages? Selinux is not enabled. Isn't crm_mon caused by not returning a response when pacemakerd prepares to stop? yep ... that doesn't look good. While in pcmk_shutdown_worker ipc isn't handled. Question is why that didn't create issue earlier. Probably I didn't test with resources that had crm_mon in their stop/monitor-actions but sbd should have run into issues. Klaus But when shutting down a node the resources should be shutdown before pacemakerd goes down. But let me have a look if it can happen that pacemakerd doesn't react to the ipc-pings before. That btw. might be lethal for sbd-scenarios (if the phase is too long and it migh actually not be defined). My idea with selinux would have been that it might block the ipc if crm_mon is issued by execd. But well forget about it as it is not enabled ;-) Klaus pgsql needs the result of crm_mon in demote processing and stop processing. crm_mon should return a response even after pacemakerd goes into a stop operation. Best Regards, Hideo Yamauchi. - Original Message - From: Klaus Wenninger To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to open-source clustering welcomed Cc: Date: 2021/4/9, Fri 21:12 Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails. On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote: Hi Ken, Hi All, In the pgsql resource, crm_mon is executed in the process of demote and stop, and the result is processed. However, pacemaker included in RHEL8.4beta fails to execute this crm_mon. - The problem also occurs on github master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f). The problem can be easily reproduced in the following ways. Step1. Modify to execute crm_mon in the stop process of the Dummy resource. dummy_stop() { mon=$(crm_mon -1) ret=$? ocf_log info "### YAMAUCHI crm_mon[${ret}] : ${mon}" dummy_monitor if [ $? = $OCF_SUCCESS ]; then rm ${OCF_RESKEY_state} fi return $OCF_SUCCESS } Step2. Configure a cluster with two nodes. [root@rh84-beta01 ~]# crm_mon -rfA1 Cluster Summary: * Stack: corosync * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition with quorum * Last updated: Thu Apr 8 18:00:52 2021 * Last change: Thu Apr 8 18:00:38 2021 by root via cibadmin on rh84-beta01 * 2 nodes configured * 1 resource instance configured Node List: * Online: [ rh84-beta01 rh84-beta02 ] Full List of Resources: * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta01 Migration Summary: Step3. Stop the node where the Dummy resource is running. The resource will fail over. [root@rh84-beta02 ~]# crm_mon -rfA1 Cluster Summary: * Stack: corosync * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition with quorum * Last updated: Thu Apr 8 18:08:56 2021 * Last change: Thu Apr 8 18:05:08 2021 by root via cibadmin on rh84-beta01 * 2 nodes configured * 1 resource instance configured Node List: * Online: [ rh84-beta02 ] * OFFLINE: [ rh84-beta01 ] Full List of Resources: * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta02 However, if you look at the log, you can see that the execution of crm_mon in the stop processing of the Dummy resource has failed. Apr 08 18:05:17 Dummy(dummy-1)[2631]: INFO: ### YAMAUCHI crm_mon[102] : Pacemaker daemons shutting down ... Apr 08 18:05:17 rh84-beta01 pacemaker-execd [2219] (log_op_output) notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not available on this node ] Hmm ... is that with selinux enabled? Respectively do you see any related avc messages? Klaus Similarly, pgsql also executes crm_mon with demote or stop, so control fails. The problem seems to be related to the next fix. * Report pacemakerd in state waiting for sbd - https://github.com/ClusterLabs/pacemaker/pull/2278 The problem does not occur with the release version of Pacemaker 2.0.5 or the Pacemaker included with RHEL8.3. This issue has a huge impact on the user. Perhaps it also affects the control of other resources that utilize crm_mon. Please improve the release version of RHEL8.4 so that it includes Pacemaker which does not cause this problem. * Distributions other than RHEL may also be affected in future releases. This content is the same as the following Bugzilla. - https://bugs.clusterlabs.org/show_bug.cgi?id=5471 Best Regards, Hideo Yamauchi. ___ Manage your subscriptio
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote: Hi Klaus, Thanks for your comment. Hmm ... is that with selinux enabled? Respectively do you see any related avc messages? Selinux is not enabled. Isn't crm_mon caused by not returning a response when pacemakerd prepares to stop? But when shutting down a node the resources should be shutdown before pacemakerd goes down. But let me have a look if it can happen that pacemakerd doesn't react to the ipc-pings before. That btw. might be lethal for sbd-scenarios (if the phase is too long and it migh actually not be defined). My idea with selinux would have been that it might block the ipc if crm_mon is issued by execd. But well forget about it as it is not enabled ;-) Klaus pgsql needs the result of crm_mon in demote processing and stop processing. crm_mon should return a response even after pacemakerd goes into a stop operation. Best Regards, Hideo Yamauchi. - Original Message - From: Klaus Wenninger To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to open-source clustering welcomed Cc: Date: 2021/4/9, Fri 21:12 Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails. On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote: Hi Ken, Hi All, In the pgsql resource, crm_mon is executed in the process of demote and stop, and the result is processed. However, pacemaker included in RHEL8.4beta fails to execute this crm_mon. - The problem also occurs on github master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f). The problem can be easily reproduced in the following ways. Step1. Modify to execute crm_mon in the stop process of the Dummy resource. dummy_stop() { mon=$(crm_mon -1) ret=$? ocf_log info "### YAMAUCHI crm_mon[${ret}] : ${mon}" dummy_monitor if [ $? = $OCF_SUCCESS ]; then rm ${OCF_RESKEY_state} fi return $OCF_SUCCESS } Step2. Configure a cluster with two nodes. [root@rh84-beta01 ~]# crm_mon -rfA1 Cluster Summary: * Stack: corosync * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition with quorum * Last updated: Thu Apr 8 18:00:52 2021 * Last change: Thu Apr 8 18:00:38 2021 by root via cibadmin on rh84-beta01 * 2 nodes configured * 1 resource instance configured Node List: * Online: [ rh84-beta01 rh84-beta02 ] Full List of Resources: * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta01 Migration Summary: Step3. Stop the node where the Dummy resource is running. The resource will fail over. [root@rh84-beta02 ~]# crm_mon -rfA1 Cluster Summary: * Stack: corosync * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition with quorum * Last updated: Thu Apr 8 18:08:56 2021 * Last change: Thu Apr 8 18:05:08 2021 by root via cibadmin on rh84-beta01 * 2 nodes configured * 1 resource instance configured Node List: * Online: [ rh84-beta02 ] * OFFLINE: [ rh84-beta01 ] Full List of Resources: * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta02 However, if you look at the log, you can see that the execution of crm_mon in the stop processing of the Dummy resource has failed. Apr 08 18:05:17 Dummy(dummy-1)[2631]: INFO: ### YAMAUCHI crm_mon[102] : Pacemaker daemons shutting down ... Apr 08 18:05:17 rh84-beta01 pacemaker-execd [2219] (log_op_output) notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not available on this node ] Hmm ... is that with selinux enabled? Respectively do you see any related avc messages? Klaus Similarly, pgsql also executes crm_mon with demote or stop, so control fails. The problem seems to be related to the next fix. * Report pacemakerd in state waiting for sbd - https://github.com/ClusterLabs/pacemaker/pull/2278 The problem does not occur with the release version of Pacemaker 2.0.5 or the Pacemaker included with RHEL8.3. This issue has a huge impact on the user. Perhaps it also affects the control of other resources that utilize crm_mon. Please improve the release version of RHEL8.4 so that it includes Pacemaker which does not cause this problem. * Distributions other than RHEL may also be affected in future releases. This content is the same as the following Bugzilla. - https://bugs.clusterlabs.org/show_bug.cgi?id=5471 Best Regards, Hideo Yamauchi. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ -- Klaus Wenninger Senior Software Engineer, EMEA ENG Base Operating Systems Red Hat kwenn...@redhat.com Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Mue
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
Hi Klaus, Thanks for your comment. > Hmm ... is that with selinux enabled? > Respectively do you see any related avc messages? Selinux is not enabled. Isn't crm_mon caused by not returning a response when pacemakerd prepares to stop? pgsql needs the result of crm_mon in demote processing and stop processing. crm_mon should return a response even after pacemakerd goes into a stop operation. Best Regards, Hideo Yamauchi. - Original Message - > From: Klaus Wenninger > To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to > open-source clustering welcomed > Cc: > Date: 2021/4/9, Fri 21:12 > Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control > fails. > > On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote: >> Hi Ken, >> Hi All, >> >> In the pgsql resource, crm_mon is executed in the process of demote and > stop, and the result is processed. >> >> However, pacemaker included in RHEL8.4beta fails to execute this crm_mon. >> - The problem also occurs on github > master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f). >> >> The problem can be easily reproduced in the following ways. >> >> Step1. Modify to execute crm_mon in the stop process of the Dummy resource. >> >> >> dummy_stop() { >> mon=$(crm_mon -1) >> ret=$? >> ocf_log info "### YAMAUCHI crm_mon[${ret}] : ${mon}" >> dummy_monitor >> if [ $? = $OCF_SUCCESS ]; then >> rm ${OCF_RESKEY_state} >> fi >> return $OCF_SUCCESS >> } >> >> >> Step2. Configure a cluster with two nodes. >> >> >> [root@rh84-beta01 ~]# crm_mon -rfA1 >> Cluster Summary: >> * Stack: corosync >> * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition > with quorum >> * Last updated: Thu Apr 8 18:00:52 2021 >> * Last change: Thu Apr 8 18:00:38 2021 by root via cibadmin on > rh84-beta01 >> * 2 nodes configured >> * 1 resource instance configured >> >> Node List: >> * Online: [ rh84-beta01 rh84-beta02 ] >> >> Full List of Resources: >> * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta01 >> >> Migration Summary: >> >> >> Step3. Stop the node where the Dummy resource is running. The resource will > fail over. >> >> [root@rh84-beta02 ~]# crm_mon -rfA1 >> Cluster Summary: >> * Stack: corosync >> * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition > with quorum >> * Last updated: Thu Apr 8 18:08:56 2021 >> * Last change: Thu Apr 8 18:05:08 2021 by root via cibadmin on > rh84-beta01 >> * 2 nodes configured >> * 1 resource instance configured >> >> Node List: >> * Online: [ rh84-beta02 ] >> * OFFLINE: [ rh84-beta01 ] >> >> Full List of Resources: >> * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta02 >> >> >> However, if you look at the log, you can see that the execution of crm_mon > in the stop processing of the Dummy resource has failed. >> >> >> Apr 08 18:05:17 Dummy(dummy-1)[2631]: INFO: ### YAMAUCHI > crm_mon[102] : Pacemaker daemons shutting down ... >> Apr 08 18:05:17 rh84-beta01 pacemaker-execd [2219] (log_op_output) > notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not > available on this node ] > Hmm ... is that with selinux enabled? > Respectively do you see any related avc messages? > > Klaus >> >> >> Similarly, pgsql also executes crm_mon with demote or stop, so control > fails. >> >> The problem seems to be related to the next fix. >> * Report pacemakerd in state waiting for sbd >> - https://github.com/ClusterLabs/pacemaker/pull/2278 >> >> The problem does not occur with the release version of Pacemaker 2.0.5 or > the Pacemaker included with RHEL8.3. >> >> This issue has a huge impact on the user. >> >> Perhaps it also affects the control of other resources that utilize > crm_mon. >> >> Please improve the release version of RHEL8.4 so that it includes Pacemaker > which does not cause this problem. >> * Distributions other than RHEL may also be affected in future releases. >> >> >> This content is the same as the following Bugzilla. >> - https://bugs.clusterlabs.org/show_bug.cgi?id=5471 >> >> >> Best Regards, >> Hideo Yamauchi. >> >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote: Hi Ken, Hi All, In the pgsql resource, crm_mon is executed in the process of demote and stop, and the result is processed. However, pacemaker included in RHEL8.4beta fails to execute this crm_mon. - The problem also occurs on github master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f). The problem can be easily reproduced in the following ways. Step1. Modify to execute crm_mon in the stop process of the Dummy resource. dummy_stop() { mon=$(crm_mon -1) ret=$? ocf_log info "### YAMAUCHI crm_mon[${ret}] : ${mon}" dummy_monitor if [ $? = $OCF_SUCCESS ]; then rm ${OCF_RESKEY_state} fi return $OCF_SUCCESS } Step2. Configure a cluster with two nodes. [root@rh84-beta01 ~]# crm_mon -rfA1 Cluster Summary: * Stack: corosync * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition with quorum * Last updated: Thu Apr 8 18:00:52 2021 * Last change: Thu Apr 8 18:00:38 2021 by root via cibadmin on rh84-beta01 * 2 nodes configured * 1 resource instance configured Node List: * Online: [ rh84-beta01 rh84-beta02 ] Full List of Resources: * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta01 Migration Summary: Step3. Stop the node where the Dummy resource is running. The resource will fail over. [root@rh84-beta02 ~]# crm_mon -rfA1 Cluster Summary: * Stack: corosync * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition with quorum * Last updated: Thu Apr 8 18:08:56 2021 * Last change: Thu Apr 8 18:05:08 2021 by root via cibadmin on rh84-beta01 * 2 nodes configured * 1 resource instance configured Node List: * Online: [ rh84-beta02 ] * OFFLINE: [ rh84-beta01 ] Full List of Resources: * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta02 However, if you look at the log, you can see that the execution of crm_mon in the stop processing of the Dummy resource has failed. Apr 08 18:05:17 Dummy(dummy-1)[2631]: INFO: ### YAMAUCHI crm_mon[102] : Pacemaker daemons shutting down ... Apr 08 18:05:17 rh84-beta01 pacemaker-execd [2219] (log_op_output) notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not available on this node ] Hmm ... is that with selinux enabled? Respectively do you see any related avc messages? Klaus Similarly, pgsql also executes crm_mon with demote or stop, so control fails. The problem seems to be related to the next fix. * Report pacemakerd in state waiting for sbd - https://github.com/ClusterLabs/pacemaker/pull/2278 The problem does not occur with the release version of Pacemaker 2.0.5 or the Pacemaker included with RHEL8.3. This issue has a huge impact on the user. Perhaps it also affects the control of other resources that utilize crm_mon. Please improve the release version of RHEL8.4 so that it includes Pacemaker which does not cause this problem. * Distributions other than RHEL may also be affected in future releases. This content is the same as the following Bugzilla. - https://bugs.clusterlabs.org/show_bug.cgi?id=5471 Best Regards, Hideo Yamauchi. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
Hi Ken, Hi All, In the pgsql resource, crm_mon is executed in the process of demote and stop, and the result is processed. However, pacemaker included in RHEL8.4beta fails to execute this crm_mon. - The problem also occurs on github master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f). The problem can be easily reproduced in the following ways. Step1. Modify to execute crm_mon in the stop process of the Dummy resource. dummy_stop() { mon=$(crm_mon -1) ret=$? ocf_log info "### YAMAUCHI crm_mon[${ret}] : ${mon}" dummy_monitor if [ $? = $OCF_SUCCESS ]; then rm ${OCF_RESKEY_state} fi return $OCF_SUCCESS } Step2. Configure a cluster with two nodes. [root@rh84-beta01 ~]# crm_mon -rfA1 Cluster Summary: * Stack: corosync * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition with quorum * Last updated: Thu Apr 8 18:00:52 2021 * Last change: Thu Apr 8 18:00:38 2021 by root via cibadmin on rh84-beta01 * 2 nodes configured * 1 resource instance configured Node List: * Online: [ rh84-beta01 rh84-beta02 ] Full List of Resources: * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta01 Migration Summary: Step3. Stop the node where the Dummy resource is running. The resource will fail over. [root@rh84-beta02 ~]# crm_mon -rfA1 Cluster Summary: * Stack: corosync * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition with quorum * Last updated: Thu Apr 8 18:08:56 2021 * Last change: Thu Apr 8 18:05:08 2021 by root via cibadmin on rh84-beta01 * 2 nodes configured * 1 resource instance configured Node List: * Online: [ rh84-beta02 ] * OFFLINE: [ rh84-beta01 ] Full List of Resources: * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta02 However, if you look at the log, you can see that the execution of crm_mon in the stop processing of the Dummy resource has failed. Apr 08 18:05:17 Dummy(dummy-1)[2631]: INFO: ### YAMAUCHI crm_mon[102] : Pacemaker daemons shutting down ... Apr 08 18:05:17 rh84-beta01 pacemaker-execd [2219] (log_op_output) notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not available on this node ] Similarly, pgsql also executes crm_mon with demote or stop, so control fails. The problem seems to be related to the next fix. * Report pacemakerd in state waiting for sbd - https://github.com/ClusterLabs/pacemaker/pull/2278 The problem does not occur with the release version of Pacemaker 2.0.5 or the Pacemaker included with RHEL8.3. This issue has a huge impact on the user. Perhaps it also affects the control of other resources that utilize crm_mon. Please improve the release version of RHEL8.4 so that it includes Pacemaker which does not cause this problem. * Distributions other than RHEL may also be affected in future releases. This content is the same as the following Bugzilla. - https://bugs.clusterlabs.org/show_bug.cgi?id=5471 Best Regards, Hideo Yamauchi. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/