Hi Klaus, Hi Ken, > I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
> I guess the simplest possible solution to the immediate issue so > that we can discuss it. Thank you for the fix. I have confirmed that the fixes have been merged. I'll test this fix today just in case. Many thanks, Hideo Yamauchi. ----- Original Message ----- > From: Klaus Wenninger <kwenn...@redhat.com> > To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to > open-source clustering welcomed <users@clusterlabs.org> > Cc: > Date: 2021/4/12, Mon 22:22 > Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control > fails. > > On 4/9/21 5:13 PM, Klaus Wenninger wrote: >> On 4/9/21 4:04 PM, Klaus Wenninger wrote: >>> On 4/9/21 3:45 PM, Klaus Wenninger wrote: >>>> On 4/9/21 3:36 PM, Klaus Wenninger wrote: >>>>> On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote: >>>>>> Hi Klaus, >>>>>> >>>>>> Thanks for your comment. >>>>>> >>>>>>> Hmm ... is that with selinux enabled? >>>>>>> Respectively do you see any related avc messages? >>>>>> >>>>>> Selinux is not enabled. >>>>>> Isn't crm_mon caused by not returning a response when > pacemakerd >>>>>> prepares to stop? >>>> yep ... that doesn't look good. >>>> While in pcmk_shutdown_worker ipc isn't handled. >>> Stop ... that should actually work as pcmk_shutdown_worker >>> should exit quite quickly and proceed after mainloop >>> dispatching when called again. >>> Don't see anything atm that might be blocking for longer ... >>> but let me dig into it further ... >> What happens is clear (thanks Ken for the hint ;-) ). >> When pacemakerd is shutting down - already when it >> shuts down the resources and not just when it starts to >> reap the subdaemons - crm_mon reads that state and >> doesn't try to connect to the cib anymore. > I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with > I guess the simplest possible solution to the immediate issue so > that we can discuss it. >>>> Question is why that didn't create issue earlier. >>>> Probably I didn't test with resources that had crm_mon in >>>> their stop/monitor-actions but sbd should have run into >>>> issues. >>>> >>>> Klaus >>>>> But when shutting down a node the resources should be >>>>> shutdown before pacemakerd goes down. >>>>> But let me have a look if it can happen that pacemakerd >>>>> doesn't react to the ipc-pings before. That btw. might be >>>>> lethal for sbd-scenarios (if the phase is too long and it >>>>> migh actually not be defined). >>>>> >>>>> My idea with selinux would have been that it might block >>>>> the ipc if crm_mon is issued by execd. But well forget >>>>> about it as it is not enabled ;-) >>>>> >>>>> >>>>> Klaus >>>>>> >>>>>> pgsql needs the result of crm_mon in demote processing and > stop >>>>>> processing. >>>>>> crm_mon should return a response even after pacemakerd goes > into a >>>>>> stop operation. >>>>>> >>>>>> Best Regards, >>>>>> Hideo Yamauchi. >>>>>> >>>>>> >>>>>> ----- Original Message ----- >>>>>>> From: Klaus Wenninger <kwenn...@redhat.com> >>>>>>> To: renayama19661...@ybb.ne.jp; Cluster Labs - All > topics related >>>>>>> to open-source clustering welcomed > <users@clusterlabs.org> >>>>>>> Cc: >>>>>>> Date: 2021/4/9, Fri 21:12 >>>>>>> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, > pgsql >>>>>>> resource control fails. >>>>>>> >>>>>>> On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote: >>>>>>>> Hi Ken, >>>>>>>> Hi All, >>>>>>>> >>>>>>>> In the pgsql resource, crm_mon is executed in the > process of >>>>>>>> demote and >>>>>>> stop, and the result is processed. >>>>>>>> However, pacemaker included in RHEL8.4beta fails > to execute >>>>>>>> this crm_mon. >>>>>>>> - The problem also occurs on github >>>>>>> master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f). >>>>>>>> The problem can be easily reproduced in the > following ways. >>>>>>>> >>>>>>>> Step1. Modify to execute crm_mon in the stop > process of the >>>>>>>> Dummy resource. >>>>>>>> ---- >>>>>>>> >>>>>>>> dummy_stop() { >>>>>>>> mon=$(crm_mon -1) >>>>>>>> ret=$? >>>>>>>> ocf_log info "### YAMAUCHI #### > crm_mon[${ret}] : ${mon}" >>>>>>>> dummy_monitor >>>>>>>> if [ $? = $OCF_SUCCESS ]; then >>>>>>>> rm ${OCF_RESKEY_state} >>>>>>>> fi >>>>>>>> return $OCF_SUCCESS >>>>>>>> } >>>>>>>> ---- >>>>>>>> >>>>>>>> Step2. Configure a cluster with two nodes. >>>>>>>> ---- >>>>>>>> >>>>>>>> [root@rh84-beta01 ~]# crm_mon -rfA1 >>>>>>>> Cluster Summary: >>>>>>>> * Stack: corosync >>>>>>>> * Current DC: rh84-beta01 (version > 2.0.5-8.el8-ba59be7122) >>>>>>>> - partition >>>>>>> with quorum >>>>>>>> * Last updated: Thu Apr 8 18:00:52 2021 >>>>>>>> * Last change: Thu Apr 8 18:00:38 2021 by > root via >>>>>>>> cibadmin on >>>>>>> rh84-beta01 >>>>>>>> * 2 nodes configured >>>>>>>> * 1 resource instance configured >>>>>>>> >>>>>>>> Node List: >>>>>>>> * Online: [ rh84-beta01 rh84-beta02 ] >>>>>>>> >>>>>>>> Full List of Resources: >>>>>>>> * dummy-1 (ocf::heartbeat:Dummy): Started > rh84-beta01 >>>>>>>> >>>>>>>> Migration Summary: >>>>>>>> ---- >>>>>>>> >>>>>>>> Step3. Stop the node where the Dummy resource is > running. The >>>>>>>> resource will >>>>>>> fail over. >>>>>>>> ---- >>>>>>>> [root@rh84-beta02 ~]# crm_mon -rfA1 >>>>>>>> Cluster Summary: >>>>>>>> * Stack: corosync >>>>>>>> * Current DC: rh84-beta02 (version > 2.0.5-8.el8-ba59be7122) >>>>>>>> - partition >>>>>>> with quorum >>>>>>>> * Last updated: Thu Apr 8 18:08:56 2021 >>>>>>>> * Last change: Thu Apr 8 18:05:08 2021 by > root via >>>>>>>> cibadmin on >>>>>>> rh84-beta01 >>>>>>>> * 2 nodes configured >>>>>>>> * 1 resource instance configured >>>>>>>> >>>>>>>> Node List: >>>>>>>> * Online: [ rh84-beta02 ] >>>>>>>> * OFFLINE: [ rh84-beta01 ] >>>>>>>> >>>>>>>> Full List of Resources: >>>>>>>> * dummy-1 (ocf::heartbeat:Dummy): Started > rh84-beta02 >>>>>>>> ---- >>>>>>>> >>>>>>>> However, if you look at the log, you can see that > the >>>>>>>> execution of crm_mon >>>>>>> in the stop processing of the Dummy resource has > failed. >>>>>>>> ---- >>>>>>>> Apr 08 18:05:17 Dummy(dummy-1)[2631]: INFO: > ### YAMAUCHI #### >>>>>>> crm_mon[102] : Pacemaker daemons shutting down ... >>>>>>>> Apr 08 18:05:17 rh84-beta01 pacemaker-execd > [2219] >>>>>>>> (log_op_output) >>>>>>> notice: dummy-1_stop_0[2631] error output [ crm_mon: > Error: >>>>>>> cluster is not >>>>>>> available on this node ] >>>>>>> Hmm ... is that with selinux enabled? >>>>>>> Respectively do you see any related avc messages? >>>>>>> >>>>>>> Klaus >>>>>>>> ---- >>>>>>>> >>>>>>>> Similarly, pgsql also executes crm_mon with > demote or stop, so >>>>>>>> control >>>>>>> fails. >>>>>>>> The problem seems to be related to the next fix. >>>>>>>> * Report pacemakerd in state waiting for sbd >>>>>>>> - > https://github.com/ClusterLabs/pacemaker/pull/2278 >>>>>>>> >>>>>>>> The problem does not occur with the release > version of >>>>>>>> Pacemaker 2.0.5 or >>>>>>> the Pacemaker included with RHEL8.3. >>>>>>>> This issue has a huge impact on the user. >>>>>>>> >>>>>>>> Perhaps it also affects the control of other > resources that >>>>>>>> utilize >>>>>>> crm_mon. >>>>>>>> Please improve the release version of RHEL8.4 so > that it >>>>>>>> includes Pacemaker >>>>>>> which does not cause this problem. >>>>>>>> * Distributions other than RHEL may also be > affected in >>>>>>>> future releases. >>>>>>>> >>>>>>>> ---- >>>>>>>> This content is the same as the following > Bugzilla. >>>>>>>> - > https://bugs.clusterlabs.org/show_bug.cgi?id=5471 >>>>>>>> ---- >>>>>>>> >>>>>>>> Best Regards, >>>>>>>> Hideo Yamauchi. >>>>>>>> >>>>>>>> _______________________________________________ >> > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/