Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

renayama19661014 Mon, 12 Apr 2021 15:07:42 -0700

Hi Klaus,
Hi Ken,

> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with


> I guess the simplest possible solution to the immediate issue so
> that we can discuss it.


Thank you for the fix.


I have confirmed that the fixes have been merged.

I'll test this fix today just in case.

Many thanks,
Hideo Yamauchi.


----- Original Message -----
> From: Klaus Wenninger <kwenn...@redhat.com>
> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed <users@clusterlabs.org>
> Cc: 
> Date: 2021/4/12, Mon 22:22
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
> 
> On 4/9/21 5:13 PM, Klaus Wenninger wrote:
>>  On 4/9/21 4:04 PM, Klaus Wenninger wrote:
>>>  On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>>>>  On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>>>>>  On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:
>>>>>>  Hi Klaus,
>>>>>> 
>>>>>>  Thanks for your comment.
>>>>>> 
>>>>>>>  Hmm ... is that with selinux enabled?
>>>>>>>  Respectively do you see any related avc messages?
>>>>>> 
>>>>>>  Selinux is not enabled.
>>>>>>  Isn't crm_mon caused by not returning a response when 
> pacemakerd 
>>>>>>  prepares to stop?
>>>>  yep ... that doesn't look good.
>>>>  While in pcmk_shutdown_worker ipc isn't handled.
>>>  Stop ... that should actually work as pcmk_shutdown_worker
>>>  should exit quite quickly and proceed after mainloop
>>>  dispatching when called again.
>>>  Don't see anything atm that might be blocking for longer ...
>>>  but let me dig into it further ...
>>  What happens is clear (thanks Ken for the hint ;-) ).
>>  When pacemakerd is shutting down - already when it
>>  shuts down the resources and not just when it starts to
>>  reap the subdaemons - crm_mon reads that state and
>>  doesn't try to connect to the cib anymore.
> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
> I guess the simplest possible solution to the immediate issue so
> that we can discuss it.
>>>>  Question is why that didn't create issue earlier.
>>>>  Probably I didn't test with resources that had crm_mon in
>>>>  their stop/monitor-actions but sbd should have run into
>>>>  issues.
>>>> 
>>>>  Klaus
>>>>>  But when shutting down a node the resources should be
>>>>>  shutdown before pacemakerd goes down.
>>>>>  But let me have a look if it can happen that pacemakerd
>>>>>  doesn't react to the ipc-pings before. That btw. might be
>>>>>  lethal for sbd-scenarios (if the phase is too long and it
>>>>>  migh actually not be defined).
>>>>> 
>>>>>  My idea with selinux would have been that it might block
>>>>>  the ipc if crm_mon is issued by execd. But well forget
>>>>>  about it as it is not enabled ;-)
>>>>> 
>>>>> 
>>>>>  Klaus
>>>>>> 
>>>>>>  pgsql needs the result of crm_mon in demote processing and 
> stop 
>>>>>>  processing.
>>>>>>  crm_mon should return a response even after pacemakerd goes 
> into a 
>>>>>>  stop operation.
>>>>>> 
>>>>>>  Best Regards,
>>>>>>  Hideo Yamauchi.
>>>>>> 
>>>>>> 
>>>>>>  ----- Original Message -----
>>>>>>>  From: Klaus Wenninger <kwenn...@redhat.com>
>>>>>>>  To: renayama19661...@ybb.ne.jp; Cluster Labs - All 
> topics related 
>>>>>>>  to open-source clustering welcomed 
> <users@clusterlabs.org>
>>>>>>>  Cc:
>>>>>>>  Date: 2021/4/9, Fri 21:12
>>>>>>>  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, 
> pgsql 
>>>>>>>  resource control fails.
>>>>>>> 
>>>>>>>  On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:
>>>>>>>>    Hi Ken,
>>>>>>>>    Hi All,
>>>>>>>> 
>>>>>>>>    In the pgsql resource, crm_mon is executed in the 
> process of 
>>>>>>>>  demote and
>>>>>>>  stop, and the result is processed.
>>>>>>>>    However, pacemaker included in RHEL8.4beta fails 
> to execute 
>>>>>>>>  this crm_mon.
>>>>>>>>      - The problem also occurs on github
>>>>>>>  master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).
>>>>>>>>    The problem can be easily reproduced in the 
> following ways.
>>>>>>>> 
>>>>>>>>    Step1. Modify to execute crm_mon in the stop 
> process of the 
>>>>>>>>  Dummy resource.
>>>>>>>>    ----
>>>>>>>> 
>>>>>>>>    dummy_stop() {
>>>>>>>>         mon=$(crm_mon -1)
>>>>>>>>         ret=$?
>>>>>>>>         ocf_log info "### YAMAUCHI #### 
> crm_mon[${ret}] : ${mon}"
>>>>>>>>         dummy_monitor
>>>>>>>>         if [ $? =  $OCF_SUCCESS ]; then
>>>>>>>>             rm ${OCF_RESKEY_state}
>>>>>>>>         fi
>>>>>>>>         return $OCF_SUCCESS
>>>>>>>>    }
>>>>>>>>    ----
>>>>>>>> 
>>>>>>>>    Step2. Configure a cluster with two nodes.
>>>>>>>>    ----
>>>>>>>> 
>>>>>>>>    [root@rh84-beta01 ~]# crm_mon -rfA1
>>>>>>>>    Cluster Summary:
>>>>>>>>       * Stack: corosync
>>>>>>>>       * Current DC: rh84-beta01 (version 
> 2.0.5-8.el8-ba59be7122) 
>>>>>>>>  - partition
>>>>>>>  with quorum
>>>>>>>>       * Last updated: Thu Apr  8 18:00:52 2021
>>>>>>>>       * Last change:  Thu Apr  8 18:00:38 2021 by 
> root via 
>>>>>>>>  cibadmin on
>>>>>>>  rh84-beta01
>>>>>>>>       * 2 nodes configured
>>>>>>>>       * 1 resource instance configured
>>>>>>>> 
>>>>>>>>    Node List:
>>>>>>>>       * Online: [ rh84-beta01 rh84-beta02 ]
>>>>>>>> 
>>>>>>>>    Full List of Resources:
>>>>>>>>       * dummy-1     (ocf::heartbeat:Dummy):  Started 
> rh84-beta01
>>>>>>>> 
>>>>>>>>    Migration Summary:
>>>>>>>>    ----
>>>>>>>> 
>>>>>>>>    Step3. Stop the node where the Dummy resource is 
> running. The 
>>>>>>>>  resource will
>>>>>>>  fail over.
>>>>>>>>    ----
>>>>>>>>    [root@rh84-beta02 ~]# crm_mon -rfA1
>>>>>>>>    Cluster Summary:
>>>>>>>>       * Stack: corosync
>>>>>>>>       * Current DC: rh84-beta02 (version 
> 2.0.5-8.el8-ba59be7122) 
>>>>>>>>  - partition
>>>>>>>  with quorum
>>>>>>>>       * Last updated: Thu Apr  8 18:08:56 2021
>>>>>>>>       * Last change:  Thu Apr  8 18:05:08 2021 by 
> root via 
>>>>>>>>  cibadmin on
>>>>>>>  rh84-beta01
>>>>>>>>       * 2 nodes configured
>>>>>>>>       * 1 resource instance configured
>>>>>>>> 
>>>>>>>>    Node List:
>>>>>>>>       * Online: [ rh84-beta02 ]
>>>>>>>>       * OFFLINE: [ rh84-beta01 ]
>>>>>>>> 
>>>>>>>>    Full List of Resources:
>>>>>>>>       * dummy-1     (ocf::heartbeat:Dummy):  Started 
> rh84-beta02
>>>>>>>>    ----
>>>>>>>> 
>>>>>>>>    However, if you look at the log, you can see that 
> the 
>>>>>>>>  execution of crm_mon
>>>>>>>  in the stop processing of the Dummy resource has 
> failed.
>>>>>>>>    ----
>>>>>>>>    Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: 
> ### YAMAUCHI ####
>>>>>>>  crm_mon[102] : Pacemaker daemons shutting down ...
>>>>>>>>    Apr 08 18:05:17 rh84-beta01 pacemaker-execd    
>  [2219] 
>>>>>>>>  (log_op_output)
>>>>>>>  notice: dummy-1_stop_0[2631] error output [ crm_mon: 
> Error: 
>>>>>>>  cluster is not
>>>>>>>  available on this node ]
>>>>>>>  Hmm ... is that with selinux enabled?
>>>>>>>  Respectively do you see any related avc messages?
>>>>>>> 
>>>>>>>  Klaus
>>>>>>>>    ----
>>>>>>>> 
>>>>>>>>    Similarly, pgsql also executes crm_mon with 
> demote or stop, so 
>>>>>>>>  control
>>>>>>>  fails.
>>>>>>>>    The problem seems to be related to the next fix.
>>>>>>>>      * Report pacemakerd in state waiting for sbd
>>>>>>>>       - 
> https://github.com/ClusterLabs/pacemaker/pull/2278 
>>>>>>>> 
>>>>>>>>    The problem does not occur with the release 
> version of 
>>>>>>>>  Pacemaker 2.0.5 or
>>>>>>>  the Pacemaker included with RHEL8.3.
>>>>>>>>    This issue has a huge impact on the user.
>>>>>>>> 
>>>>>>>>    Perhaps it also affects the control of other 
> resources that 
>>>>>>>>  utilize
>>>>>>>  crm_mon.
>>>>>>>>    Please improve the release version of RHEL8.4 so 
> that it 
>>>>>>>>  includes Pacemaker
>>>>>>>  which does not cause this problem.
>>>>>>>>      * Distributions other than RHEL may also be 
> affected in 
>>>>>>>>  future releases.
>>>>>>>> 
>>>>>>>>    ----
>>>>>>>>    This content is the same as the following 
> Bugzilla.
>>>>>>>>      - 
> https://bugs.clusterlabs.org/show_bug.cgi?id=5471 
>>>>>>>>    ----
>>>>>>>> 
>>>>>>>>    Best Regards,
>>>>>>>>    Hideo Yamauchi.
>>>>>>>> 
>>>>>>>>    _______________________________________________
>> 
> 

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

Reply via email to