On 02/09/2017 05:46 AM, Jehan-Guillaume de Rorthais wrote:
> On Thu, 9 Feb 2017 19:24:22 +0900 (JST)
> renayama19661...@ybb.ne.jp wrote:
>
>> Hi Ken,
>>
>>
>>> 1. Return a "hard" error such as OCF_ERR_ARGS or OCF_ERR_PERM. When
>>> Pacemaker gets one of these errors from an agent, it will ban the
>>> resource from that node (until the failure is cleared).
>>
>> The first suggestion does not work well.
>>
>> Even if this returns OCF_ERR_ARGS and OCF_ERR_PERM, it seems to be to be
>> pre_promote(notify) handling of RA. Pacemaker does not record the notify(pre
>> promote) error in CIB.
>>
>> * https://github.com/ClusterLabs/pacemaker/blob/master/crmd/lrm.c#L2411
>>
>> Because it is not recorded in CIB, there cannot be the thing that pengine
>> works as "hard error".
Ah, I didn't think of that.
> Indeed. That's why PAF use private attribute to give informations between
> actions. We detect the failure during the notify as well, but raise the error
> during the promotion itself. See how I dealt with this in PAF:
>
> https://github.com/ioguix/PAF/commit/6123025ff7cd9929b56c9af2faaefdf392886e68
That's a nice use of private attributes.
> As private attributes does not work on older stacks, you could rely on local
> temp file as well in $HA_RSCTMP.
>
>>> 2. Use crm_resource --ban instead. This would ban the resource from that
>>> node until the user removes the ban with crm_resource --clear (or by
>>> deleting the ban consraint from the configuration).
>>
>> The second suggestion works well.
>> I intend to adopt the second suggestion.
>>
>> As other methods, you think crm_resource -F to be available, but what do you
>> think? I think that last-failure does not have a problem either to let you
>> handle pseudotrouble if it is crm_resource -F.
>>
>> I think whether crm_resource -F is available, but adopt crm_resource -B
>> because RA wants to completely stop pgsql resource.
>>
>> ``` @pgsql RA
>>
>> pgsql_pre_promote() {
>> (snip)
>> if [ "$cmp_location" != "$my_master_baseline" ]; then
>> ocf_exit_reason "My data is newer than new master's one. New
>> master's location : $master_baseline" exec_with_retry 0 $CRM_RESOURCE -B -r
>> $OCF_RESOURCE_INSTANCE -N $NODENAME -Q return $OCF_ERR_GENERIC
>> fi
>> (snip)
>> CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
>> CRM_RESOURCE="${HA_SBIN_DIR}/crm_resource"
>> ```
>>
>> I test movement a little more and send a patch.
>
> I suppose crm_resource -F will just raise the failcount, break the current
> transition and the CRM will recompute another transition paying attention to
> your "failed" resource (will it try to recover it? retry the previous
> transition again?).
>
> I would bet on crm_resource -B.
Correct, crm_resource -F only simulates OCF_ERR_GENERIC, which is a soft
error. It might be a nice extension to be able to specify the error
code, but in this case, I think crm_resource -B (or the private
attribute approach, if you're OK with limiting it to corosync 2 and
pacemaker 1.1.13+) is better.
>> - Original Message -
>>> From: Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de>
>>> To: users@clusterlabs.org; kgail...@redhat.com
>>> Cc:
>>> Date: 2017/2/6, Mon 17:44
>>> Subject: [ClusterLabs] Antw: Re: [Question] About a change of crm_failcount.
>>>
>>>>>> Ken Gaillot <kgail...@redhat.com> schrieb am 02.02.2017 um
>>> 19:33 in Nachricht
>>> <91a83571-9930-94fd-e635-962830671...@redhat.com>:
>>>> On 02/02/2017 12:23 PM, renayama19661...@ybb.ne.jp wrote:
>>>>> Hi All,
>>>>>
>>>>> By the next correction, the user was not able to set a value except
>>> zero in
>>>> crm_failcount.
>>>>>
>>>>> - [Fix: tools: implement crm_failcount command-line options correctly]
>>>>> -
>>>>
>>> https://github.com/ClusterLabs/pacemaker/commit/95db10602e8f646eefed335414e40
>>>
>>>> a994498cafd#diff-6e58482648938fd488a920b9902daac4
>>>>>
>>>>> However, pgsql RA sets INFINITY in a script.
>>>>>
>>>>> ```
>>>>> (snip)
>>>>> CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
>>>>> (snip)
>>>>> ocf_exit_reason "My data is newer than new master's one.