Re: [ClusterLabs] Antw: Re: [Question] About a change of crm_failcount.

2017-02-09 Thread Ferenc Wágner
Jehan-Guillaume de Rorthais  writes:

> PAF use private attribute to give informations between actions. We
> detect the failure during the notify as well, but raise the error
> during the promotion itself. See how I dealt with this in PAF:
>
> https://github.com/ioguix/PAF/commit/6123025ff7cd9929b56c9af2faaefdf392886e68

This is the first time I hear about private attributes.  Since they
could come useful one day, I'd like to understand them better.  After
some reading, they seem to be node attributes, not resource attributes.
This may be irrelevant for PAF, but doesn't it mean that two resources
of the same type on the same node would interfere with each other?
Also, your _set_priv_attr could fall into an infinite loop if another
instance used it at the inappropriate moment.  Do I miss something here?
-- 
Thanks,
Feri

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: [Question] About a change of crm_failcount.

2017-02-09 Thread Ken Gaillot
On 02/09/2017 05:46 AM, Jehan-Guillaume de Rorthais wrote:
> On Thu, 9 Feb 2017 19:24:22 +0900 (JST)
> renayama19661...@ybb.ne.jp wrote:
> 
>> Hi Ken,
>>
>>
>>> 1. Return a "hard" error such as OCF_ERR_ARGS or OCF_ERR_PERM. When
>>> Pacemaker gets one of these errors from an agent, it will ban the
>>> resource from that node (until the failure is cleared).  
>>
>> The first suggestion does not work well.
>>
>> Even if this returns OCF_ERR_ARGS and OCF_ERR_PERM, it seems to be to be
>> pre_promote(notify) handling of RA. Pacemaker does not record the notify(pre
>> promote) error in CIB.
>>
>>  * https://github.com/ClusterLabs/pacemaker/blob/master/crmd/lrm.c#L2411
>>
>> Because it is not recorded in CIB, there cannot be the thing that pengine
>> works as "hard error".

Ah, I didn't think of that.

> Indeed. That's why PAF use private attribute to give informations between
> actions. We detect the failure during the notify as well, but raise the error
> during the promotion itself. See how I dealt with this in PAF:
> 
> https://github.com/ioguix/PAF/commit/6123025ff7cd9929b56c9af2faaefdf392886e68

That's a nice use of private attributes.

> As private attributes does not work on older stacks, you could rely on local
> temp file as well in $HA_RSCTMP.
> 
>>> 2. Use crm_resource --ban instead. This would ban the resource from that
>>> node until the user removes the ban with crm_resource --clear (or by
>>> deleting the ban consraint from the configuration).  
>>
>> The second suggestion works well.
>> I intend to adopt the second suggestion.
>>
>> As other methods, you think crm_resource -F to be available, but what do you
>> think? I think that last-failure does not have a problem either to let you
>> handle pseudotrouble if it is crm_resource -F.
>>
>> I think whether crm_resource -F is available, but adopt crm_resource -B
>> because RA wants to completely stop pgsql resource.
>>
>> ``` @pgsql RA
>>
>> pgsql_pre_promote() {
>> (snip)
>> if [ "$cmp_location" != "$my_master_baseline" ]; then
>> ocf_exit_reason "My data is newer than new master's one. New
>> master's location : $master_baseline" exec_with_retry 0 $CRM_RESOURCE -B -r
>> $OCF_RESOURCE_INSTANCE -N $NODENAME -Q return $OCF_ERR_GENERIC
>> fi
>> (snip)
>> CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
>> CRM_RESOURCE="${HA_SBIN_DIR}/crm_resource"
>> ```
>>
>> I test movement a little more and send a patch.
> 
> I suppose crm_resource -F will just raise the failcount, break the current
> transition and the CRM will recompute another transition paying attention to
> your "failed" resource (will it try to recover it? retry the previous
> transition again?).
> 
> I would bet on crm_resource -B.

Correct, crm_resource -F only simulates OCF_ERR_GENERIC, which is a soft
error. It might be a nice extension to be able to specify the error
code, but in this case, I think crm_resource -B (or the private
attribute approach, if you're OK with limiting it to corosync 2 and
pacemaker 1.1.13+) is better.

>> - Original Message -
>>> From: Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de>
>>> To: users@clusterlabs.org; kgail...@redhat.com
>>> Cc: 
>>> Date: 2017/2/6, Mon 17:44
>>> Subject: [ClusterLabs] Antw: Re: [Question] About a change of crm_failcount.
>>>   
>>>>>>  Ken Gaillot <kgail...@redhat.com> schrieb am 02.02.2017 um   
>>> 19:33 in Nachricht
>>> <91a83571-9930-94fd-e635-962830671...@redhat.com>:  
>>>>  On 02/02/2017 12:23 PM, renayama19661...@ybb.ne.jp wrote:  
>>>>>  Hi All,
>>>>>
>>>>>  By the next correction, the user was not able to set a value except   
>>> zero in   
>>>>  crm_failcount.  
>>>>>
>>>>>   - [Fix: tools: implement crm_failcount command-line options correctly]
>>>>> -   
>>>>   
>>> https://github.com/ClusterLabs/pacemaker/commit/95db10602e8f646eefed335414e40
>>>
>>>>  a994498cafd#diff-6e58482648938fd488a920b9902daac4  
>>>>>
>>>>>  However, pgsql RA sets INFINITY in a script.
>>>>>
>>>>>  ```
>>>>>  (snip)
>>>>>  CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
>>>>>  (snip)
>>>>>  ocf_exit_reason "My data is newer than new master's one.  

[ClusterLabs] Antw: Re: [Question] About a change of crm_failcount.

2017-02-06 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 02.02.2017 um 19:33 in 
>>> Nachricht
<91a83571-9930-94fd-e635-962830671...@redhat.com>:
> On 02/02/2017 12:23 PM, renayama19661...@ybb.ne.jp wrote:
>> Hi All,
>> 
>> By the next correction, the user was not able to set a value except zero in 
> crm_failcount.
>> 
>>  - [Fix: tools: implement crm_failcount command-line options correctly]
>>- 
> https://github.com/ClusterLabs/pacemaker/commit/95db10602e8f646eefed335414e40 
> a994498cafd#diff-6e58482648938fd488a920b9902daac4
>> 
>> However, pgsql RA sets INFINITY in a script.
>> 
>> ```
>> (snip)
>> CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
>> (snip)
>> ocf_exit_reason "My data is newer than new master's one. New   master's 
> location : $master_baseline"
>> exec_with_retry 0 $CRM_FAILCOUNT -r $OCF_RESOURCE_INSTANCE -U $NODENAME 
>> -v 
> INFINITY
>> return $OCF_ERR_GENERIC
>> (snip)
>> ```
>> 
>> There seems to be the influence only in pgsql somehow or other.
>> 
>> Can you revise it to set a value except zero in crm_failcount?
>> We make modifications to use crm_attribute in pgsql RA if we cannot revise 
> it.
>> 
>> Best Regards,
>> Hideo Yamauchi.
> 
> Hmm, I didn't realize that was used. I changed it because it's not a
> good idea to set fail-count without also changing last-failure and
> having a failed op in the LRM history. I'll have to think about what the
> best alternative is.

The question also is whether the RA can acieve the same effect otherwise. I 
thought CRM sets the failcount, not the RA...

> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org