Re: [ClusterLabs] [Question] About a change of crm_failcount.

2017-02-09 Thread Jehan-Guillaume de Rorthais
On Thu, 09 Feb 2017 18:04:41 +0100
wf...@niif.hu (Ferenc Wágner) wrote:

> Jehan-Guillaume de Rorthais  writes:
> 
> > PAF use private attribute to give informations between actions. We
> > detect the failure during the notify as well, but raise the error
> > during the promotion itself. See how I dealt with this in PAF:
> >
> > https://github.com/ioguix/PAF/commit/6123025ff7cd9929b56c9af2faaefdf392886e68
> >   
> 
> This is the first time I hear about private attributes.  Since they
> could come useful one day, I'd like to understand them better.  After
> some reading, they seem to be node attributes, not resource attributes.
> This may be irrelevant for PAF, but doesn't it mean that two resources
> of the same type on the same node would interfere with each other?
> Also, your _set_priv_attr could fall into an infinite loop if another
> instance used it at the inappropriate moment.  Do I miss something here?

No, you are perfectly right. We are aware of this, this is something we need to
fix in the next release of PAF (I was actually discussing this with a user 2
days ago on IRC :)).

Thank you for the report!
-- 
Jehan-Guillaume de Rorthais
Dalibo

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Question] About a change of crm_failcount.

2017-02-03 Thread Jehan-Guillaume de Rorthais
On Fri, 3 Feb 2017 09:45:18 -0600
Ken Gaillot  wrote:

> On 02/02/2017 12:33 PM, Ken Gaillot wrote:
> > On 02/02/2017 12:23 PM, renayama19661...@ybb.ne.jp wrote:  
> >> Hi All,
> >>
> >> By the next correction, the user was not able to set a value except zero
> >> in crm_failcount.
> >>
> >>  - [Fix: tools: implement crm_failcount command-line options correctly]
> >>-
> >> https://github.com/ClusterLabs/pacemaker/commit/95db10602e8f646eefed335414e40a994498cafd#diff-6e58482648938fd488a920b9902daac4
> >>
> >> However, pgsql RA sets INFINITY in a script.
> >>
> >> ```
> >> (snip)
> >> CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
> >> (snip)
> >> ocf_exit_reason "My data is newer than new master's one. New
> >> master's location : $master_baseline" exec_with_retry 0 $CRM_FAILCOUNT -r
> >> $OCF_RESOURCE_INSTANCE -U $NODENAME -v INFINITY return $OCF_ERR_GENERIC
> >> (snip)
> >> ```
> >>
> >> There seems to be the influence only in pgsql somehow or other.
> >>
> >> Can you revise it to set a value except zero in crm_failcount?
> >> We make modifications to use crm_attribute in pgsql RA if we cannot revise
> >> it.
> >>
> >> Best Regards,
> >> Hideo Yamauchi.  
> > 
> > Hmm, I didn't realize that was used. I changed it because it's not a
> > good idea to set fail-count without also changing last-failure and
> > having a failed op in the LRM history. I'll have to think about what the
> > best alternative is.  
> 
> Having a resource agent modify its own fail count is not a good idea,
> and could lead to unpredictable behavior. I didn't realize the pgsql
> agent did that.
> 
> I don't want to re-enable the functionality, because I don't want to
> encourage more agents doing this.
> 
> There are two alternatives the pgsql agent can choose from:
> 
> 1. Return a "hard" error such as OCF_ERR_ARGS or OCF_ERR_PERM. When
> Pacemaker gets one of these errors from an agent, it will ban the
> resource from that node (until the failure is cleared).
> 
> 2. Use crm_resource --ban instead. This would ban the resource from that
> node until the user removes the ban with crm_resource --clear (or by
> deleting the ban consraint from the configuration).
> 
> I'd recommend #1 since it does not require any pacemaker-specific tools.
> 
> We can make sure resource-agents has a fix for this before we release a
> new version of Pacemaker. We'll have to publicize as much as possible to
> pgsql users that they should upgrade resource-agents before or at the
> same time as pacemaker. I see the alternative PAF agent has the same
> usage, so it will need to be updated, too.

Yes, I was following this conversation.

I'll do the fix on our side.

Thank you!

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Question] About a change of crm_failcount.

2017-02-03 Thread Ken Gaillot
On 02/02/2017 12:33 PM, Ken Gaillot wrote:
> On 02/02/2017 12:23 PM, renayama19661...@ybb.ne.jp wrote:
>> Hi All,
>>
>> By the next correction, the user was not able to set a value except zero in 
>> crm_failcount.
>>
>>  - [Fix: tools: implement crm_failcount command-line options correctly]
>>- 
>> https://github.com/ClusterLabs/pacemaker/commit/95db10602e8f646eefed335414e40a994498cafd#diff-6e58482648938fd488a920b9902daac4
>>
>> However, pgsql RA sets INFINITY in a script.
>>
>> ```
>> (snip)
>> CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
>> (snip)
>> ocf_exit_reason "My data is newer than new master's one. New   master's 
>> location : $master_baseline"
>> exec_with_retry 0 $CRM_FAILCOUNT -r $OCF_RESOURCE_INSTANCE -U $NODENAME 
>> -v INFINITY
>> return $OCF_ERR_GENERIC
>> (snip)
>> ```
>>
>> There seems to be the influence only in pgsql somehow or other.
>>
>> Can you revise it to set a value except zero in crm_failcount?
>> We make modifications to use crm_attribute in pgsql RA if we cannot revise 
>> it.
>>
>> Best Regards,
>> Hideo Yamauchi.
> 
> Hmm, I didn't realize that was used. I changed it because it's not a
> good idea to set fail-count without also changing last-failure and
> having a failed op in the LRM history. I'll have to think about what the
> best alternative is.

Having a resource agent modify its own fail count is not a good idea,
and could lead to unpredictable behavior. I didn't realize the pgsql
agent did that.

I don't want to re-enable the functionality, because I don't want to
encourage more agents doing this.

There are two alternatives the pgsql agent can choose from:

1. Return a "hard" error such as OCF_ERR_ARGS or OCF_ERR_PERM. When
Pacemaker gets one of these errors from an agent, it will ban the
resource from that node (until the failure is cleared).

2. Use crm_resource --ban instead. This would ban the resource from that
node until the user removes the ban with crm_resource --clear (or by
deleting the ban consraint from the configuration).

I'd recommend #1 since it does not require any pacemaker-specific tools.

We can make sure resource-agents has a fix for this before we release a
new version of Pacemaker. We'll have to publicize as much as possible to
pgsql users that they should upgrade resource-agents before or at the
same time as pacemaker. I see the alternative PAF agent has the same
usage, so it will need to be updated, too.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Question] About a change of crm_failcount.

2017-02-02 Thread Ken Gaillot
On 02/02/2017 12:23 PM, renayama19661...@ybb.ne.jp wrote:
> Hi All,
> 
> By the next correction, the user was not able to set a value except zero in 
> crm_failcount.
> 
>  - [Fix: tools: implement crm_failcount command-line options correctly]
>- 
> https://github.com/ClusterLabs/pacemaker/commit/95db10602e8f646eefed335414e40a994498cafd#diff-6e58482648938fd488a920b9902daac4
> 
> However, pgsql RA sets INFINITY in a script.
> 
> ```
> (snip)
> CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
> (snip)
> ocf_exit_reason "My data is newer than new master's one. New   master's 
> location : $master_baseline"
> exec_with_retry 0 $CRM_FAILCOUNT -r $OCF_RESOURCE_INSTANCE -U $NODENAME 
> -v INFINITY
> return $OCF_ERR_GENERIC
> (snip)
> ```
> 
> There seems to be the influence only in pgsql somehow or other.
> 
> Can you revise it to set a value except zero in crm_failcount?
> We make modifications to use crm_attribute in pgsql RA if we cannot revise it.
> 
> Best Regards,
> Hideo Yamauchi.

Hmm, I didn't realize that was used. I changed it because it's not a
good idea to set fail-count without also changing last-failure and
having a failed op in the LRM history. I'll have to think about what the
best alternative is.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org