subject:"Re\: \[ClusterLabs\] monitor failed actions not cleared"

Re: [ClusterLabs] monitor failed actions not cleared

2017-10-18 Thread Ken Gaillot

On Mon, 2017-10-02 at 13:29 +, LE COQUIL Pierre-Yves wrote:
> Hi,
>  
> I finally found my mistake:
> I have set up the failure-timeout like the lifetime example in the
> RedHat Documentation with the value PT1M.
> If I set up the failure-timeout with 60, it works like it should.

This is a bug somewhere in pacemaker. I recently got a bug report
related to recurring monitors, so I'm taking a closer look at time
interval handling in general. I'll make sure to figure out where this
one is.

>  
> Just trying a last question …:
> Couldn’t it be something in the log telling the value isn’t at the
> right format ?

Definitely, it should ... though in this case, it should parse PT1M
correctly to begin with.

> Pierre-Yves
>  
>  
> De : LE COQUIL Pierre-Yves 
> Envoyé : mercredi 27 septembre 2017 19:37
> À : 'users@clusterlabs.org' 
> Objet : RE: monitor failed actions not cleared
>  
>  
>  
> De : LE COQUIL Pierre-Yves 
> Envoyé : lundi 25 septembre 2017 16:58
> À : 'users@clusterlabs.org' 
> Objet : monitor failed actions not cleared
>  
> Hi,
>  
> I’am using Pacemaker  1.1.15-11.el7_3.4 / Corosync 2.4.0-4.el7 under
> CentOS 7.3.1611
>  
> ð  Is this configuration too old ? (yum indicates these versions are
> up to date)

No, those are recent versions. CentOS 7.4 has slightly newer versions,
but there's nothing wrong with staying on those for now.

> ð  Should I install more recent versions of Pacemaker and Corosync ?
>  
> My subject is very close to the post “clearing failed actions”
> initiated by Attila Megyeri in May 2017.
> But the issue doesn’t fit my case.
>  
> What I want to do is:
> -  2 systemd resources running on 1 of the 2 nodes of my
> cluster,
> -  When  1 resource fails (by killing it or by moving the
> resource), I want it to be restarted on the other node, but I want
> the other resource still running on the same node.
>  
> ð  Is this possible with Pacemaker ?
>  
> What I have done in addition to the default parameters:
> -  For my resources:
> o   migration-threshold=1,
> o   failure-timeout=PT1M
> -  For the cluster
> o   Cluster-recheck-interval=120
>  
> I have added for my resource operation monitor: on-fail=restart
> (which is the default)
>  
> I do not use Fencing (Stonith Enabled = false)
> ð  Is Fencing compatible with my goal ?

Yes, fencing should be considered a requirement for a stable cluster.

Fencing handles node-level failures rather than resource-level
failures. If a node becomes unresponsive, the rest of the cluster can't
know whether it is inoperational (and thus unable to pose any conflict)
or just misbehaving (perhaps the CPU is overloaded, or a network card
went out, or ...) in which case it's not safe to recover resources
elsewhere. Fencing makes it certain it's safe.

> What happens:
> -  When I kill or move 1 resource, it is restarted on the
> other node => OK
> -  The failcount is incremented to 1 for this resource => OK
> -  The failcount is never cleared => NOK
>  
> ð  I get a warning in the log :
> “pengine:  warning: unpack_rsc_op_failure:    Processing failed
> op monitor for ACTIVATION_KX on metro.cas-n1: not running (7)”
> when my resource  ACTIVATION_KX has been killed on node  metro.cas-n1
> but pcs status shows ACTIVATION_KX is started on the other node

It's a longstanding to-do to improve this message ... it doesn't
(necessarily) mean any new failure has occurred. It just means the
policy engine is processing the resource history, which includes a
failure (which could be recent, or old). The log message will show up
every time the policy engine runs, and continue to be displayed in the
status failure history, until you clean the resource.

> ð  Is it a bad monitor operation configuration for my resource ? (I
> have added “requires= nothing”)

Your configuration is fine, although "requires" has no effect in a
monitor operation. It's only relevant for start and promote operations,
and even then, it's deprecated to set it in the operation configuration
... it belongs in the resource configuration now. "requires=nothing" is
highly unlikely to be what you want, though; the default is usually
sufficient.

> I know that my english and my pacemaker knowledge are not so high but
> could you please give me some explanations about that behavior that I
> misunderstand.

Not at all, this was a very clear and well-thought-out post :)

> ð  If something is wrong with my post, just tell me (this is my
> first)
>  
> Thank you
>  
> Thanks
>  
> Pierre-Yves Le Coquil
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] monitor failed actions not cleared

2017-10-02 Thread LE COQUIL Pierre-Yves

Hi,

I finally found my mistake:
I have set up the failure-timeout like the lifetime example in the RedHat 
Documentation with the value PT1M.
If I set up the failure-timeout with 60, it works like it should.

Just trying a last question ...:
Couldn't it be something in the log telling the value isn't at the right format 
?

Pierre-Yves


De : LE COQUIL Pierre-Yves
Envoyé : mercredi 27 septembre 2017 19:37
À : 'users@clusterlabs.org' 
Objet : RE: monitor failed actions not cleared



De : LE COQUIL Pierre-Yves
Envoyé : lundi 25 septembre 2017 16:58
À : 'users@clusterlabs.org' 
mailto:users@clusterlabs.org>>
Objet : monitor failed actions not cleared

Hi,

I'am using Pacemaker  1.1.15-11.el7_3.4 / Corosync 2.4.0-4.el7 under CentOS 
7.3.1611


ð  Is this configuration too old ? (yum indicates these versions are up to date)

ð  Should I install more recent versions of Pacemaker and Corosync ?

My subject is very close to the post "clearing failed actions" initiated by 
Attila Megyeri in May 2017.
But the issue doesn't fit my case.

What I want to do is:

-  2 systemd resources running on 1 of the 2 nodes of my cluster,

-  When  1 resource fails (by killing it or by moving the resource), I 
want it to be restarted on the other node, but I want the other resource still 
running on the same node.


ð  Is this possible with Pacemaker ?

What I have done in addition to the default parameters:

-  For my resources:

o   migration-threshold=1,

o   failure-timeout=PT1M

-  For the cluster

o   Cluster-recheck-interval=120

I have added for my resource operation monitor: on-fail=restart (which is the 
default)

I do not use Fencing (Stonith Enabled = false)

ð  Is Fencing compatible with my goal ?

What happens:

-  When I kill or move 1 resource, it is restarted on the other node => 
OK

-  The failcount is incremented to 1 for this resource => OK

-  The failcount is never cleared => NOK


ð  I get a warning in the log :

"pengine:  warning: unpack_rsc_op_failure:Processing failed op monitor 
for ACTIVATION_KX on metro.cas-n1: not running (7)"

when my resource  ACTIVATION_KX has been killed on node  metro.cas-n1

but pcs status shows ACTIVATION_KX is started on the other node


ð  Is it a bad monitor operation configuration for my resource ? (I have added 
"requires= nothing")

I know that my english and my pacemaker knowledge are not so high but could you 
please give me some explanations about that behavior that I misunderstand.


ð  If something is wrong with my post, just tell me (this is my first)

Thank you

Thanks

Pierre-Yves Le Coquil














___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] monitor failed actions not cleared

2017-09-27 Thread LE COQUIL Pierre-Yves



De : LE COQUIL Pierre-Yves
Envoyé : lundi 25 septembre 2017 16:58
À : 'users@clusterlabs.org' 
Objet : monitor failed actions not cleared

Hi,

I'am using Pacemaker  1.1.15-11.el7_3.4 / Corosync 2.4.0-4.el7 under CentOS 
7.3.1611


ð  Is this configuration too old ? (yum indicates these versions are up to date)

ð  Should I install more recent versions of Pacemaker and Corosync ?

My subject is very close to the post "clearing failed actions" initiated by 
Attila Megyeri in May 2017.
But the issue doesn't fit my case.

What I want to do is:

-  2 systemd resources running on 1 of the 2 nodes of my cluster,

-  When  1 resource fails (by killing it or by moving the resource), I 
want it to be restarted on the other node, but I want the other resource still 
running on the same node.


ð  Is this possible with Pacemaker ?

What I have done in addition to the default parameters:

-  For my resources:

o   migration-threshold=1,

o   failure-timeout=PT1M

-  For the cluster

o   Cluster-recheck-interval=120

I have added for my resource operation monitor: on-fail=restart (which is the 
default)

I do not use Fencing (Stonith Enabled = false)

ð  Is Fencing compatible with my goal ?

What happens:

-  When I kill or move 1 resource, it is restarted on the other node => 
OK

-  The failcount is incremented to 1 for this resource => OK

-  The failcount is never cleared => NOK


ð  I get a warning in the log :

"pengine:  warning: unpack_rsc_op_failure:Processing failed op monitor 
for ACTIVATION_KX on metro.cas-n1: not running (7)"

when my resource  ACTIVATION_KX has been killed on node  metro.cas-n1

but pcs status shows ACTIVATION_KX is started on the other node


ð  Is it a bad monitor operation configuration for my resource ? (I have added 
"requires= nothing")

I know that my english and my pacemaker knowledge are not so high but could you 
please give me some explanations about that behavior that I misunderstand.


ð  If something is wrong with my post, just tell me (this is my first)

Thank you

Thanks

Pierre-Yves Le Coquil














___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] monitor failed actions not cleared

Re: [ClusterLabs] monitor failed actions not cleared

Re: [ClusterLabs] monitor failed actions not cleared

3 matches

Site Navigation

Mail list logo

Footer information