Re: [ClusterLabs] monitor failed actions not cleared
On Mon, 2017-10-02 at 13:29 +, LE COQUIL Pierre-Yves wrote: > Hi, > > I finally found my mistake: > I have set up the failure-timeout like the lifetime example in the > RedHat Documentation with the value PT1M. > If I set up the failure-timeout with 60, it works like it should. This is a bug somewhere in pacemaker. I recently got a bug report related to recurring monitors, so I'm taking a closer look at time interval handling in general. I'll make sure to figure out where this one is. > > Just trying a last question …: > Couldn’t it be something in the log telling the value isn’t at the > right format ? Definitely, it should ... though in this case, it should parse PT1M correctly to begin with. > Pierre-Yves > > > De : LE COQUIL Pierre-Yves > Envoyé : mercredi 27 septembre 2017 19:37 > À : 'users@clusterlabs.org' > Objet : RE: monitor failed actions not cleared > > > > De : LE COQUIL Pierre-Yves > Envoyé : lundi 25 septembre 2017 16:58 > À : 'users@clusterlabs.org' > Objet : monitor failed actions not cleared > > Hi, > > I’am using Pacemaker 1.1.15-11.el7_3.4 / Corosync 2.4.0-4.el7 under > CentOS 7.3.1611 > > ð Is this configuration too old ? (yum indicates these versions are > up to date) No, those are recent versions. CentOS 7.4 has slightly newer versions, but there's nothing wrong with staying on those for now. > ð Should I install more recent versions of Pacemaker and Corosync ? > > My subject is very close to the post “clearing failed actions” > initiated by Attila Megyeri in May 2017. > But the issue doesn’t fit my case. > > What I want to do is: > - 2 systemd resources running on 1 of the 2 nodes of my > cluster, > - When 1 resource fails (by killing it or by moving the > resource), I want it to be restarted on the other node, but I want > the other resource still running on the same node. > > ð Is this possible with Pacemaker ? > > What I have done in addition to the default parameters: > - For my resources: > o migration-threshold=1, > o failure-timeout=PT1M > - For the cluster > o Cluster-recheck-interval=120 > > I have added for my resource operation monitor: on-fail=restart > (which is the default) > > I do not use Fencing (Stonith Enabled = false) > ð Is Fencing compatible with my goal ? Yes, fencing should be considered a requirement for a stable cluster. Fencing handles node-level failures rather than resource-level failures. If a node becomes unresponsive, the rest of the cluster can't know whether it is inoperational (and thus unable to pose any conflict) or just misbehaving (perhaps the CPU is overloaded, or a network card went out, or ...) in which case it's not safe to recover resources elsewhere. Fencing makes it certain it's safe. > What happens: > - When I kill or move 1 resource, it is restarted on the > other node => OK > - The failcount is incremented to 1 for this resource => OK > - The failcount is never cleared => NOK > > ð I get a warning in the log : > “pengine: warning: unpack_rsc_op_failure: Processing failed > op monitor for ACTIVATION_KX on metro.cas-n1: not running (7)” > when my resource ACTIVATION_KX has been killed on node metro.cas-n1 > but pcs status shows ACTIVATION_KX is started on the other node It's a longstanding to-do to improve this message ... it doesn't (necessarily) mean any new failure has occurred. It just means the policy engine is processing the resource history, which includes a failure (which could be recent, or old). The log message will show up every time the policy engine runs, and continue to be displayed in the status failure history, until you clean the resource. > ð Is it a bad monitor operation configuration for my resource ? (I > have added “requires= nothing”) Your configuration is fine, although "requires" has no effect in a monitor operation. It's only relevant for start and promote operations, and even then, it's deprecated to set it in the operation configuration ... it belongs in the resource configuration now. "requires=nothing" is highly unlikely to be what you want, though; the default is usually sufficient. > I know that my english and my pacemaker knowledge are not so high but > could you please give me some explanations about that behavior that I > misunderstand. Not at all, this was a very clear and well-thought-out post :) > ð If something is wrong with my post, just tell me (this is my > first) > > Thank you > > Thanks > > Pierre-Yves Le Coquil -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] monitor failed actions not cleared
Hi, I finally found my mistake: I have set up the failure-timeout like the lifetime example in the RedHat Documentation with the value PT1M. If I set up the failure-timeout with 60, it works like it should. Just trying a last question ...: Couldn't it be something in the log telling the value isn't at the right format ? Pierre-Yves De : LE COQUIL Pierre-Yves Envoyé : mercredi 27 septembre 2017 19:37 À : 'users@clusterlabs.org' Objet : RE: monitor failed actions not cleared De : LE COQUIL Pierre-Yves Envoyé : lundi 25 septembre 2017 16:58 À : 'users@clusterlabs.org' mailto:users@clusterlabs.org>> Objet : monitor failed actions not cleared Hi, I'am using Pacemaker 1.1.15-11.el7_3.4 / Corosync 2.4.0-4.el7 under CentOS 7.3.1611 ð Is this configuration too old ? (yum indicates these versions are up to date) ð Should I install more recent versions of Pacemaker and Corosync ? My subject is very close to the post "clearing failed actions" initiated by Attila Megyeri in May 2017. But the issue doesn't fit my case. What I want to do is: - 2 systemd resources running on 1 of the 2 nodes of my cluster, - When 1 resource fails (by killing it or by moving the resource), I want it to be restarted on the other node, but I want the other resource still running on the same node. ð Is this possible with Pacemaker ? What I have done in addition to the default parameters: - For my resources: o migration-threshold=1, o failure-timeout=PT1M - For the cluster o Cluster-recheck-interval=120 I have added for my resource operation monitor: on-fail=restart (which is the default) I do not use Fencing (Stonith Enabled = false) ð Is Fencing compatible with my goal ? What happens: - When I kill or move 1 resource, it is restarted on the other node => OK - The failcount is incremented to 1 for this resource => OK - The failcount is never cleared => NOK ð I get a warning in the log : "pengine: warning: unpack_rsc_op_failure:Processing failed op monitor for ACTIVATION_KX on metro.cas-n1: not running (7)" when my resource ACTIVATION_KX has been killed on node metro.cas-n1 but pcs status shows ACTIVATION_KX is started on the other node ð Is it a bad monitor operation configuration for my resource ? (I have added "requires= nothing") I know that my english and my pacemaker knowledge are not so high but could you please give me some explanations about that behavior that I misunderstand. ð If something is wrong with my post, just tell me (this is my first) Thank you Thanks Pierre-Yves Le Coquil ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] monitor failed actions not cleared
De : LE COQUIL Pierre-Yves Envoyé : lundi 25 septembre 2017 16:58 À : 'users@clusterlabs.org' Objet : monitor failed actions not cleared Hi, I'am using Pacemaker 1.1.15-11.el7_3.4 / Corosync 2.4.0-4.el7 under CentOS 7.3.1611 ð Is this configuration too old ? (yum indicates these versions are up to date) ð Should I install more recent versions of Pacemaker and Corosync ? My subject is very close to the post "clearing failed actions" initiated by Attila Megyeri in May 2017. But the issue doesn't fit my case. What I want to do is: - 2 systemd resources running on 1 of the 2 nodes of my cluster, - When 1 resource fails (by killing it or by moving the resource), I want it to be restarted on the other node, but I want the other resource still running on the same node. ð Is this possible with Pacemaker ? What I have done in addition to the default parameters: - For my resources: o migration-threshold=1, o failure-timeout=PT1M - For the cluster o Cluster-recheck-interval=120 I have added for my resource operation monitor: on-fail=restart (which is the default) I do not use Fencing (Stonith Enabled = false) ð Is Fencing compatible with my goal ? What happens: - When I kill or move 1 resource, it is restarted on the other node => OK - The failcount is incremented to 1 for this resource => OK - The failcount is never cleared => NOK ð I get a warning in the log : "pengine: warning: unpack_rsc_op_failure:Processing failed op monitor for ACTIVATION_KX on metro.cas-n1: not running (7)" when my resource ACTIVATION_KX has been killed on node metro.cas-n1 but pcs status shows ACTIVATION_KX is started on the other node ð Is it a bad monitor operation configuration for my resource ? (I have added "requires= nothing") I know that my english and my pacemaker knowledge are not so high but could you please give me some explanations about that behavior that I misunderstand. ð If something is wrong with my post, just tell me (this is my first) Thank you Thanks Pierre-Yves Le Coquil ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org