Re: [ClusterLabs] Antw: Re: Antw: [EXT] Coming in 2.1.3: node health monitoring improvements

Klaus Wenninger Thu, 14 Apr 2022 08:00:56 -0700

On Thu, Apr 14, 2022 at 7:57 AM Ulrich Windl
<ulrich.wi...@rz.uni-regensburg.de> wrote:
>
> Ken,
>
> thanks for thje explanations! Maybe it would be best (next time) if you
> present the documentation for a new feature first (as a base for discussion),
> and _then_ implement it.
> I know: People first implement it, and later, if they have time or feel like
> it, they'll document.
> However, as I found out for myself, sometimes documentation is really useful
> when you review your code some time later and wonder: "What should have been
> the purpose of all that?" ;-)
Guess what has proven to work quite well - as well efficiency wise - is an
iterative approach ;-)
Like have a rough idea - implement something incl. first documentation -
play with it - discuss it - improve documentation/implementation
through feedback ...


Klaus
>
> Regards,
> Ulrich
>
> >>> Ken Gaillot <kgail...@redhat.com> schrieb am 13.04.2022 um 15:59 in
> Nachricht
> <3ad20a26a4623d2e7ff11eb0bdf822faae1a5114.ca...@redhat.com>:
> > On Wed, 2022-04-13 at 08:22 +0200, Ulrich Windl wrote:
> >> > > > Ken Gaillot <kgail...@redhat.com> schrieb am 12.04.2022 um
> >> > > > 17:22 in
> >> Nachricht
> >> <33f4147d0f6a3e46581aaa46a4eca81dfa59ce15.ca...@redhat.com>:
> >> > Hi all,
> >> >
> >> > I'm hoping to have the first release candidate for 2.1.3 ready next
> >> > week.
> >> >
> >> > Pacemaker has long had a feature to monitor node health (CPU usage,
> >> > SMART drive errors, etc.) and move resources off degraded nodes:
> >> >
> >> >
> > https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/singlehtml/ind
>
> >> > ex.html#tracking‑node‑health
> >>
> >> Great, I wanted to ask a question on it anyway:
> >> Is the node health attribute stored in the CIB, or is it transient
> >> (i.e.:
> >> reset when the node is restarted)?
> >
> > They can be either, although transient makes more sense. As long as the
> > name starts with "#health" it will be treated as a health attribute.
> >
> >>
> >> Some comments on the docs:
> >>
> >> "yellow" state: could also mean node is becoming healthy (coming from
> >> red),
> >> right?
> >
> > True, I'll make a note to update that
> >
> >>
> >> The "Node Health Strategy" could benefit from  better explanation.
> >> E.g.: "Assign the value of ..." Assign to whom/what?
> >
> > The wording could definitely be improved.
> >
> > In this case, the idea is that "red", "yellow", and "green" are just
> > convenient names for particular integer scores. The actual values used
> > depend on the strategy, hence "assign ... to red" and so forth.
> >
> >> It's very hard to find out what "progressive" really does.
> >>
> >> I think an configuration example with a sample scenario (node health
> >> changes)
> >> would be very helpful.
> >
> > Yes progressive and custom are confusing without examples. I'll add it
> > to the to-do list ...
> >
> > The idea behind progressive is that you might want to give a negative
> > but not infinite preference to yellow and/or red. With the other
> > strategies, any red attribute will cause all resources to move off.
> > With progressive, you could set red to some number (say -100) and that
> > score would be used just as if you had configured a location constraint
> > with that score. If you had stickiness higher than that, that would
> > keep any existing resources running there, but prevent any new
> > resources from being moved to the node.
> >
> >>
> >> > The 2.1.3 release will add a couple of features to make this more
> >> > useful.
> >> >
> >> > First, you can now exempt particular resources from health‑related
> >> > bans, using the new "allow‑unhealthy‑nodes" resource
> >> > meta‑attribute.
> >>
> >> If that's  a resource attribute, then the name is poorly chosen
> >> (IMHO).
> >> In times like these I'd almost suggest to name it
> >> "immune-against-node-health=red" or so (OK, just a joke).
> >
> > I always agonize over the names :)
> >
> > What I really wanted was to use the existing "requires" meta-attribute.
> > It currently can be set to nothing, quorum, fencing, or unfencing, to
> > determine what conditions have to be in place for the resource to run
> > (the default of fencing means that the cluster partition must have
> > quorum and any unclean nodes must have been successfully fenced).
> >
> > It would have been nice to have requires="fencing,health" mean that the
> > resource can only run on a healthy node (as defined by the configured
> > strategy). Unfortunately that would not have been backward compatible
> > with existing explicit configurations.
> >
> >>
> >>
> >> > This is particularly helpful for the health monitoring agents
> >> > themselves. Without the new option, health agents get moved off
> >>
> >> Specifically if the health state can improve again.
> >>
> >> > degraded nodes, which means the cluster can't detect if the
> >> > degraded
> >> > condition goes away. Users had to manually clear the health
> >> > attributes
> >> > to allow resources to move back to the node. Now, you can set
> >> > allow‑
> >> > unhealthy‑nodes=true on your health agent resources, so they can
> >> > continue detecting changes in the health status.
> >> >
> >> > Second, crm_mon will indicate when a node's health is yellow or
> >> > red,
> >> > like:
> >> >
> >> >     * Node List:
> >> >         * Node node1: online (health is RED)
> >>
> >> For compatibility I'd prefer a new option to display those, or at
> >> least a new
> >> item; maybe like:
> >> ----
> >> Node Health:
> >>   * Node: h16: green
> >>   ...
> >> ----
> >>
> >> or
> >>
> >> ---
> >> Node Attributes:
> >>   * Node h16: green
> >> ---
> >
> > You can already list all attributes (including health attributes) with
> > the -A / --show-node-attributes option.
> >
> >>
> >> > Previously, you would see that the node is not running any
> >> > resources,
> >> > but not know why, unless you thought to check every node health
> >> > attribute.
> >>
> >> That's definitely a bad thing for any atrificial intelligence not to
> >> be able
> >> to explain itself ;-)
> >>
> >> Regards,
> >> Ulrich
> >
> > --
> > Ken Gaillot <kgail...@redhat.com>
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Coming in 2.1.3: node health monitoring improvements

Reply via email to