Re: [ClusterLabs] Antw: Re: Antw: [EXT] Coming in 2.1.3: node health monitoring improvements
On Thu, Apr 14, 2022 at 7:57 AM Ulrich Windl wrote: > > Ken, > > thanks for thje explanations! Maybe it would be best (next time) if you > present the documentation for a new feature first (as a base for discussion), > and _then_ implement it. > I know: People first implement it, and later, if they have time or feel like > it, they'll document. > However, as I found out for myself, sometimes documentation is really useful > when you review your code some time later and wonder: "What should have been > the purpose of all that?" ;-) Guess what has proven to work quite well - as well efficiency wise - is an iterative approach ;-) Like have a rough idea - implement something incl. first documentation - play with it - discuss it - improve documentation/implementation through feedback ... Klaus > > Regards, > Ulrich > > >>> Ken Gaillot schrieb am 13.04.2022 um 15:59 in > Nachricht > <3ad20a26a4623d2e7ff11eb0bdf822faae1a5114.ca...@redhat.com>: > > On Wed, 2022-04-13 at 08:22 +0200, Ulrich Windl wrote: > >> > > > Ken Gaillot schrieb am 12.04.2022 um > >> > > > 17:22 in > >> Nachricht > >> <33f4147d0f6a3e46581aaa46a4eca81dfa59ce15.ca...@redhat.com>: > >> > Hi all, > >> > > >> > I'm hoping to have the first release candidate for 2.1.3 ready next > >> > week. > >> > > >> > Pacemaker has long had a feature to monitor node health (CPU usage, > >> > SMART drive errors, etc.) and move resources off degraded nodes: > >> > > >> > > > https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/singlehtml/ind > > >> > ex.html#tracking‑node‑health > >> > >> Great, I wanted to ask a question on it anyway: > >> Is the node health attribute stored in the CIB, or is it transient > >> (i.e.: > >> reset when the node is restarted)? > > > > They can be either, although transient makes more sense. As long as the > > name starts with "#health" it will be treated as a health attribute. > > > >> > >> Some comments on the docs: > >> > >> "yellow" state: could also mean node is becoming healthy (coming from > >> red), > >> right? > > > > True, I'll make a note to update that > > > >> > >> The "Node Health Strategy" could benefit from better explanation. > >> E.g.: "Assign the value of ..." Assign to whom/what? > > > > The wording could definitely be improved. > > > > In this case, the idea is that "red", "yellow", and "green" are just > > convenient names for particular integer scores. The actual values used > > depend on the strategy, hence "assign ... to red" and so forth. > > > >> It's very hard to find out what "progressive" really does. > >> > >> I think an configuration example with a sample scenario (node health > >> changes) > >> would be very helpful. > > > > Yes progressive and custom are confusing without examples. I'll add it > > to the to-do list ... > > > > The idea behind progressive is that you might want to give a negative > > but not infinite preference to yellow and/or red. With the other > > strategies, any red attribute will cause all resources to move off. > > With progressive, you could set red to some number (say -100) and that > > score would be used just as if you had configured a location constraint > > with that score. If you had stickiness higher than that, that would > > keep any existing resources running there, but prevent any new > > resources from being moved to the node. > > > >> > >> > The 2.1.3 release will add a couple of features to make this more > >> > useful. > >> > > >> > First, you can now exempt particular resources from health‑related > >> > bans, using the new "allow‑unhealthy‑nodes" resource > >> > meta‑attribute. > >> > >> If that's a resource attribute, then the name is poorly chosen > >> (IMHO). > >> In times like these I'd almost suggest to name it > >> "immune-against-node-health=red" or so (OK, just a joke). > > > > I always agonize over the names :) > > > > What I really wanted was to use the existing "requires" meta-attribute. > > It currently can be set to nothing, quorum, fencing, or unfencing, to > > determine what conditions have to be in place for the resource to run > > (the default of fencing means that the cluster partition must have > > quorum and any unclean nodes must have been successfully fenced). > > > > It would have been nice to have requires="fencing,health" mean that the > > resource can only run on a healthy node (as defined by the configured > > strategy). Unfortunately that would not have been backward compatible > > with existing explicit configurations. > > > >> > >> > >> > This is particularly helpful for the health monitoring agents > >> > themselves. Without the new option, health agents get moved off > >> > >> Specifically if the health state can improve again. > >> > >> > degraded nodes, which means the cluster can't detect if the > >> > degraded > >> > condition goes away. Users had to manually clear the health > >> > attributes > >> > to allow resources to move back to the node. Now, you can set > >> > allow‑ > >> > unhealthy‑nodes=true on your
[ClusterLabs] Antw: Re: Antw: [EXT] Coming in 2.1.3: node health monitoring improvements
Ken, thanks for thje explanations! Maybe it would be best (next time) if you present the documentation for a new feature first (as a base for discussion), and _then_ implement it. I know: People first implement it, and later, if they have time or feel like it, they'll document. However, as I found out for myself, sometimes documentation is really useful when you review your code some time later and wonder: "What should have been the purpose of all that?" ;-) Regards, Ulrich >>> Ken Gaillot schrieb am 13.04.2022 um 15:59 in Nachricht <3ad20a26a4623d2e7ff11eb0bdf822faae1a5114.ca...@redhat.com>: > On Wed, 2022-04-13 at 08:22 +0200, Ulrich Windl wrote: >> > > > Ken Gaillot schrieb am 12.04.2022 um >> > > > 17:22 in >> Nachricht >> <33f4147d0f6a3e46581aaa46a4eca81dfa59ce15.ca...@redhat.com>: >> > Hi all, >> > >> > I'm hoping to have the first release candidate for 2.1.3 ready next >> > week. >> > >> > Pacemaker has long had a feature to monitor node health (CPU usage, >> > SMART drive errors, etc.) and move resources off degraded nodes: >> > >> > > https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/singlehtml/ind >> > ex.html#tracking‑node‑health >> >> Great, I wanted to ask a question on it anyway: >> Is the node health attribute stored in the CIB, or is it transient >> (i.e.: >> reset when the node is restarted)? > > They can be either, although transient makes more sense. As long as the > name starts with "#health" it will be treated as a health attribute. > >> >> Some comments on the docs: >> >> "yellow" state: could also mean node is becoming healthy (coming from >> red), >> right? > > True, I'll make a note to update that > >> >> The "Node Health Strategy" could benefit from better explanation. >> E.g.: "Assign the value of ..." Assign to whom/what? > > The wording could definitely be improved. > > In this case, the idea is that "red", "yellow", and "green" are just > convenient names for particular integer scores. The actual values used > depend on the strategy, hence "assign ... to red" and so forth. > >> It's very hard to find out what "progressive" really does. >> >> I think an configuration example with a sample scenario (node health >> changes) >> would be very helpful. > > Yes progressive and custom are confusing without examples. I'll add it > to the to-do list ... > > The idea behind progressive is that you might want to give a negative > but not infinite preference to yellow and/or red. With the other > strategies, any red attribute will cause all resources to move off. > With progressive, you could set red to some number (say -100) and that > score would be used just as if you had configured a location constraint > with that score. If you had stickiness higher than that, that would > keep any existing resources running there, but prevent any new > resources from being moved to the node. > >> >> > The 2.1.3 release will add a couple of features to make this more >> > useful. >> > >> > First, you can now exempt particular resources from health‑related >> > bans, using the new "allow‑unhealthy‑nodes" resource >> > meta‑attribute. >> >> If that's a resource attribute, then the name is poorly chosen >> (IMHO). >> In times like these I'd almost suggest to name it >> "immune-against-node-health=red" or so (OK, just a joke). > > I always agonize over the names :) > > What I really wanted was to use the existing "requires" meta-attribute. > It currently can be set to nothing, quorum, fencing, or unfencing, to > determine what conditions have to be in place for the resource to run > (the default of fencing means that the cluster partition must have > quorum and any unclean nodes must have been successfully fenced). > > It would have been nice to have requires="fencing,health" mean that the > resource can only run on a healthy node (as defined by the configured > strategy). Unfortunately that would not have been backward compatible > with existing explicit configurations. > >> >> >> > This is particularly helpful for the health monitoring agents >> > themselves. Without the new option, health agents get moved off >> >> Specifically if the health state can improve again. >> >> > degraded nodes, which means the cluster can't detect if the >> > degraded >> > condition goes away. Users had to manually clear the health >> > attributes >> > to allow resources to move back to the node. Now, you can set >> > allow‑ >> > unhealthy‑nodes=true on your health agent resources, so they can >> > continue detecting changes in the health status. >> > >> > Second, crm_mon will indicate when a node's health is yellow or >> > red, >> > like: >> > >> > * Node List: >> > * Node node1: online (health is RED) >> >> For compatibility I'd prefer a new option to display those, or at >> least a new >> item; maybe like: >> >> Node Health: >> * Node: h16: green >> ... >> >> >> or >> >> --- >> Node Attributes: >> * Node h16: green >> --- > > You can