On Wednesday 07 April 2021 at 10:40:54, Ulrich Windl wrote: > >>> Ken Gaillot <kgail...@redhat.com> schrieb am 06.04.2021 um 15:58 > > On Tue, 2021-04-06 at 09:15 +0200, Ulrich Windl wrote:
> >> Sorry I don't get it: If you have a timestamp for each failure- > >> timeout, what's so hard to put all the fail counts that are older than > >> failure-timeout on a list, and then reset that list to zero? > > > > That's exactly the issue -- we don't have a timestamp for each failure. > > Only the most recent failed operation, and the total fail count (per > > resource and operation), are stored in the CIB status. > > > > We could store all failures in the CIB, but that would be a significant > > project, and we'd need new options to keep the current behavior as the > > default. > > I still don't quite get it: Some failing operation increases the > fail-count, and the time stamp for the failing operation is recorded > (crm_mon can display it). So solving this problem (saving the last time > for each fail count) doesn't look so hard to do. For the avoidance of doubt, I (who started this thread) have solved my problem by following the advice from Reid Wahl - I was putting the "failure-timeout" parameter into the incorrect section of mt resource definition. Moving it to the "meta" section has resolved my problem. The way it works now makes completely good sense to me: 1. A failure happens, and gets corrected. 2. Provided no further failure of that resource occurs within the failure- timeout setting, the failure gets forgotten about. 3. If a further failure of the resource does occur within failure-timeout, the original timestamp is discarded, the failure count is incremented, and the timestamp of the new failure is used to check whether there's another failure within failure-timeout of *that* 4. If no further failure occurs within failure-timeout of the most recent failure timestamp, all previous failures are forgotten. 5. If enough failures occur within failure-timeout *of each other* then the failure count gets incremented to the point where the resource gets moved to another node. Regards, Antony. -- "It wouldn't be a good idea to talk about him behind his back in front of him." - murble Please reply to the list; please *don't* CC me. _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/