>>> Ken Gaillot <kgail...@redhat.com> schrieb am 06.04.2021 um 15:58 in Nachricht <dd0e25837f82746a4363c216685f3ec5a01ca8a0.ca...@redhat.com>: > On Tue, 2021-04-06 at 09:15 +0200, Ulrich Windl wrote: >> > > > Ken Gaillot <kgail...@redhat.com> schrieb am 31.03.2021 um >> > > > 15:48 in >> >> Nachricht >> <7dfc7c46442db17d9645854081f1269261518f84.ca...@redhat.com>: >> > On Wed, 2021‑03‑31 at 14:32 +0200, Antony Stone wrote: >> > > Hi. >> > > >> > > I'm trying to understand what looks to me like incorrect >> > > behaviour >> > > between >> > > cluster‑recheck‑interval and failure‑timeout, under pacemaker >> > > 2.0.1 >> > > >> > > I have three machines in a corosync (3.0.1 if it matters) >> > > cluster, >> > > managing 12 >> > > resources in a single group. >> > > >> > > I'm following documentation from: >> > > >> > > https://clusterlabs.org/pacemaker/doc/en‑US/Pacemaker/2.0/html/ >> > > Pacemaker_Explained/s‑cluster‑options.html >> > > >> > > and >> > > >> > > https://clusterlabs.org/pacemaker/doc/en‑US/Pacemaker/2.0/html/ >> > > Pacemaker_Explained/s‑resource‑options.html >> > > >> > > I have set a cluster property: >> > > >> > > cluster‑recheck‑interval=60s >> > > >> > > I have set a resource property: >> > > >> > > failure‑timeout=180 >> > > >> > > The docs say failure‑timeout is "How many seconds to wait before >> > > acting as if >> > > the failure had not occurred, and potentially allowing the >> > > resource >> > > back to >> > > the node on which it failed." >> > > >> > > I think this should mean that if the resource fails and gets >> > > restarted, the >> > > fact that it failed will be "forgotten" after 180 seconds (or >> > > maybe a >> > > little >> > > longer, depending on exactly when the next cluster recheck is >> > > done). >> > > >> > > However what I'm seeing is that if the resource fails and gets >> > > restarted, and >> > > this then happens an hour later, it's still counted as two >> > > failures. If it >> > >> > That is exactly correct. >> > >> > > fails and gets restarted another hour after that, it's recorded >> > > as >> > > three >> > > failures and (because I have "migration‑threshold=3") it gets >> > > moved >> > > to another >> > > node (and therefore all the other resources in group are moved as >> > > well). >> > > >> > > So, what am I misunderstanding about "failure‑timeout", and what >> > > configuration >> > > setting do I need to use to tell pacemaker that "provided the >> > > resource hasn't >> > > failed within the past X seconds, forget the fact that it failed >> > > more >> > > than X >> > > seconds ago"? >> > >> > Unfortunately, there is no way. failure‑timeout expires *all* >> > failures >> > once the *most recent* is that old. It's a bit counter‑intuitive >> > but >> > currently, Pacemaker only remembers a resource's most recent >> > failure >> > and the total count of failures, and changing that would be a big >> > project. >> >> Hi! >> >> Sorry I don't get it: If you have a timestamp for each failure- >> timeout, what's >> so hard to put all the fail counts that are older than failure- >> timeout on a >> list, and then reset that list to zero? > > That's exactly the issue -- we don't have a timestamp for each failure. > Only the most recent failed operation, and the total fail count (per > resource and operation), are stored in the CIB status. > > We could store all failures in the CIB, but that would be a significant > project, and we'd need new options to keep the current behavior as the > default.
Hi! I still don't quite get it: Some failing operation increases the fail-count, and the time stamp for the failing operation is recorded (crm_mon can display it). So solving this problem (saving the last time for each fail count) doesn't look so hard to do. Regards, Ulrich > >> I mean: That would be what everyone expects. >> What is implemented instead is like FIFO scheduling: As long as there >> is a new >> entry at the head of the queue, the jobs at the tail will never be >> executed. >> >> Regards, >> Ulrich >> >> > >> > >> > > Thanks, >> > > >> > > >> > > Antony. >> > > >> > >> > ‑‑ >> > Ken Gaillot <kgail...@redhat.com> >> > >> > _______________________________________________ >> > Manage your subscription: >> > https://lists.clusterlabs.org/mailman/listinfo/users >> > >> > ClusterLabs home: https://www.clusterlabs.org/ >> >> >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > -- > Ken Gaillot <kgail...@redhat.com> > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/