Did the "active on too many nodes" message happen right after a probe? If so, then it does sound like the probe returned code 0.
If a probe returned 0 and it **shouldn't** have done so, then either the monitor operation needs to be redesigned, or resource-discovery=never (or resource-discovery=exclusive) can be used to prevent the probe from happening where it should not. If a probe returned 0 and it **should** have done so, but the stop operation on the other node wasn't reflected in the CIB (so that the resource still appeared to be active there), then that's odd. A bug is certainly possible, though we can't say without more detail :) On Sun, Mar 7, 2021 at 11:10 PM Ulrich Windl < [email protected]> wrote: > >>> Reid Wahl <[email protected]> schrieb am 05.03.2021 um 21:22 in > Nachricht > <capiuu991o08dnavkm9bc8n9bk-+nh9e0_f25o6ddis5wzwg...@mail.gmail.com>: > > On Fri, Mar 5, 2021 at 10:13 AM Ken Gaillot <[email protected]> wrote: > > > >> On Fri, 2021-03-05 at 11:39 +0100, Ulrich Windl wrote: > >> > Hi! > >> > > >> > I'm unsure what actually causes a problem I see (a resource was > >> > "detected running" when it actually was not), but I'm sure some probe > >> > started on cluster node start cannot provide a useful result until > >> > some other resource has been started. AFAIK there is no way to make a > >> > probe obey odering or colocation constraints, so the only work-around > >> > seems to be a delay. However I'm unsure whether probes can actually > >> > be delayed. > >> > > >> > Ideas? > >> > >> Ordered probes are a thorny problem that we've never been able to come > >> up with a general solution for. We do order certain probes where we > >> have enough information to know it's safe. The problem is that it is > >> very easy to introduce ordering loops. > >> > >> I don't remember if there any workarounds. > >> > > > > Maybe as a workaround: > > - Add an ocf:pacemaker:attribute resource after-and-with rsc1 > > - Then configure a location rule for rsc2 with resource-discovery=never > > and score=-INFINITY with expression (in pseudocode) "attribute is not set > > to active value" > > > > I haven't tested but that might cause rsc2's probe to wait until rsc1 is > > active. > > > > And of course, use the usual constraints/rules to ensure rsc2's probe > only > > runs on rsc1's node. > > > > > >> > Despite of that I wonder whether some probe/monitor returncode like > >> > OCF_NOT_READY would make sense if the operation detects that it > >> > cannot return a current status (so both "running" and "stopped" would > >> > be as inadequate as "starting" and "stopping" would be (despite of > >> > the fact that the latter two do not exist)). > >> > > > > This seems logically reasonable, independent of any implementation > > complexity and considerations of what we would do with that return code. > > Thanks for the proposal! > The actual problem I was facing was that the cluster claimed some resource > would be running on two nodes at the same time, when actually one node had > been stopped properly (with all the resources). The bad state in the CIB > was most likely due to a software bug in pacemaker, but probes on > re-starting the node seemed not to prevent pacemaker from doing a really > wrong "recovery action". > My hope was that probes might update the CIB before some stupid action is > being dopne. Maybe it's just another software bug... > > Regards, > Ulrich > > > > > > >> > Regards, > >> > Ulrich > >> -- > >> Ken Gaillot <[email protected]> > >> > >> _______________________________________________ > >> Manage your subscription: > >> https://lists.clusterlabs.org/mailman/listinfo/users > >> > >> ClusterLabs home: https://www.clusterlabs.org/ > >> > >> > > > > -- > > Regards, > > > > Reid Wahl, RHCA > > Senior Software Maintenance Engineer, Red Hat > > CEE - Platform Support Delivery - ClusterHA > > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > -- Regards, Reid Wahl, RHCA Senior Software Maintenance Engineer, Red Hat CEE - Platform Support Delivery - ClusterHA
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
