On Sat, 2021-10-30 at 21:17 +0300, Andrei Borzenkov wrote: > On 29.10.2021 18:37, Ken Gaillot wrote: > ... > > > > > To address the original question, this is the log sequence I > > > > > find > > > > > most > > > > > relevant: > > > > > > > > > > > Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker- > > > > > > schedulerd[776553] > > > > > > (unpack_rsc_op_failure) warning: Unexpected result > > > > > > (error) > > > > > > was > > > > > > recorded for monitor of jangcluster-srv-4 on jangcluster- > > > > > > srv-2 > > > > > > at Oct > > > > > > 22 12:21:09 2021 | rc=1 id=jangcluster-srv-4_last_failure_0 > > > > > > Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker- > > > > > > schedulerd[776553] > > > > > > (unpack_rsc_op_failure) notice: jangcluster-srv-4 will > > > > > > not > > > > > > be > > > > > > started under current conditions > > > > > > Oct 22 12:21:09.389 jangcluster-srv-2 pacemaker-schedulerd[ > > > > > > 776553] (pe_fence_node) warning: Remote node > > > > > > jangcluster- > > > > > > srv-4 > > > > > > will be fenced: remote connection is unrecoverable > > > > > > > > > > The "will not be started" is why the node had to be fenced. > > > > > There > > > > > was > > > > > > > > OK so it implies that remote resource should fail over if > > > > connection to > > > > remote node fails. Thank you, that was not exactly clear from > > > > documentation. > > > > > > > > > nowhere to recover the connection. I'd need to see the CIB > > > > > from > > > > > that > > > > > time to know why; it's possible you had an old constraint > > > > > banning > > > > > the > > > > > connection from the other node (e.g. from a ban or move > > > > > command), > > > > > or > > > > > something like that. > > > > > > > > > > > > > Hmm ... looking in (current) sources it seems this message is > > > > emitted > > > > only in case of on-fail=stop operation property ... > > > > > > > > > > Well ... > > > > > > /* For remote nodes, ensure that any failure that results in > > > dropping an > > > > > > * active connection to the node results in fencing of the > > > node. > > > > > > * > > > > > > * There are only two action failures that don't result in > > > fencing. > > > > > > * 1. probes - probe failures are expected. > > > > > > * 2. start - a start failure indicates that an active > > > connection > > > does not already > > > > > > * exist. The user can set op on-fail=fence if they really > > > want > > > to > > > fence start > > > > > > * failures. */ > > > > > > > > > pacemaker will forcibly set on-fail=stop for remote resource. > > > > The default isn't any different, it's on-fail=restart. > > > > At that point in the code, on-fail is not what the user set (or > > default), but how the result should be handled, taking into account > > what the user set. E.g. if the result is success, then on-fail is > > set > > to ignore because nothing needs to be done, regardless of what the > > configured on-fail is. > > > > There are two issues discussed in this thread. > > 1. Remote node is fenced when connection with this node is lost. For > all > I can tell this is intended and expected behavior. That was the > original > question.
It's expected only because the connection can't be recovered elsewhere. If another node can run the connection, pacemaker will try to reconnect from there and re-probe everything to make sure what the current state is. > 2. Remote resource appears to not fail over. I cannot reproduce it, > but > then we also do not have the complete CIB, so something may affect > it. > OTOH logs shown stop before fencing has possibly succeeded, so may be > remote resource *did* fail over. > > What I see is - connection to remote node is lost, pacemaker fences > remote node and attempts to restart remote resource, if this is > unsuccessful (meaning - connection still could not be established) > remote resource fails over to another node. > > I do not know if it is possible to avoid fencing of remote node under > described conditions. > > What is somewhat interesting (and looks like a bug) - in my testing > pacemaker ignored failed fencing attempt and proceeded with > restarting > of remote resource. Is it expected behavior? I don't see a failed fencing attempt (or any result of the fencing attempt) in the logs in the original message, only failures of the connection monitor. -- Ken Gaillot <[email protected]> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
