On 2022/3/31 9:03, Ulrich Windl wrote:
Hi!
I just wanted to point out one thing that hit us with SLES15 SP3:
Some failed live VM migration causing node fencing resulted in a fencing loop,
because of two reasons:
1) Pacemaker thinks that even _after_ fencing there is some migration to "clean
up". Pacemaker treats the situation as if the VM is running on both nodes, thus (50%
chance?) trying to stop the VM on the node that just booted after fencing. That's supid
but shouldn't be fatal IF there weren't...
2) The stop operation of the VM (that atually isn't running) fails,
AFAICT it could not connect to the hypervisor, but the logic in the RA
is kind of arguable that the probe (monitor) of the VM returned "not
running", but the stop right after that returned failure...
OTOH, the point about pacemaker is the stop of the resource on the
fenced and rejoined node is not really necessary. There has been
discussions about this here and we are trying to figure out a solution
for it:
https://github.com/ClusterLabs/pacemaker/pull/2146#discussion_r828204919
For now it requires administrator's intervene if the situation happens:
1) Fix the access to hypervisor before the fenced node rejoins.
2) Manually cleanup the resource, which tells pacemaker it can safely
forget the historical migrate_to failure.
Regards,
Yan
causing a node fence. So the loop is complete.
Some details (many unrelated messages left out):
Mar 30 16:06:14 h16 libvirtd[13637]: internal error: libxenlight failed to
restore domain 'v15'
Mar 30 16:06:15 h19 pacemaker-schedulerd[7350]: warning: Unexpected result
(error: v15: live migration to h16 failed: 1) was recorded for migrate_to of
prm_xen_v15 on h18 at Mar 30 16:06:13 2022
Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]: warning: Unexpected result
(OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at Mar 30 16:13:36
2022
Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]: warning: Unexpected result
(OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at Mar 30 16:13:36
2022
Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]: warning: Cluster node h18 will
be fenced: prm_libvirtd:0 failed there
Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]: warning: Unexpected result
(error: v15: live migration to h18 failed: 1) was recorded for migrate_to of
prm_xen_v15 on h16 at Mar 29 23:58:40 2022
Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]: error: Resource prm_xen_v15 is
active on 2 nodes (attempting recovery)
Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]: notice: * Restart
prm_xen_v15 ( h18 )
Mar 30 16:19:04 h18 VirtualDomain(prm_xen_v15)[8768]: INFO: Virtual domain v15
currently has no state, retrying.
Mar 30 16:19:05 h18 VirtualDomain(prm_xen_v15)[8787]: INFO: Virtual domain v15
currently has no state, retrying.
Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8822]: ERROR: Virtual domain v15
has no state during stop operation, bailing out.
Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8836]: INFO: Issuing forced
shutdown (destroy) request for domain v15.
Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8860]: ERROR: forced stop failed
Mar 30 16:19:07 h19 pacemaker-controld[7351]: notice: Transition 124 action
115 (prm_xen_v15_stop_0 on h18): expected 'ok' but got 'error'
Note: Our cluster nodes start pacemaker during boot. Yesterday I was there when
the problem happened. But as we had another boot loop some time ago I wrote a
systemd service that counts boots, and if too many happen within a short time,
pacemaker will be disabled on that node. As it it set now, the counter is reset
if the node is up for at least 15 minutes; if it fails more than 4 times to do
so, pacemaker will be disabled. If someone wants to try that or give feedback,
drop me a line, so I could provide the RPM
(boot-loop-handler-0.0.5-0.0.noarch)...
Regards,
Ulrich
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/