Re: [ClusterLabs] Failed migration causing fencing loop

Gao,Yan via Users Thu, 31 Mar 2022 02:19:00 -0700

On 2022/3/31 9:03, Ulrich Windl wrote:

Hi!


I just wanted to point out one thing that hit us with SLES15 SP3:
Some failed live VM migration causing node fencing resulted in a fencing loop, 
because of two reasons:

1) Pacemaker thinks that even _after_ fencing there is some migration to "clean 
up". Pacemaker treats the situation as if the VM is running on both nodes, thus (50% 
chance?) trying to stop the VM on the node that just booted after fencing. That's supid 
but shouldn't be fatal IF there weren't...

2) The stop operation of the VM (that atually isn't running) fails,

AFAICT it could not connect to the hypervisor, but the logic in the RAis kind of arguable that the probe (monitor) of the VM returned "notrunning", but the stop right after that returned failure...

OTOH, the point about pacemaker is the stop of the resource on thefenced and rejoined node is not really necessary. There has beendiscussions about this here and we are trying to figure out a solutionfor it:


https://github.com/ClusterLabs/pacemaker/pull/2146#discussion_r828204919

For now it requires administrator's intervene if the situation happens:
1) Fix the access to hypervisor before the fenced node rejoins.

2) Manually cleanup the resource, which tells pacemaker it can safelyforget the historical migrate_to failure.


Regards,
  Yan

causing a node fence. So the loop is complete.

Some details (many unrelated messages left out):

Mar 30 16:06:14 h16 libvirtd[13637]: internal error: libxenlight failed to 
restore domain 'v15'

Mar 30 16:06:15 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result 
(error: v15: live migration to h16 failed: 1) was recorded for migrate_to of 
prm_xen_v15 on h18 at Mar 30 16:06:13 2022

Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result 
(OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at Mar 30 16:13:36 
2022
Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result 
(OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at Mar 30 16:13:36 
2022
Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]:  warning: Cluster node h18 will 
be fenced: prm_libvirtd:0 failed there

Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result 
(error: v15: live migration to h18 failed: 1) was recorded for migrate_to of 
prm_xen_v15 on h16 at Mar 29 23:58:40 2022
Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]:  error: Resource prm_xen_v15 is 
active on 2 nodes (attempting recovery)

Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]:  notice:  * Restart    
prm_xen_v15              (             h18 )

Mar 30 16:19:04 h18 VirtualDomain(prm_xen_v15)[8768]: INFO: Virtual domain v15 
currently has no state, retrying.
Mar 30 16:19:05 h18 VirtualDomain(prm_xen_v15)[8787]: INFO: Virtual domain v15 
currently has no state, retrying.
Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8822]: ERROR: Virtual domain v15 
has no state during stop operation, bailing out.
Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8836]: INFO: Issuing forced 
shutdown (destroy) request for domain v15.
Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8860]: ERROR: forced stop failed

Mar 30 16:19:07 h19 pacemaker-controld[7351]:  notice: Transition 124 action 
115 (prm_xen_v15_stop_0 on h18): expected 'ok' but got 'error'

Note: Our cluster nodes start pacemaker during boot. Yesterday I was there when 
the problem happened. But as we had another boot loop some time ago I wrote a 
systemd service that counts boots, and if too many happen within a short time, 
pacemaker will be disabled on that node. As it it set now, the counter is reset 
if the node is up for at least 15 minutes; if it fails more than 4 times to do 
so, pacemaker will be disabled. If someone wants to try that or give feedback, 
drop me a line, so I could provide the RPM 
(boot-loop-handler-0.0.5-0.0.noarch)...

Regards,
Ulrich



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Failed migration causing fencing loop

Reply via email to