[ClusterLabs] Antw: Another word of warning regarding VirtualDomain and Live Migration

Ulrich Windl Wed, 16 Dec 2020 01:06:49 -0800

Hi!

(I changed the subject of the thread)
VirtualDomain seems to be broken, as it does not handle a failed live-,igration 
correctly:


With my test-VM running on node h16, this happened when I tried to move it away 
(for testing):

Dec 16 09:28:46 h19 pacemaker-schedulerd[4427]:  notice:  * Migrate    
prm_xen_test-jeos                    ( h16 -> h19 )
Dec 16 09:28:46 h19 pacemaker-controld[4428]:  notice: Initiating migrate_to 
operation prm_xen_test-jeos_migrate_to_0 on h16
Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 aborted 
by operation prm_xen_test-jeos_migrate_to_0 'modify' on h16: Event failed
Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 action 
115 (prm_xen_test-jeos_migrate_to_0 on h16): expected 'ok' but got 'error'
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to 
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to 
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
### (note the message above is duplicate!)
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  error: Resource 
prm_xen_test-jeos is active on 2 nodes (attempting recovery)
### This is nonsense after a failed live migration!
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  notice:  * Recover    
prm_xen_test-jeos                    (             h19 )


So the cluster is exactly doing the wrong thing: The VM ist still active on 
h16, while a "recovery" on h19 will start it there! So _after_ the recovery the 
VM is duplicate.

Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Initiating stop 
operation prm_xen_test-jeos_stop_0 locally on h19
Dec 16 09:28:47 h19 VirtualDomain(prm_xen_test-jeos)[20656]: INFO: Domain 
test-jeos already stopped.
Dec 16 09:28:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos stop 
(call 372, PID 20620) exited with status 0 (execution time 283ms, queue time 
0ms)
Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Result of stop operation 
for prm_xen_test-jeos on h19: ok
Dec 16 09:31:45 h19 pacemaker-controld[4428]:  notice: Initiating start 
operation prm_xen_test-jeos_start_0 locally on h19

Dec 16 09:31:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos start 
(call 373, PID 21005) exited with status 0 (execution time 2715ms, queue time 
0ms)
Dec 16 09:31:47 h19 pacemaker-controld[4428]:  notice: Result of start 
operation for prm_xen_test-jeos on h19: ok
Dec 16 09:33:46 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to 
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020

Amazingly manual migration using virsh worked:
virsh migrate --live test-jeos xen+tls://h18...

Regards,
Ulrich Windl


>>> Ulrich Windl schrieb am 14.12.2020 um 15:21 in Nachricht <5FD774CF.8DE : 
>>> 161 :
60728>:
> Hi!
> 
> I think I found the problem why a VM ist started on two nodes:
> 
> Live-Migration had failed (e.g. away from h16), so the cluster uses stop and 
> start (stop on h16, start on h19 for example).
> When rebooting h16, I see these messages (h19 is DC):
> 
> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
> (error: test-jeos: live migration to h16 failed: 1) was recorded for 
> migrate_to of prm_xen_test-jeos on h19 at Dec 14 11:54:08 2020
> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  error: Resource 
> prm_xen_test-jeos is active on 2 nodes (attempting recovery)
> 
> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  notice:  * Restart    
> prm_xen_test-jeos                    (             h16 )
> 
> THIS IS WRONG: h16 was booted, so no VM is running on h16 (unless there was 
> some autostart from libvirt. " virsh list --autostart" does not list any)
> 
> Dec 14 15:09:27 h16 VirtualDomain(prm_xen_test-jeos)[4850]: INFO: Domain 
> test-jeos already stopped.
> 
> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  error: Calculated 
> transition 669 (with errors), saving inputs in 
> /var/lib/pacemaker/pengine/pe-error-4.bz2
> 
> Whhat's going on here?
> 
> Regards,
> Ulrich
> 
> >>> Ulrich Windl schrieb am 14.12.2020 um 08:15 in Nachricht <5FD7110D.D09 : 
> >>> 161 
> :
> 60728>:
> > Hi!
> > 
> > Another word of warning regarding VirtualDomain: While configuring a 3-node 
> 
> > cluster with SLES15 SP2 for Xen PVM (using libvirt and the VirtaulDOmain 
> RA), 
> > I had created a TestVM using BtrFS.
> > At some time of testing the cluster ended with the testVM running on more 
> > than one node (for reasons still to examine). Only after a "crm resource 
> > refresh" (rebprobe) the cluster tried to fix the problem.
> > Well at some point the VM wouldn't start any more, because the BtrFS used 
> > for all (SLES default) was corrupted in a way that seems unrecoverable, 
> > independenlty of how many subvolumes and snapshots of those may exist.
> > 
> > Initially I would guess the libvirt stack and VirtualDomain is less 
> reliable 
> > than the old Xen method and RA.
> > 
> > Regards,
> > Ulrich
> > 
> > 
> > 
> 
> 
> 
> 



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: Another word of warning regarding VirtualDomain and Live Migration

Reply via email to