Re: [ClusterLabs] Antw: Another word of warning regarding VirtualDomain and Live Migration

Roger Zhou Wed, 16 Dec 2020 09:00:11 -0800


On 12/16/20 5:06 PM, Ulrich Windl wrote:

Hi!

(I changed the subject of the thread)
VirtualDomain seems to be broken, as it does not handle a failed live-,igration 
correctly:

With my test-VM running on node h16, this happened when I tried to move it away 
(for testing):

Dec 16 09:28:46 h19 pacemaker-schedulerd[4427]:  notice:  * Migrate    
prm_xen_test-jeos                    ( h16 -> h19 )
Dec 16 09:28:46 h19 pacemaker-controld[4428]:  notice: Initiating migrate_to 
operation prm_xen_test-jeos_migrate_to_0 on h16
Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 aborted 
by operation prm_xen_test-jeos_migrate_to_0 'modify' on h16: Event failed


RA migration_to failed quickly. Maybe the configuration is not perfect enough?

How about enable trace, and collect more RA logs to check what exactly virshcommand used and check if it works manually


`crm resource trace prm_xen_test-jeos`

Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 action 
115 (prm_xen_test-jeos_migrate_to_0 on h16): expected 'ok' but got 'error'
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to 
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to 
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
### (note the message above is duplicate!)
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  error: Resource 
prm_xen_test-jeos is active on 2 nodes (attempting recovery)
### This is nonsense after a failed live migration!

Indeed, sounds like an valid improvement for pacemaker-schedulerd? Or,articulate what to do with the migration_to fails. I couldn't find thedefinition from any doc yet.

Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  notice:  * Recover    
prm_xen_test-jeos                    (             h19 )


So the cluster is exactly doing the wrong thing: The VM ist still active on h16, while a 
"recovery" on h19 will start it there! So _after_ the recovery the VM is 
duplicate.

Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Initiating stop 
operation prm_xen_test-jeos_stop_0 locally on h19
Dec 16 09:28:47 h19 VirtualDomain(prm_xen_test-jeos)[20656]: INFO: Domain 
test-jeos already stopped.
Dec 16 09:28:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos stop 
(call 372, PID 20620) exited with status 0 (execution time 283ms, queue time 
0ms)
Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Result of stop operation 
for prm_xen_test-jeos on h19: ok
Dec 16 09:31:45 h19 pacemaker-controld[4428]:  notice: Initiating start 
operation prm_xen_test-jeos_start_0 locally on h19

Dec 16 09:31:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos start 
(call 373, PID 21005) exited with status 0 (execution time 2715ms, queue time 
0ms)
Dec 16 09:31:47 h19 pacemaker-controld[4428]:  notice: Result of start 
operation for prm_xen_test-jeos on h19: ok
Dec 16 09:33:46 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to 
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020


yeah, schedulerd is trying so hard to report the migration_to failure here!

Amazingly manual migration using virsh worked:
virsh migrate --live test-jeos xen+tls://h18...


What about s/h18/h19/?

Or, manually reproduce exactly as the RA code:

`virsh ${VIRSH_OPTIONS} migrate --live $migrate_opts $DOMAIN_NAME $remoteuri$migrateuri`



Good luck!
Roger

Regards,
Ulrich Windl

Ulrich Windl schrieb am 14.12.2020 um 15:21 in Nachricht <5FD774CF.8DE : 161 :

60728>:

Hi!

I think I found the problem why a VM ist started on two nodes:

Live-Migration had failed (e.g. away from h16), so the cluster uses stop and
start (stop on h16, start on h19 for example).
When rebooting h16, I see these messages (h19 is DC):

Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result
(error: test-jeos: live migration to h16 failed: 1) was recorded for
migrate_to of prm_xen_test-jeos on h19 at Dec 14 11:54:08 2020
Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  error: Resource
prm_xen_test-jeos is active on 2 nodes (attempting recovery)

Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  notice:  * Restart
prm_xen_test-jeos                    (             h16 )

THIS IS WRONG: h16 was booted, so no VM is running on h16 (unless there was
some autostart from libvirt. " virsh list --autostart" does not list any)

Dec 14 15:09:27 h16 VirtualDomain(prm_xen_test-jeos)[4850]: INFO: Domain
test-jeos already stopped.

Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  error: Calculated
transition 669 (with errors), saving inputs in
/var/lib/pacemaker/pengine/pe-error-4.bz2

Whhat's going on here?

Regards,
Ulrich

Ulrich Windl schrieb am 14.12.2020 um 08:15 in Nachricht <5FD7110D.D09 : 161

:
60728>:

Hi!

Another word of warning regarding VirtualDomain: While configuring a 3-node

cluster with SLES15 SP2 for Xen PVM (using libvirt and the VirtaulDOmain

RA),

I had created a TestVM using BtrFS.
At some time of testing the cluster ended with the testVM running on more
than one node (for reasons still to examine). Only after a "crm resource
refresh" (rebprobe) the cluster tried to fix the problem.
Well at some point the VM wouldn't start any more, because the BtrFS used
for all (SLES default) was corrupted in a way that seems unrecoverable,
independenlty of how many subvolumes and snapshots of those may exist.

Initially I would guess the libvirt stack and VirtualDomain is less

reliable

than the old Xen method and RA.

Regards,
Ulrich




_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: Another word of warning regarding VirtualDomain and Live Migration

Reply via email to