On 12/16/20 5:06 PM, Ulrich Windl wrote:
Hi!
(I changed the subject of the thread)
VirtualDomain seems to be broken, as it does not handle a failed live-,igration
correctly:
With my test-VM running on node h16, this happened when I tried to move it away
(for testing):
Dec 16 09:28:46 h19 pacemaker-schedulerd[4427]: notice: * Migrate
prm_xen_test-jeos ( h16 -> h19 )
Dec 16 09:28:46 h19 pacemaker-controld[4428]: notice: Initiating migrate_to
operation prm_xen_test-jeos_migrate_to_0 on h16
Dec 16 09:28:47 h19 pacemaker-controld[4428]: notice: Transition 840 aborted
by operation prm_xen_test-jeos_migrate_to_0 'modify' on h16: Event failed
RA migration_to failed quickly. Maybe the configuration is not perfect enough?
How about enable trace, and collect more RA logs to check what exactly virsh
command used and check if it works manually
`crm resource trace prm_xen_test-jeos`
Dec 16 09:28:47 h19 pacemaker-controld[4428]: notice: Transition 840 action
115 (prm_xen_test-jeos_migrate_to_0 on h16): expected 'ok' but got 'error'
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]: warning: Unexpected result
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]: warning: Unexpected result
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
### (note the message above is duplicate!)
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]: error: Resource
prm_xen_test-jeos is active on 2 nodes (attempting recovery)
### This is nonsense after a failed live migration!
Indeed, sounds like an valid improvement for pacemaker-schedulerd? Or,
articulate what to do with the migration_to fails. I couldn't find the
definition from any doc yet.
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]: notice: * Recover
prm_xen_test-jeos ( h19 )
So the cluster is exactly doing the wrong thing: The VM ist still active on h16, while a
"recovery" on h19 will start it there! So _after_ the recovery the VM is
duplicate.
Dec 16 09:28:47 h19 pacemaker-controld[4428]: notice: Initiating stop
operation prm_xen_test-jeos_stop_0 locally on h19
Dec 16 09:28:47 h19 VirtualDomain(prm_xen_test-jeos)[20656]: INFO: Domain
test-jeos already stopped.
Dec 16 09:28:47 h19 pacemaker-execd[4425]: notice: prm_xen_test-jeos stop
(call 372, PID 20620) exited with status 0 (execution time 283ms, queue time
0ms)
Dec 16 09:28:47 h19 pacemaker-controld[4428]: notice: Result of stop operation
for prm_xen_test-jeos on h19: ok
Dec 16 09:31:45 h19 pacemaker-controld[4428]: notice: Initiating start
operation prm_xen_test-jeos_start_0 locally on h19
Dec 16 09:31:47 h19 pacemaker-execd[4425]: notice: prm_xen_test-jeos start
(call 373, PID 21005) exited with status 0 (execution time 2715ms, queue time
0ms)
Dec 16 09:31:47 h19 pacemaker-controld[4428]: notice: Result of start
operation for prm_xen_test-jeos on h19: ok
Dec 16 09:33:46 h19 pacemaker-schedulerd[4427]: warning: Unexpected result
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
yeah, schedulerd is trying so hard to report the migration_to failure here!
Amazingly manual migration using virsh worked:
virsh migrate --live test-jeos xen+tls://h18...
What about s/h18/h19/?
Or, manually reproduce exactly as the RA code:
`virsh ${VIRSH_OPTIONS} migrate --live $migrate_opts $DOMAIN_NAME $remoteuri
$migrateuri`
Good luck!
Roger
Regards,
Ulrich Windl
Ulrich Windl schrieb am 14.12.2020 um 15:21 in Nachricht <5FD774CF.8DE : 161 :
60728>:
Hi!
I think I found the problem why a VM ist started on two nodes:
Live-Migration had failed (e.g. away from h16), so the cluster uses stop and
start (stop on h16, start on h19 for example).
When rebooting h16, I see these messages (h19 is DC):
Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]: warning: Unexpected result
(error: test-jeos: live migration to h16 failed: 1) was recorded for
migrate_to of prm_xen_test-jeos on h19 at Dec 14 11:54:08 2020
Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]: error: Resource
prm_xen_test-jeos is active on 2 nodes (attempting recovery)
Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]: notice: * Restart
prm_xen_test-jeos ( h16 )
THIS IS WRONG: h16 was booted, so no VM is running on h16 (unless there was
some autostart from libvirt. " virsh list --autostart" does not list any)
Dec 14 15:09:27 h16 VirtualDomain(prm_xen_test-jeos)[4850]: INFO: Domain
test-jeos already stopped.
Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]: error: Calculated
transition 669 (with errors), saving inputs in
/var/lib/pacemaker/pengine/pe-error-4.bz2
Whhat's going on here?
Regards,
Ulrich
Ulrich Windl schrieb am 14.12.2020 um 08:15 in Nachricht <5FD7110D.D09 : 161
:
60728>:
Hi!
Another word of warning regarding VirtualDomain: While configuring a 3-node
cluster with SLES15 SP2 for Xen PVM (using libvirt and the VirtaulDOmain
RA),
I had created a TestVM using BtrFS.
At some time of testing the cluster ended with the testVM running on more
than one node (for reasons still to examine). Only after a "crm resource
refresh" (rebprobe) the cluster tried to fix the problem.
Well at some point the VM wouldn't start any more, because the BtrFS used
for all (SLES default) was corrupted in a way that seems unrecoverable,
independenlty of how many subvolumes and snapshots of those may exist.
Initially I would guess the libvirt stack and VirtualDomain is less
reliable
than the old Xen method and RA.
Regards,
Ulrich
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/