Hi! (I changed the subject of the thread) VirtualDomain seems to be broken, as it does not handle a failed live-,igration correctly:
With my test-VM running on node h16, this happened when I tried to move it away (for testing): Dec 16 09:28:46 h19 pacemaker-schedulerd[4427]: notice: * Migrate prm_xen_test-jeos ( h16 -> h19 ) Dec 16 09:28:46 h19 pacemaker-controld[4428]: notice: Initiating migrate_to operation prm_xen_test-jeos_migrate_to_0 on h16 Dec 16 09:28:47 h19 pacemaker-controld[4428]: notice: Transition 840 aborted by operation prm_xen_test-jeos_migrate_to_0 'modify' on h16: Event failed Dec 16 09:28:47 h19 pacemaker-controld[4428]: notice: Transition 840 action 115 (prm_xen_test-jeos_migrate_to_0 on h16): expected 'ok' but got 'error' Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]: warning: Unexpected result (error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020 Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]: warning: Unexpected result (error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020 ### (note the message above is duplicate!) Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]: error: Resource prm_xen_test-jeos is active on 2 nodes (attempting recovery) ### This is nonsense after a failed live migration! Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]: notice: * Recover prm_xen_test-jeos ( h19 ) So the cluster is exactly doing the wrong thing: The VM ist still active on h16, while a "recovery" on h19 will start it there! So _after_ the recovery the VM is duplicate. Dec 16 09:28:47 h19 pacemaker-controld[4428]: notice: Initiating stop operation prm_xen_test-jeos_stop_0 locally on h19 Dec 16 09:28:47 h19 VirtualDomain(prm_xen_test-jeos)[20656]: INFO: Domain test-jeos already stopped. Dec 16 09:28:47 h19 pacemaker-execd[4425]: notice: prm_xen_test-jeos stop (call 372, PID 20620) exited with status 0 (execution time 283ms, queue time 0ms) Dec 16 09:28:47 h19 pacemaker-controld[4428]: notice: Result of stop operation for prm_xen_test-jeos on h19: ok Dec 16 09:31:45 h19 pacemaker-controld[4428]: notice: Initiating start operation prm_xen_test-jeos_start_0 locally on h19 Dec 16 09:31:47 h19 pacemaker-execd[4425]: notice: prm_xen_test-jeos start (call 373, PID 21005) exited with status 0 (execution time 2715ms, queue time 0ms) Dec 16 09:31:47 h19 pacemaker-controld[4428]: notice: Result of start operation for prm_xen_test-jeos on h19: ok Dec 16 09:33:46 h19 pacemaker-schedulerd[4427]: warning: Unexpected result (error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020 Amazingly manual migration using virsh worked: virsh migrate --live test-jeos xen+tls://h18... Regards, Ulrich Windl >>> Ulrich Windl schrieb am 14.12.2020 um 15:21 in Nachricht <5FD774CF.8DE : >>> 161 : 60728>: > Hi! > > I think I found the problem why a VM ist started on two nodes: > > Live-Migration had failed (e.g. away from h16), so the cluster uses stop and > start (stop on h16, start on h19 for example). > When rebooting h16, I see these messages (h19 is DC): > > Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]: warning: Unexpected result > (error: test-jeos: live migration to h16 failed: 1) was recorded for > migrate_to of prm_xen_test-jeos on h19 at Dec 14 11:54:08 2020 > Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]: error: Resource > prm_xen_test-jeos is active on 2 nodes (attempting recovery) > > Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]: notice: * Restart > prm_xen_test-jeos ( h16 ) > > THIS IS WRONG: h16 was booted, so no VM is running on h16 (unless there was > some autostart from libvirt. " virsh list --autostart" does not list any) > > Dec 14 15:09:27 h16 VirtualDomain(prm_xen_test-jeos)[4850]: INFO: Domain > test-jeos already stopped. > > Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]: error: Calculated > transition 669 (with errors), saving inputs in > /var/lib/pacemaker/pengine/pe-error-4.bz2 > > Whhat's going on here? > > Regards, > Ulrich > > >>> Ulrich Windl schrieb am 14.12.2020 um 08:15 in Nachricht <5FD7110D.D09 : > >>> 161 > : > 60728>: > > Hi! > > > > Another word of warning regarding VirtualDomain: While configuring a 3-node > > > cluster with SLES15 SP2 for Xen PVM (using libvirt and the VirtaulDOmain > RA), > > I had created a TestVM using BtrFS. > > At some time of testing the cluster ended with the testVM running on more > > than one node (for reasons still to examine). Only after a "crm resource > > refresh" (rebprobe) the cluster tried to fix the problem. > > Well at some point the VM wouldn't start any more, because the BtrFS used > > for all (SLES default) was corrupted in a way that seems unrecoverable, > > independenlty of how many subvolumes and snapshots of those may exist. > > > > Initially I would guess the libvirt stack and VirtualDomain is less > reliable > > than the old Xen method and RA. > > > > Regards, > > Ulrich > > > > > > > > > > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
