Re: [pve-devel] successfull migration but failed resume
Hi, I have done test with both nodes with last pve-kernel 3.10, (without the specific xsave patch) and good news, no more migration hang 63XX-61XX ! Could be great if you can could with it :) - Mail original - De: Alexandre DERUMIER aderum...@odiso.com À: Michael Rasmussen m...@datanom.net Cc: pve-devel@pve.proxmox.com Envoyé: Vendredi 29 Août 2014 17:23:31 Objet: Re: [pve-devel] successfull migration but failed resume From which CPU generation has AMD introduced the cpu flag xsave? I see it on Opteron 63XX , but not 61XX. BTW, does it work for you with current 3.10 kernel ? (which don't have the xsave patch yet) - Mail original - De: Michael Rasmussen m...@datanom.net À: pve-devel@pve.proxmox.com Envoyé: Vendredi 29 Août 2014 17:21:26 Objet: Re: [pve-devel] successfull migration but failed resume On Fri, 29 Aug 2014 17:11:08 +0200 (CEST) Alexandre DERUMIER aderum...@odiso.com wrote: Note, I just receive some new opteron servers, so I'll do tests next week :) As mentioned before I had the same problems migrating from Opteron to Phenom and Athlon II based CPUs. From which CPU generation has AMD introduced the cpu flag xsave? -- Hilsen/Regards Michael Rasmussen Get my public GnuPG keys: michael at rasmussen dot cc http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xD3C9A00E mir at datanom dot net http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xE501F51C mir at miras dot org http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xE3E80917 -- /usr/games/fortune -es says: The world is full of people who have never, since childhood, met an open doorway with an open mind. -- E. B. White ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
I might be able to do some tests but I have to take this E5-2640 server out from this production cluster and create a new test cluster. It takes some days until I rearrange things. If that’s fine Im okay. Does this mean I have to re-install proxmox 3.1 on both cluster nodes? If you remove node from a cluster, yes, it's better to reinstall it before join a new cluster. (BTW: It's proxmox 3.2 right ? not 3.1 ?) could be great to test with current 3.10 kernel. - Mail original - De: Christian Tari christ...@zaark.com À: Alexandre DERUMIER aderum...@odiso.com Envoyé: Vendredi 29 Août 2014 15:19:10 Objet: Re: [pve-devel] successfull migration but failed resume Good. At least we are on track. I might be able to do some tests but I have to take this E5-2640 server out from this production cluster and create a new test cluster. It takes some days until I rearrange things. If that’s fine Im okay. Does this mean I have to re-install proxmox 3.1 on both cluster nodes? //Christian On 29 Aug 2014, at 15:08, Alexandre DERUMIER aderum...@odiso.com wrote: Can it lead issues if we migrate between two different arch? BTW the prior is HP dL360G8 the latter is HP dl380G7. I have same bug with amd opteron 63XX - 61XX, I think because of a bug of kvm, with the cpuflags :xsave existing on 63XX and not 61XX. https://lkml.org/lkml/2014/2/22/58 It seem to be your case too, with E5-2640 0 @ 2.50GHz : xsave CPU E5645 @ 2.40GHz : no xsave. Does the migration in the reverse way is working ? I have a kernel 3.10 patch for this xsave bug, but don't have tested it yet. Don't known if you could test it ? - Mail original - De: Christian Tari christ...@zaark.com À: Alexandre DERUMIER aderum...@odiso.com Envoyé: Vendredi 29 Août 2014 14:16:59 Objet: Re: [pve-devel] successfull migration but failed resume Yes, the default, kvm64. Can it lead issues if we migrate between two different arch? BTW the prior is HP dL360G8 the latter is HP dl380G7. The strange thing is that it doesn’t happen every time. Especially after a failed migration the subsequent migrations always work. It happens often instances with relatively higher memory usage (6-18GB). Can it be some timeout while the content of the memory is being transferred? Aug 29 11:37:42 ERROR: migration finished with problems (duration 00:04:23) //Christian On 29 Aug 2014, at 14:08, Alexandre DERUMIER aderum...@odiso.com wrote: and you guest cpu is kvm64? - Mail original - De: Christian Tari christ...@zaark.com À: Alexandre DERUMIER aderum...@odiso.com Envoyé: Vendredi 29 Août 2014 13:02:15 Objet: Re: [pve-devel] successfull migration but failed resume Source host: processor : 11 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz stepping : 7 cpu MHz : 2493.793 cache size : 15360 KB physical id : 0 siblings : 12 core id : 5 cpu cores : 6 apicid : 11 initial apicid : 11 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid bogomips : 4987.58 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: # pveversion pve-manager/3.2-1/1933730b (running kernel: 2.6.32-27-pve) Target host: processor : 11 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz stepping : 2 cpu MHz : 2399.404 cache size : 12288 KB physical id : 1 siblings : 12 core id : 9 cpu cores : 6 apicid : 50 initial apicid : 50 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid bogomips : 4798.17 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: # pveversion pve-manager/3.2-4/e24a91c1 (running kernel: 2.6.32-29-pve) //Christian On 29 Aug 2014, at 12:56, Alexandre DERUMIER aderum...@odiso.com wrote: blockquote blockquote Aug 29 11:37:39 ERROR
Re: [pve-devel] successfull migration but failed resume
Note, I just receive some new opteron servers, so I'll do tests next week :) - Mail original - De: Alexandre DERUMIER aderum...@odiso.com À: Christian Tari christ...@zaark.com Cc: pve-devel@pve.proxmox.com Envoyé: Vendredi 29 Août 2014 16:14:09 Objet: Re: [pve-devel] successfull migration but failed resume I might be able to do some tests but I have to take this E5-2640 server out from this production cluster and create a new test cluster. It takes some days until I rearrange things. If that’s fine Im okay. Does this mean I have to re-install proxmox 3.1 on both cluster nodes? If you remove node from a cluster, yes, it's better to reinstall it before join a new cluster. (BTW: It's proxmox 3.2 right ? not 3.1 ?) could be great to test with current 3.10 kernel. - Mail original - De: Christian Tari christ...@zaark.com À: Alexandre DERUMIER aderum...@odiso.com Envoyé: Vendredi 29 Août 2014 15:19:10 Objet: Re: [pve-devel] successfull migration but failed resume Good. At least we are on track. I might be able to do some tests but I have to take this E5-2640 server out from this production cluster and create a new test cluster. It takes some days until I rearrange things. If that’s fine Im okay. Does this mean I have to re-install proxmox 3.1 on both cluster nodes? //Christian On 29 Aug 2014, at 15:08, Alexandre DERUMIER aderum...@odiso.com wrote: Can it lead issues if we migrate between two different arch? BTW the prior is HP dL360G8 the latter is HP dl380G7. I have same bug with amd opteron 63XX - 61XX, I think because of a bug of kvm, with the cpuflags :xsave existing on 63XX and not 61XX. https://lkml.org/lkml/2014/2/22/58 It seem to be your case too, with E5-2640 0 @ 2.50GHz : xsave CPU E5645 @ 2.40GHz : no xsave. Does the migration in the reverse way is working ? I have a kernel 3.10 patch for this xsave bug, but don't have tested it yet. Don't known if you could test it ? - Mail original - De: Christian Tari christ...@zaark.com À: Alexandre DERUMIER aderum...@odiso.com Envoyé: Vendredi 29 Août 2014 14:16:59 Objet: Re: [pve-devel] successfull migration but failed resume Yes, the default, kvm64. Can it lead issues if we migrate between two different arch? BTW the prior is HP dL360G8 the latter is HP dl380G7. The strange thing is that it doesn’t happen every time. Especially after a failed migration the subsequent migrations always work. It happens often instances with relatively higher memory usage (6-18GB). Can it be some timeout while the content of the memory is being transferred? Aug 29 11:37:42 ERROR: migration finished with problems (duration 00:04:23) //Christian On 29 Aug 2014, at 14:08, Alexandre DERUMIER aderum...@odiso.com wrote: and you guest cpu is kvm64? - Mail original - De: Christian Tari christ...@zaark.com À: Alexandre DERUMIER aderum...@odiso.com Envoyé: Vendredi 29 Août 2014 13:02:15 Objet: Re: [pve-devel] successfull migration but failed resume Source host: processor : 11 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz stepping : 7 cpu MHz : 2493.793 cache size : 15360 KB physical id : 0 siblings : 12 core id : 5 cpu cores : 6 apicid : 11 initial apicid : 11 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid bogomips : 4987.58 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: # pveversion pve-manager/3.2-1/1933730b (running kernel: 2.6.32-27-pve) Target host: processor : 11 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz stepping : 2 cpu MHz : 2399.404 cache size : 12288 KB physical id : 1 siblings : 12 core id : 9 cpu cores : 6 apicid : 50 initial apicid : 50 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid bogomips : 4798.17 clflush size : 64
Re: [pve-devel] successfull migration but failed resume
On Fri, 29 Aug 2014 17:11:08 +0200 (CEST) Alexandre DERUMIER aderum...@odiso.com wrote: Note, I just receive some new opteron servers, so I'll do tests next week :) As mentioned before I had the same problems migrating from Opteron to Phenom and Athlon II based CPUs. From which CPU generation has AMD introduced the cpu flag xsave? -- Hilsen/Regards Michael Rasmussen Get my public GnuPG keys: michael at rasmussen dot cc http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xD3C9A00E mir at datanom dot net http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xE501F51C mir at miras dot org http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xE3E80917 -- /usr/games/fortune -es says: The world is full of people who have never, since childhood, met an open doorway with an open mind. -- E. B. White pgpr4DpOOvHlQ.pgp Description: OpenPGP digital signature ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
From which CPU generation has AMD introduced the cpu flag xsave? I see it on Opteron 63XX , but not 61XX. BTW, does it work for you with current 3.10 kernel ? (which don't have the xsave patch yet) - Mail original - De: Michael Rasmussen m...@datanom.net À: pve-devel@pve.proxmox.com Envoyé: Vendredi 29 Août 2014 17:21:26 Objet: Re: [pve-devel] successfull migration but failed resume On Fri, 29 Aug 2014 17:11:08 +0200 (CEST) Alexandre DERUMIER aderum...@odiso.com wrote: Note, I just receive some new opteron servers, so I'll do tests next week :) As mentioned before I had the same problems migrating from Opteron to Phenom and Athlon II based CPUs. From which CPU generation has AMD introduced the cpu flag xsave? -- Hilsen/Regards Michael Rasmussen Get my public GnuPG keys: michael at rasmussen dot cc http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xD3C9A00E mir at datanom dot net http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xE501F51C mir at miras dot org http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xE3E80917 -- /usr/games/fortune -es says: The world is full of people who have never, since childhood, met an open doorway with an open mind. -- E. B. White ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
On Fri, 29 Aug 2014 17:23:31 +0200 (CEST) Alexandre DERUMIER aderum...@odiso.com wrote: From which CPU generation has AMD introduced the cpu flag xsave? I see it on Opteron 63XX , but not 61XX. Just found it here: Family 15h and up. https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2013-2076 -- Hilsen/Regards Michael Rasmussen Get my public GnuPG keys: michael at rasmussen dot cc http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xD3C9A00E mir at datanom dot net http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xE501F51C mir at miras dot org http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xE3E80917 -- /usr/games/fortune -es says: Writing free verse is like playing tennis with the net down. pgp60Sr8yeDYf.pgp Description: OpenPGP digital signature ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
Yes, that is what I thought about. Another possibility, is a race condition with file rename. If the file is rename on node1, but not yet on node2, the qmp resume will fail unable to find configuration file for VM xxx - no such machine (I don't known how pve clusterfs work) I have send a patch to mailing to display qm result error is migration task log The only safe thing is to stop both sides? Well, it's already safe, because target process is in pause state, and source process goes in pause at the end of the migration. So if qm resume fail, I think user simply need to resume it manually. - Mail original - De: Dietmar Maurer diet...@proxmox.com À: Alexandre DERUMIER aderum...@odiso.com Cc: pve-devel@pve.proxmox.com Envoyé: Dimanche 24 Février 2013 08:43:18 Objet: RE: [pve-devel] successfull migration but failed resume Now, why the 'cont' fail,I really don't known, I can't reproduce it easily. What we need to verify, is can we resume manually the target vm if the 'cont' fail ? maybe something bad has happen during the migration, and target vm is in strange state and qmp fail ? Yes, that is what I thought about. The only safe thing is to stop both sides? ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
Another possibility, is a race condition with file rename. If the file is rename on node1, but not yet on node2, the qmp resume will fail unable to find configuration file for VM xxx - no such machine (I don't known how pve clusterfs work) I have send a patch to mailing to display qm result error is migration task log thanks. The only safe thing is to stop both sides? Well, it's already safe, because target process is in pause state, and source process goes in pause at the end of the migration. So if qm resume fail, I think user simply need to resume it manually. Ok ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
I've seen this sometimes. Is there any way to see how the output of the ssh command was? Stefan, when you have this error, do you see a resume task in pve-manager task list ? If not,that mean that it's hang before fork_worker, so it's not qmp cont command related I have send a patch to display errors in migration task list, if an error occur before fork_woker qm resume code is: qm resume code = sub { my ($param) = @_; my $rpcenv = PVE::RPCEnvironment::get(); my $authuser = $rpcenv-get_user(); my $node = extract_param($param, 'node'); my $vmid = extract_param($param, 'vmid'); my $skiplock = extract_param($param, 'skiplock'); raise_param_exc({ skiplock = Only root may use this option. }) if $skiplock $authuser ne 'root@pam'; die VM $vmid not running\n if !PVE::QemuServer::check_running($vmid); my $realcmd = sub { my $upid = shift; syslog('info', resume VM $vmid: $upid\n); PVE::QemuServer::vm_resume($vmid, $skiplock); return; }; return $rpcenv-fork_worker('qmresume', $vmid, $authuser, $realcmd); }}); So it's possible that is hanging on die VM $vmid not running\n if !PVE::QemuServer::check_running($vmid); because config file is not yet available - Mail original - De: Stefan Priebe - Profihost AG s.pri...@profihost.ag À: pve-devel@pve.proxmox.com Envoyé: Vendredi 22 Février 2013 15:01:25 Objet: [pve-devel] successfull migration but failed resume Hello, I've seen this sometimes. Is there any way to see how the output of the ssh command was? Feb 22 14:48:05 migration speed: 819.20 MB/s - downtime 49 ms Feb 22 14:48:05 migration status: completed Feb 22 14:48:06 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@10.255.0.20 qm resume 129 --skiplock' failed: exit code 2 Feb 22 14:48:07 ERROR: migration finished with problems (duration 00:00:10) Greets, Stefan ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
Hi Alexandre, Am 24.02.2013 09:34, schrieb Alexandre DERUMIER: I've seen this sometimes. Is there any way to see how the output of the ssh command was? Stefan, when you have this error, do you see a resume task in pve-manager task list ? Yes i have a resume task and this task show status OK. But the migration task says failed. Stefan - Mail original - De: Stefan Priebe - Profihost AG s.pri...@profihost.ag À: pve-devel@pve.proxmox.com Envoyé: Vendredi 22 Février 2013 15:01:25 Objet: [pve-devel] successfull migration but failed resume Hello, I've seen this sometimes. Is there any way to see how the output of the ssh command was? Feb 22 14:48:05 migration speed: 819.20 MB/s - downtime 49 ms Feb 22 14:48:05 migration status: completed Feb 22 14:48:06 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@10.255.0.20 qm resume 129 --skiplock' failed: exit code 2 Feb 22 14:48:07 ERROR: migration finished with problems (duration 00:00:10) Greets, Stefan ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
Hi, what is the problem / disadvantage of this way: 1.) don't use -S so the VM starts directly after being migrated (we minimize downtime by may be 1s for the ssh resume stuff) 2.) we move the config file at the beginning of the migration 3.) if the source host crashes while migration the source kvm process is dead anyways so starting on the new target won't be a problem 4.) if the target host crashes while migrating the source host will detect this and abort the migration + move the config back. Greets, Stefan Am 24.02.2013 13:48, schrieb Stefan Priebe: Hi Alexandre, Am 24.02.2013 09:34, schrieb Alexandre DERUMIER: I've seen this sometimes. Is there any way to see how the output of the ssh command was? Stefan, when you have this error, do you see a resume task in pve-manager task list ? Yes i have a resume task and this task show status OK. But the migration task says failed. Stefan - Mail original - De: Stefan Priebe - Profihost AG s.pri...@profihost.ag À: pve-devel@pve.proxmox.com Envoyé: Vendredi 22 Février 2013 15:01:25 Objet: [pve-devel] successfull migration but failed resume Hello, I've seen this sometimes. Is there any way to see how the output of the ssh command was? Feb 22 14:48:05 migration speed: 819.20 MB/s - downtime 49 ms Feb 22 14:48:05 migration status: completed Feb 22 14:48:06 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@10.255.0.20 qm resume 129 --skiplock' failed: exit code 2 Feb 22 14:48:07 ERROR: migration finished with problems (duration 00:00:10) Greets, Stefan ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
4.) if the target host crashes while migrating the source host will detect this and abort the migration + move the config back. This is technically not possible - how do you detect that? ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
The config file should always be with the kvm running. Or you'll lost graphs stats for example during the migration. (not everybody have 10gb link, so migration can take time) and 4) if something bad happen and config is not moving back, you will have a phantom running kvm on source. - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Alexandre DERUMIER aderum...@odiso.com, Dietmar Maurer diet...@proxmox.com Cc: pve-devel@pve.proxmox.com Envoyé: Dimanche 24 Février 2013 14:11:42 Objet: Re: [pve-devel] successfull migration but failed resume Hi, what is the problem / disadvantage of this way: 1.) don't use -S so the VM starts directly after being migrated (we minimize downtime by may be 1s for the ssh resume stuff) 2.) we move the config file at the beginning of the migration 3.) if the source host crashes while migration the source kvm process is dead anyways so starting on the new target won't be a problem 4.) if the target host crashes while migrating the source host will detect this and abort the migration + move the config back. Greets, Stefan Am 24.02.2013 13:48, schrieb Stefan Priebe: Hi Alexandre, Am 24.02.2013 09:34, schrieb Alexandre DERUMIER: I've seen this sometimes. Is there any way to see how the output of the ssh command was? Stefan, when you have this error, do you see a resume task in pve-manager task list ? Yes i have a resume task and this task show status OK. But the migration task says failed. Stefan - Mail original - De: Stefan Priebe - Profihost AG s.pri...@profihost.ag À: pve-devel@pve.proxmox.com Envoyé: Vendredi 22 Février 2013 15:01:25 Objet: [pve-devel] successfull migration but failed resume Hello, I've seen this sometimes. Is there any way to see how the output of the ssh command was? Feb 22 14:48:05 migration speed: 819.20 MB/s - downtime 49 ms Feb 22 14:48:05 migration status: completed Feb 22 14:48:06 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@10.255.0.20 qm resume 129 --skiplock' failed: exit code 2 Feb 22 14:48:07 ERROR: migration finished with problems (duration 00:00:10) Greets, Stefan ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
Yes i have a resume task and this task show status OK. But the migration task says failed. Damn, this is strange... and how is the state of the target vm ? paused ? crashed ? - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Alexandre DERUMIER aderum...@odiso.com Cc: pve-devel@pve.proxmox.com Envoyé: Dimanche 24 Février 2013 13:48:05 Objet: Re: [pve-devel] successfull migration but failed resume Hi Alexandre, Am 24.02.2013 09:34, schrieb Alexandre DERUMIER: I've seen this sometimes. Is there any way to see how the output of the ssh command was? Stefan, when you have this error, do you see a resume task in pve-manager task list ? Yes i have a resume task and this task show status OK. But the migration task says failed. Stefan - Mail original - De: Stefan Priebe - Profihost AG s.pri...@profihost.ag À: pve-devel@pve.proxmox.com Envoyé: Vendredi 22 Février 2013 15:01:25 Objet: [pve-devel] successfull migration but failed resume Hello, I've seen this sometimes. Is there any way to see how the output of the ssh command was? Feb 22 14:48:05 migration speed: 819.20 MB/s - downtime 49 ms Feb 22 14:48:05 migration status: completed Feb 22 14:48:06 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@10.255.0.20 qm resume 129 --skiplock' failed: exit code 2 Feb 22 14:48:07 ERROR: migration finished with problems (duration 00:00:10) Greets, Stefan ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
Am 24.02.2013 14:44, schrieb Dietmar Maurer: 4.) if the target host crashes while migrating the source host will detect this and abort the migration + move the config back. This is technically not possible - how do you detect that? mhm good question... - was just a spontanious idea. I thought the source host won't acknowledge the migration finish via qmp. Stefan ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
Hi, Am 24.02.2013 14:44, schrieb Alexandre DERUMIER: The config file should always be with the kvm running. Or you'll lost graphs stats for example during the migration. (not everybody have 10gb link, so migration can take time) and 4) if something bad happen and config is not moving back, you will have a phantom running kvm on source. Yes sure. Make sense. Stefan ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
Am 24.02.2013 14:51, schrieb Alexandre DERUMIER: Yes i have a resume task and this task show status OK. But the migration task says failed. Damn, this is strange... and how is the state of the target vm ? paused ? crashed ? No idea as proxmox kills the target kvm proces if the migration fails. But not crashed i don't see a segfault. Most probably paused. Stefan - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Alexandre DERUMIER aderum...@odiso.com Cc: pve-devel@pve.proxmox.com Envoyé: Dimanche 24 Février 2013 13:48:05 Objet: Re: [pve-devel] successfull migration but failed resume Hi Alexandre, Am 24.02.2013 09:34, schrieb Alexandre DERUMIER: I've seen this sometimes. Is there any way to see how the output of the ssh command was? Stefan, when you have this error, do you see a resume task in pve-manager task list ? Yes i have a resume task and this task show status OK. But the migration task says failed. Stefan - Mail original - De: Stefan Priebe - Profihost AG s.pri...@profihost.ag À: pve-devel@pve.proxmox.com Envoyé: Vendredi 22 Février 2013 15:01:25 Objet: [pve-devel] successfull migration but failed resume Hello, I've seen this sometimes. Is there any way to see how the output of the ssh command was? Feb 22 14:48:05 migration speed: 819.20 MB/s - downtime 49 ms Feb 22 14:48:05 migration status: completed Feb 22 14:48:06 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@10.255.0.20 qm resume 129 --skiplock' failed: exit code 2 Feb 22 14:48:07 ERROR: migration finished with problems (duration 00:00:10) Greets, Stefan ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
No idea as proxmox kills the target kvm proces if the migration fails. Not true for the last phase, resume is done is phase3_cleanup. I have done the test, using a die instead qmp cont command in resume task, the migration task finish with error, but the target vm is in pause. I just need to resume it. But not crashed i don't see a segfault. Most probably paused. Maybe my patch will show more info if it's happen again... - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Alexandre DERUMIER aderum...@odiso.com Cc: pve-devel@pve.proxmox.com Envoyé: Dimanche 24 Février 2013 20:09:31 Objet: Re: [pve-devel] successfull migration but failed resume Am 24.02.2013 14:51, schrieb Alexandre DERUMIER: Yes i have a resume task and this task show status OK. But the migration task says failed. Damn, this is strange... and how is the state of the target vm ? paused ? crashed ? No idea as proxmox kills the target kvm proces if the migration fails. But not crashed i don't see a segfault. Most probably paused. Stefan - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Alexandre DERUMIER aderum...@odiso.com Cc: pve-devel@pve.proxmox.com Envoyé: Dimanche 24 Février 2013 13:48:05 Objet: Re: [pve-devel] successfull migration but failed resume Hi Alexandre, Am 24.02.2013 09:34, schrieb Alexandre DERUMIER: I've seen this sometimes. Is there any way to see how the output of the ssh command was? Stefan, when you have this error, do you see a resume task in pve-manager task list ? Yes i have a resume task and this task show status OK. But the migration task says failed. Stefan - Mail original - De: Stefan Priebe - Profihost AG s.pri...@profihost.ag À: pve-devel@pve.proxmox.com Envoyé: Vendredi 22 Février 2013 15:01:25 Objet: [pve-devel] successfull migration but failed resume Hello, I've seen this sometimes. Is there any way to see how the output of the ssh command was? Feb 22 14:48:05 migration speed: 819.20 MB/s - downtime 49 ms Feb 22 14:48:05 migration status: completed Feb 22 14:48:06 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@10.255.0.20 qm resume 129 --skiplock' failed: exit code 2 Feb 22 14:48:07 ERROR: migration finished with problems (duration 00:00:10) Greets, Stefan ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
But isn't it a simple rename right now? Under which circumstances this can fail? Am 23.02.2013 um 08:05 schrieb Alexandre DERUMIER aderum...@odiso.com: yes Well, No really, if the migration fail, the target vm process is always killed, so it's not a problem. The problem is when we have the target vm correctly migrated, but the vm config file that is keep on first node.(timeframe windows is very short,maybe 1s) In this case,you have a phantom kvm process on target, as user don't see the vm on target node, and user can start again the vm on first node, and boom. It was really a problem last year, If I remember the vm config file was moved at the begin of the migration, and we killed the sourcevm when migration failed. But killing the sourcevm didn't always working,so we had a phantom process on sourcevm and user can start again the vm on target vm and boom. So this is why we start in paused. But the risk currently is in the little timeframe at the end of the migration, when we need to move the config file. Ideas are welcome to improve this ;) - Mail original - De: Dietmar Maurer diet...@proxmox.com À: Stefan Priebe - Profihost AG s.pri...@profihost.ag Cc: Alexandre DERUMIER aderum...@odiso.com, pve-devel@pve.proxmox.com Envoyé: Vendredi 22 Février 2013 20:04:37 Objet: RE: [pve-devel] successfull migration but failed resume Mhm but in cases like Mine we have no running vm on both sides. So are you sure that when migrating there could be a reason to have two vms running? yes ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
But isn't it a simple rename right now? Under which circumstances this can fail? Yes, it's just a rename, chance to fail are very little. A host crash between the end of the migration and the rename file can give us problem for example. - Mail original - De: Stefan Priebe - Profihost AG s.pri...@profihost.ag À: Alexandre DERUMIER aderum...@odiso.com Cc: Dietmar Maurer diet...@proxmox.com, pve-devel@pve.proxmox.com Envoyé: Samedi 23 Février 2013 09:29:00 Objet: Re: [pve-devel] successfull migration but failed resume But isn't it a simple rename right now? Under which circumstances this can fail? Am 23.02.2013 um 08:05 schrieb Alexandre DERUMIER aderum...@odiso.com: yes Well, No really, if the migration fail, the target vm process is always killed, so it's not a problem. The problem is when we have the target vm correctly migrated, but the vm config file that is keep on first node.(timeframe windows is very short,maybe 1s) In this case,you have a phantom kvm process on target, as user don't see the vm on target node, and user can start again the vm on first node, and boom. It was really a problem last year, If I remember the vm config file was moved at the begin of the migration, and we killed the sourcevm when migration failed. But killing the sourcevm didn't always working,so we had a phantom process on sourcevm and user can start again the vm on target vm and boom. So this is why we start in paused. But the risk currently is in the little timeframe at the end of the migration, when we need to move the config file. Ideas are welcome to improve this ;) - Mail original - De: Dietmar Maurer diet...@proxmox.com À: Stefan Priebe - Profihost AG s.pri...@profihost.ag Cc: Alexandre DERUMIER aderum...@odiso.com, pve-devel@pve.proxmox.com Envoyé: Vendredi 22 Février 2013 20:04:37 Objet: RE: [pve-devel] successfull migration but failed resume Mhm but in cases like Mine we have no running vm on both sides. So are you sure that when migrating there could be a reason to have two vms running? yes ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
Maybe can we hack qemu, and make the file rename from qemu, at the end of the migration ? in migration.c static void migrate_fd_completed(MigrationState *s) { DPRINTF(setting completed state\n); if (migrate_fd_cleanup(s) 0) { s-state = MIG_STATE_ERROR; } else { s-state = MIG_STATE_COMPLETED; move config file here runstate_set(RUN_STATE_POSTMIGRATE); } notifier_list_notify(migration_state_notifiers, s); } - Mail original - De: Alexandre DERUMIER aderum...@odiso.com À: Michael Rasmussen m...@datanom.net Cc: pve-devel@pve.proxmox.com Envoyé: Samedi 23 Février 2013 12:44:50 Objet: Re: [pve-devel] successfull migration but failed resume I think the main problem is that qemu is doing the switch himself. (when all memory is transferred, the source process is paused and the target process continue. But we move check the config file, by checking the migration status in a loop with qmp command, with some ms sleep So it's possible that the migration is finished some milliseconds before we see it and move the file. Also qmp check command could fail. Maybe can we hack qemu, and make the file rename from qemu, at the end of the migration ? Dietmar, what do you think about this ? - Mail original - De: Michael Rasmussen m...@datanom.net À: pve-devel@pve.proxmox.com Envoyé: Samedi 23 Février 2013 11:21:35 Objet: Re: [pve-devel] successfull migration but failed resume On Sat, 23 Feb 2013 08:05:35 +0100 (CET) Alexandre DERUMIER aderum...@odiso.com wrote: Ideas are welcome to improve this ;) Since it will always be a deal between to nodes you could implement something like the TCP 3-way handshake (SYN,SYN-ACK,ACK). Node A sends a Migrate SYNchronize packet to Node B Node B receives A's SYN Node B sends a SYNchronize-ACKnowledgement Node A receives B's SYN-ACK Node A sends ACKnowledge Node B receives ACK. Migration is completed. -- Hilsen/Regards Michael Rasmussen Get my public GnuPG keys: michael at rasmussen dot cc http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xD3C9A00E mir at datanom dot net http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xE501F51C mir at miras dot org http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xE3E80917 -- She's so tough she won't take 'yes' for an answer. ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
Sound easy enough. And that solves the whole problem? I solve 99,99%. with this zait's safe to launch the target vm without -S last thing, is that just after end of the migration, the source kvm process is in pause (no more acess to disk), but we stop it at the end of the phase 3. So, if source host crash, it's not a problem. If proxmox task crash (after the migration and before the stop), we can have a phantom kvm process on source node, but doing nothing Note that is already like this now. - Mail original - De: Dietmar Maurer diet...@proxmox.com À: Alexandre DERUMIER aderum...@odiso.com, Michael Rasmussen m...@datanom.net Cc: pve-devel@pve.proxmox.com Envoyé: Samedi 23 Février 2013 14:55:17 Objet: RE: [pve-devel] successfull migration but failed resume Maybe can we hack qemu, and make the file rename from qemu, at the end of the migration ? Dietmar, what do you think about this ? Sound easy enough. And that solves the whole problem? ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
I think this last point can be resolve by hacking qemu, to kill himself after a timeout of X seconds when this migration is finished. Don't known if it's easy to implement. So if the migration task hang between the file move and qmp stop is send , we have a protection. Why do we want to change anything (I guess I missed some mails)? If the 'cont' command fails, we should try to find out why? ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
Why do we want to change anything (I guess I missed some mails)? If the 'cont' command fails, we should try to find out why? Yes sure, It was just a proposal to improve things. Mainly if the source host crash at the end of the migration, or qmp migrate-status hang,... , before the file is move or the resume command is sent. And also to reduce from some ms the migration time. But of course not for proxmox 2.3 ;) Now, why the 'cont' fail,I really don't known, I can't reproduce it easily. What we need to verify, is can we resume manually the target vm if the 'cont' fail ? maybe something bad has happen during the migration, and target vm is in strange state and qmp fail ? - Mail original - De: Dietmar Maurer diet...@proxmox.com À: Alexandre DERUMIER aderum...@odiso.com Cc: pve-devel@pve.proxmox.com Envoyé: Dimanche 24 Février 2013 08:12:03 Objet: RE: [pve-devel] successfull migration but failed resume I think this last point can be resolve by hacking qemu, to kill himself after a timeout of X seconds when this migration is finished. Don't known if it's easy to implement. So if the migration task hang between the file move and qmp stop is send , we have a protection. Why do we want to change anything (I guess I missed some mails)? If the 'cont' command fails, we should try to find out why? ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
Now, why the 'cont' fail,I really don't known, I can't reproduce it easily. What we need to verify, is can we resume manually the target vm if the 'cont' fail ? maybe something bad has happen during the migration, and target vm is in strange state and qmp fail ? Yes, that is what I thought about. The only safe thing is to stop both sides? ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
root@10.255.0.20 qm resume 129 --skiplock' failed: exit code 2 Feb 22 14:48:07 ERROR: migration finished with problems (duration 00:00:10) I have also see this bug sometimes. I don't know how to display the output, but the command send cont command to qmp socket of migrate vm, to resume it. So maybe it's fail to connect to qmp socket. (maybe a retry can help ? ) We are starting with -S (to pause it), it is because we want to be sure to resume it after move the config file. - Mail original - De: Stefan Priebe - Profihost AG s.pri...@profihost.ag À: pve-devel@pve.proxmox.com Envoyé: Vendredi 22 Février 2013 15:01:25 Objet: [pve-devel] successfull migration but failed resume Hello, I've seen this sometimes. Is there any way to see how the output of the ssh command was? Feb 22 14:48:05 migration speed: 819.20 MB/s - downtime 49 ms Feb 22 14:48:05 migration status: completed Feb 22 14:48:06 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@10.255.0.20 qm resume 129 --skiplock' failed: exit code 2 Feb 22 14:48:07 ERROR: migration finished with problems (duration 00:00:10) Greets, Stefan ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
Dietmar why do we pause? Stefan Am 22.02.2013 um 15:37 schrieb Alexandre DERUMIER aderum...@odiso.com: root@10.255.0.20 qm resume 129 --skiplock' failed: exit code 2 Feb 22 14:48:07 ERROR: migration finished with problems (duration 00:00:10) I have also see this bug sometimes. I don't know how to display the output, but the command send cont command to qmp socket of migrate vm, to resume it. So maybe it's fail to connect to qmp socket. (maybe a retry can help ? ) We are starting with -S (to pause it), it is because we want to be sure to resume it after move the config file. - Mail original - De: Stefan Priebe - Profihost AG s.pri...@profihost.ag À: pve-devel@pve.proxmox.com Envoyé: Vendredi 22 Février 2013 15:01:25 Objet: [pve-devel] successfull migration but failed resume Hello, I've seen this sometimes. Is there any way to see how the output of the ssh command was? Feb 22 14:48:05 migration speed: 819.20 MB/s - downtime 49 ms Feb 22 14:48:05 migration status: completed Feb 22 14:48:06 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@10.255.0.20 qm resume 129 --skiplock' failed: exit code 2 Feb 22 14:48:07 ERROR: migration finished with problems (duration 00:00:10) Greets, Stefan ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
Mhm but in cases like Mine we have no running vm on both sides. So are you sure that when migrating there could be a reason to have two vms running? Stefan Am 22.02.2013 um 18:51 schrieb Dietmar Maurer diet...@proxmox.com: Dietmar why do we pause? For safety reasons. We want to avoid the same VM running two times. ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] successfull migration but failed resume
yes Well, No really, if the migration fail, the target vm process is always killed, so it's not a problem. The problem is when we have the target vm correctly migrated, but the vm config file that is keep on first node.(timeframe windows is very short,maybe 1s) In this case,you have a phantom kvm process on target, as user don't see the vm on target node, and user can start again the vm on first node, and boom. It was really a problem last year, If I remember the vm config file was moved at the begin of the migration, and we killed the sourcevm when migration failed. But killing the sourcevm didn't always working,so we had a phantom process on sourcevm and user can start again the vm on target vm and boom. So this is why we start in paused. But the risk currently is in the little timeframe at the end of the migration, when we need to move the config file. Ideas are welcome to improve this ;) - Mail original - De: Dietmar Maurer diet...@proxmox.com À: Stefan Priebe - Profihost AG s.pri...@profihost.ag Cc: Alexandre DERUMIER aderum...@odiso.com, pve-devel@pve.proxmox.com Envoyé: Vendredi 22 Février 2013 20:04:37 Objet: RE: [pve-devel] successfull migration but failed resume Mhm but in cases like Mine we have no running vm on both sides. So are you sure that when migrating there could be a reason to have two vms running? yes ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel