[Bug 1848497] Re: virtio-balloon change breaks migration from qemu prior to 4.0

Christian Ehrhardt  Mon, 28 Oct 2019 00:01:56 -0700

Now that Focal is open I have opened proper Focal MP replacing the old one and 
also an Eoan SRU MP right away.
=> 
https://code.launchpad.net/~paelzer/ubuntu/+source/qemu/+git/qemu/+merge/374770
=> 
https://code.launchpad.net/~paelzer/ubuntu/+source/qemu/+git/qemu/+merge/374771


** Description changed:

+ [Impact]
+ 
+  * Due to a bug in qemu in 4.0 the config size for virtio-baloon changed.
+  * This breaks migration from pre 4.0 qemu because the PCI BAR size is 
+    affected.
+ 
+  * Upstream has realized this and fixed it in 4.1, this backports the fix 
+    to qemu 4.0 in Ubuntu Eoan
+ 
+ [Test Case]
+ 
+  * Take a pre-eoan (pre qemu 4.0) guest and check that your setup can 
+    migrate it back and forther with a eoan/qemu-4.0 target.
+    Then add a virt-baloon device to the guest on pre-4.0 and migrate it 
+    again.
+    Unfixed the following error will show up:
+    get_pci_config_device: Bad config data: i=0x10 read: a1 device: 1 cmask: 
ff wmask: c0 w1cmask:0
+ 
+  * Unfixed -> Fixed qemu 4.0 migrations should work as well. While the 
+    other way around it could (size didn't change), but there are no 
+    guarantees (no logic in the target).
+ 
+ [Regression Potential]
+ 
+  * Messing with machine types is always dangerous, as in case of a mistake 
+    things get even more complex. But in this case things seemed rather 
+    straight forward. Pre 4.0 code all behaves the same, only 4.0 gets the 
+    new attribute set and later code has logic to handle dynamic sizes.
+    That way I think we are safe of machine-type regressions.
+  * For the change in behavior, it changes pre 4.0 migrations, which atm 
+    are broken if a virt-baloon device is present. There is nothing to 
+    break more int hat use case, and if such a device isn't present it 
+    shouldn't change anything. Therefore IMHO safe again.
+ 
+ [Other Info]
+  
+  * n/a
+ 
+ 
+ ---
+ 
+ 
  Related but not the same as bug 1838569 which had two error signatures.
  The first being covered there and the second handled here.
  
  --- ---
  Quote from https://bugs.launchpad.net/cloud-archive/+bug/1838569/comments/4
  Daniel 'f0o' Preussker (dpreussker) wrote 1 hour ago: #4
  With recent release of OpenStack Train this issue reappears...
  
  Upgrading from Stein to Train will require all VMs to be hard-rebooted
  to be migrated as a final step because Live Migration fails with:
  
  Oct 17 10:28:43 h2.1.openstack.r0cket.net libvirtd[1545]: Unable to read from 
monitor: Connection reset by peer
  Oct 17 10:28:43 h2.1.openstack.r0cket.net libvirtd[1545]: internal error: 
qemu unexpectedly closed the monitor: 2019-10-17T10:28:42.981201Z 
qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10 read: a1 
device: 1 cmask: ff wmask: c0 w1cmask:0
-                                                           
2019-10-17T10:28:42.981250Z qemu-system-x86_64: Failed to load PCIDevice:config
-                                                           
2019-10-17T10:28:42.981263Z qemu-system-x86_64: Failed to load 
virtio-balloon:virtio
-                                                           
2019-10-17T10:28:42.981272Z qemu-system-x86_64: error while loading state for 
instance 0x0 of device '0000:00:05.0/virtio-balloon'
-                                                           
2019-10-17T10:28:42.981391Z qemu-system-x86_64: warning: TSC frequency mismatch 
between VM (2532609 kHz) and host (2532608 kHz), and TSC scaling unavailable
-                                                           
2019-10-17T10:28:42.983157Z qemu-system-x86_64: warning: TSC frequency mismatch 
between VM (2532609 kHz) and host (2532608 kHz), and TSC scaling unavailable
-                                                           
2019-10-17T10:28:42.983672Z qemu-system-x86_64: load of migration failed: 
Invalid argument
- 
+                                                           
2019-10-17T10:28:42.981250Z qemu-system-x86_64: Failed to load PCIDevice:config
+                                                           
2019-10-17T10:28:42.981263Z qemu-system-x86_64: Failed to load 
virtio-balloon:virtio
+                                                           
2019-10-17T10:28:42.981272Z qemu-system-x86_64: error while loading state for 
instance 0x0 of device '0000:00:05.0/virtio-balloon'
+                                                           
2019-10-17T10:28:42.981391Z qemu-system-x86_64: warning: TSC frequency mismatch 
between VM (2532609 kHz) and host (2532608 kHz), and TSC scaling unavailable
+                                                           
2019-10-17T10:28:42.983157Z qemu-system-x86_64: warning: TSC frequency mismatch 
between VM (2532609 kHz) and host (2532608 kHz), and TSC scaling unavailable
+                                                           
2019-10-17T10:28:42.983672Z qemu-system-x86_64: load of migration failed: 
Invalid argument
  
  --- ---
- 
  
  Identified as:
  Dr. David Alan Gilbert (dgilbert-h) wrote 1 hour ago: #5
  Dnaiel: That's a different problem; 'Bad config data: i=0x10 read: a1 device: 
1 cmask: ff wmask: c0 w1cmask:0'; so should probably be a separate bug.
  
  I'd bet on this being the one fixed by
  2bbadb08ce272d65e1f78621002008b07d1e0f03
  
- 
  --- ---
  
  And that is a fix that only is in qemu 4.1 and would be an open bug for
  Ubuntu and Cloud Archive

** Description changed:

  [Impact]
  
-  * Due to a bug in qemu in 4.0 the config size for virtio-baloon changed.
-  * This breaks migration from pre 4.0 qemu because the PCI BAR size is 
-    affected.
+  * Due to a bug in qemu in 4.0 the config size for virtio-baloon changed.
+  * This breaks migration from pre 4.0 qemu because the PCI BAR size is
+    affected.
  
-  * Upstream has realized this and fixed it in 4.1, this backports the fix 
-    to qemu 4.0 in Ubuntu Eoan
+  * Upstream has realized this and fixed it in 4.1, this backports the fix
+    to qemu 4.0 in Ubuntu Eoan
  
  [Test Case]
  
-  * Take a pre-eoan (pre qemu 4.0) guest and check that your setup can 
-    migrate it back and forther with a eoan/qemu-4.0 target.
-    Then add a virt-baloon device to the guest on pre-4.0 and migrate it 
-    again.
-    Unfixed the following error will show up:
-    get_pci_config_device: Bad config data: i=0x10 read: a1 device: 1 cmask: 
ff wmask: c0 w1cmask:0
+  * Take a pre-eoan (pre qemu 4.0) guest and check that your setup can
+    migrate it back and forth with a eoan/qemu-4.0 target.
+    Note: (always) use a versioned machine type like pc-i44fx-disco (also 
+    the default if you use disco as source).
+    Then add a virt-baloon device to the guest on pre-4.0 and migrate it
+    again.
+    Unfixed the following error will show up:
+    get_pci_config_device: Bad config data: i=0x10 read: a1 device: 1 cmask: 
ff wmask: c0 w1cmask:0
  
-  * Unfixed -> Fixed qemu 4.0 migrations should work as well. While the 
-    other way around it could (size didn't change), but there are no 
-    guarantees (no logic in the target).
+  * Unfixed -> Fixed qemu 4.0 migrations should work as well. While the
+    other way around it could (size didn't change), but there are no
+    guarantees (no logic in the target).
  
  [Regression Potential]
  
-  * Messing with machine types is always dangerous, as in case of a mistake 
-    things get even more complex. But in this case things seemed rather 
-    straight forward. Pre 4.0 code all behaves the same, only 4.0 gets the 
-    new attribute set and later code has logic to handle dynamic sizes.
-    That way I think we are safe of machine-type regressions.
-  * For the change in behavior, it changes pre 4.0 migrations, which atm 
-    are broken if a virt-baloon device is present. There is nothing to 
-    break more int hat use case, and if such a device isn't present it 
-    shouldn't change anything. Therefore IMHO safe again.
+  * Messing with machine types is always dangerous, as in case of a mistake
+    things get even more complex. But in this case things seemed rather
+    straight forward. Pre 4.0 code all behaves the same, only 4.0 gets the
+    new attribute set and later code has logic to handle dynamic sizes.
+    That way I think we are safe of machine-type regressions.
+  * For the change in behavior, it changes pre 4.0 migrations, which atm
+    are broken if a virt-baloon device is present. There is nothing to
+    break more int hat use case, and if such a device isn't present it
+    shouldn't change anything. Therefore IMHO safe again.
  
  [Other Info]
-  
-  * n/a
  
+  * n/a
  
  ---
- 
  
  Related but not the same as bug 1838569 which had two error signatures.
  The first being covered there and the second handled here.
  
  --- ---
  Quote from https://bugs.launchpad.net/cloud-archive/+bug/1838569/comments/4
  Daniel 'f0o' Preussker (dpreussker) wrote 1 hour ago: #4
  With recent release of OpenStack Train this issue reappears...
  
  Upgrading from Stein to Train will require all VMs to be hard-rebooted
  to be migrated as a final step because Live Migration fails with:
  
  Oct 17 10:28:43 h2.1.openstack.r0cket.net libvirtd[1545]: Unable to read from 
monitor: Connection reset by peer
  Oct 17 10:28:43 h2.1.openstack.r0cket.net libvirtd[1545]: internal error: 
qemu unexpectedly closed the monitor: 2019-10-17T10:28:42.981201Z 
qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10 read: a1 
device: 1 cmask: ff wmask: c0 w1cmask:0
                                                            
2019-10-17T10:28:42.981250Z qemu-system-x86_64: Failed to load PCIDevice:config
                                                            
2019-10-17T10:28:42.981263Z qemu-system-x86_64: Failed to load 
virtio-balloon:virtio
                                                            
2019-10-17T10:28:42.981272Z qemu-system-x86_64: error while loading state for 
instance 0x0 of device '0000:00:05.0/virtio-balloon'
                                                            
2019-10-17T10:28:42.981391Z qemu-system-x86_64: warning: TSC frequency mismatch 
between VM (2532609 kHz) and host (2532608 kHz), and TSC scaling unavailable
                                                            
2019-10-17T10:28:42.983157Z qemu-system-x86_64: warning: TSC frequency mismatch 
between VM (2532609 kHz) and host (2532608 kHz), and TSC scaling unavailable
                                                            
2019-10-17T10:28:42.983672Z qemu-system-x86_64: load of migration failed: 
Invalid argument
  
  --- ---
  
  Identified as:
  Dr. David Alan Gilbert (dgilbert-h) wrote 1 hour ago: #5
  Dnaiel: That's a different problem; 'Bad config data: i=0x10 read: a1 device: 
1 cmask: ff wmask: c0 w1cmask:0'; so should probably be a separate bug.
  
  I'd bet on this being the one fixed by
  2bbadb08ce272d65e1f78621002008b07d1e0f03
  
  --- ---
  
  And that is a fix that only is in qemu 4.1 and would be an open bug for
  Ubuntu and Cloud Archive

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1848497

Title:
  virtio-balloon change breaks migration from qemu prior to 4.0

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1848497/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1848497] Re: virtio-balloon change breaks migration from qemu prior to 4.0

Reply via email to