** Description changed:

  [Description]
  
  - Configured a machine with 32 static VCPUs, 160GB of RAM using 1G
  hugepages on a NUMA capable machine.
  
  Domain definition (http://pastebin.ubuntu.com/25121106/)
  
  - Once started (virsh start).
  
  Libvirt log.
  
  LC_ALL=C
  PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
  QEMU_AUDIO_DRV=none /usr/bin/kvm-spice -name reproducer2 -S -machine pc-
  i440fx-2.5,accel=kvm,usb=off -cpu host -m 124928 -realtime mlock=off
  -smp 32,sockets=16,cores=1,threads=2 -object memory-backend-file,id=ram-
  node0,prealloc=yes,mem-
  path=/dev/hugepages/libvirt/qemu,share=yes,size=64424509440,host-
  nodes=0,policy=bind -numa node,nodeid=0,cpus=0-15,memdev=ram-node0
  -object memory-backend-file,id=ram-node1,prealloc=yes,mem-
  path=/dev/hugepages/libvirt/qemu,share=yes,size=66571993088,host-
  nodes=1,policy=bind -numa node,nodeid=1,cpus=16-31,memdev=ram-node1
  -uuid d7a4af7f-7549-4b44-8ceb-4a6c951388d4 -no-user-config -nodefaults
  -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-
  reproducer2/monitor.sock,server,nowait -mon
  chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
  -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2
  -drive
  file=/var/lib/uvtool/libvirt/images/test.qcow,format=qcow2,if=none,id
  =drive-virtio-disk0,cache=none -device virtio-blk-
  pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-
  disk0,bootindex=1 -chardev pty,id=charserial0 -device isa-
  serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -device cirrus-
  vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-
  pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on
  
  Then the following error is raised.
  
  virsh start reproducer2
  error: Failed to start domain reproducer2
  error: monitor socket did not show up: No such file or directory
  
+ - The fix is done via backports, as a TL;DR the change does:
+   1. instead of sleeping too short (1ms) in a loop for very long start 
+      small but exponentially increase for the few cases that need long. 
+      That way fast actions are done fast, but long actions are no cpu-hogs
+   2. huge guests get ~1s per 1Gb extra timeout to come up, that allows 
+      huge guests to initialize properly.
+ 
  [Impact]
  
-   * Cannot start virtual machines with large pools of memory allocated 
+   * Cannot start virtual machines with large pools of memory allocated
  on NUMA nodes.
  
  [Test Case]
  
-  * Configure a Machine with at least 2 NUMA nodes.
- 
-    root@buneary:/home/ubuntu# virsh freepages 0 1G
-    1048576KiB: 60
- 
-    root@buneary:/home/ubuntu# virsh freepages 1 1G
-    1048576KiB: 62
- 
-  * Create a guest that uses the full amount of available huge pages (on
- this case 122). (full guest definition:
- http://paste.ubuntu.com/25125500/)
+  * this is a tradeoff of memory clearing speed vs guest size.
+    Once the clearing of guest memory exceeds ~30 seconds the issue will 
+    trigger.
+  * Guest must be backed by huge pages as otherwise the kernel will fault 
+    in on demand instead of needing the initial clear.
+  * One way to "slow down" is to Configure a Machine with multiple NUMA 
+    nodes.
+    root@buneary:/home/ubuntu# virsh freepages 0 1G
+    1048576KiB: 60
+    root@buneary:/home/ubuntu# virsh freepages 1 1G
+    1048576KiB: 62
+  * Another one to slow down the init is to just use a really heg guest. In 
+    the example 122G guest was enough. (full guest definition: 
+    http://paste.ubuntu.com/25125500/)
  
  <memory unit='GiB'>120</memory>
-   <currentMemory unit='GiB'>120</currentMemory>
+   <currentMemory unit='GiB'>120</currentMemory>
  
-   <memoryBacking>
-     <hugepages>
-       <page size='1' unit='GiB' nodeset='0'/>
-       <page size='1' unit='GiB' nodeset='1'/>
-     </hugepages>
+   <memoryBacking>
+     <hugepages>
+       <page size='1' unit='GiB' nodeset='0'/>
+       <page size='1' unit='GiB' nodeset='1'/>
+     </hugepages>
  
-   </memoryBacking>
+   </memoryBacking>
  
-   <cpu mode='host-passthrough'>
+   <cpu mode='host-passthrough'>
  
-     <topology sockets='16' cores='1' threads='2'/>
-     <numa>
-       <cell id='0' cpus='0-15' memory='60' unit='GiB' memAccess='shared'/>
-       <cell id='1' cpus='16-31' memory='62' unit='GiB' memAccess='shared'/>
-     </numa>
-   </cpu>
+     <topology sockets='16' cores='1' threads='2'/>
+     <numa>
+       <cell id='0' cpus='0-15' memory='60' unit='GiB' memAccess='shared'/>
+       <cell id='1' cpus='16-31' memory='62' unit='GiB' memAccess='shared'/>
+     </numa>
+   </cpu>
  
-  * Define the guest, and try to start it.
+  * Define the guest, and try to start it.
  
-   $ virsh define reproducer.xml
-   $ virsh start reproducer
+   $ virsh define reproducer.xml
+   $ virsh start reproducer
  
  * Verify that the following error is raised:
  
  root@buneary:/home/ubuntu# virsh start reproducer2
  error: Failed to start domain reproducer2
  error: monitor socket did not show up: No such file or directory
  
  [Expected Behavior]
  
- * Machine is started without issues as displayed 
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1705132/comments/7
-    
+ * Machine is started without issues as displayed
+ https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1705132/comments/7
  
  [Regression Potential]
  
-  * None identified.
+  * The behavior on timeouts around starting a guest changed. We backported 
+    the fix along with a fix to that new behavior (where guests seemed to 
+    wait forever due to the exponential wait).
+    Still the "allowed" wait time is increased, but users might expect it 
+    instantly as they are used from their laptop.
+    Now if one starts a 1TB guest the allowed time is base+1000s.
+    A user might think a while it is broken or hanging, but there is no way 
+    to avoid that.
+    OTOH before the fix it would have failed to start after 30 seconds so 
+    not really a regression IMHO.
+ 
  
  [Other Info]
  
  
https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=85af0b803cd19a03f71bd01ab4e045552410368f;hp=67dcb797ed7f1fbb048aa47006576f424923933b

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1705132

Title:
  Large memory guests, "error: monitor socket did not show up: No such
  file or directory"

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1705132/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to