Steps with test packages on Focal (shutdown-on-init)
---

Environment:
---

On top of LXD VM in comments #12/#13.

Enable PPA & debug symbols

        sudo add-apt-repository -yn ppa:mfo/lp2059272
        sudo sed '/^deb / s,$, main/debug,' -i 
/etc/apt/sources.list.d/mfo-ubuntu-lp2059272-focal.list
        sudo apt update

Install packages

        sudo apt install --yes libvirt{0,-daemon{,-driver-
qemu}}{,-dbgsym} libvirt-clients gdb qemu-system-x86

        $ dpkg -s libvirt-daemon | grep ^Version:
        Version: 6.0.0-0ubuntu8.18~ppa1

Libvirtd debug logging

        cat <<EOF | sudo tee -a /etc/libvirt/libvirtd.conf
        log_filters="1:qemu 1:libvirt"
        log_outputs="3:syslog:libvirtd 
1:file:/var/log/libvirt/libvirtd-debug.log"
        EOF

Follow `Steps to reproduce on Focal (shutdown-on-init)` in comment #13
---

Up to ...

Check the backtrace of the domain status XML save function, coming from
QEMU process reconnect:

        t 20

        (gdb) bt
        #0  virDomainObjSave (obj=0x7fe638012540, xmlopt=0x7fe63800d4e0, 
statusDir=0x7fe63800cf10 "/run/libvirt/qemu") at 
../../../src/conf/domain_conf.c:29157
        #1  0x00007fe644190545 in qemuProcessReconnect (opaque=<optimized out>) 
at ../../../src/qemu/qemu_process.c:8123
        #2  0x00007fe64aebd54a in virThreadHelper (data=<optimized out>) at 
../../../src/util/virthread.c:196
        #3  0x00007fe64ab7e609 in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
        #4  0x00007fe64aaa3353 in clone () from /lib/x86_64-linux-gnu/libc.so.6

        $ sudo kill $(pidof libvirtd)

        Thread 1 "libvirtd" hit Breakpoint 1, qemuStateCleanup () at
../../../src/qemu/qemu_driver.c:1180

        t 20

        (gdb) p xmlopt.privateData.format
        $1 = (virDomainXMLPrivateDataFormatFunc) 0x7fe644152890 
<qemuDomainObjPrivateXMLFormat>

Let the cleanup function finish

        t 1
        finish

Notice it took a while (30 seconds).

        (gdb) t 20
        (gdb) p xmlopt.privateData.format
        $3 = (virDomainXMLPrivateDataFormatFunc) 0x0

Let the save function continue, and libvirt finish shutdown:

        (gdb) c &
        (gdb) t 1
        (gdb) c
        (gdb) q

Check the VM status XML *after*:

        ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e 
'monitor path' /run/libvirt/qemu/test-vm.xml
        <domstatus state='running' reason='booted' pid='6817'>
          <domain type='qemu' id='1'>
        
And everything happened as in the reproducer.
i.e., the SAME behavior happened BY DEFAULT.
Just with a 30 seconds delay.

Checking the libvirtd debug logs to confirm the patch behavior:

        $ sudo tail -n50 /var/log/libvirt/libvirtd-debug.log | sed -n 
'/qemuStateCleanupWait/,$p'
        2024-03-30 22:49:24.737+0000: 6875: debug : qemuStateCleanupWait:1144 : 
timeout 30, timeout_env '(null)'
        2024-03-30 22:49:24.737+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 0
        2024-03-30 22:49:24.737+0000: 6875: warning : qemuStateCleanupWait:1153 
: Waiting for qemuProcessReconnect() threads (1) to end. Configure with 
LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0 = do not wait; N = wait 
up to N seconds; current = 30)
        2024-03-30 22:49:25.740+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 1
        2024-03-30 22:49:26.740+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 2
        2024-03-30 22:49:27.740+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 3
        2024-03-30 22:49:28.741+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 4
        2024-03-30 22:49:29.741+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 5
        2024-03-30 22:49:30.741+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 6
        2024-03-30 22:49:31.742+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 7
        2024-03-30 22:49:32.742+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 8
        2024-03-30 22:49:33.742+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 9
        2024-03-30 22:49:34.742+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 10
        2024-03-30 22:49:35.743+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 11
        2024-03-30 22:49:36.743+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 12
        2024-03-30 22:49:37.744+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 13
        2024-03-30 22:49:38.744+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 14
        2024-03-30 22:49:39.744+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 15
        2024-03-30 22:49:40.744+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 16
        2024-03-30 22:49:41.745+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 17
        2024-03-30 22:49:42.745+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 18
        2024-03-30 22:49:43.746+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 19
        2024-03-30 22:49:44.746+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 20
        2024-03-30 22:49:45.747+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 21
        2024-03-30 22:49:46.747+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 22
        2024-03-30 22:49:47.748+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 23
        2024-03-30 22:49:48.748+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 24
        2024-03-30 22:49:49.749+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 25
        2024-03-30 22:49:50.749+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 26
        2024-03-30 22:49:51.750+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 27
        2024-03-30 22:49:52.750+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 28
        2024-03-30 22:49:53.750+0000: 6875: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 29
        2024-03-30 22:49:54.751+0000: 6875: warning : qemuStateCleanupWait:1164 
: Leaving qemuProcessReconnect() threads (1) per timeout (30)
        2024-03-30 22:51:00.315+0000: 6906: debug : qemuDomainObjEndJob:9746 : 
Stopping job: modify (async=none vm=0x7fe638012540 name=test-vm)
        2024-03-30 22:51:00.315+0000: 6906: debug : qemuProcessReconnect:8161 : 
Not decrementing qemuProcessReconnect() threads as the QEMU driver is already 
deallocated/freed.

        This would be shown in libvirtd syslog/journalctl (warnings/errors):
        
        $ sudo tail -n50 /var/log/libvirt/libvirtd-debug.log | sed -n 
'/qemuStateCleanupWait/,$p' | grep -e warning -e error
        2024-03-30 22:49:24.737+0000: 6875: warning : qemuStateCleanupWait:1153 
: Waiting for qemuProcessReconnect() threads (1) to end. Configure with 
LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0 = do not wait; N = wait 
up to N seconds; current = 30)
        2024-03-30 22:49:54.751+0000: 6875: warning : qemuStateCleanupWait:1164 
: Leaving qemuProcessReconnect() threads (1) per timeout (30)

Stop the VM, and restart it with libvirt.

        sudo kill $(sudo cat /run/libvirt/qemu/test-vm.pid) && sudo rm 
/run/libvirt/qemu/test-vm.{pid,xml}
        sudo systemctl start libvirtd.service && virsh start test-vm && sudo 
systemctl stop 'libvirtd*'

Scenario with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=5
---

The same result happens with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=5
(ie wait at most 5 seconds)

Repeat, with `gdb -ex 'set environment
LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT 5' -ex 'run'`:

The steps 't 1; finish' take 5 seconds, instead of 30 seconds.

        ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e 
'monitor path' /run/libvirt/qemu/test-vm.xml
        <domstatus state='running' reason='booted' pid='7005'>
          <domain type='qemu' id='1'>

        ubuntu@lp2059272-focal:~$ sudo tail -n50 
/var/log/libvirt/libvirtd-debug.log | sed -n '/qemuStateCleanupWait/,$p'
        2024-03-30 23:00:11.016+0000: 7017: debug : qemuStateCleanupWait:1144 : 
timeout 5, timeout_env '5'
        2024-03-30 23:00:11.016+0000: 7017: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 0
        2024-03-30 23:00:11.016+0000: 7017: warning : qemuStateCleanupWait:1153 
: Waiting for qemuProcessReconnect() threads (1) to end. Configure with 
LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0 = do not wait; N = wait 
up to N seconds; current = 5)
        2024-03-30 23:00:12.017+0000: 7017: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 1
        2024-03-30 23:00:13.018+0000: 7017: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 2
        2024-03-30 23:00:14.018+0000: 7017: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 3
        2024-03-30 23:00:15.018+0000: 7017: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 4
        2024-03-30 23:00:16.018+0000: 7017: warning : qemuStateCleanupWait:1164 
: Leaving qemuProcessReconnect() threads (1) per timeout (5)
        2024-03-30 23:00:45.694+0000: 7048: debug : qemuDomainObjEndJob:9746 : 
Stopping job: modify (async=none vm=0x7f40d0052de0 name=test-vm)
        2024-03-30 23:00:45.694+0000: 7048: debug : qemuProcessReconnect:8161 : 
Not decrementing qemuProcessReconnect() threads as the QEMU driver is already 
deallocated/freed.


Scenario with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=0
---

The same result happens with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=0
(ie do not wait)

Repeat, with `gdb -ex 'set environment
LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT 0' -ex 'run'`:

The steps 't 1; finish' take 0 seconds (no wait), instead of 30 or 5
seconds.

        ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e 
'monitor path' /run/libvirt/qemu/test-vm.xml
        <domstatus state='running' reason='booted' pid='7113'>
          <domain type='qemu' id='1'>

        ubuntu@lp2059272-focal:~$ sudo tail -n50 
/var/log/libvirt/libvirtd-debug.log | sed -n '/qemuStateCleanupWait/,$p'
        2024-03-30 23:03:11.487+0000: 7124: debug : qemuStateCleanupWait:1144 : 
timeout 0, timeout_env '0'
        2024-03-30 23:03:11.488+0000: 7124: warning : qemuStateCleanupWait:1164 
: Leaving qemuProcessReconnect() threads (1) per timeout (0)
        2024-03-30 23:03:15.313+0000: 7155: debug : qemuDomainObjEndJob:9746 : 
Stopping job: modify (async=none vm=0x7ff620052ad0 name=test-vm)
        2024-03-30 23:03:15.313+0000: 7155: debug : qemuProcessReconnect:8161 : 
Not decrementing qemuProcessReconnect() threads as the QEMU driver is already 
deallocated/freed.

Scenario with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=-1
---

A different result happens with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=-1
(ie wait forever)

Repeat, with `gdb -ex 'set environment
LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT -1' -ex 'run'`:

The steps 't 1; finish' does not finish, it keeps running, waiting for
the pending thread.

        t 1
        finish
        ... wait, wait, wait ...
        ctrl-c

        (gdb) bt
        #0  0x00007fb29ceed23f in clock_nanosleep () from 
/lib/x86_64-linux-gnu/libc.so.6
        #1  0x00007fb29cef2ec7 in nanosleep () from 
/lib/x86_64-linux-gnu/libc.so.6
        #2  0x00007fb29d0bf557 in g_usleep () from 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
        #3  0x00007fb2906498f5 in qemuStateCleanupWait () at 
../../../src/qemu/qemu_driver.c:1159
        #4  qemuStateCleanup () at ../../../src/qemu/qemu_driver.c:1184
        #5  0x00007fb29d4e746f in virStateCleanup () at 
../../../src/libvirt.c:669
        #6  0x00005569adc89bc8 in main (argc=<optimized out>, argv=<optimized 
out>) at ../../../src/remote/remote_daemon.c:1447

Check the formatter/options again; it is *STILL* referenced, not 0x0
anymore:

        t 20

        (gdb) p xmlopt.privateData.format
        $1 = (virDomainXMLPrivateDataFormatFunc) 0x7fb2905d8890 
<qemuDomainObjPrivateXMLFormat>

Thread 1 is still in qemuStateCleanupWait(), so let it run again,

        (gdb) c &
        
And unblock the other thread.
Now libvirt finishes shutting down.

        (gdb) t 20
        (gdb) c
        ...
        [Inferior 1 (process 7233) exited normally]

The logs show that thread has actually finished before libvirt exited.

        ubuntu@lp2059272-focal:~$ sudo tail -n200 
/var/log/libvirt/libvirtd-debug.log | sed -n '/qemuStateCleanupWait/,$p'
        2024-03-30 23:06:00.512+0000: 7233: debug : qemuStateCleanupWait:1144 : 
timeout -1, timeout_env '-1'
        2024-03-30 23:06:00.512+0000: 7233: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 0
        2024-03-30 23:06:00.512+0000: 7233: warning : qemuStateCleanupWait:1153 
: Waiting for qemuProcessReconnect() threads (1) to end
        . Configure with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0 
= do not wait; N = wait up to N seconds; current = -1)
        2024-03-30 23:06:01.513+0000: 7233: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 1
        2024-03-30 23:06:02.513+0000: 7233: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 2
        2024-03-30 23:06:03.514+0000: 7233: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 3
        ...
        2024-03-30 23:09:43.994+0000: 7233: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 130
        2024-03-30 23:09:44.994+0000: 7233: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 131
        2024-03-30 23:09:45.994+0000: 7233: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 132
        2024-03-30 23:09:46.075+0000: 7264: debug : qemuDomainObjEndJob:9746 : 
Stopping job: modify (async=none vm=0x7fb28c04c1c0 name=test-vm)
        2024-03-30 23:09:46.075+0000: 7264: debug : qemuProcessReconnect:8158 : 
Decrementing qemuProcessReconnect() threads.
        2024-03-30 23:09:46.995+0000: 7233: debug : qemuStateCleanupWait:1170 : 
All qemuProcessReconnect() threads finished

And the `monitor path` is still in the XML:

        ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e 
'monitor path' /run/libvirt/qemu/test-vm.xml
        <domstatus state='running' reason='booted' pid='7222'>
          <monitor path='/var/lib/libvirt/qemu/domain-1-test-vm/monitor.sock' 
type='unix'/>
          <domain type='qemu' id='1'>

Of course, the above also happens by default
if the thread finishes within the default timeout (30 seconds).

Scenario: (default/real-world) no env var, and the thread finishes quickly 
---

        (Running the steps real quick.)

        Thread 20 "libvirtd" hit Breakpoint 2, virDomainObjSave
(obj=0x55c688ebbe80, xmlopt=0x55c688eb3f40, statusDir=0x55c688e78f60
"/run/libvirt/qemu") at ../../../src/conf/domain_conf.c:29157

        $ sudo kill $(pidof libvirtd)
        
        Thread 1 "libvirtd" hit Breakpoint 1, qemuStateCleanup () at 
../../../src/qemu/qemu_driver.c:1181
        (gdb) t 1
        (gdb) c &
        (gdb) t 20
        (gdb) c
        ...
        [Inferior 1 (process 32761) exited normally]
        
        ubuntu@lp2059272-focal:~$ sudo tail -n50 
/var/log/libvirt/libvirtd-debug.log | sed -n '/qemuStateCleanupWait/,$p'
        2024-03-30 23:12:10.242+0000: 7281: debug : qemuStateCleanupWait:1144 : 
timeout 30, timeout_env '(null)'
        2024-03-30 23:12:10.242+0000: 7281: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 0
        2024-03-30 23:12:10.242+0000: 7281: warning : qemuStateCleanupWait:1153 
: Waiting for qemuProcessReconnect() threads (1) to end. Configure with 
LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0 = do not wait; N = wait 
up to N seconds; current = 30)
        2024-03-30 23:12:11.242+0000: 7281: debug : qemuStateCleanupWait:1150 : 
threads 1, seconds 1
        2024-03-30 23:12:11.484+0000: 7312: debug : qemuDomainObjEndJob:9746 : 
Stopping job: modify (async=none vm=0x7f7b4c04c3a0 name=test-vm)
        2024-03-30 23:12:11.484+0000: 7312: debug : qemuProcessReconnect:8158 : 
Decrementing qemuProcessReconnect() threads.
        2024-03-30 23:12:12.243+0000: 7281: debug : qemuStateCleanupWait:1170 : 
All qemuProcessReconnect() threads finished

        ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e 
'monitor path' /run/libvirt/qemu/test-vm.xml
        <domstatus state='running' reason='booted' pid='7222'>
          <monitor path='/var/lib/libvirt/qemu/domain-1-test-vm/monitor.sock' 
type='unix'/>
          <domain type='qemu' id='1'>

Now, the next time libvirtd starts, it correctly parses that XML:

        $ sudo systemctl start libvirtd.service

        ubuntu@lp2059272-focal:~$ journalctl -b -u libvirtd.service | grep error
        ...
        Mar 30 23:14:27 lp2059272-focal libvirtd[7325]: 7341: error : 
dnsmasqCapsRefreshInternal:714 : Cannot check dnsmasq binary /usr/sbin/dnsmasq: 
No such file or directory


And libvirt is now aware of the domain, and can manage it:

        $ virsh list
         Id   Name      State
        -------------------------
         1    test-vm   running

        $ virsh destroy test-vm
        Domain test-vm destroyed


** Description changed:

  [ Impact ]
  
   * If a race condition occurs on libvirtd shutdown,
     a QEMU domain status XML (/run/libvirt/qemu/*.xml)
     might lose the QEMU-driver specific information,
     such as '<monitor path=.../>'.
+    (The race condition details are in [Other Info].)
  
   * On the next libvirtd startup, the parsing of that
     QEMU domain's status XML fails as '<monitor path='
     is not found:
  
    $ journalctl -b -u libvirtd.service | tail
    ...
    ... libvirtd[2789]: internal error: no monitor path
    ... libvirtd[2789]: Failed to load config for domain 'test-vm'
  
   * As a result, the domain is not listed in `virsh list`,
     and `virsh` commands to it fail.
  
    $ virsh list
     Id Name State
    --------------------
  
   * The domain is still running, but libvirt considers
     it as shutdown, which might cause conflicts/issues
     with higher-level tools (e.g., openstack nova).
  
    $ virsh list --all
     Id Name State
    --------------------------
     - test-vm shut off
  
    $ pgrep -af qemu-system-x86_64 | cut -d, -f1
    2638 /usr/bin/qemu-system-x86_64 -name guest=test-vm,
  
  [ Test Plan ]
  
-  * Synthetic reproducer with GDB in comments #1 and #2.
+  * (Focal/Jammy) shutdown-on-runtime:
+    Synthetic reproducer/verification with GDB in comments #1 and #2 (Jammy) 
and #12 and #14 (Focal).
  
-    On failure, the XML is saved *without* '<monitor path='
+  * (Focal-only) shutdown-on-init:
+    Synthetic reproducer/verification with GDB in comments #13 and #15.
+ 
+  * On failure, the XML is saved *without* '<monitor path='
     and libvirt fails to parse the domain on startup.
     The domain is *not* listed in `virsh list`.
-    (comment #1)
  
-    On success, the XML is saved *with* '<monitor path='
+  * On success, the XML is saved *with* '<monitor path='
     and libvirt correctly parses the domain on startup.
     The domain is listed in `virsh list`.
-    (comment #2)
  
   * Normal 'restart' testing in comment #5.
  
+  * Test packages built successfully in all architectures
+    with -proposed enabled in Launchpad PPA mfo/lp2059272 [0]
+ 
+ [0] https://launchpad.net/~mfo/+archive/ubuntu/lp2059272
+ 
+ 
  [ Regression Potential ]
  
-  * The patch changes *where* in the libvirt qemu driver's
+  * One patch changes *where* in the libvirt qemu driver's
     shutdown path the worker thread pool is stopped/freed:
     from _after_ releasing other data to _before_ doing so.
+ 
+  * The other patch (Focal-only) introduces a bounded wait
+    (with configurable timeout via an environment variable)
+    in the (same) libvirt qemu driver's shutdown path.
+ 
+    By default, this waits for qemuProcessReconnect threads
+    for up to 30 seconds (expected to finish in less than
+    1 second, in practice), and gives up / continues with
+    shutdown anyway so not to introduce a behavior change
+    on this path (prevents impact in case of regressions).
  
   * Therefore, the potential for regression is limited to
     the libvirt qemu driver's shutdown path, and would be
     observed when stopping/restarting libvirtd.service.
  
   * The behavior during normal operation is not affected.
  
  [Other Info]
  
-  * The fix commit [1] is included in Mantic and later,
-    and needed in Focal and Jammy.
+  * In Focal, race windows exist if libvirtd shuts down
+    _after_ initialization and _during_ initialization
+    (which is unlikely in practice, but it's possible.)
+ 
+    Say, 'shutdown'on-runtime' and 'shutdown-on-init'.
+ 
+  * In Jammy, only 'shutdown-on-runtime' might happen,
+    due to the introduction of the '.stateShutdownWait'
+    driver callback (not available in Focal), which
+    indirectly prevents the 'shutdown-on-init' race
+    due to additional synchronization with locking.
+ 
+  * For 'shutdown-on-init' (Focal-only), we should use a
+    downstream-only patch (with configurable behavior),
+    since upstream addressed this issue indirectly with
+    the '.stateShutdownWait' callbacks and other changes
+    (which are not SRU material, ~10 patches, redesign [2]).
+ 
+  * For 'shutdown-on-runtime': use upstream commit [1].
+    It's needed in Focal and Jammy (included in Mantic).
  
   $ git describe --contains 152770333449cd3b78b4f5a9f1148fc1f482d842
   v9.3.0-rc1~90
  
   $ rmadison -a source libvirt | sed -n '/focal/,$p'
    libvirt | 6.0.0-0ubuntu8       | focal           | source
    libvirt | 6.0.0-0ubuntu8.16    | focal-security  | source
    libvirt | 6.0.0-0ubuntu8.16    | focal-updates   | source
    libvirt | 6.0.0-0ubuntu8.17    | focal-proposed  | source
    libvirt | 8.0.0-1ubuntu7       | jammy           | source
    libvirt | 8.0.0-1ubuntu7.5     | jammy-security  | source
    libvirt | 8.0.0-1ubuntu7.8     | jammy-updates   | source
    libvirt | 9.6.0-1ubuntu1       | mantic          | source
    libvirt | 10.0.0-2ubuntu1      | noble           | source
    libvirt | 10.0.0-2ubuntu5      | noble-proposed  | source
  
  [1]
  
https://gitlab.com/libvirt/libvirt/-/commit/152770333449cd3b78b4f5a9f1148fc1f482d842
  
-  * Test packages built successfully in all architectures
-    with -proposed enabled in Launchpad PPA mfo/lp2059272 [2]
+ [2] https://listman.redhat.com/archives/libvir-list/2020-July/205291.html
+ PATCH 00/10] resolve hangs/crashes on libvirtd shutdown
  
- [2] https://launchpad.net/~mfo/+archive/ubuntu/lp2059272
+ commit 94e45d1042e21e03a15ce993f90fbef626f1ae41
+ Author: Nikolay Shirokovskiy <nshirokovs...@virtuozzo.com>
+ Date: Thu Jul 23 09:53:04 2020 +0300
+ 
+ rpc: finish all threads before exiting main loop
+ 
+ $ git describe --contains 94e45d1042e21e03a15ce993f90fbef626f1ae41
+ v6.8.0-rc1~279
+ 
  
  [Original Description]
  
  There's a race condition on libvirtd shutdown
  that might cause the domain status XML file(s)
  to lose the '<monitor path=...'> tag/field.
  
  This causes an error on libvirtd startup, and
  the domain is not listed/managed, despite it
  is still running.
  
   $ virsh list
    Id   Name      State
   -------------------------
    1    test-vm   running
  
   $ sudo systemctl restart libvirtd.service
  
   $ journalctl -b -u libvirtd.service | tail
   ...
   ... libvirtd[2789]: internal error: no monitor path
   ... libvirtd[2789]: Failed to load config for domain 'test-vm'
  
   $ virsh list
    Id   Name   State
   --------------------
  
   $ virsh list --all
    Id   Name      State
   --------------------------
    -    test-vm   shut off
  
   $ pgrep -af qemu-system-x86_64 | cut -d, -f1
   2638 /usr/bin/qemu-system-x86_64 -name guest=test-vm,

** Description changed:

  [ Impact ]
  
   * If a race condition occurs on libvirtd shutdown,
     a QEMU domain status XML (/run/libvirt/qemu/*.xml)
     might lose the QEMU-driver specific information,
     such as '<monitor path=.../>'.
-    (The race condition details are in [Other Info].)
+    (The race condition details are in [Other Info].)
  
   * On the next libvirtd startup, the parsing of that
     QEMU domain's status XML fails as '<monitor path='
     is not found:
  
    $ journalctl -b -u libvirtd.service | tail
    ...
    ... libvirtd[2789]: internal error: no monitor path
    ... libvirtd[2789]: Failed to load config for domain 'test-vm'
  
   * As a result, the domain is not listed in `virsh list`,
     and `virsh` commands to it fail.
  
    $ virsh list
     Id Name State
    --------------------
  
   * The domain is still running, but libvirt considers
     it as shutdown, which might cause conflicts/issues
     with higher-level tools (e.g., openstack nova).
  
    $ virsh list --all
     Id Name State
    --------------------------
     - test-vm shut off
  
    $ pgrep -af qemu-system-x86_64 | cut -d, -f1
    2638 /usr/bin/qemu-system-x86_64 -name guest=test-vm,
  
  [ Test Plan ]
  
   * (Focal/Jammy) shutdown-on-runtime:
-    Synthetic reproducer/verification with GDB in comments #1 and #2 (Jammy) 
and #12 and #14 (Focal).
+    Synthetic reproducer/verification with GDB in comments #1 and #2 (Jammy) 
and #12 and #14 (Focal).
  
-  * (Focal-only) shutdown-on-init:
-    Synthetic reproducer/verification with GDB in comments #13 and #15.
+  * (Focal-only) shutdown-on-init:
+    Synthetic reproducer/verification with GDB in comments #13 and #15.
  
-  * On failure, the XML is saved *without* '<monitor path='
+  * On failure, the XML is saved *without* '<monitor path='
     and libvirt fails to parse the domain on startup.
     The domain is *not* listed in `virsh list`.
  
   * On success, the XML is saved *with* '<monitor path='
     and libvirt correctly parses the domain on startup.
     The domain is listed in `virsh list`.
  
   * Normal 'restart' testing in comment #5.
  
   * Test packages built successfully in all architectures
     with -proposed enabled in Launchpad PPA mfo/lp2059272 [0]
  
  [0] https://launchpad.net/~mfo/+archive/ubuntu/lp2059272
  
- 
  [ Regression Potential ]
  
   * One patch changes *where* in the libvirt qemu driver's
     shutdown path the worker thread pool is stopped/freed:
     from _after_ releasing other data to _before_ doing so.
  
-  * The other patch (Focal-only) introduces a bounded wait
-    (with configurable timeout via an environment variable)
-    in the (same) libvirt qemu driver's shutdown path.
+  * The other patch (Focal-only) introduces a bounded wait
+    (with configurable timeout via an environment variable)
+    in the (same) libvirt qemu driver's shutdown path.
  
-    By default, this waits for qemuProcessReconnect threads
-    for up to 30 seconds (expected to finish in less than
-    1 second, in practice), and gives up / continues with
-    shutdown anyway so not to introduce a behavior change
-    on this path (prevents impact in case of regressions).
+    By default, this waits for qemuProcessReconnect threads
+    for up to 30 seconds (expected to finish in less than
+    1 second, in practice), and gives up / continues with
+    shutdown anyway so not to introduce a behavior change
+    on this path (prevents impact in case of regressions).
  
   * Therefore, the potential for regression is limited to
     the libvirt qemu driver's shutdown path, and would be
     observed when stopping/restarting libvirtd.service.
  
   * The behavior during normal operation is not affected.
  
  [Other Info]
  
-  * In Focal, race windows exist if libvirtd shuts down
-    _after_ initialization and _during_ initialization
-    (which is unlikely in practice, but it's possible.)
+  * In Focal, race windows exist if libvirtd shuts down
+    _after_ initialization and _during_ initialization
+    (which is unlikely in practice, but it's possible.)
  
-    Say, 'shutdown'on-runtime' and 'shutdown-on-init'.
+    Say, 'shutdown'on-runtime' and 'shutdown-on-init'.
  
-  * In Jammy, only 'shutdown-on-runtime' might happen,
-    due to the introduction of the '.stateShutdownWait'
-    driver callback (not available in Focal), which
-    indirectly prevents the 'shutdown-on-init' race
-    due to additional synchronization with locking.
- 
-  * For 'shutdown-on-init' (Focal-only), we should use a
-    downstream-only patch (with configurable behavior),
-    since upstream addressed this issue indirectly with
-    the '.stateShutdownWait' callbacks and other changes
-    (which are not SRU material, ~10 patches, redesign [2]).
+  * In Jammy, only 'shutdown-on-runtime' might happen,
+    due to the introduction of the '.stateShutdownWait'
+    driver callback (not available in Focal), which
+    indirectly prevents the 'shutdown-on-init' race
+    due to additional synchronization with locking.
  
   * For 'shutdown-on-runtime': use upstream commit [1].
-    It's needed in Focal and Jammy (included in Mantic).
+    It's needed in Focal and Jammy (included in Mantic).
+ 
+  * For 'shutdown-on-init' (Focal-only), we should use a
+    downstream-only patch (with configurable behavior),
+    since upstream addressed this issue indirectly with
+    the '.stateShutdownWait' callbacks and other changes
+    (which are not SRU material, ~10 patches, redesign [2])
+    in 6.8.0.
+ 
+ [1]
+ 
https://gitlab.com/libvirt/libvirt/-/commit/152770333449cd3b78b4f5a9f1148fc1f482d842
  
   $ git describe --contains 152770333449cd3b78b4f5a9f1148fc1f482d842
   v9.3.0-rc1~90
  
   $ rmadison -a source libvirt | sed -n '/focal/,$p'
    libvirt | 6.0.0-0ubuntu8       | focal           | source
    libvirt | 6.0.0-0ubuntu8.16    | focal-security  | source
    libvirt | 6.0.0-0ubuntu8.16    | focal-updates   | source
    libvirt | 6.0.0-0ubuntu8.17    | focal-proposed  | source
    libvirt | 8.0.0-1ubuntu7       | jammy           | source
    libvirt | 8.0.0-1ubuntu7.5     | jammy-security  | source
    libvirt | 8.0.0-1ubuntu7.8     | jammy-updates   | source
    libvirt | 9.6.0-1ubuntu1       | mantic          | source
    libvirt | 10.0.0-2ubuntu1      | noble           | source
    libvirt | 10.0.0-2ubuntu5      | noble-proposed  | source
  
- [1]
- 
https://gitlab.com/libvirt/libvirt/-/commit/152770333449cd3b78b4f5a9f1148fc1f482d842
- 
  [2] https://listman.redhat.com/archives/libvir-list/2020-July/205291.html
- PATCH 00/10] resolve hangs/crashes on libvirtd shutdown
+ [PATCH 00/10] resolve hangs/crashes on libvirtd shutdown
  
  commit 94e45d1042e21e03a15ce993f90fbef626f1ae41
  Author: Nikolay Shirokovskiy <nshirokovs...@virtuozzo.com>
  Date: Thu Jul 23 09:53:04 2020 +0300
  
  rpc: finish all threads before exiting main loop
  
  $ git describe --contains 94e45d1042e21e03a15ce993f90fbef626f1ae41
  v6.8.0-rc1~279
- 
  
  [Original Description]
  
  There's a race condition on libvirtd shutdown
  that might cause the domain status XML file(s)
  to lose the '<monitor path=...'> tag/field.
  
  This causes an error on libvirtd startup, and
  the domain is not listed/managed, despite it
  is still running.
  
   $ virsh list
    Id   Name      State
   -------------------------
    1    test-vm   running
  
   $ sudo systemctl restart libvirtd.service
  
   $ journalctl -b -u libvirtd.service | tail
   ...
   ... libvirtd[2789]: internal error: no monitor path
   ... libvirtd[2789]: Failed to load config for domain 'test-vm'
  
   $ virsh list
    Id   Name   State
   --------------------
  
   $ virsh list --all
    Id   Name      State
   --------------------------
    -    test-vm   shut off
  
   $ pgrep -af qemu-system-x86_64 | cut -d, -f1
   2638 /usr/bin/qemu-system-x86_64 -name guest=test-vm,

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2059272

Title:
  libvirt domain is not listed/managed after libvirt restart with
  messages "internal error: no monitor path" and "Failed to load config
  for domain"

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/2059272/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to