Steps with test packages on Focal (shutdown-on-init)
---
Environment:
---
On top of LXD VM in comments #12/#13.
Enable PPA & debug symbols
sudo add-apt-repository -yn ppa:mfo/lp2059272
sudo sed '/^deb / s,$, main/debug,' -i
/etc/apt/sources.list.d/mfo-ubuntu-lp2059272-focal.list
sudo apt update
Install packages
sudo apt install --yes libvirt{0,-daemon{,-driver-
qemu}}{,-dbgsym} libvirt-clients gdb qemu-system-x86
$ dpkg -s libvirt-daemon | grep ^Version:
Version: 6.0.0-0ubuntu8.18~ppa1
Libvirtd debug logging
cat <<EOF | sudo tee -a /etc/libvirt/libvirtd.conf
log_filters="1:qemu 1:libvirt"
log_outputs="3:syslog:libvirtd
1:file:/var/log/libvirt/libvirtd-debug.log"
EOF
Follow `Steps to reproduce on Focal (shutdown-on-init)` in comment #13
---
Up to ...
Check the backtrace of the domain status XML save function, coming from
QEMU process reconnect:
t 20
(gdb) bt
#0 virDomainObjSave (obj=0x7fe638012540, xmlopt=0x7fe63800d4e0,
statusDir=0x7fe63800cf10 "/run/libvirt/qemu") at
../../../src/conf/domain_conf.c:29157
#1 0x00007fe644190545 in qemuProcessReconnect (opaque=<optimized out>)
at ../../../src/qemu/qemu_process.c:8123
#2 0x00007fe64aebd54a in virThreadHelper (data=<optimized out>) at
../../../src/util/virthread.c:196
#3 0x00007fe64ab7e609 in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00007fe64aaa3353 in clone () from /lib/x86_64-linux-gnu/libc.so.6
$ sudo kill $(pidof libvirtd)
Thread 1 "libvirtd" hit Breakpoint 1, qemuStateCleanup () at
../../../src/qemu/qemu_driver.c:1180
t 20
(gdb) p xmlopt.privateData.format
$1 = (virDomainXMLPrivateDataFormatFunc) 0x7fe644152890
<qemuDomainObjPrivateXMLFormat>
Let the cleanup function finish
t 1
finish
Notice it took a while (30 seconds).
(gdb) t 20
(gdb) p xmlopt.privateData.format
$3 = (virDomainXMLPrivateDataFormatFunc) 0x0
Let the save function continue, and libvirt finish shutdown:
(gdb) c &
(gdb) t 1
(gdb) c
(gdb) q
Check the VM status XML *after*:
ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e
'monitor path' /run/libvirt/qemu/test-vm.xml
<domstatus state='running' reason='booted' pid='6817'>
<domain type='qemu' id='1'>
And everything happened as in the reproducer.
i.e., the SAME behavior happened BY DEFAULT.
Just with a 30 seconds delay.
Checking the libvirtd debug logs to confirm the patch behavior:
$ sudo tail -n50 /var/log/libvirt/libvirtd-debug.log | sed -n
'/qemuStateCleanupWait/,$p'
2024-03-30 22:49:24.737+0000: 6875: debug : qemuStateCleanupWait:1144 :
timeout 30, timeout_env '(null)'
2024-03-30 22:49:24.737+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 0
2024-03-30 22:49:24.737+0000: 6875: warning : qemuStateCleanupWait:1153
: Waiting for qemuProcessReconnect() threads (1) to end. Configure with
LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0 = do not wait; N = wait
up to N seconds; current = 30)
2024-03-30 22:49:25.740+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 1
2024-03-30 22:49:26.740+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 2
2024-03-30 22:49:27.740+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 3
2024-03-30 22:49:28.741+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 4
2024-03-30 22:49:29.741+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 5
2024-03-30 22:49:30.741+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 6
2024-03-30 22:49:31.742+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 7
2024-03-30 22:49:32.742+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 8
2024-03-30 22:49:33.742+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 9
2024-03-30 22:49:34.742+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 10
2024-03-30 22:49:35.743+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 11
2024-03-30 22:49:36.743+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 12
2024-03-30 22:49:37.744+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 13
2024-03-30 22:49:38.744+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 14
2024-03-30 22:49:39.744+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 15
2024-03-30 22:49:40.744+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 16
2024-03-30 22:49:41.745+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 17
2024-03-30 22:49:42.745+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 18
2024-03-30 22:49:43.746+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 19
2024-03-30 22:49:44.746+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 20
2024-03-30 22:49:45.747+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 21
2024-03-30 22:49:46.747+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 22
2024-03-30 22:49:47.748+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 23
2024-03-30 22:49:48.748+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 24
2024-03-30 22:49:49.749+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 25
2024-03-30 22:49:50.749+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 26
2024-03-30 22:49:51.750+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 27
2024-03-30 22:49:52.750+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 28
2024-03-30 22:49:53.750+0000: 6875: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 29
2024-03-30 22:49:54.751+0000: 6875: warning : qemuStateCleanupWait:1164
: Leaving qemuProcessReconnect() threads (1) per timeout (30)
2024-03-30 22:51:00.315+0000: 6906: debug : qemuDomainObjEndJob:9746 :
Stopping job: modify (async=none vm=0x7fe638012540 name=test-vm)
2024-03-30 22:51:00.315+0000: 6906: debug : qemuProcessReconnect:8161 :
Not decrementing qemuProcessReconnect() threads as the QEMU driver is already
deallocated/freed.
This would be shown in libvirtd syslog/journalctl (warnings/errors):
$ sudo tail -n50 /var/log/libvirt/libvirtd-debug.log | sed -n
'/qemuStateCleanupWait/,$p' | grep -e warning -e error
2024-03-30 22:49:24.737+0000: 6875: warning : qemuStateCleanupWait:1153
: Waiting for qemuProcessReconnect() threads (1) to end. Configure with
LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0 = do not wait; N = wait
up to N seconds; current = 30)
2024-03-30 22:49:54.751+0000: 6875: warning : qemuStateCleanupWait:1164
: Leaving qemuProcessReconnect() threads (1) per timeout (30)
Stop the VM, and restart it with libvirt.
sudo kill $(sudo cat /run/libvirt/qemu/test-vm.pid) && sudo rm
/run/libvirt/qemu/test-vm.{pid,xml}
sudo systemctl start libvirtd.service && virsh start test-vm && sudo
systemctl stop 'libvirtd*'
Scenario with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=5
---
The same result happens with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=5
(ie wait at most 5 seconds)
Repeat, with `gdb -ex 'set environment
LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT 5' -ex 'run'`:
The steps 't 1; finish' take 5 seconds, instead of 30 seconds.
ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e
'monitor path' /run/libvirt/qemu/test-vm.xml
<domstatus state='running' reason='booted' pid='7005'>
<domain type='qemu' id='1'>
ubuntu@lp2059272-focal:~$ sudo tail -n50
/var/log/libvirt/libvirtd-debug.log | sed -n '/qemuStateCleanupWait/,$p'
2024-03-30 23:00:11.016+0000: 7017: debug : qemuStateCleanupWait:1144 :
timeout 5, timeout_env '5'
2024-03-30 23:00:11.016+0000: 7017: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 0
2024-03-30 23:00:11.016+0000: 7017: warning : qemuStateCleanupWait:1153
: Waiting for qemuProcessReconnect() threads (1) to end. Configure with
LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0 = do not wait; N = wait
up to N seconds; current = 5)
2024-03-30 23:00:12.017+0000: 7017: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 1
2024-03-30 23:00:13.018+0000: 7017: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 2
2024-03-30 23:00:14.018+0000: 7017: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 3
2024-03-30 23:00:15.018+0000: 7017: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 4
2024-03-30 23:00:16.018+0000: 7017: warning : qemuStateCleanupWait:1164
: Leaving qemuProcessReconnect() threads (1) per timeout (5)
2024-03-30 23:00:45.694+0000: 7048: debug : qemuDomainObjEndJob:9746 :
Stopping job: modify (async=none vm=0x7f40d0052de0 name=test-vm)
2024-03-30 23:00:45.694+0000: 7048: debug : qemuProcessReconnect:8161 :
Not decrementing qemuProcessReconnect() threads as the QEMU driver is already
deallocated/freed.
Scenario with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=0
---
The same result happens with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=0
(ie do not wait)
Repeat, with `gdb -ex 'set environment
LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT 0' -ex 'run'`:
The steps 't 1; finish' take 0 seconds (no wait), instead of 30 or 5
seconds.
ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e
'monitor path' /run/libvirt/qemu/test-vm.xml
<domstatus state='running' reason='booted' pid='7113'>
<domain type='qemu' id='1'>
ubuntu@lp2059272-focal:~$ sudo tail -n50
/var/log/libvirt/libvirtd-debug.log | sed -n '/qemuStateCleanupWait/,$p'
2024-03-30 23:03:11.487+0000: 7124: debug : qemuStateCleanupWait:1144 :
timeout 0, timeout_env '0'
2024-03-30 23:03:11.488+0000: 7124: warning : qemuStateCleanupWait:1164
: Leaving qemuProcessReconnect() threads (1) per timeout (0)
2024-03-30 23:03:15.313+0000: 7155: debug : qemuDomainObjEndJob:9746 :
Stopping job: modify (async=none vm=0x7ff620052ad0 name=test-vm)
2024-03-30 23:03:15.313+0000: 7155: debug : qemuProcessReconnect:8161 :
Not decrementing qemuProcessReconnect() threads as the QEMU driver is already
deallocated/freed.
Scenario with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=-1
---
A different result happens with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT=-1
(ie wait forever)
Repeat, with `gdb -ex 'set environment
LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT -1' -ex 'run'`:
The steps 't 1; finish' does not finish, it keeps running, waiting for
the pending thread.
t 1
finish
... wait, wait, wait ...
ctrl-c
(gdb) bt
#0 0x00007fb29ceed23f in clock_nanosleep () from
/lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fb29cef2ec7 in nanosleep () from
/lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fb29d0bf557 in g_usleep () from
/lib/x86_64-linux-gnu/libglib-2.0.so.0
#3 0x00007fb2906498f5 in qemuStateCleanupWait () at
../../../src/qemu/qemu_driver.c:1159
#4 qemuStateCleanup () at ../../../src/qemu/qemu_driver.c:1184
#5 0x00007fb29d4e746f in virStateCleanup () at
../../../src/libvirt.c:669
#6 0x00005569adc89bc8 in main (argc=<optimized out>, argv=<optimized
out>) at ../../../src/remote/remote_daemon.c:1447
Check the formatter/options again; it is *STILL* referenced, not 0x0
anymore:
t 20
(gdb) p xmlopt.privateData.format
$1 = (virDomainXMLPrivateDataFormatFunc) 0x7fb2905d8890
<qemuDomainObjPrivateXMLFormat>
Thread 1 is still in qemuStateCleanupWait(), so let it run again,
(gdb) c &
And unblock the other thread.
Now libvirt finishes shutting down.
(gdb) t 20
(gdb) c
...
[Inferior 1 (process 7233) exited normally]
The logs show that thread has actually finished before libvirt exited.
ubuntu@lp2059272-focal:~$ sudo tail -n200
/var/log/libvirt/libvirtd-debug.log | sed -n '/qemuStateCleanupWait/,$p'
2024-03-30 23:06:00.512+0000: 7233: debug : qemuStateCleanupWait:1144 :
timeout -1, timeout_env '-1'
2024-03-30 23:06:00.512+0000: 7233: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 0
2024-03-30 23:06:00.512+0000: 7233: warning : qemuStateCleanupWait:1153
: Waiting for qemuProcessReconnect() threads (1) to end
. Configure with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0
= do not wait; N = wait up to N seconds; current = -1)
2024-03-30 23:06:01.513+0000: 7233: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 1
2024-03-30 23:06:02.513+0000: 7233: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 2
2024-03-30 23:06:03.514+0000: 7233: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 3
...
2024-03-30 23:09:43.994+0000: 7233: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 130
2024-03-30 23:09:44.994+0000: 7233: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 131
2024-03-30 23:09:45.994+0000: 7233: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 132
2024-03-30 23:09:46.075+0000: 7264: debug : qemuDomainObjEndJob:9746 :
Stopping job: modify (async=none vm=0x7fb28c04c1c0 name=test-vm)
2024-03-30 23:09:46.075+0000: 7264: debug : qemuProcessReconnect:8158 :
Decrementing qemuProcessReconnect() threads.
2024-03-30 23:09:46.995+0000: 7233: debug : qemuStateCleanupWait:1170 :
All qemuProcessReconnect() threads finished
And the `monitor path` is still in the XML:
ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e
'monitor path' /run/libvirt/qemu/test-vm.xml
<domstatus state='running' reason='booted' pid='7222'>
<monitor path='/var/lib/libvirt/qemu/domain-1-test-vm/monitor.sock'
type='unix'/>
<domain type='qemu' id='1'>
Of course, the above also happens by default
if the thread finishes within the default timeout (30 seconds).
Scenario: (default/real-world) no env var, and the thread finishes quickly
---
(Running the steps real quick.)
Thread 20 "libvirtd" hit Breakpoint 2, virDomainObjSave
(obj=0x55c688ebbe80, xmlopt=0x55c688eb3f40, statusDir=0x55c688e78f60
"/run/libvirt/qemu") at ../../../src/conf/domain_conf.c:29157
$ sudo kill $(pidof libvirtd)
Thread 1 "libvirtd" hit Breakpoint 1, qemuStateCleanup () at
../../../src/qemu/qemu_driver.c:1181
(gdb) t 1
(gdb) c &
(gdb) t 20
(gdb) c
...
[Inferior 1 (process 32761) exited normally]
ubuntu@lp2059272-focal:~$ sudo tail -n50
/var/log/libvirt/libvirtd-debug.log | sed -n '/qemuStateCleanupWait/,$p'
2024-03-30 23:12:10.242+0000: 7281: debug : qemuStateCleanupWait:1144 :
timeout 30, timeout_env '(null)'
2024-03-30 23:12:10.242+0000: 7281: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 0
2024-03-30 23:12:10.242+0000: 7281: warning : qemuStateCleanupWait:1153
: Waiting for qemuProcessReconnect() threads (1) to end. Configure with
LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0 = do not wait; N = wait
up to N seconds; current = 30)
2024-03-30 23:12:11.242+0000: 7281: debug : qemuStateCleanupWait:1150 :
threads 1, seconds 1
2024-03-30 23:12:11.484+0000: 7312: debug : qemuDomainObjEndJob:9746 :
Stopping job: modify (async=none vm=0x7f7b4c04c3a0 name=test-vm)
2024-03-30 23:12:11.484+0000: 7312: debug : qemuProcessReconnect:8158 :
Decrementing qemuProcessReconnect() threads.
2024-03-30 23:12:12.243+0000: 7281: debug : qemuStateCleanupWait:1170 :
All qemuProcessReconnect() threads finished
ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e
'monitor path' /run/libvirt/qemu/test-vm.xml
<domstatus state='running' reason='booted' pid='7222'>
<monitor path='/var/lib/libvirt/qemu/domain-1-test-vm/monitor.sock'
type='unix'/>
<domain type='qemu' id='1'>
Now, the next time libvirtd starts, it correctly parses that XML:
$ sudo systemctl start libvirtd.service
ubuntu@lp2059272-focal:~$ journalctl -b -u libvirtd.service | grep error
...
Mar 30 23:14:27 lp2059272-focal libvirtd[7325]: 7341: error :
dnsmasqCapsRefreshInternal:714 : Cannot check dnsmasq binary /usr/sbin/dnsmasq:
No such file or directory
And libvirt is now aware of the domain, and can manage it:
$ virsh list
Id Name State
-------------------------
1 test-vm running
$ virsh destroy test-vm
Domain test-vm destroyed
** Description changed:
[ Impact ]
* If a race condition occurs on libvirtd shutdown,
a QEMU domain status XML (/run/libvirt/qemu/*.xml)
might lose the QEMU-driver specific information,
such as '<monitor path=.../>'.
+ (The race condition details are in [Other Info].)
* On the next libvirtd startup, the parsing of that
QEMU domain's status XML fails as '<monitor path='
is not found:
$ journalctl -b -u libvirtd.service | tail
...
... libvirtd[2789]: internal error: no monitor path
... libvirtd[2789]: Failed to load config for domain 'test-vm'
* As a result, the domain is not listed in `virsh list`,
and `virsh` commands to it fail.
$ virsh list
Id Name State
--------------------
* The domain is still running, but libvirt considers
it as shutdown, which might cause conflicts/issues
with higher-level tools (e.g., openstack nova).
$ virsh list --all
Id Name State
--------------------------
- test-vm shut off
$ pgrep -af qemu-system-x86_64 | cut -d, -f1
2638 /usr/bin/qemu-system-x86_64 -name guest=test-vm,
[ Test Plan ]
- * Synthetic reproducer with GDB in comments #1 and #2.
+ * (Focal/Jammy) shutdown-on-runtime:
+ Synthetic reproducer/verification with GDB in comments #1 and #2 (Jammy)
and #12 and #14 (Focal).
- On failure, the XML is saved *without* '<monitor path='
+ * (Focal-only) shutdown-on-init:
+ Synthetic reproducer/verification with GDB in comments #13 and #15.
+
+ * On failure, the XML is saved *without* '<monitor path='
and libvirt fails to parse the domain on startup.
The domain is *not* listed in `virsh list`.
- (comment #1)
- On success, the XML is saved *with* '<monitor path='
+ * On success, the XML is saved *with* '<monitor path='
and libvirt correctly parses the domain on startup.
The domain is listed in `virsh list`.
- (comment #2)
* Normal 'restart' testing in comment #5.
+ * Test packages built successfully in all architectures
+ with -proposed enabled in Launchpad PPA mfo/lp2059272 [0]
+
+ [0] https://launchpad.net/~mfo/+archive/ubuntu/lp2059272
+
+
[ Regression Potential ]
- * The patch changes *where* in the libvirt qemu driver's
+ * One patch changes *where* in the libvirt qemu driver's
shutdown path the worker thread pool is stopped/freed:
from _after_ releasing other data to _before_ doing so.
+
+ * The other patch (Focal-only) introduces a bounded wait
+ (with configurable timeout via an environment variable)
+ in the (same) libvirt qemu driver's shutdown path.
+
+ By default, this waits for qemuProcessReconnect threads
+ for up to 30 seconds (expected to finish in less than
+ 1 second, in practice), and gives up / continues with
+ shutdown anyway so not to introduce a behavior change
+ on this path (prevents impact in case of regressions).
* Therefore, the potential for regression is limited to
the libvirt qemu driver's shutdown path, and would be
observed when stopping/restarting libvirtd.service.
* The behavior during normal operation is not affected.
[Other Info]
- * The fix commit [1] is included in Mantic and later,
- and needed in Focal and Jammy.
+ * In Focal, race windows exist if libvirtd shuts down
+ _after_ initialization and _during_ initialization
+ (which is unlikely in practice, but it's possible.)
+
+ Say, 'shutdown'on-runtime' and 'shutdown-on-init'.
+
+ * In Jammy, only 'shutdown-on-runtime' might happen,
+ due to the introduction of the '.stateShutdownWait'
+ driver callback (not available in Focal), which
+ indirectly prevents the 'shutdown-on-init' race
+ due to additional synchronization with locking.
+
+ * For 'shutdown-on-init' (Focal-only), we should use a
+ downstream-only patch (with configurable behavior),
+ since upstream addressed this issue indirectly with
+ the '.stateShutdownWait' callbacks and other changes
+ (which are not SRU material, ~10 patches, redesign [2]).
+
+ * For 'shutdown-on-runtime': use upstream commit [1].
+ It's needed in Focal and Jammy (included in Mantic).
$ git describe --contains 152770333449cd3b78b4f5a9f1148fc1f482d842
v9.3.0-rc1~90
$ rmadison -a source libvirt | sed -n '/focal/,$p'
libvirt | 6.0.0-0ubuntu8 | focal | source
libvirt | 6.0.0-0ubuntu8.16 | focal-security | source
libvirt | 6.0.0-0ubuntu8.16 | focal-updates | source
libvirt | 6.0.0-0ubuntu8.17 | focal-proposed | source
libvirt | 8.0.0-1ubuntu7 | jammy | source
libvirt | 8.0.0-1ubuntu7.5 | jammy-security | source
libvirt | 8.0.0-1ubuntu7.8 | jammy-updates | source
libvirt | 9.6.0-1ubuntu1 | mantic | source
libvirt | 10.0.0-2ubuntu1 | noble | source
libvirt | 10.0.0-2ubuntu5 | noble-proposed | source
[1]
https://gitlab.com/libvirt/libvirt/-/commit/152770333449cd3b78b4f5a9f1148fc1f482d842
- * Test packages built successfully in all architectures
- with -proposed enabled in Launchpad PPA mfo/lp2059272 [2]
+ [2] https://listman.redhat.com/archives/libvir-list/2020-July/205291.html
+ PATCH 00/10] resolve hangs/crashes on libvirtd shutdown
- [2] https://launchpad.net/~mfo/+archive/ubuntu/lp2059272
+ commit 94e45d1042e21e03a15ce993f90fbef626f1ae41
+ Author: Nikolay Shirokovskiy <[email protected]>
+ Date: Thu Jul 23 09:53:04 2020 +0300
+
+ rpc: finish all threads before exiting main loop
+
+ $ git describe --contains 94e45d1042e21e03a15ce993f90fbef626f1ae41
+ v6.8.0-rc1~279
+
[Original Description]
There's a race condition on libvirtd shutdown
that might cause the domain status XML file(s)
to lose the '<monitor path=...'> tag/field.
This causes an error on libvirtd startup, and
the domain is not listed/managed, despite it
is still running.
$ virsh list
Id Name State
-------------------------
1 test-vm running
$ sudo systemctl restart libvirtd.service
$ journalctl -b -u libvirtd.service | tail
...
... libvirtd[2789]: internal error: no monitor path
... libvirtd[2789]: Failed to load config for domain 'test-vm'
$ virsh list
Id Name State
--------------------
$ virsh list --all
Id Name State
--------------------------
- test-vm shut off
$ pgrep -af qemu-system-x86_64 | cut -d, -f1
2638 /usr/bin/qemu-system-x86_64 -name guest=test-vm,
** Description changed:
[ Impact ]
* If a race condition occurs on libvirtd shutdown,
a QEMU domain status XML (/run/libvirt/qemu/*.xml)
might lose the QEMU-driver specific information,
such as '<monitor path=.../>'.
- (The race condition details are in [Other Info].)
+ (The race condition details are in [Other Info].)
* On the next libvirtd startup, the parsing of that
QEMU domain's status XML fails as '<monitor path='
is not found:
$ journalctl -b -u libvirtd.service | tail
...
... libvirtd[2789]: internal error: no monitor path
... libvirtd[2789]: Failed to load config for domain 'test-vm'
* As a result, the domain is not listed in `virsh list`,
and `virsh` commands to it fail.
$ virsh list
Id Name State
--------------------
* The domain is still running, but libvirt considers
it as shutdown, which might cause conflicts/issues
with higher-level tools (e.g., openstack nova).
$ virsh list --all
Id Name State
--------------------------
- test-vm shut off
$ pgrep -af qemu-system-x86_64 | cut -d, -f1
2638 /usr/bin/qemu-system-x86_64 -name guest=test-vm,
[ Test Plan ]
* (Focal/Jammy) shutdown-on-runtime:
- Synthetic reproducer/verification with GDB in comments #1 and #2 (Jammy)
and #12 and #14 (Focal).
+ Synthetic reproducer/verification with GDB in comments #1 and #2 (Jammy)
and #12 and #14 (Focal).
- * (Focal-only) shutdown-on-init:
- Synthetic reproducer/verification with GDB in comments #13 and #15.
+ * (Focal-only) shutdown-on-init:
+ Synthetic reproducer/verification with GDB in comments #13 and #15.
- * On failure, the XML is saved *without* '<monitor path='
+ * On failure, the XML is saved *without* '<monitor path='
and libvirt fails to parse the domain on startup.
The domain is *not* listed in `virsh list`.
* On success, the XML is saved *with* '<monitor path='
and libvirt correctly parses the domain on startup.
The domain is listed in `virsh list`.
* Normal 'restart' testing in comment #5.
* Test packages built successfully in all architectures
with -proposed enabled in Launchpad PPA mfo/lp2059272 [0]
[0] https://launchpad.net/~mfo/+archive/ubuntu/lp2059272
-
[ Regression Potential ]
* One patch changes *where* in the libvirt qemu driver's
shutdown path the worker thread pool is stopped/freed:
from _after_ releasing other data to _before_ doing so.
- * The other patch (Focal-only) introduces a bounded wait
- (with configurable timeout via an environment variable)
- in the (same) libvirt qemu driver's shutdown path.
+ * The other patch (Focal-only) introduces a bounded wait
+ (with configurable timeout via an environment variable)
+ in the (same) libvirt qemu driver's shutdown path.
- By default, this waits for qemuProcessReconnect threads
- for up to 30 seconds (expected to finish in less than
- 1 second, in practice), and gives up / continues with
- shutdown anyway so not to introduce a behavior change
- on this path (prevents impact in case of regressions).
+ By default, this waits for qemuProcessReconnect threads
+ for up to 30 seconds (expected to finish in less than
+ 1 second, in practice), and gives up / continues with
+ shutdown anyway so not to introduce a behavior change
+ on this path (prevents impact in case of regressions).
* Therefore, the potential for regression is limited to
the libvirt qemu driver's shutdown path, and would be
observed when stopping/restarting libvirtd.service.
* The behavior during normal operation is not affected.
[Other Info]
- * In Focal, race windows exist if libvirtd shuts down
- _after_ initialization and _during_ initialization
- (which is unlikely in practice, but it's possible.)
+ * In Focal, race windows exist if libvirtd shuts down
+ _after_ initialization and _during_ initialization
+ (which is unlikely in practice, but it's possible.)
- Say, 'shutdown'on-runtime' and 'shutdown-on-init'.
+ Say, 'shutdown'on-runtime' and 'shutdown-on-init'.
- * In Jammy, only 'shutdown-on-runtime' might happen,
- due to the introduction of the '.stateShutdownWait'
- driver callback (not available in Focal), which
- indirectly prevents the 'shutdown-on-init' race
- due to additional synchronization with locking.
-
- * For 'shutdown-on-init' (Focal-only), we should use a
- downstream-only patch (with configurable behavior),
- since upstream addressed this issue indirectly with
- the '.stateShutdownWait' callbacks and other changes
- (which are not SRU material, ~10 patches, redesign [2]).
+ * In Jammy, only 'shutdown-on-runtime' might happen,
+ due to the introduction of the '.stateShutdownWait'
+ driver callback (not available in Focal), which
+ indirectly prevents the 'shutdown-on-init' race
+ due to additional synchronization with locking.
* For 'shutdown-on-runtime': use upstream commit [1].
- It's needed in Focal and Jammy (included in Mantic).
+ It's needed in Focal and Jammy (included in Mantic).
+
+ * For 'shutdown-on-init' (Focal-only), we should use a
+ downstream-only patch (with configurable behavior),
+ since upstream addressed this issue indirectly with
+ the '.stateShutdownWait' callbacks and other changes
+ (which are not SRU material, ~10 patches, redesign [2])
+ in 6.8.0.
+
+ [1]
+
https://gitlab.com/libvirt/libvirt/-/commit/152770333449cd3b78b4f5a9f1148fc1f482d842
$ git describe --contains 152770333449cd3b78b4f5a9f1148fc1f482d842
v9.3.0-rc1~90
$ rmadison -a source libvirt | sed -n '/focal/,$p'
libvirt | 6.0.0-0ubuntu8 | focal | source
libvirt | 6.0.0-0ubuntu8.16 | focal-security | source
libvirt | 6.0.0-0ubuntu8.16 | focal-updates | source
libvirt | 6.0.0-0ubuntu8.17 | focal-proposed | source
libvirt | 8.0.0-1ubuntu7 | jammy | source
libvirt | 8.0.0-1ubuntu7.5 | jammy-security | source
libvirt | 8.0.0-1ubuntu7.8 | jammy-updates | source
libvirt | 9.6.0-1ubuntu1 | mantic | source
libvirt | 10.0.0-2ubuntu1 | noble | source
libvirt | 10.0.0-2ubuntu5 | noble-proposed | source
- [1]
-
https://gitlab.com/libvirt/libvirt/-/commit/152770333449cd3b78b4f5a9f1148fc1f482d842
-
[2] https://listman.redhat.com/archives/libvir-list/2020-July/205291.html
- PATCH 00/10] resolve hangs/crashes on libvirtd shutdown
+ [PATCH 00/10] resolve hangs/crashes on libvirtd shutdown
commit 94e45d1042e21e03a15ce993f90fbef626f1ae41
Author: Nikolay Shirokovskiy <[email protected]>
Date: Thu Jul 23 09:53:04 2020 +0300
rpc: finish all threads before exiting main loop
$ git describe --contains 94e45d1042e21e03a15ce993f90fbef626f1ae41
v6.8.0-rc1~279
-
[Original Description]
There's a race condition on libvirtd shutdown
that might cause the domain status XML file(s)
to lose the '<monitor path=...'> tag/field.
This causes an error on libvirtd startup, and
the domain is not listed/managed, despite it
is still running.
$ virsh list
Id Name State
-------------------------
1 test-vm running
$ sudo systemctl restart libvirtd.service
$ journalctl -b -u libvirtd.service | tail
...
... libvirtd[2789]: internal error: no monitor path
... libvirtd[2789]: Failed to load config for domain 'test-vm'
$ virsh list
Id Name State
--------------------
$ virsh list --all
Id Name State
--------------------------
- test-vm shut off
$ pgrep -af qemu-system-x86_64 | cut -d, -f1
2638 /usr/bin/qemu-system-x86_64 -name guest=test-vm,
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2059272
Title:
libvirt domain is not listed/managed after libvirt restart with
messages "internal error: no monitor path" and "Failed to load config
for domain"
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/2059272/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs