** Description changed:

  [ Impact ]
  
  When running `block-stream` and `query-named-block-nodes` concurrently,
  a null-pointer dereference causes QEMU to segfault.
  
+ The original reporter of this issue experienced the bug while performing
+ concurrent libvirt `virDomainBlockPull` calls on the same VM/different
+ disks. The race condition occurs at the end of the `block-stream` QMP;
+ libvirt's handler for a completed `block-stream`
+ (`qemuBlockJobProcessEventCompletedPull` [1]) calls `query-named-block-
+ nodes` (see "libvirt trace" below for a full trace).
+ 
  This occurs in every version of QEMU shipped with Ubuntu, 22.04 thru
- 25.10. I have not yet reproduced the bug using an upstream build.
- 
- I will link the upstream bug report here as soon as I've written it.
- 
- [ Reproducer ]
+ 25.10.
+ 
+ [1] qemuBlockJobProcessEventCompletedPull
+ 
+ [ Test Plan ]
+ 
+ ```
+ sudo apt install libvirt-daemon-system virtinst
+ ```
  
  In `query-named-block-nodes.sh`:
  ```sh
  #!/bin/bash
  
  while true; do
      virsh qemu-monitor-command "$1" query-named-block-nodes > /dev/null
  done
  ```
  
  In `blockrebase-crash.sh`:
  ```sh
  #!/bin/bash
  
  set -ex
  
  domain="$1"
  
  if [ -z "${domain}" ]; then
      echo "Missing domain name"
      exit 1
  fi
  
- ./query_named_block_nodes.sh "${domain}" &
+ ./query-named-block-nodes.sh "${domain}" &
  query_pid=$!
  
  while [ -n "$(virsh list --uuid)" ]; do
      snap="snap0-$(uuidgen)"
  
      virsh snapshot-create-as "${domain}" \
          --name "${snap}" \
          --disk-only file= \
          --diskspec vda,snapshot=no \
          --diskspec 
"vdb,stype=file,file=/var/lib/libvirt/images/n0-blk0_${snap}.qcow2" \
          --atomic \
          --no-metadata
  
      virsh blockpull "${domain}" vdb
  
      while bjr=$(virsh blockjob "$domain" vdb); do
          if [[ "$bjr" == *"No current block job for"* ]] ; then
              break;
          fi;
      done;
  done
  
  kill "${query_pid}"
  ```
  
- Provision (`Ctrl + ]` after boot):
- ```sh
- wget 
https://cloud-images.ubuntu.com/daily/server/noble/current/noble-server-cloudimg-amd64.img
+ `provision.sh` (`Ctrl + ]` after boot):
+ ```sh
+ #!/bin/bash
+ 
+ set -ex
+ 
+ wget https://cloud-images.ubuntu.com/daily/server/noble/current/noble-
+ server-cloudimg-amd64.img
  
  sudo cp noble-server-cloudimg-amd64.img /var/lib/libvirt/images/n0-root.qcow2
  sudo qemu-img create -f qcow2 /var/lib/libvirt/images/n0-blk0.qcow2 10G
  
  touch network-config
  touch meta-data
  touch user-data
  
  virt-install \
    -n n0 \
    --description "Test noble minimal" \
    --os-variant=ubuntu24.04 \
    --ram=1024 --vcpus=2 \
    --import \
    --disk 
path=/var/lib/libvirt/images/n0-root.qcow2,bus=virtio,cache=writethrough,size=10
 \
    --disk 
path=/var/lib/libvirt/images/n0-blk0.qcow2,bus=virtio,cache=writethrough,size=10
 \
    --graphics none \
    --network network=default \
    --cloud-init 
user-data="user-data,meta-data=meta-data,network-config=network-config"
  ```
  
  And run the script to cause the crash (you may need to manually kill
  query-named-block-jobs.sh):
  ```sh
+ chmod 755 provision.sh blockrebase-crash.sh query-named-block-nodes.sh
+ ./provision.sh
  ./blockrebase-crash n0
  ```
  
- [ Details ]
+ Expected behavior: `blockrebase-crash.sh` runs until "No space left on
+ device"
+ 
+ Actual behavior: QEMU crashes after a few iterations:
+ ```
+ Block Pull: [81.05 %]+ bjr=
+ + [[ '' == *\N\o\ \c\u\r\r\e\n\t\ \b\l\o\c\k\ \j\o\b\ \f\o\r* ]]
+ ++ virsh blockjob n0 vdb
+ Block Pull: [97.87 %]+ bjr=
+ + [[ '' == *\N\o\ \c\u\r\r\e\n\t\ \b\l\o\c\k\ \j\o\b\ \f\o\r* ]]
+ ++ virsh blockjob n0 vdb
+ error: Unable to read from monitor: Connection reset by peer
+ error: Unable to read from monitor: Connection reset by peer
+ + bjr=
+ ++ virsh list --uuid
+ + '[' -n 4eed8ba4-300b-4488-a520-510e5b544f57 ']'
+ ++ uuidgen
+ + snap=snap0-88be23e5-696c-445d-870a-abe5f7df56c0
+ + virsh snapshot-create-as n0 --name 
snap0-88be23e5-696c-445d-870a-abe5f7df56c0 --disk-only file= --diskspec 
vda,snapshot=no --diskspec 
vdb,stype=file,file=/var/lib/libvirt/images/n0-blk0_snap0-88be23e5-696c-445d-870a-abe5f7df56c0.qcow2
 --atomic --no-metadata
+ error: Requested operation is not valid: domain is not running
+ Domain snapshot snap0-88be23e5-696c-445d-870a-abe5f7df56c0 created
+ + virsh blockpull n0 vdb
+ error: Requested operation is not valid: domain is not running
+ error: Requested operation is not valid: domain is not running
+ 
+ wesley@nv0:~$ error: Requested operation is not valid: domain is not running
+ ```
+ 
+ [ Where problems could occur ]
+ 
+ The only codepaths affected by this change are `block-stream` and
+ `blockdev-backup` [1][2]. If the code is somehow broken, we would expect
+ to see failures when executing these QMP commands (or the libvirt APIs
+ that use them, `virDomainBlockPull` and `virDomainBackupBegin` [3][4]).
+ 
+ As noted in the upstream commit message, the change does cause an
+ additional flush to occur during `blockdev-backup` QMPs.
+ 
+ The patch that was ultimately merged upstream was a revert of most of
+ [5]. _That_ patch was a workaround for a blockdev permissions issue that
+ was later resolved in [6] (see the end of [7] and replies for upstream
+ discussion). Both [5] and [6] are present in QEMU 6.2.0, so the
+ assumptions that led us to the upstream solution hold for Jammy.
+ 
+ [1] 
https://qemu-project.gitlab.io/qemu/interop/qemu-qmp-ref.html#command-QMP-block-core.block-stream
+ [2] 
https://qemu-project.gitlab.io/qemu/interop/qemu-qmp-ref.html#command-QMP-block-core.blockdev-backup
+ [3] https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockPull
+ [4] https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBackupBegin
+ [5] https://gitlab.com/qemu-project/qemu/-/commit/3108a15cf09
+ [6] https://gitlab.com/qemu-project/qemu/-/commit/3860c0201924d
+ [7] https://lists.gnu.org/archive/html/qemu-devel/2025-10/msg06800.html
+ 
+ [ Other info ]
  
  Backtrace from the coredump (source at [1]):
  ```
  #0  bdrv_refresh_filename (bs=0x5efed72f8350) at 
/usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:8082
  #1  0x00005efea73cf9dc in bdrv_block_device_info (blk=0x0, bs=0x5efed72f8350, 
flat=true, errp=0x7ffeb829ebd8)
      at block/qapi.c:62
  #2  0x00005efea7391ed3 in bdrv_named_nodes_list (flat=<optimized out>, 
errp=0x7ffeb829ebd8)
      at /usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:6275
  #3  0x00005efea7471993 in qmp_query_named_block_nodes (has_flat=<optimized 
out>, flat=<optimized out>,
      errp=0x7ffeb829ebd8) at 
/usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/blockdev.c:2834
  #4  qmp_marshal_query_named_block_nodes (args=<optimized out>, 
ret=0x7f2b753beec0, errp=0x7f2b753beec8)
      at qapi/qapi-commands-block-core.c:553
  #5  0x00005efea74f03a5 in do_qmp_dispatch_bh (opaque=0x7f2b753beed0) at 
qapi/qmp-dispatch.c:128
  #6  0x00005efea75108e6 in aio_bh_poll (ctx=0x5efed6f3f430) at util/async.c:219
  #7  0x00005efea74ffdb2 in aio_dispatch (ctx=0x5efed6f3f430) at 
util/aio-posix.c:436
  #8  0x00005efea7512846 in aio_ctx_dispatch (source=<optimized out>, 
callback=<optimized out>,
      user_data=<optimized out>) at util/async.c:361
  #9  0x00007f2b77809bfb in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #10 0x00007f2b77809e70 in g_main_context_dispatch () from 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
  #11 0x00005efea7517228 in glib_pollfds_poll () at util/main-loop.c:287
  #12 os_host_main_loop_wait (timeout=0) at util/main-loop.c:310
  #13 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:589
  #14 0x00005efea7140482 in qemu_main_loop () at system/runstate.c:905
  #15 0x00005efea744e4e8 in qemu_default_main (opaque=opaque@entry=0x0) at 
system/main.c:50
  #16 0x00005efea6e76319 in main (argc=<optimized out>, argv=<optimized out>) 
at system/main.c:93
  ```
  
- The libvirt logs suggest that the crash occurs right at the end of the 
blockjob, since it reaches "concluded" state before crashing. I assume that 
this is one of:
+ The libvirt logs suggest that the crash occurs right at the end of the 
blockjob, since it reaches "concluded" state before crashing. I assumed that 
this was one of:
  - `stream_clean` is freeing/modifying the `cor_filter_bs` without holding a 
lock that it needs to [2][3]
  - `bdrv_refresh_filename` needs to handle the possibility that the QLIST of 
children for a filter bs could be NULL [1]
+ 
+ Ultimately the fix was neither of these [4]; `bdrv_refresh_filename`
+ should not be able to observe a NULL list of children.
+ 
+ `query-named-block-nodes` iterates the global list of block nodes
+ `graph_bdrv_states` [5]. The offending block node (the `cor_filter_bs`,
+ added during a `block-stream`) was removed from the list of block nodes
+ _for the disk_ when the operation finished, but not removed from the
+ global list of block nodes until later (this is the window for the
+ race). The patch keeps the block node in the disk's list until it is
+ dropped at the end of the blockjob.
  
  [1] 
https://git.launchpad.net/ubuntu/+source/qemu/tree/block.c?h=ubuntu/questing-devel#n8071
  [2] 
https://git.launchpad.net/ubuntu/+source/qemu/tree/block/stream.c?h=ubuntu/questing-devel#n131
  [3] 
https://git.launchpad.net/ubuntu/+source/qemu/tree/block/stream.c?h=ubuntu/questing-devel#n340
+ [4] 
https://gitlab.com/qemu-project/qemu/-/commit/9dbfd4e28dd11a83f54c371fade8d49a63d6dc1e
+ [5] 
https://gitlab.com/qemu-project/qemu/-/blob/v10.1.0/block.c?ref_type=tags#L72
+ 
+ [ libvirt trace ]
+ `qemuBlockJobProcessEventCompletedPull` [1]
+ `qemuBlockJobProcessEventCompletedPullBitmaps` [2]
+ `qemuBlockGetNamedNodeData` [3]
+ `qemuMonitorBlockGetNamedNodeData` [4]
+ `qemuMonitorJSONBlockGetNamedNodeData` [5]
+ `qemuMonitorJSONQueryNamedBlockNodes` [6]
+ 
+ [1] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_blockjob.c?h=applied/ubuntu/questing-devel#n870
+ [2] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_blockjob.c?h=applied/ubuntu/questing-devel#n807
+ [3] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_block.c?h=applied/ubuntu/questing-devel#n2925
+ [4] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_monitor.c?h=applied/ubuntu/questing-devel#n2039
+ [5] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_monitor_json.c?h=applied/ubuntu/questing-devel#n2816
+ [6] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_monitor_json.c?h=applied/ubuntu/questing-devel#n2159

** Changed in: qemu (Ubuntu Questing)
       Status: Confirmed => In Progress

** Changed in: qemu (Ubuntu Noble)
       Status: Confirmed => In Progress

** Changed in: qemu (Ubuntu Jammy)
       Status: Confirmed => In Progress

** Changed in: qemu (Ubuntu Resolute)
     Assignee: Wesley Hershberger (whershberger) => (unassigned)

** Changed in: qemu (Ubuntu Resolute)
       Status: Confirmed => In Progress

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2126951

Title:
  `block-stream` segfault with concurrent `query-named-block-nodes`

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/2126951/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to