Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support

2024-03-21 Thread Jonah Palmer




On 3/21/24 3:48 PM, Dongli Zhang wrote:

Hi Jonah,

Would you mind helping explain how does VIRTIO_F_IN_ORDER improve the 
performance?

https://lore.kernel.org/all/20240321155717.1392787-1-jonah.pal...@oracle.com/#t

I tried to look for it from prior discussions but could not find why.

https://lore.kernel.org/all/byapr18mb2791df7e6c0f61e2d8698e8fa0...@byapr18mb2791.namprd18.prod.outlook.com/

Thank you very much!

Dongli Zhang



Hey Dongli,

So VIRTIO_F_IN_ORDER can theoretically improve performance under certain 
conditions. Whether it can improve performance today, I'm not sure.


But, if we can guarantee that all buffers are used by the device in the 
same order in which they're made available by the driver (enforcing a 
strict in-order processing and completion of requests), then we can 
leverage this to our advantage.


For example, we could simplify device and driver logic such as not 
needing complex mechanisms to track the completion of out-of-order 
requests (reduce request management overhead). Though the need of 
complex mechanisms to force this data to be in-order kind of defeats 
this benefit.


It could also improve cache utilization since sequential access patterns 
are more cache-friendly compared to random access patterns.


Also, in-order processing is more predictable, making it easier to 
optimize device and driver performance. E.g. it can allow us to 
fine-tune things without having to account for the variability of 
out-of-order completions.


But again, the actual performance impact will vary depending on the use 
case and workload. Scenarios that require high levels of parallelism or 
where out-of-order completions are efficiently managed, the flexibility 
of out-of-order processing can still be preferable.


Jonah


On 3/21/24 08:57, Jonah Palmer wrote:

The goal of these patches is to add support to a variety of virtio and
vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
indicates that all buffers are used by the device in the same order in
which they were made available by the driver.

These patches attempt to implement a generalized, non-device-specific
solution to support this feature.

The core feature behind this solution is a buffer mechanism in the form
of GLib's GHashTable. The decision behind using a hash table was to
leverage their ability for quick lookup, insertion, and removal
operations. Given that our keys are simply numbers of an ordered
sequence, a hash table seemed like the best choice for a buffer
mechanism.

-

The strategy behind this implementation is as follows:

We know that buffers that are popped from the available ring and enqueued
for further processing will always done in the same order in which they
were made available by the driver. Given this, we can note their order
by assigning the resulting VirtQueueElement a key. This key is a number
in a sequence that represents the order in which they were popped from
the available ring, relative to the other VirtQueueElements.

For example, given 3 "elements" that were popped from the available
ring, we assign a key value to them which represents their order (elem0
is popped first, then elem1, then lastly elem2):

  elem2   --  elem1   --  elem0   ---> Enqueue for processing
 (key: 2)(key: 1)(key: 0)

Then these elements are enqueued for further processing by the host.

While most devices will return these completed elements in the same
order in which they were enqueued, some devices may not (e.g.
virtio-blk). To guarantee that these elements are put on the used ring
in the same order in which they were enqueued, we can use a buffering
mechanism that keeps track of the next expected sequence number of an
element.

In other words, if the completed element does not have a key value that
matches the next expected sequence number, then we know this element is
not in-order and we must stash it away in a hash table until an order
can be made. The element's key value is used as the key for placing it
in the hash table.

If the completed element has a key value that matches the next expected
sequence number, then we know this element is in-order and we can push
it on the used ring. Then we increment the next expected sequence number
and check if the hash table contains an element at this key location.

If so, we retrieve this element, push it to the used ring, delete the
key-value pair from the hash table, increment the next expected sequence
number, and check the hash table again for an element at this new key
location. This process is repeated until we're unable to find an element
in the hash table to continue the order.

So, for example, say the 3 elements we enqueued were completed in the
following order: elem1, elem2, elem0. The next expected sequence number
is 0:

 exp-seq-num = 0:

  elem1   --> elem1.key == exp-seq-num ? --> No, stash it
 (key: 1) |

Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support

2024-03-21 Thread Dongli Zhang
Hi Jonah,

Would you mind helping explain how does VIRTIO_F_IN_ORDER improve the 
performance?

https://lore.kernel.org/all/20240321155717.1392787-1-jonah.pal...@oracle.com/#t

I tried to look for it from prior discussions but could not find why.

https://lore.kernel.org/all/byapr18mb2791df7e6c0f61e2d8698e8fa0...@byapr18mb2791.namprd18.prod.outlook.com/

Thank you very much!

Dongli Zhang

On 3/21/24 08:57, Jonah Palmer wrote:
> The goal of these patches is to add support to a variety of virtio and
> vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
> indicates that all buffers are used by the device in the same order in
> which they were made available by the driver.
> 
> These patches attempt to implement a generalized, non-device-specific
> solution to support this feature.
> 
> The core feature behind this solution is a buffer mechanism in the form
> of GLib's GHashTable. The decision behind using a hash table was to
> leverage their ability for quick lookup, insertion, and removal
> operations. Given that our keys are simply numbers of an ordered
> sequence, a hash table seemed like the best choice for a buffer
> mechanism.
> 
> -
> 
> The strategy behind this implementation is as follows:
> 
> We know that buffers that are popped from the available ring and enqueued
> for further processing will always done in the same order in which they
> were made available by the driver. Given this, we can note their order
> by assigning the resulting VirtQueueElement a key. This key is a number
> in a sequence that represents the order in which they were popped from
> the available ring, relative to the other VirtQueueElements.
> 
> For example, given 3 "elements" that were popped from the available
> ring, we assign a key value to them which represents their order (elem0
> is popped first, then elem1, then lastly elem2):
> 
>  elem2   --  elem1   --  elem0   ---> Enqueue for processing
> (key: 2)(key: 1)(key: 0)
> 
> Then these elements are enqueued for further processing by the host.
> 
> While most devices will return these completed elements in the same
> order in which they were enqueued, some devices may not (e.g.
> virtio-blk). To guarantee that these elements are put on the used ring
> in the same order in which they were enqueued, we can use a buffering
> mechanism that keeps track of the next expected sequence number of an
> element.
> 
> In other words, if the completed element does not have a key value that
> matches the next expected sequence number, then we know this element is
> not in-order and we must stash it away in a hash table until an order
> can be made. The element's key value is used as the key for placing it
> in the hash table.
> 
> If the completed element has a key value that matches the next expected
> sequence number, then we know this element is in-order and we can push
> it on the used ring. Then we increment the next expected sequence number
> and check if the hash table contains an element at this key location.
> 
> If so, we retrieve this element, push it to the used ring, delete the
> key-value pair from the hash table, increment the next expected sequence
> number, and check the hash table again for an element at this new key
> location. This process is repeated until we're unable to find an element
> in the hash table to continue the order.
> 
> So, for example, say the 3 elements we enqueued were completed in the
> following order: elem1, elem2, elem0. The next expected sequence number
> is 0:
> 
> exp-seq-num = 0:
> 
>  elem1   --> elem1.key == exp-seq-num ? --> No, stash it
> (key: 1) |
>  |
>  v
>
>|key: 1 - elem1|
>
> -
> exp-seq-num = 0:
> 
>  elem2   --> elem2.key == exp-seq-num ? --> No, stash it
> (key: 2) |
>  |
>  v
>
>|key: 1 - elem1|
>|--|
>|key: 2 - elem2|
>
> -
> exp-seq-num = 0:
> 
>  elem0   --> elem0.key == exp-seq-num ? --> Yes, push to used ring
> (key: 0)
> 
> exp-seq-num = 1:
> 
> lookup(table, exp-seq-num) != NULL ? --> Yes, push to used ring,
>  remove elem from table
>  |
> 

[PULL for-9.0 1/1] coroutine: reserve 5,000 mappings

2024-03-21 Thread Stefan Hajnoczi
Daniel P. Berrangé  pointed out that the coroutine
pool size heuristic is very conservative. Instead of halving
max_map_count, he suggested reserving 5,000 mappings for non-coroutine
users based on observations of guests he has access to.

Fixes: 86a637e48104 ("coroutine: cap per-thread local pool size")
Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Daniel P. Berrangé 
Message-id: 20240320181232.1464819-1-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 util/qemu-coroutine.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 2790959eaf..eb4eebefdf 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -377,12 +377,17 @@ static unsigned int get_global_pool_hard_max_size(void)
 NULL) &&
 qemu_strtoi(contents, NULL, 10, _map_count) == 0) {
 /*
- * This is a conservative upper bound that avoids exceeding
- * max_map_count. Leave half for non-coroutine users like library
- * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
- * halve the amount again.
+ * This is an upper bound that avoids exceeding max_map_count. Leave a
+ * fixed amount for non-coroutine users like library dependencies,
+ * vhost-user, etc. Each coroutine takes up 2 VMAs so halve the
+ * remaining amount.
  */
-return max_map_count / 4;
+if (max_map_count > 5000) {
+return (max_map_count - 5000) / 2;
+} else {
+/* Disable the global pool but threads still have local pools */
+return 0;
+}
 }
 #endif
 
-- 
2.44.0




[PULL for-9.0 0/1] Block patches

2024-03-21 Thread Stefan Hajnoczi
The following changes since commit fea445e8fe9acea4f775a832815ee22bdf2b0222:

  Merge tag 'pull-maintainer-final-for-real-this-time-200324-1' of 
https://gitlab.com/stsquad/qemu into staging (2024-03-21 10:31:56 +)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 9352f80cd926fe2dde7c89b93ee33bb0356ff40e:

  coroutine: reserve 5,000 mappings (2024-03-21 13:14:30 -0400)


Pull request

I was too quick in sending the coroutine pool sizing change for -rc0 and still
needed to address feedback from Daniel Berrangé.



Stefan Hajnoczi (1):
  coroutine: reserve 5,000 mappings

 util/qemu-coroutine.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

-- 
2.44.0




[RFC 4/8] virtio: Implement in-order handling for virtio devices

2024-03-21 Thread Jonah Palmer
Implements in-order handling for most virtio devices using the
VIRTIO_F_IN_ORDER transport feature, specifically those who call
virtqueue_push to push their used elements onto the used ring.

The logic behind this implementation is as follows:

1.) virtqueue_pop always enqueues VirtQueueElements in-order.

virtqueue_pop always retrieves one or more buffer descriptors in-order
from the available ring and converts them into a VirtQueueElement. This
means that the order in which VirtQueueElements are enqueued are
in-order by default.

By virtue, as VirtQueueElements are created, we can assign a sequential
key value to them. This preserves the order of buffers that have been
made available to the device by the driver.

As VirtQueueElements are assigned a key value, the current sequence
number is incremented.

2.) Requests can be completed out-of-order.

While most devices complete requests in the same order that they were
enqueued by default, some devices don't (e.g. virtio-blk). The goal of
this out-of-order handling is to reduce the impact of devices that
process elements in-order by default while also guaranteeing compliance
with the VIRTIO_F_IN_ORDER feature.

Below is the logic behind handling completed requests (which may or may
not be in-order).

3.) Does the incoming used VirtQueueElement preserve the correct order?

In other words, is the sequence number (key) assigned to the
VirtQueueElement the expected number that would preserve the original
order?

3a.)
If it does... immediately push the used element onto the used ring.
Then increment the next expected sequence number and check to see if
any previous out-of-order VirtQueueElements stored on the hash table
has a key that matches this next expected sequence number.

For each VirtQueueElement found on the hash table with a matching key:
push the element on the used ring, remove the key-value pair from the
hash table, and then increment the next expected sequence number. Repeat
this process until we're unable to find an element with a matching key.

Note that if the device uses batching (e.g. virtio-net), then we skip
the virtqueue_flush call and let the device call it themselves.

3b.)
If it does not... stash the VirtQueueElement, along with relevant data,
as a InOrderVQElement on the hash table. The key used is the order_key
that was assigned when the VirtQueueElement was created.

Signed-off-by: Jonah Palmer 
---
 hw/virtio/virtio.c | 70 --
 include/hw/virtio/virtio.h |  8 +
 2 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 40124545d6..40e4377f1e 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -992,12 +992,56 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
 }
 }
 
+void virtqueue_order_element(VirtQueue *vq, const VirtQueueElement *elem,
+ unsigned int len, unsigned int idx,
+ unsigned int count)
+{
+InOrderVQElement *in_order_elem;
+
+if (elem->order_key == vq->current_order_idx) {
+/* Element is in-order, push to used ring */
+virtqueue_fill(vq, elem, len, idx);
+
+/* Batching? Don't flush */
+if (count) {
+virtqueue_flush(vq, count);
+}
+
+/* Increment next expected order, search for more in-order elements */
+while ((in_order_elem = g_hash_table_lookup(vq->in_order_ht,
+GUINT_TO_POINTER(++vq->current_order_idx))) != NULL) {
+/* Found in-order element, push to used ring */
+virtqueue_fill(vq, in_order_elem->elem, in_order_elem->len,
+   in_order_elem->idx);
+
+/* Batching? Don't flush */
+if (count) {
+virtqueue_flush(vq, in_order_elem->count);
+}
+
+/* Remove key-value pair from hash table */
+g_hash_table_remove(vq->in_order_ht,
+GUINT_TO_POINTER(vq->current_order_idx));
+}
+} else {
+/* Element is out-of-order, stash in hash table */
+in_order_elem = virtqueue_alloc_in_order_element(elem, len, idx,
+ count);
+g_hash_table_insert(vq->in_order_ht, GUINT_TO_POINTER(elem->order_key),
+in_order_elem);
+}
+}
+
 void virtqueue_push(VirtQueue *vq, const VirtQueueElement *elem,
 unsigned int len)
 {
 RCU_READ_LOCK_GUARD();
-virtqueue_fill(vq, elem, len, 0);
-virtqueue_flush(vq, 1);
+if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_IN_ORDER)) {
+virtqueue_order_element(vq, elem, len, 0, 1);
+} else {
+virtqueue_fill(vq, elem, len, 0);
+virtqueue_flush(vq, 1);
+}
 }
 
 /* Called within rcu_read_lock().  */
@@ -1478,6 +1522,18 @@ void virtqueue_map(VirtIODevice *vdev, VirtQueueElement 
*elem)

[RFC 3/8] virtio: Define order variables

2024-03-21 Thread Jonah Palmer
Define order variables for their use in a VirtQueue's in-order hash
table. Also initialize current_order variables to 0 when creating or
resetting a VirtQueue. These variables are used when the device has
negotiated the VIRTIO_F_IN_ORDER transport feature.

A VirtQueue's current_order_idx represents the next expected index in
the sequence of VirtQueueElements to be processed (put on the used
ring). The next VirtQueueElement to be processed must match this
sequence number before additional elements can be safely added to the
used ring.

A VirtQueue's current_order_key is essentially a counter whose value is
saved as a key in a VirtQueueElement. After the value has been assigned
to the VirtQueueElement, the counter is incremented. All
VirtQueueElements being used by the device are assigned a key value and
the sequence at which they're assigned must be preserved when the device
puts these elements on the used ring.

A VirtQueueElement's order_key is value of a VirtQueue's
current_order_key at the time of the VirtQueueElement's creation. This
value must match with the VirtQueue's current_order_idx before it's able
to be put on the used ring by the device.

Signed-off-by: Jonah Palmer 
---
 hw/virtio/virtio.c | 6 ++
 include/hw/virtio/virtio.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index d2afeeb59a..40124545d6 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -155,6 +155,8 @@ struct VirtQueue
 
 /* In-Order */
 GHashTable *in_order_ht;
+uint16_t current_order_idx;
+uint16_t current_order_key;
 };
 
 const char *virtio_device_names[] = {
@@ -2103,6 +2105,8 @@ static void __virtio_queue_reset(VirtIODevice *vdev, 
uint32_t i)
 if (vdev->vq[i].in_order_ht != NULL) {
 g_hash_table_remove_all(vdev->vq[i].in_order_ht);
 }
+vdev->vq[i].current_order_idx = 0;
+vdev->vq[i].current_order_key = 0;
 virtio_virtqueue_reset_region_cache(>vq[i]);
 }
 
@@ -2357,6 +2361,8 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int 
queue_size,
 g_hash_table_new_full(g_direct_hash, g_direct_equal, NULL,
   free_in_order_vq_element);
 }
+vdev->vq[i].current_order_idx = 0;
+vdev->vq[i].current_order_key = 0;
 
 return >vq[i];
 }
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index c8aa435a5e..f83d7e1fee 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -75,6 +75,7 @@ typedef struct VirtQueueElement
 hwaddr *out_addr;
 struct iovec *in_sg;
 struct iovec *out_sg;
+uint16_t order_key;
 } VirtQueueElement;
 
 typedef struct InOrderVQElement {
-- 
2.39.3




[RFC 8/8] virtio: Add VIRTIO_F_IN_ORDER property definition

2024-03-21 Thread Jonah Palmer
Extend the virtio device property definitions to include the
VIRTIO_F_IN_ORDER feature.

The default state of this feature is disabled, allowing it to be
explicitly enabled where it's supported.

Signed-off-by: Jonah Palmer 
---
 include/hw/virtio/virtio.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index eeeda397a9..ffd78830a3 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -400,7 +400,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
 DEFINE_PROP_BIT64("packed", _state, _field, \
   VIRTIO_F_RING_PACKED, false), \
 DEFINE_PROP_BIT64("queue_reset", _state, _field, \
-  VIRTIO_F_RING_RESET, true)
+  VIRTIO_F_RING_RESET, true), \
+DEFINE_PROP_BIT64("in_order", _state, _field, \
+  VIRTIO_F_IN_ORDER, false)
 
 hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
 bool virtio_queue_enabled_legacy(VirtIODevice *vdev, int n);
-- 
2.39.3




[RFC 5/8] virtio-net: in-order handling

2024-03-21 Thread Jonah Palmer
Implements in-order handling for the virtio-net device.

Since virtio-net utilizes batching for its Rx VirtQueue, the device is
responsible for calling virtqueue_flush once it has completed its
batching operation.

Note:
-
It's unclear if this implementation is really necessary to "guarantee"
that used VirtQueueElements are put on the used ring in-order since, by
design, virtio-net already does this with its Rx VirtQueue.

Signed-off-by: Jonah Palmer 
---
 hw/net/virtio-net.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 9959f1932b..b0375f7e5e 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -2069,7 +2069,11 @@ static ssize_t virtio_net_receive_rcu(NetClientState 
*nc, const uint8_t *buf,
 
 for (j = 0; j < i; j++) {
 /* signal other side */
-virtqueue_fill(q->rx_vq, elems[j], lens[j], j);
+if (virtio_vdev_has_feature(vdev, VIRTIO_F_IN_ORDER)) {
+virtqueue_order_element(q->rx_vq, elems[j], lens[j], j, 0);
+} else {
+virtqueue_fill(q->rx_vq, elems[j], lens[j], j);
+}
 g_free(elems[j]);
 }
 
-- 
2.39.3




[RFC 6/8] vhost-svq: in-order handling

2024-03-21 Thread Jonah Palmer
Implements in-order handling for vhost devices using shadow virtqueues.

Since vhost's shadow virtqueues utilize batching in their
vhost_svq_flush calls, the vhost device is responsible for calling
virtqueue_flush once it has completed its batching operation.

Note:
-
It's unclear if this implementation is really necessary to "guarantee"
in-order handling since, by design, the vhost_svq_flush function puts
used VirtQueueElements in-order already.

Signed-off-by: Jonah Palmer 
---
 hw/virtio/vhost-shadow-virtqueue.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index fc5f408f77..3c42adee87 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -493,11 +493,20 @@ static void vhost_svq_flush(VhostShadowVirtqueue *svq,
 qemu_log_mask(LOG_GUEST_ERROR,
  "More than %u used buffers obtained in a %u size SVQ",
  i, svq->vring.num);
-virtqueue_fill(vq, elem, len, i);
-virtqueue_flush(vq, i);
+if (virtio_vdev_has_feature(svq->vdev, VIRTIO_F_IN_ORDER)) {
+virtqueue_order_element(vq, elem, len, i, i);
+} else {
+virtqueue_fill(vq, elem, len, i);
+virtqueue_flush(vq, i);
+}
 return;
 }
-virtqueue_fill(vq, elem, len, i++);
+
+if (virtio_vdev_has_feature(svq->vdev, VIRTIO_F_IN_ORDER)) {
+virtqueue_order_element(vq, elem, len, i++, 0);
+} else {
+virtqueue_fill(vq, elem, len, i++);
+}
 }
 
 virtqueue_flush(vq, i);
-- 
2.39.3




[RFC 7/8] vhost/vhost-user: Add VIRTIO_F_IN_ORDER to vhost feature bits

2024-03-21 Thread Jonah Palmer
Add support for the VIRTIO_F_IN_ORDER feature across a variety of vhost
devices.

The inclusion of VIRTIO_F_IN_ORDER in the feature bits arrays for these
devices ensures that the backend is capable of offering and providing
support for this feature, and that it can be disabled if the backend
does not support it.

Signed-off-by: Jonah Palmer 
---
 hw/block/vhost-user-blk.c| 1 +
 hw/net/vhost_net.c   | 2 ++
 hw/scsi/vhost-scsi.c | 1 +
 hw/scsi/vhost-user-scsi.c| 1 +
 hw/virtio/vhost-user-fs.c| 1 +
 hw/virtio/vhost-user-vsock.c | 1 +
 net/vhost-vdpa.c | 1 +
 7 files changed, 8 insertions(+)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 6a856ad51a..d176ed857e 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -51,6 +51,7 @@ static const int user_feature_bits[] = {
 VIRTIO_F_RING_PACKED,
 VIRTIO_F_IOMMU_PLATFORM,
 VIRTIO_F_RING_RESET,
+VIRTIO_F_IN_ORDER,
 VHOST_INVALID_FEATURE_BIT
 };
 
diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index e8e1661646..33d1d4b9d3 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -48,6 +48,7 @@ static const int kernel_feature_bits[] = {
 VIRTIO_F_IOMMU_PLATFORM,
 VIRTIO_F_RING_PACKED,
 VIRTIO_F_RING_RESET,
+VIRTIO_F_IN_ORDER,
 VIRTIO_NET_F_HASH_REPORT,
 VHOST_INVALID_FEATURE_BIT
 };
@@ -76,6 +77,7 @@ static const int user_feature_bits[] = {
 VIRTIO_F_IOMMU_PLATFORM,
 VIRTIO_F_RING_PACKED,
 VIRTIO_F_RING_RESET,
+VIRTIO_F_IN_ORDER,
 VIRTIO_NET_F_RSS,
 VIRTIO_NET_F_HASH_REPORT,
 VIRTIO_NET_F_GUEST_USO4,
diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
index ae26bc19a4..40e7630191 100644
--- a/hw/scsi/vhost-scsi.c
+++ b/hw/scsi/vhost-scsi.c
@@ -38,6 +38,7 @@ static const int kernel_feature_bits[] = {
 VIRTIO_RING_F_EVENT_IDX,
 VIRTIO_SCSI_F_HOTPLUG,
 VIRTIO_F_RING_RESET,
+VIRTIO_F_IN_ORDER,
 VHOST_INVALID_FEATURE_BIT
 };
 
diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c
index a63b1f4948..1d59951ab7 100644
--- a/hw/scsi/vhost-user-scsi.c
+++ b/hw/scsi/vhost-user-scsi.c
@@ -36,6 +36,7 @@ static const int user_feature_bits[] = {
 VIRTIO_RING_F_EVENT_IDX,
 VIRTIO_SCSI_F_HOTPLUG,
 VIRTIO_F_RING_RESET,
+VIRTIO_F_IN_ORDER,
 VHOST_INVALID_FEATURE_BIT
 };
 
diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index cca2cd41be..9243dbb128 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -33,6 +33,7 @@ static const int user_feature_bits[] = {
 VIRTIO_F_RING_PACKED,
 VIRTIO_F_IOMMU_PLATFORM,
 VIRTIO_F_RING_RESET,
+VIRTIO_F_IN_ORDER,
 
 VHOST_INVALID_FEATURE_BIT
 };
diff --git a/hw/virtio/vhost-user-vsock.c b/hw/virtio/vhost-user-vsock.c
index 9431b9792c..cc7e4e47b4 100644
--- a/hw/virtio/vhost-user-vsock.c
+++ b/hw/virtio/vhost-user-vsock.c
@@ -21,6 +21,7 @@ static const int user_feature_bits[] = {
 VIRTIO_RING_F_INDIRECT_DESC,
 VIRTIO_RING_F_EVENT_IDX,
 VIRTIO_F_NOTIFY_ON_EMPTY,
+VIRTIO_F_IN_ORDER,
 VHOST_INVALID_FEATURE_BIT
 };
 
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 85e73dd6a7..ed3185acfa 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -62,6 +62,7 @@ const int vdpa_feature_bits[] = {
 VIRTIO_F_RING_PACKED,
 VIRTIO_F_RING_RESET,
 VIRTIO_F_VERSION_1,
+VIRTIO_F_IN_ORDER,
 VIRTIO_NET_F_CSUM,
 VIRTIO_NET_F_CTRL_GUEST_OFFLOADS,
 VIRTIO_NET_F_CTRL_MAC_ADDR,
-- 
2.39.3




[RFC 2/8] virtio: Create/destroy/reset VirtQueue In-Order hash table

2024-03-21 Thread Jonah Palmer
Define a GLib hash table (GHashTable) member in a device's VirtQueue
and add its creation, destruction, and reset functions appropriately.
Also define a function to handle the deallocation of InOrderVQElement
values whenever they're removed from the hash table or the hash table
is destroyed. This hash table is to be used when the device is using
the VIRTIO_F_IN_ORDER transport feature.

A VirtQueue's in-order hash table will take in a uint16_t key with a
InOrderVQElement value as its key-value pair.

The hash table will be used as a buffer mechanism for completed,
out-of-order VirtQueueElements until they can be used in the same order
in which they were made available to the device.

Signed-off-by: Jonah Palmer 
---
 hw/virtio/virtio.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index fb6b4ccd83..d2afeeb59a 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -152,6 +152,9 @@ struct VirtQueue
 EventNotifier host_notifier;
 bool host_notifier_enabled;
 QLIST_ENTRY(VirtQueue) node;
+
+/* In-Order */
+GHashTable *in_order_ht;
 };
 
 const char *virtio_device_names[] = {
@@ -2070,6 +2073,16 @@ static enum virtio_device_endian 
virtio_current_cpu_endian(void)
 }
 }
 
+/* 
+ * Called when an element is removed from the hash table
+ * or when the hash table is destroyed.
+ */
+static void free_in_order_vq_element(gpointer data)
+{
+InOrderVQElement *elem = (InOrderVQElement *)data;
+g_free(elem);
+}
+
 static void __virtio_queue_reset(VirtIODevice *vdev, uint32_t i)
 {
 vdev->vq[i].vring.desc = 0;
@@ -2087,6 +2100,9 @@ static void __virtio_queue_reset(VirtIODevice *vdev, 
uint32_t i)
 vdev->vq[i].notification = true;
 vdev->vq[i].vring.num = vdev->vq[i].vring.num_default;
 vdev->vq[i].inuse = 0;
+if (vdev->vq[i].in_order_ht != NULL) {
+g_hash_table_remove_all(vdev->vq[i].in_order_ht);
+}
 virtio_virtqueue_reset_region_cache(>vq[i]);
 }
 
@@ -2334,6 +2350,13 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int 
queue_size,
 vdev->vq[i].vring.align = VIRTIO_PCI_VRING_ALIGN;
 vdev->vq[i].handle_output = handle_output;
 vdev->vq[i].used_elems = g_new0(VirtQueueElement, queue_size);
+vdev->vq[i].in_order_ht = NULL;
+
+if (virtio_host_has_feature(vdev, VIRTIO_F_IN_ORDER)) {
+vdev->vq[i].in_order_ht =
+g_hash_table_new_full(g_direct_hash, g_direct_equal, NULL,
+  free_in_order_vq_element);
+}
 
 return >vq[i];
 }
@@ -2345,6 +2368,10 @@ void virtio_delete_queue(VirtQueue *vq)
 vq->handle_output = NULL;
 g_free(vq->used_elems);
 vq->used_elems = NULL;
+if (virtio_host_has_feature(vq->vdev, VIRTIO_F_IN_ORDER)) {
+g_hash_table_destroy(vq->in_order_ht);
+vq->in_order_ht = NULL;
+}
 virtio_virtqueue_reset_region_cache(vq);
 }
 
-- 
2.39.3




[RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support

2024-03-21 Thread Jonah Palmer
The goal of these patches is to add support to a variety of virtio and
vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
indicates that all buffers are used by the device in the same order in
which they were made available by the driver.

These patches attempt to implement a generalized, non-device-specific
solution to support this feature.

The core feature behind this solution is a buffer mechanism in the form
of GLib's GHashTable. The decision behind using a hash table was to
leverage their ability for quick lookup, insertion, and removal
operations. Given that our keys are simply numbers of an ordered
sequence, a hash table seemed like the best choice for a buffer
mechanism.

-

The strategy behind this implementation is as follows:

We know that buffers that are popped from the available ring and enqueued
for further processing will always done in the same order in which they
were made available by the driver. Given this, we can note their order
by assigning the resulting VirtQueueElement a key. This key is a number
in a sequence that represents the order in which they were popped from
the available ring, relative to the other VirtQueueElements.

For example, given 3 "elements" that were popped from the available
ring, we assign a key value to them which represents their order (elem0
is popped first, then elem1, then lastly elem2):

 elem2   --  elem1   --  elem0   ---> Enqueue for processing
(key: 2)(key: 1)(key: 0)

Then these elements are enqueued for further processing by the host.

While most devices will return these completed elements in the same
order in which they were enqueued, some devices may not (e.g.
virtio-blk). To guarantee that these elements are put on the used ring
in the same order in which they were enqueued, we can use a buffering
mechanism that keeps track of the next expected sequence number of an
element.

In other words, if the completed element does not have a key value that
matches the next expected sequence number, then we know this element is
not in-order and we must stash it away in a hash table until an order
can be made. The element's key value is used as the key for placing it
in the hash table.

If the completed element has a key value that matches the next expected
sequence number, then we know this element is in-order and we can push
it on the used ring. Then we increment the next expected sequence number
and check if the hash table contains an element at this key location.

If so, we retrieve this element, push it to the used ring, delete the
key-value pair from the hash table, increment the next expected sequence
number, and check the hash table again for an element at this new key
location. This process is repeated until we're unable to find an element
in the hash table to continue the order.

So, for example, say the 3 elements we enqueued were completed in the
following order: elem1, elem2, elem0. The next expected sequence number
is 0:

exp-seq-num = 0:

 elem1   --> elem1.key == exp-seq-num ? --> No, stash it
(key: 1) |
 |
 v
   
   |key: 1 - elem1|
   
-
exp-seq-num = 0:

 elem2   --> elem2.key == exp-seq-num ? --> No, stash it
(key: 2) |
 |
 v
   
   |key: 1 - elem1|
   |--|
   |key: 2 - elem2|
   
-
exp-seq-num = 0:

 elem0   --> elem0.key == exp-seq-num ? --> Yes, push to used ring
(key: 0)

exp-seq-num = 1:

lookup(table, exp-seq-num) != NULL ? --> Yes, push to used ring,
 remove elem from table
 |
 v
   
   |key: 2 - elem2|
   

exp-seq-num = 2:

lookup(table, exp-seq-num) != NULL ? --> Yes, push to used ring,
 remove elem from table
 |
 v
   
  

[RFC 1/8] virtio: Define InOrderVQElement

2024-03-21 Thread Jonah Palmer
Define the InOrderVQElement structure for the VIRTIO_F_IN_ORDER
transport feature implementation.

The InOrderVQElement structure is used to encapsulate out-of-order
VirtQueueElement data that was processed by the host. This data
includes:
 - The processed VirtQueueElement (elem)
 - Length of data (len)
 - VirtQueueElement array index (idx)
 - Number of processed VirtQueueElements (count)

InOrderVQElements will be stored in a buffering mechanism until an
order can be achieved.

Signed-off-by: Jonah Palmer 
---
 include/hw/virtio/virtio.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b3c74a1bca..c8aa435a5e 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -77,6 +77,13 @@ typedef struct VirtQueueElement
 struct iovec *out_sg;
 } VirtQueueElement;
 
+typedef struct InOrderVQElement {
+const VirtQueueElement *elem;
+unsigned int len;
+unsigned int idx;
+unsigned int count;
+} InOrderVQElement;
+
 #define VIRTIO_QUEUE_MAX 1024
 
 #define VIRTIO_NO_VECTOR 0x
-- 
2.39.3




[PATCH v2 0/3] fix two edge cases related to stream block jobs

2024-03-21 Thread Fiona Ebner
Changes in v2:
* Ran into another issue while writing the IO test Stefan wanted
  to have (good call :)), so include a fix for that and add the
  test. I didn't notice during manual testing, because I hadn't
  used a scripted QMP 'quit', so there was no race.

Fiona Ebner (2):
  block-backend: fix edge case in bdrv_next() where BDS associated to BB
changes
  iotests: add test for stream job with an unaligned prefetch read

Stefan Reiter (1):
  block/io: accept NULL qiov in bdrv_pad_request

 block/block-backend.c |  7 +-
 block/io.c| 31 ---
 .../tests/stream-unaligned-prefetch   | 86 +++
 .../tests/stream-unaligned-prefetch.out   |  5 ++
 4 files changed, 113 insertions(+), 16 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

-- 
2.39.2





[PATCH v2 3/3] iotests: add test for stream job with an unaligned prefetch read

2024-03-21 Thread Fiona Ebner
Previously, bdrv_pad_request() could not deal with a NULL qiov when
a read needed to be aligned. During prefetch, a stream job will pass a
NULL qiov. Add a test case to cover this scenario.

By accident, also covers a previous race during shutdown, where block
graph changes during iteration in bdrv_flush_all() could lead to
unreferencing the wrong block driver state and an assertion failure
later.

Signed-off-by: Fiona Ebner 
---

New in v2.

 .../tests/stream-unaligned-prefetch   | 86 +++
 .../tests/stream-unaligned-prefetch.out   |  5 ++
 2 files changed, 91 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch
new file mode 100755
index 00..546db1d369
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch
@@ -0,0 +1,86 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when a stream job does an unaligned prefetch read
+# which requires padding while having a NULL qiov.
+#
+# Copyright (C) Proxmox Server Solutions GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import iotests
+from iotests import imgfmt, qemu_img_create, qemu_io, QMPTestCase
+
+image_size = 1 * 1024 * 1024
+cluster_size = 64 * 1024
+base = os.path.join(iotests.test_dir, 'base.img')
+top = os.path.join(iotests.test_dir, 'top.img')
+
+class TestStreamUnalignedPrefetch(QMPTestCase):
+def setUp(self) -> None:
+"""
+Create two images:
+- base image {base} with {cluster_size // 2} bytes allocated
+- top image {top} without any data allocated and coarser
+  cluster size
+
+Attach a compress filter for the top image, because that
+requires that the request alignment is the top image's cluster
+size.
+"""
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size // 2),
+base, str(image_size))
+qemu_io('-c', f'write 0 {cluster_size // 2}', base)
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size),
+top, str(image_size))
+
+self.vm = iotests.VM()
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': imgfmt,
+'node-name': 'base',
+'file': {
+'driver': 'file',
+'filename': base
+}
+}))
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': 'compress',
+'node-name': 'compress-top',
+'file': {
+'driver': imgfmt,
+'node-name': 'top',
+'file': {
+'driver': 'file',
+'filename': top
+},
+'backing': 'base'
+}
+}))
+self.vm.launch()
+
+def tearDown(self) -> None:
+self.vm.shutdown()
+os.remove(top)
+os.remove(base)
+
+def test_stream_unaligned_prefetch(self) -> None:
+self.vm.cmd('block-stream', job_id='stream', device='compress-top')
+
+
+if __name__ == '__main__':
+iotests.main(supported_fmts=['qcow2'], supported_protocols=['file'])
diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch.out 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
new file mode 100644
index 00..ae1213e6f8
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
@@ -0,0 +1,5 @@
+.
+--
+Ran 1 tests
+
+OK
-- 
2.39.2





[PATCH v2 1/3] block/io: accept NULL qiov in bdrv_pad_request

2024-03-21 Thread Fiona Ebner
From: Stefan Reiter 

Some operations, e.g. block-stream, perform reads while discarding the
results (only copy-on-read matters). In this case, they will pass NULL
as the target QEMUIOVector, which will however trip bdrv_pad_request,
since it wants to extend its passed vector. In particular, this is the
case for the blk_co_preadv() call in stream_populate().

If there is no qiov, no operation can be done with it, but the bytes
and offset still need to be updated, so the subsequent aligned read
will actually be aligned and not run into an assertion failure.

In particular, this can happen when the request alignment of the top
node is larger than the allocated part of the bottom node, in which
case padding becomes necessary. For example:

> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": 
> "node0", "node-name": "node1" } }
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node1" } }
> EOF

Originally-by: Stefan Reiter 
Signed-off-by: Thomas Lamprecht 
[FE: do update bytes and offset in any case
 add reproducer to commit message]
Signed-off-by: Fiona Ebner 
---

No changes in v2.

 block/io.c | 31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/block/io.c b/block/io.c
index 33150c0359..395bea3bac 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs,
 return 0;
 }
 
-sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
-  _head, _tail,
-  _niov);
+/*
+ * For prefetching in stream_populate(), no qiov is passed along, because
+ * only copy-on-read matters.
+ */
+if (qiov && *qiov) {
+sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+  _head, _tail,
+  _niov);
 
-/* Guaranteed by bdrv_check_request32() */
-assert(*bytes <= SIZE_MAX);
-ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
-  sliced_head, *bytes);
-if (ret < 0) {
-bdrv_padding_finalize(pad);
-return ret;
+/* Guaranteed by bdrv_check_request32() */
+assert(*bytes <= SIZE_MAX);
+ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
+  sliced_head, *bytes);
+if (ret < 0) {
+bdrv_padding_finalize(pad);
+return ret;
+}
+*qiov = >local_qiov;
+*qiov_offset = 0;
 }
+
 *bytes += pad->head + pad->tail;
 *offset -= pad->head;
-*qiov = >local_qiov;
-*qiov_offset = 0;
 if (padded) {
 *padded = true;
 }
-- 
2.39.2





[PATCH v2 2/3] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-03-21 Thread Fiona Ebner
The old_bs variable in bdrv_next() is currently determined by looking
at the old block backend. However, if the block graph changes before
the next bdrv_next() call, it might be that the associated BDS is not
the same that was referenced previously. In that case, the wrong BDS
is unreferenced, leading to an assertion failure later:

> bdrv_unref: Assertion `bs->refcnt > 0' failed.

In particular, this can happen in the context of bdrv_flush_all(),
when polling for bdrv_co_flush() in the generated co-wrapper leads to
a graph change (for example with a stream block job [0]).

A racy reproducer:

> #!/bin/bash
> rm -f /tmp/backing.qcow2
> rm -f /tmp/top.qcow2
> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node0" } }
> {"execute": "quit"}
> EOF

[0]:

> #0  bdrv_replace_child_tran (child=..., new_bs=..., tran=...)
> #1  bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., 
> errp=...)
> #2  bdrv_replace_node_common (from=..., to=..., auto_skip=..., 
> detach_subchain=..., errp=...)
> #3  bdrv_drop_filter (bs=..., errp=...)
> #4  bdrv_cor_filter_drop (cor_filter_bs=...)
> #5  stream_prepare (job=...)
> #6  job_prepare_locked (job=...)
> #7  job_txn_apply_locked (fn=..., job=...)
> #8  job_do_finalize_locked (job=...)
> #9  job_exit (opaque=...)
> #10 aio_bh_poll (ctx=...)
> #11 aio_poll (ctx=..., blocking=...)
> #12 bdrv_poll_co (s=...)
> #13 bdrv_flush (bs=...)
> #14 bdrv_flush_all ()
> #15 do_vm_stop (state=..., send_stop=...)
> #16 vm_shutdown ()

Signed-off-by: Fiona Ebner 
---

Not sure if this is the correct fix, or if the call site should rather
be adapted somehow?

New in v2.

 block/block-backend.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 9c4de79e6b..28af1eb17a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -599,14 +599,14 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 /* Must be called from the main loop */
 assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
+old_bs = it->bs;
+
 /* First, return all root nodes of BlockBackends. In order to avoid
  * returning a BDS twice when multiple BBs refer to it, we only return it
  * if the BB is the first one in the parent list of the BDS. */
 if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
 BlockBackend *old_blk = it->blk;
 
-old_bs = old_blk ? blk_bs(old_blk) : NULL;
-
 do {
 it->blk = blk_all_next(it->blk);
 bs = it->blk ? blk_bs(it->blk) : NULL;
@@ -620,11 +620,10 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 if (bs) {
 bdrv_ref(bs);
 bdrv_unref(old_bs);
+it->bs = bs;
 return bs;
 }
 it->phase = BDRV_NEXT_MONITOR_OWNED;
-} else {
-old_bs = it->bs;
 }
 
 /* Then return the monitor-owned BDSes without a BB attached. Ignore all
-- 
2.39.2