from:"Hanna Czenczek"

Re: [PATCH for-9.0?] usb-storage: Fix BlockConf defaults

2024-04-16 Thread Hanna Czenczek


On 12.04.24 16:42, Kevin Wolf wrote:

Commit 30896374 started to pass the full BlockConf from usb-storage to
scsi-disk, while previously only a few select properties would be
forwarded. This enables the user to set more properties, e.g. the block
size, that are actually taking effect.

However, now the calls to blkconf_apply_backend_options() and
blkconf_blocksizes() in usb_msd_storage_realize() that modify some of
these properties take effect, too, instead of being silently ignored.
This means at least that the block sizes get an unconditional default of
512 bytes before the configuration is passed to scsi-disk.

Before commit 30896374, the property wouldn't be set for scsi-disk and
therefore the device dependent defaults would apply - 512 for scsi-hd,
but 2048 for scsi-cd. The latter default has now become 512, too, which
makes at least Windows 11 installation fail when installing from
usb-storage.

Fix this by simply not calling these functions any more in usb-storage
and passing BlockConf on unmodified (except for the BlockBackend). The
same functions are called by the SCSI code anyway and it sets the right
defaults for the actual media type.

Fixes: 308963746169 ('scsi: Don't ignore most usb-storage properties')
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2260
Reported-by: Jonas Svensson
Signed-off-by: Kevin Wolf 
---
Considering this a candidate for 9.0 given that we're already having an
rc4, it's a regression from 8.2 and breaks installing Windows from USB

  hw/usb/dev-storage-classic.c | 9 -
  1 file changed, 9 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH 0/2] block: Allow concurrent BB context changes

2024-02-12 Thread Hanna Czenczek


On 10.02.24 09:46, Michael Tokarev wrote:

09.02.2024 19:51, Hanna Czenczek :

On 09.02.24 15:08, Michael Tokarev wrote:

02.02.2024 17:47, Hanna Czenczek :

Hi,

Without the AioContext lock, a BB's context may kind of change at any
time (unless it has a root node, and I/O requests are pending). That
also means that its own context (BlockBackend.ctx) and that of its 
root

node can differ sometimes (while the context is being changed).


How relevant this is for -stable (8.2 at least) which does not have
"scsi: eliminate AioContext lock" patchset, and in particular,:
v8.2.0-124-geaad0fe260 "scsi: only access SCSIDevice->requests from
one thread"?

The issue first patch "block-backend: Allow concurrent context changes"
fixes (RHEL-19381) seems to be for 8.1.something, so it exists in 8.2
too, and this particular fix applies to 8.2.

But with other changes around all this, I'm a bit lost as of what 
should

be done on stable.  Not even thinking about 7.2 here :)


Ah, sorry, yes.  Since we do still have the AioContext lock, this 
series won’t be necessary in -stable.  Sorry for the noise!


Hm. Now I'm confused even more.. :)

ad89367202 "block-backend: Allow concurrent context changes" - the first
one in this series - apparently is needed, as it fixes an issue reported
for qemu 8.1 (https://issues.redhat.com/browse/RHEL-19381).  Or is it not
the case?


Ah, yes, I got confused there.  There are two (unfortunately? 
fortunately? Red-Hat-internal) comments, one of which describes the 
crash that’s fixed here, so I thought that bug described this crash.  
But the actual description in the report describes something different 
(more like what’s fixed by 
https://lists.nongnu.org/archive/html/qemu-devel/2024-01/msg03649.html, 
but I’m not entirely sure yet).


So basically I got the bug link wrong.  We now have 
https://issues.redhat.com/browse/RHEL-24593, which has been reported 
only against 8.2.


Hanna


FWIW, truth is born in the noise, not in silence ;)

Thanks,

/mjt

Re: [PATCH v2 3/3] virtio-blk: Use ioeventfd_attach in start_ioeventfd

2024-02-09 Thread Hanna Czenczek


On 09.02.24 15:38, Michael Tokarev wrote:

02.02.2024 18:31, Hanna Czenczek :

Commit d3f6f294aeadd5f88caf0155e4360808c95b3146 ("virtio-blk: always set
ioeventfd during startup") has made virtio_blk_start_ioeventfd() always
kick the virtqueue (set the ioeventfd), regardless of whether the BB is
drained.  That is no longer necessary, because attaching the host
notifier will now set the ioeventfd, too; this happens either
immediately right here in virtio_blk_start_ioeventfd(), or later when
the drain ends, in virtio_blk_ioeventfd_attach().

With event_notifier_set() removed, the code becomes the same as the one
in virtio_blk_ioeventfd_attach(), so we can reuse that function.


The mentioned comit is v8.2.0-812-gd3f6f294ae, - ie, past 8.2.
Is this new change still relevant for stable?


Sorry again. :/  This patch is a clean-up patch that won’t apply to 
8.2.  Now, 8.2 does have basically the same logic as described in the 
patch message (d3f6f294aea restored it after it was broken), so a 
similar patch could be made for it (removing the event_notifier_set() 
from virtio_blk_data_plane_start()), but whether we kick the virtqueues 
once or twice on start-up probably won’t make a difference, certainly 
not in terms of correctness.


Hanna

Re: [PATCH 0/2] block: Allow concurrent BB context changes

2024-02-09 Thread Hanna Czenczek


On 09.02.24 15:08, Michael Tokarev wrote:

02.02.2024 17:47, Hanna Czenczek :

Hi,

Without the AioContext lock, a BB's context may kind of change at any
time (unless it has a root node, and I/O requests are pending). That
also means that its own context (BlockBackend.ctx) and that of its root
node can differ sometimes (while the context is being changed).


How relevant this is for -stable (8.2 at least) which does not have
"scsi: eliminate AioContext lock" patchset, and in particular,:
v8.2.0-124-geaad0fe260 "scsi: only access SCSIDevice->requests from
one thread"?

The issue first patch "block-backend: Allow concurrent context changes"
fixes (RHEL-19381) seems to be for 8.1.something, so it exists in 8.2
too, and this particular fix applies to 8.2.

But with other changes around all this, I'm a bit lost as of what should
be done on stable.  Not even thinking about 7.2 here :)


Ah, sorry, yes.  Since we do still have the AioContext lock, this series 
won’t be necessary in -stable.  Sorry for the noise!


Hanna

Re: [PATCH 0/2] block: Allow concurrent BB context changes

2024-02-07 Thread Hanna Czenczek


On 06.02.24 17:53, Stefan Hajnoczi wrote:

On Fri, Feb 02, 2024 at 03:47:53PM +0100, Hanna Czenczek wrote:

Hi,

Without the AioContext lock, a BB's context may kind of change at any
time (unless it has a root node, and I/O requests are pending).  That
also means that its own context (BlockBackend.ctx) and that of its root
node can differ sometimes (while the context is being changed).

blk_get_aio_context() doesn't know this yet and asserts that both are
always equal (if there is a root node).  Because it's no longer true,
and because callers don't seem to really care about the root node's
context, we can and should remove the assertion and just return the BB's
context.

Beyond that, the question is whether the callers of
blk_get_aio_context() are OK with the context potentially changing
concurrently.  Honestly, it isn't entirely clear to me; most look OK,
except for the virtio-scsi code, which operates under the general
assumption that the BB's context is always equal to that of the
virtio-scsi device.  I doubt that this assumption always holds (it is
definitely not obvious to me that it would), but then again, this series
will not make matters worse in that regard, and that is what counts for
me now.

One clear point of contention is scsi_device_for_each_req_async(), which
is addressed by patch 2.  Right now, it schedules a BH in the BB
context, then the BH double-checks whether the context still fits, and
if not, re-schedules itself.  Because virtio-scsi's context is fixed,
this seems to indicate to me that it wants to be able to deal with a
case where BB and virtio-scsi context differ, which seems to break that
aforementioned general virtio-scsi assumption.

I don't agree with the last sentence: virtio-scsi's context isn't fixed.

The AioContext changes when dataplane is started/stopped. virtio-scsi
switches AioContext between the IOThread's AioContext and the main
loop's qemu_aio_context.

However, virtio-scsi virtqueue processing only happens in the IOThread's
AioContext. Maybe this is what you meant when you said the AioContext is
fixed?


Specifically, I meant VirtIOSCSI.ctx, which is set only once in 
virtio_scsi_dataplane_setup().  That’s at least where the virtqueue 
notifiers are registered, so yes, virtqueue processing should at least 
be fixed to that context.  It seems like it’s always possible some 
things are processed in the main thread (not just setup/teardown, but 
also e.g. TMF_LOGICAL_UNIT_RESET), so to me it seems like virtio-scsi 
kind of runs in two contexts simultaneously. Yes, when virtqueue 
processing is paused, all processing VirtIOSCSI.ctx is stopped, but I 
wouldn’t say it switches contexts there.  It just stops processing some 
requests.


Either way, virtio-scsi request processing doesn’t stop just because a 
scsi-hd device is hot-plugged or -unplugged.  If the BB changes contexts 
in the hot-unplug path (while vq request processing is continuing in the 
I/O thread), its context will differ from that of virtio-scsi.


So should I just replace the “the context is fixed” and say that in this 
specific instance, virtio-scsi vq processing continues in the I/O thread?



The BH function is aware that the current AioContext might not be the
same as the AioContext at the time the BH was scheduled. That doesn't
break assumptions in the code.

(It may be possible to rewrite virtio-blk, virtio-scsi, and core
VirtIODevice ioeventfd code to use the simpler model where the
AioContext really is fixed because things have changed significantly
over the years, but I looked a few weeks ago and it's difficult work.)

I'm just pointing out that I think this description is incomplete. I
*do* agree with what this patch series is doing :).


Well, this description won’t land in any commit log, so from my side, 
I’m not too worried about its correctness. O:)


Hanna


Unfortunately, I definitely have to touch that code, because accepting
concurrent changes of AioContexts breaks the double-check (just because
the BB has the right context in that place does not mean it stays in
that context); instead, we must prevent any concurrent change until the
BH is done.  Because changing contexts generally involves a drained
section, we can prevent it by keeping the BB in-flight counter elevated.

Question is, how to reason for that.  I’d really rather not include the
need to follow the BB context in my argument, because I find that part a
bit fishy.

Luckily, there’s a second, completely different reason for having
scsi_device_for_each_req_async() increment the in-flight counter:
Specifically, scsi_device_purge_requests() probably wants to await full
completion of scsi_device_for_each_req_async(), and we can do that most
easily in the very same way by incrementing the in-flight counter.  This
way, the blk_drain() in scsi_device_purge_requests() will not only await
all (cancelled) I/O requests, but also the non-I/O requests.

The fact that this prevents the BB AioContext from changing while the BH

Re: [PATCH] virtio-blk: do not use C99 mixed declarations

2024-02-06 Thread Hanna Czenczek


On 06.02.24 15:04, Stefan Hajnoczi wrote:

QEMU's coding style generally forbids C99 mixed declarations.

Signed-off-by: Stefan Hajnoczi 
---
  hw/block/virtio-blk.c | 25 ++---
  1 file changed, 14 insertions(+), 11 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH 5/5] monitor: use aio_co_reschedule_self()

2024-02-06 Thread Hanna Czenczek


On 05.02.24 18:26, Stefan Hajnoczi wrote:

The aio_co_reschedule_self() API is designed to avoid the race
condition between scheduling the coroutine in another AioContext and
yielding.

The QMP dispatch code uses the open-coded version that appears
susceptible to the race condition at first glance:

   aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
   qemu_coroutine_yield();

The code is actually safe because the iohandler and qemu_aio_context
AioContext run under the Big QEMU Lock. Nevertheless, set a good example
and use aio_co_reschedule_self() so it's obvious that there is no race.

Suggested-by: Hanna Reitz 
Signed-off-by: Stefan Hajnoczi 
---
  qapi/qmp-dispatch.c | 7 ++-
  1 file changed, 2 insertions(+), 5 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH 4/5] virtio-blk: declare VirtIOBlock::rq with a type

2024-02-06 Thread Hanna Czenczek


On 05.02.24 18:26, Stefan Hajnoczi wrote:

The VirtIOBlock::rq field has had the type void * since its introduction
in commit 869a5c6df19a ("Stop VM on error in virtio-blk. (Gleb
Natapov)").

Perhaps this was done to avoid the forward declaration of
VirtIOBlockReq.

Hanna Czenczek  pointed out the missing type. Specify
the actual type because there is no need to use void * here.

Suggested-by: Hanna Czenczek 
Signed-off-by: Stefan Hajnoczi 
---
  include/hw/virtio/virtio-blk.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH 3/5] virtio-blk: add vq_rq[] bounds check in virtio_blk_dma_restart_cb()

2024-02-06 Thread Hanna Czenczek


On 05.02.24 18:26, Stefan Hajnoczi wrote:

Hanna Czenczek  noted that the array index in
virtio_blk_dma_restart_cb() is not bounds-checked:

   g_autofree VirtIOBlockReq **vq_rq = g_new0(VirtIOBlockReq *, num_queues);
   ...
   while (rq) {
   VirtIOBlockReq *next = rq->next;
   uint16_t idx = virtio_get_queue_index(rq->vq);

   rq->next = vq_rq[idx];
  ^^

The code is correct because both rq->vq and vq_rq[] depend on
num_queues, but this is indirect and not 100% obvious. Add an assertion.

Suggested-by: Hanna Czenczek 
Signed-off-by: Stefan Hajnoczi 
---
  hw/block/virtio-blk.c | 1 +
  1 file changed, 1 insertion(+)


Reviewed-by: Hanna Czenczek

Re: [PATCH 2/5] virtio-blk: clarify that there is at least 1 virtqueue

2024-02-06 Thread Hanna Czenczek


On 05.02.24 18:26, Stefan Hajnoczi wrote:

It is not possible to instantiate a virtio-blk device with 0 virtqueues.
The following check is located in ->realize():

   if (!conf->num_queues) {
   error_setg(errp, "num-queues property must be larger than 0");
   return;
   }

Later on we access s->vq_aio_context[0] under the assumption that there
is as least one virtqueue. Hanna Czenczek  noted that
it would help to show that the array index is already valid.

Add an assertion to document that s->vq_aio_context[0] is always
safe...and catch future code changes that break this assumption.

Suggested-by: Hanna Czenczek
Signed-off-by: Stefan Hajnoczi
---
  hw/block/virtio-blk.c | 1 +
  1 file changed, 1 insertion(+)


Reviewed-by: Hanna Czenczek

Re: [PATCH 1/5] virtio-blk: enforce iothread-vq-mapping validation

2024-02-06 Thread Hanna Czenczek


On 05.02.24 18:26, Stefan Hajnoczi wrote:

Hanna Czenczek  noticed that the safety of
`vq_aio_context[vq->value] = ctx;` with user-defined vq->value inputs is
not obvious.

The code is structured in validate() + apply() steps so input validation
is there, but it happens way earlier and there is nothing that
guarantees apply() can only be called with validated inputs.

This patch moves the validate() call inside the apply() function so
validation is guaranteed. I also added the bounds checking assertion
that Hanna suggested.

Signed-off-by: Stefan Hajnoczi 
---
  hw/block/virtio-blk.c | 192 +++---
  1 file changed, 107 insertions(+), 85 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 227d83569f..e8b37fd5f4 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c


[...]


@@ -1660,6 +1681,14 @@ static bool virtio_blk_vq_aio_context_init(VirtIOBlock 
*s, Error **errp)
  BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
  VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
  
+if (conf->iothread && conf->iothread_vq_mapping_list) {

+if (conf->iothread) {


This inner condition should probably be dropped.  With that done:

Reviewed-by: Hanna Czenczek 


+error_setg(errp, "iothread and iothread-vq-mapping properties "
+ "cannot be set at the same time");
+return false;
+}
+}
+
  if (conf->iothread || conf->iothread_vq_mapping_list) {
  if (!k->set_guest_notifiers || !k->ioeventfd_assign) {
  error_setg(errp,

[PATCH v2 2/3] virtio: Re-enable notifications after drain

2024-02-02 Thread Hanna Czenczek

During drain, we do not care about virtqueue notifications, which is why
we remove the handlers on it.  When removing those handlers, whether vq
notifications are enabled or not depends on whether we were in polling
mode or not; if not, they are enabled (by default); if so, they have
been disabled by the io_poll_start callback.

Because we do not care about those notifications after removing the
handlers, this is fine.  However, we have to explicitly ensure they are
enabled when re-attaching the handlers, so we will resume receiving
notifications.  We do this in virtio_queue_aio_attach_host_notifier*().
If such a function is called while we are in a polling section,
attaching the notifiers will then invoke the io_poll_start callback,
re-disabling notifications.

Because we will always miss virtqueue updates in the drained section, we
also need to poll the virtqueue once after attaching the notifiers.

Buglink: https://issues.redhat.com/browse/RHEL-3934
Signed-off-by: Hanna Czenczek 
---
 include/block/aio.h |  7 ++-
 hw/virtio/virtio.c  | 42 ++
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 5d0a114988..8378553eb9 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -480,9 +480,14 @@ void aio_set_event_notifier(AioContext *ctx,
 AioPollFn *io_poll,
 EventNotifierHandler *io_poll_ready);
 
-/* Set polling begin/end callbacks for an event notifier that has already been
+/*
+ * Set polling begin/end callbacks for an event notifier that has already been
  * registered with aio_set_event_notifier.  Do nothing if the event notifier is
  * not registered.
+ *
+ * Note that if the io_poll_end() callback (or the entire notifier) is removed
+ * during polling, it will not be called, so an io_poll_begin() is not
+ * necessarily always followed by an io_poll_end().
  */
 void aio_set_event_notifier_poll(AioContext *ctx,
  EventNotifier *notifier,
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 7549094154..d229755eae 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3556,6 +3556,17 @@ static void 
virtio_queue_host_notifier_aio_poll_end(EventNotifier *n)
 
 void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
+/*
+ * virtio_queue_aio_detach_host_notifier() can leave notifications 
disabled.
+ * Re-enable them.  (And if detach has not been used before, notifications
+ * being enabled is still the default state while a notifier is attached;
+ * see virtio_queue_host_notifier_aio_poll_end(), which will always leave
+ * notifications enabled once the polling section is left.)
+ */
+if (!virtio_queue_get_notification(vq)) {
+virtio_queue_set_notification(vq, 1);
+}
+
 aio_set_event_notifier(ctx, >host_notifier,
virtio_queue_host_notifier_read,
virtio_queue_host_notifier_aio_poll,
@@ -3563,6 +3574,13 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
 aio_set_event_notifier_poll(ctx, >host_notifier,
 virtio_queue_host_notifier_aio_poll_begin,
 virtio_queue_host_notifier_aio_poll_end);
+
+/*
+ * We will have ignored notifications about new requests from the guest
+ * while no notifiers were attached, so "kick" the virt queue to process
+ * those requests now.
+ */
+event_notifier_set(>host_notifier);
 }
 
 /*
@@ -3573,14 +3591,38 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
  */
 void virtio_queue_aio_attach_host_notifier_no_poll(VirtQueue *vq, AioContext 
*ctx)
 {
+/* See virtio_queue_aio_attach_host_notifier() */
+if (!virtio_queue_get_notification(vq)) {
+virtio_queue_set_notification(vq, 1);
+}
+
 aio_set_event_notifier(ctx, >host_notifier,
virtio_queue_host_notifier_read,
NULL, NULL);
+
+/*
+ * See virtio_queue_aio_attach_host_notifier().
+ * Note that this may be unnecessary for the type of virtqueues this
+ * function is used for.  Still, it will not hurt to have a quick look into
+ * whether we can/should process any of the virtqueue elements.
+ */
+event_notifier_set(>host_notifier);
 }
 
 void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
 aio_set_event_notifier(ctx, >host_notifier, NULL, NULL, NULL);
+
+/*
+ * aio_set_event_notifier_poll() does not guarantee whether io_poll_end()
+ * will run after io_poll_begin(), so by removing the notifier, we do not
+ * know whether virtio_queue_host_notifier_aio_poll_end() has run after a
+ * previous virtio_queue_host_notifier_aio_poll_begin(), i.e. whether
+ * notifications are e

[PATCH v2 0/3] virtio: Re-enable notifications after drain

2024-02-02 Thread Hanna Czenczek

v1:

https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg00336.html


Hi,

This is basically the same series as v1: When using
aio_set_event_notifier_poll(), the io_poll_end() callback is only
invoked when polling ends, not when the notifier is being removed while
in a polling section.  This can leave the virtqueue notifier disabled
during drained sections, which however is not a bad thing.  We just need
to ensure they are re-enabled after the drain, and kick the virtqueue
once to pick up all the requests that came in during the drained
section.

Patch 1 is a technically unrelated fix, but addresses a problem that
became visible with patch 2 applied.

Patch 3 is a small (optional) clean-up patch.


v2:
- Changed the title of this series and patch 2 (was: "Keep notifications
  disabled durin drain"): Keeping the notifier disabled was something
  the initial RFC did, this version (v1 too) just ensures the notifier
  is enabled after the drain, regardless of its state before.

- Use event_notifier_set() instead of virtio_queue_notify() in patch 2

- Added patch 3


Hanna Czenczek (3):
  virtio-scsi: Attach event vq notifier with no_poll
  virtio: Re-enable notifications after drain
  virtio-blk: Use ioeventfd_attach in start_ioeventfd

 include/block/aio.h   |  7 ++-
 hw/block/virtio-blk.c | 21 ++---
 hw/scsi/virtio-scsi.c |  7 ++-
 hw/virtio/virtio.c| 42 ++
 4 files changed, 64 insertions(+), 13 deletions(-)

-- 
2.43.0

[PATCH v2 1/3] virtio-scsi: Attach event vq notifier with no_poll

2024-02-02 Thread Hanna Czenczek

As of commit 38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi:
don't waste CPU polling the event virtqueue"), we only attach an io_read
notifier for the virtio-scsi event virtqueue instead, and no polling
notifiers.  During operation, the event virtqueue is typically
non-empty, but none of the buffers are intended to be used immediately.
Instead, they only get used when certain events occur.  Therefore, it
makes no sense to continuously poll it when non-empty, because it is
supposed to be and stay non-empty.

We do this by using virtio_queue_aio_attach_host_notifier_no_poll()
instead of virtio_queue_aio_attach_host_notifier() for the event
virtqueue.

Commit 766aa2de0f29b657148e04599320d771c36fd126 ("virtio-scsi: implement
BlockDevOps->drained_begin()") however has virtio_scsi_drained_end() use
virtio_queue_aio_attach_host_notifier() for all virtqueues, including
the event virtqueue.  This can lead to it being polled again, undoing
the benefit of commit 38738f7dbbda90fbc161757b7f4be35b52205552.

Fix it by using virtio_queue_aio_attach_host_notifier_no_poll() for the
event virtqueue.

Reported-by: Fiona Ebner 
Fixes: 766aa2de0f29b657148e04599320d771c36fd126
   ("virtio-scsi: implement BlockDevOps->drained_begin()")
Reviewed-by: Stefan Hajnoczi 
Tested-by: Fiona Ebner 
Reviewed-by: Fiona Ebner 
Signed-off-by: Hanna Czenczek 
---
 hw/scsi/virtio-scsi.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 690aceec45..9f02ceea09 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -1149,6 +1149,7 @@ static void virtio_scsi_drained_begin(SCSIBus *bus)
 static void virtio_scsi_drained_end(SCSIBus *bus)
 {
 VirtIOSCSI *s = container_of(bus, VirtIOSCSI, bus);
+VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
 VirtIODevice *vdev = VIRTIO_DEVICE(s);
 uint32_t total_queues = VIRTIO_SCSI_VQ_NUM_FIXED +
 s->parent_obj.conf.num_queues;
@@ -1166,7 +1167,11 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
 
 for (uint32_t i = 0; i < total_queues; i++) {
 VirtQueue *vq = virtio_get_queue(vdev, i);
-virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+if (vq == vs->event_vq) {
+virtio_queue_aio_attach_host_notifier_no_poll(vq, s->ctx);
+} else {
+virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+}
 }
 }
 
-- 
2.43.0

[PATCH v2 3/3] virtio-blk: Use ioeventfd_attach in start_ioeventfd

2024-02-02 Thread Hanna Czenczek

Commit d3f6f294aeadd5f88caf0155e4360808c95b3146 ("virtio-blk: always set
ioeventfd during startup") has made virtio_blk_start_ioeventfd() always
kick the virtqueue (set the ioeventfd), regardless of whether the BB is
drained.  That is no longer necessary, because attaching the host
notifier will now set the ioeventfd, too; this happens either
immediately right here in virtio_blk_start_ioeventfd(), or later when
the drain ends, in virtio_blk_ioeventfd_attach().

With event_notifier_set() removed, the code becomes the same as the one
in virtio_blk_ioeventfd_attach(), so we can reuse that function.

Signed-off-by: Hanna Czenczek 
---
 hw/block/virtio-blk.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 227d83569f..22b8eef69b 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -37,6 +37,8 @@
 #include "hw/virtio/virtio-blk-common.h"
 #include "qemu/coroutine.h"
 
+static void virtio_blk_ioeventfd_attach(VirtIOBlock *s);
+
 static void virtio_blk_init_request(VirtIOBlock *s, VirtQueue *vq,
 VirtIOBlockReq *req)
 {
@@ -1808,17 +1810,14 @@ static int virtio_blk_start_ioeventfd(VirtIODevice 
*vdev)
 s->ioeventfd_started = true;
 smp_wmb(); /* paired with aio_notify_accept() on the read side */
 
-/* Get this show started by hooking up our callbacks */
-for (i = 0; i < nvqs; i++) {
-VirtQueue *vq = virtio_get_queue(vdev, i);
-AioContext *ctx = s->vq_aio_context[i];
-
-/* Kick right away to begin processing requests already in vring */
-event_notifier_set(virtio_queue_get_host_notifier(vq));
-
-if (!blk_in_drain(s->conf.conf.blk)) {
-virtio_queue_aio_attach_host_notifier(vq, ctx);
-}
+/*
+ * Get this show started by hooking up our callbacks.  If drained now,
+ * virtio_blk_drained_end() will do this later.
+ * Attaching the notifier also kicks the virtqueues, processing any 
requests
+ * they may already have.
+ */
+if (!blk_in_drain(s->conf.conf.blk)) {
+virtio_blk_ioeventfd_attach(s);
 }
 return 0;
 
-- 
2.43.0

[PATCH 1/2] block-backend: Allow concurrent context changes

2024-02-02 Thread Hanna Czenczek

Since AioContext locks have been removed, a BlockBackend's AioContext
may really change at any time (only exception is that it is often
confined to a drained section, as noted in this patch).  Therefore,
blk_get_aio_context() cannot rely on its root node's context always
matching that of the BlockBackend.

In practice, whether they match does not matter anymore anyway: Requests
can be sent to BDSs from any context, so anyone who requests the BB's
context should have no reason to require the root node to have the same
context.  Therefore, we can and should remove the assertion to that
effect.

In addition, because the context can be set and queried from different
threads concurrently, it has to be accessed with atomic operations.

Buglink: https://issues.redhat.com/browse/RHEL-19381
Suggested-by: Kevin Wolf 
Signed-off-by: Hanna Czenczek 
---
 block/block-backend.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 209eb07528..9c4de79e6b 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -44,7 +44,7 @@ struct BlockBackend {
 char *name;
 int refcnt;
 BdrvChild *root;
-AioContext *ctx;
+AioContext *ctx; /* access with atomic operations only */
 DriveInfo *legacy_dinfo;/* null unless created by drive_new() */
 QTAILQ_ENTRY(BlockBackend) link; /* for block_backends */
 QTAILQ_ENTRY(BlockBackend) monitor_link; /* for monitor_block_backends */
@@ -2414,22 +2414,22 @@ void blk_op_unblock_all(BlockBackend *blk, Error 
*reason)
 }
 }
 
+/**
+ * Return BB's current AioContext.  Note that this context may change
+ * concurrently at any time, with one exception: If the BB has a root node
+ * attached, its context will only change through 
bdrv_try_change_aio_context(),
+ * which creates a drained section.  Therefore, incrementing such a BB's
+ * in-flight counter will prevent its context from changing.
+ */
 AioContext *blk_get_aio_context(BlockBackend *blk)
 {
-BlockDriverState *bs;
 IO_CODE();
 
 if (!blk) {
 return qemu_get_aio_context();
 }
 
-bs = blk_bs(blk);
-if (bs) {
-AioContext *ctx = bdrv_get_aio_context(blk_bs(blk));
-assert(ctx == blk->ctx);
-}
-
-return blk->ctx;
+return qatomic_read(>ctx);
 }
 
 int blk_set_aio_context(BlockBackend *blk, AioContext *new_context,
@@ -2442,7 +2442,7 @@ int blk_set_aio_context(BlockBackend *blk, AioContext 
*new_context,
 GLOBAL_STATE_CODE();
 
 if (!bs) {
-blk->ctx = new_context;
+qatomic_set(>ctx, new_context);
 return 0;
 }
 
@@ -2471,7 +2471,7 @@ static void blk_root_set_aio_ctx_commit(void *opaque)
 AioContext *new_context = s->new_ctx;
 ThrottleGroupMember *tgm = >public.throttle_group_member;
 
-blk->ctx = new_context;
+qatomic_set(>ctx, new_context);
 if (tgm->throttle_state) {
 throttle_group_detach_aio_context(tgm);
 throttle_group_attach_aio_context(tgm, new_context);
-- 
2.43.0

[PATCH 2/2] scsi: Await request purging

2024-02-02 Thread Hanna Czenczek

scsi_device_for_each_req_async() currently does not provide any way to
be awaited.  One of its callers is scsi_device_purge_requests(), which
therefore currently does not guarantee that all requests are fully
settled when it returns.

We want all requests to be settled, because scsi_device_purge_requests()
is called through the unrealize path, including the one invoked by
virtio_scsi_hotunplug() through qdev_simple_device_unplug_cb(), which
most likely assumes that all SCSI requests are done then.

In fact, scsi_device_purge_requests() already contains a blk_drain(),
but this will not fully await scsi_device_for_each_req_async(), only the
I/O requests it potentially cancels (not the non-I/O requests).
However, we can have scsi_device_for_each_req_async() increment the BB
in-flight counter, and have scsi_device_for_each_req_async_bh()
decrement it when it is done.  This way, the blk_drain() will fully
await all SCSI requests to be purged.

This also removes the need for scsi_device_for_each_req_async_bh() to
double-check the current context and potentially re-schedule itself,
should it now differ from the BB's context: Changing a BB's AioContext
with a root node is done through bdrv_try_change_aio_context(), which
creates a drained section.  With this patch, we keep the BB in-flight
counter elevated throughout, so we know the BB's context cannot change.

Signed-off-by: Hanna Czenczek 
---
 hw/scsi/scsi-bus.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 0a2eb11c56..230313022c 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -120,17 +120,13 @@ static void scsi_device_for_each_req_async_bh(void 
*opaque)
 SCSIRequest *next;
 
 /*
- * If the AioContext changed before this BH was called then reschedule into
- * the new AioContext before accessing ->requests. This can happen when
- * scsi_device_for_each_req_async() is called and then the AioContext is
- * changed before BHs are run.
+ * The BB cannot have changed contexts between this BH being scheduled and
+ * now: BBs' AioContexts, when they have a node attached, can only be
+ * changed via bdrv_try_change_aio_context(), in a drained section.  While
+ * we have the in-flight counter incremented, that drain must block.
  */
 ctx = blk_get_aio_context(s->conf.blk);
-if (ctx != qemu_get_current_aio_context()) {
-aio_bh_schedule_oneshot(ctx, scsi_device_for_each_req_async_bh,
-g_steal_pointer());
-return;
-}
+assert(ctx == qemu_get_current_aio_context());
 
 QTAILQ_FOREACH_SAFE(req, >requests, next, next) {
 data->fn(req, data->fn_opaque);
@@ -138,11 +134,16 @@ static void scsi_device_for_each_req_async_bh(void 
*opaque)
 
 /* Drop the reference taken by scsi_device_for_each_req_async() */
 object_unref(OBJECT(s));
+
+/* Paired with blk_inc_in_flight() in scsi_device_for_each_req_async() */
+blk_dec_in_flight(s->conf.blk);
 }
 
 /*
  * Schedule @fn() to be invoked for each enqueued request in device @s. @fn()
  * runs in the AioContext that is executing the request.
+ * Keeps the BlockBackend's in-flight counter incremented until everything is
+ * done, so draining it will settle all scheduled @fn() calls.
  */
 static void scsi_device_for_each_req_async(SCSIDevice *s,
void (*fn)(SCSIRequest *, void *),
@@ -163,6 +164,8 @@ static void scsi_device_for_each_req_async(SCSIDevice *s,
  */
 object_ref(OBJECT(s));
 
+/* Paired with blk_dec_in_flight() in scsi_device_for_each_req_async_bh() 
*/
+blk_inc_in_flight(s->conf.blk);
 aio_bh_schedule_oneshot(blk_get_aio_context(s->conf.blk),
 scsi_device_for_each_req_async_bh,
 data);
@@ -1728,11 +1731,20 @@ static void scsi_device_purge_one_req(SCSIRequest *req, 
void *opaque)
 scsi_req_cancel_async(req, NULL);
 }
 
+/**
+ * Cancel all requests, and block until they are deleted.
+ */
 void scsi_device_purge_requests(SCSIDevice *sdev, SCSISense sense)
 {
 scsi_device_for_each_req_async(sdev, scsi_device_purge_one_req, NULL);
 
+/*
+ * Await all the scsi_device_purge_one_req() calls scheduled by
+ * scsi_device_for_each_req_async(), and all I/O requests that were
+ * cancelled this way, but may still take a bit of time to settle.
+ */
 blk_drain(sdev->conf.blk);
+
 scsi_device_set_ua(sdev, sense);
 }
 
-- 
2.43.0

[PATCH 0/2] block: Allow concurrent BB context changes

2024-02-02 Thread Hanna Czenczek

Hi,

Without the AioContext lock, a BB's context may kind of change at any
time (unless it has a root node, and I/O requests are pending).  That
also means that its own context (BlockBackend.ctx) and that of its root
node can differ sometimes (while the context is being changed).

blk_get_aio_context() doesn't know this yet and asserts that both are
always equal (if there is a root node).  Because it's no longer true,
and because callers don't seem to really care about the root node's
context, we can and should remove the assertion and just return the BB's
context.

Beyond that, the question is whether the callers of
blk_get_aio_context() are OK with the context potentially changing
concurrently.  Honestly, it isn't entirely clear to me; most look OK,
except for the virtio-scsi code, which operates under the general
assumption that the BB's context is always equal to that of the
virtio-scsi device.  I doubt that this assumption always holds (it is
definitely not obvious to me that it would), but then again, this series
will not make matters worse in that regard, and that is what counts for
me now.

One clear point of contention is scsi_device_for_each_req_async(), which
is addressed by patch 2.  Right now, it schedules a BH in the BB
context, then the BH double-checks whether the context still fits, and
if not, re-schedules itself.  Because virtio-scsi's context is fixed,
this seems to indicate to me that it wants to be able to deal with a
case where BB and virtio-scsi context differ, which seems to break that
aforementioned general virtio-scsi assumption.

Unfortunately, I definitely have to touch that code, because accepting
concurrent changes of AioContexts breaks the double-check (just because
the BB has the right context in that place does not mean it stays in
that context); instead, we must prevent any concurrent change until the
BH is done.  Because changing contexts generally involves a drained
section, we can prevent it by keeping the BB in-flight counter elevated.

Question is, how to reason for that.  I’d really rather not include the
need to follow the BB context in my argument, because I find that part a
bit fishy.

Luckily, there’s a second, completely different reason for having
scsi_device_for_each_req_async() increment the in-flight counter:
Specifically, scsi_device_purge_requests() probably wants to await full
completion of scsi_device_for_each_req_async(), and we can do that most
easily in the very same way by incrementing the in-flight counter.  This
way, the blk_drain() in scsi_device_purge_requests() will not only await
all (cancelled) I/O requests, but also the non-I/O requests.

The fact that this prevents the BB AioContext from changing while the BH
is scheduled/running then is just a nice side effect.


Hanna Czenczek (2):
  block-backend: Allow concurrent context changes
  scsi: Await request purging

 block/block-backend.c | 22 +++---
 hw/scsi/scsi-bus.c| 30 +-
 2 files changed, 32 insertions(+), 20 deletions(-)

-- 
2.43.0

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-02-02 Thread Hanna Czenczek


On 01.02.24 16:25, Hanna Czenczek wrote:

On 01.02.24 15:28, Stefan Hajnoczi wrote:


[...]


Did you find a scenario where the virtio-scsi AioContext is different
from the scsi-hd BB's Aiocontext?


Technically, that’s the reason for this thread, specifically that 
virtio_scsi_hotunplug() switches the BB back to the main context while 
scsi_device_for_each_req_async_bh() is running.  Yes, we can fix that 
specific case via the in-flight counter, but I’m wondering whether 
there’s really any merit in requiring the BB to always be in 
virtio-scsi’s context, or whether it would make more sense to schedule 
everything in virtio-scsi’s context.  Now that BBs/BDSs can receive 
requests from any context, that is.


Now that I know that wouldn’t be easy, let me turn this around: As far 
as I understand, scsi_device_for_each_req_async_bh() should still run in 
virtio-scsi’s context, but that’s hard, so we take the BB’s context, 
which we therefore require to be the same one. Further, (again AFAIU,) 
virtio-scsi’s context cannot change (only set in 
virtio_scsi_dataplane_setup(), which is run in 
virtio_scsi_device_realize()).  Therefore, why does the 
scsi_device_for_each_req_async() code accommodate for BB context changes?


Hanna

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-02-01 Thread Hanna Czenczek


On 01.02.24 16:25, Hanna Czenczek wrote:

[...]


It just seems simpler to me to not rely on the BB's context at all.


Hm, I now see the problem is that the processing (and scheduling) is 
largely done in generic SCSI code, which doesn’t have access to 
virtio-scsi’s context, only to that of the BB.  That makes my idea quite 
impossible. :/

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-02-01 Thread Hanna Czenczek


On 01.02.24 15:28, Stefan Hajnoczi wrote:

On Thu, Feb 01, 2024 at 03:10:12PM +0100, Hanna Czenczek wrote:

On 31.01.24 21:35, Stefan Hajnoczi wrote:

On Fri, Jan 26, 2024 at 04:24:49PM +0100, Hanna Czenczek wrote:

On 26.01.24 14:18, Kevin Wolf wrote:

Am 25.01.2024 um 18:32 hat Hanna Czenczek geschrieben:

On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
   the requests list.
- When the VM is stopped only the main loop may access the requests
   list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
  include/hw/scsi/scsi.h |   7 +-
  hw/scsi/scsi-bus.c | 181 -
  2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:

Note: We (on issues.redhat.com) have a separate report that seems to be
concerning this very problem: https://issues.redhat.com/browse/RHEL-19381


{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.

[...]


I don’t know anything about the problem yet, but as usual, I like
speculation and discovering how wrong I was later on, so one thing I came
across that’s funny about virtio-scsi is that requests can happen even while
a disk is being attached or detached.  That is, Linux seems to probe all
LUNs when a new virtio-scsi device is being attached, and it won’t stop just
because a disk is being attached or removed.  So maybe that’s part of the
problem, that we get a request while the BB is being detached, and
temporarily in an inconsistent state (BDS context differs from BB context).

I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()? And of course take the reader lock for
blk_get_aio_context(), but that should be completely unproblematic.

Actually, now that completely unproblematic part is giving me trouble.  I
wanted to just put a graph lock into blk_get_aio_context() (making it a
coroutine with a wrapper)

Which is the first thing I neglected and already not great. We have
calls of blk_get_aio_context() in the SCSI I/O path, and creating a
coroutine and doing at least two context switches simply for this call
is a lot of overhead...


but callers of blk_get_aio_context() generally assume the context is
going to stay the BB’s context for as long as their AioContext *
variable is in scope.

I'm not so sure about that. And taking another step back, I'm actually
also not sure how much it still matters now that they can submit I/O
from any thread.

That’s my impression, too, but “not sure” doesn’t feel great. :)
scsi_device_for_each_req_async_bh() specifically double-checks whether it’s
still in the right context before invoking the specified function, so it
seems there was some intention to continue to run in the context associated
with the BB.

(Not judging whether that intent makes sense or not, yet.)


Maybe the correct solution is to remove the assertion from
blk_get_aio_context() and just always return blk->ctx. If it's in the
middle of a change, you'll either get the old one or the new one. Either
one is fine to submit I/O from, and if you care about changes for other
reasons (like SCSI does), then you need explicit code to protect it
anyway (which SCSI apparently has, but it doesn't work).

I think most callers do just assume the BB stays in the context they got
(without any proof, admittedly), but I agree that under re-evaluation, it
probably doesn’t actually matter to them, really. And yes, basically, if the
caller doesn’t need to take a lock because it doesn’t really matter whether
blk->ctx changes while its still using the old value, blk_get_aio_context()
in turn doesn’t need to double-check blk->ctx against the root node’s
context either, and nobody needs a lock.

So I agree, it’s on the caller to protect against a potentially changing
context,

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-02-01 Thread Hanna Czenczek


On 31.01.24 21:35, Stefan Hajnoczi wrote:

On Fri, Jan 26, 2024 at 04:24:49PM +0100, Hanna Czenczek wrote:

On 26.01.24 14:18, Kevin Wolf wrote:

Am 25.01.2024 um 18:32 hat Hanna Czenczek geschrieben:

On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
  the requests list.
- When the VM is stopped only the main loop may access the requests
  list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
 include/hw/scsi/scsi.h |   7 +-
 hw/scsi/scsi-bus.c | 181 -
 2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:

Note: We (on issues.redhat.com) have a separate report that seems to be
concerning this very problem: https://issues.redhat.com/browse/RHEL-19381


{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.

[...]


I don’t know anything about the problem yet, but as usual, I like
speculation and discovering how wrong I was later on, so one thing I came
across that’s funny about virtio-scsi is that requests can happen even while
a disk is being attached or detached.  That is, Linux seems to probe all
LUNs when a new virtio-scsi device is being attached, and it won’t stop just
because a disk is being attached or removed.  So maybe that’s part of the
problem, that we get a request while the BB is being detached, and
temporarily in an inconsistent state (BDS context differs from BB context).

I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()? And of course take the reader lock for
blk_get_aio_context(), but that should be completely unproblematic.

Actually, now that completely unproblematic part is giving me trouble.  I
wanted to just put a graph lock into blk_get_aio_context() (making it a
coroutine with a wrapper)

Which is the first thing I neglected and already not great. We have
calls of blk_get_aio_context() in the SCSI I/O path, and creating a
coroutine and doing at least two context switches simply for this call
is a lot of overhead...


but callers of blk_get_aio_context() generally assume the context is
going to stay the BB’s context for as long as their AioContext *
variable is in scope.

I'm not so sure about that. And taking another step back, I'm actually
also not sure how much it still matters now that they can submit I/O
from any thread.

That’s my impression, too, but “not sure” doesn’t feel great. :)
scsi_device_for_each_req_async_bh() specifically double-checks whether it’s
still in the right context before invoking the specified function, so it
seems there was some intention to continue to run in the context associated
with the BB.

(Not judging whether that intent makes sense or not, yet.)


Maybe the correct solution is to remove the assertion from
blk_get_aio_context() and just always return blk->ctx. If it's in the
middle of a change, you'll either get the old one or the new one. Either
one is fine to submit I/O from, and if you care about changes for other
reasons (like SCSI does), then you need explicit code to protect it
anyway (which SCSI apparently has, but it doesn't work).

I think most callers do just assume the BB stays in the context they got
(without any proof, admittedly), but I agree that under re-evaluation, it
probably doesn’t actually matter to them, really. And yes, basically, if the
caller doesn’t need to take a lock because it doesn’t really matter whether
blk->ctx changes while its still using the old value, blk_get_aio_context()
in turn doesn’t need to double-check blk->ctx against the root node’s
context either, and nobody needs a lock.

So I agree, it’s on the caller to protect against a potentially changing
context, blk_get_aio_context() should just return blk->ctx and not check
against the root node.

(On a tangent: blk_drain(

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-02-01 Thread Hanna Czenczek


On 01.02.24 11:21, Kevin Wolf wrote:

Am 01.02.2024 um 10:43 hat Hanna Czenczek geschrieben:

On 31.01.24 11:17, Kevin Wolf wrote:

Am 29.01.2024 um 17:30 hat Hanna Czenczek geschrieben:

I don’t like using drain as a form of lock specifically against AioContext
changes, but maybe Stefan is right, and we should use it in this specific
case to get just the single problem fixed.  (Though it’s not quite trivial
either.  We’d probably still want to remove the assertion from
blk_get_aio_context(), so we don’t have to require all of its callers to
hold a count in the in-flight counter.)

Okay, fair, maybe fixing the specific problem is more important that
solving the more generic blk_get_aio_context() race.

In this case, wouldn't it be enough to increase the in-flight counter so
that the drain before switching AioContexts would run the BH before
anything bad can happen? Does the following work?

Yes, that’s what I had in mind (Stefan, too, I think), and in testing,
it looks good.

Oh, sorry, I completely misunderstood then. I thought you were talking
about adding a new drained section somewhere and that sounded a bit more
complicated. :-)

If it works, let's do this. Would you like to pick this up and send it
as a formal patch (possibly in a more polished form), or should I do
that?


Sure, I can do it.

Hanna

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-02-01 Thread Hanna Czenczek


On 31.01.24 11:17, Kevin Wolf wrote:

Am 29.01.2024 um 17:30 hat Hanna Czenczek geschrieben:

I don’t like using drain as a form of lock specifically against AioContext
changes, but maybe Stefan is right, and we should use it in this specific
case to get just the single problem fixed.  (Though it’s not quite trivial
either.  We’d probably still want to remove the assertion from
blk_get_aio_context(), so we don’t have to require all of its callers to
hold a count in the in-flight counter.)

Okay, fair, maybe fixing the specific problem is more important that
solving the more generic blk_get_aio_context() race.

In this case, wouldn't it be enough to increase the in-flight counter so
that the drain before switching AioContexts would run the BH before
anything bad can happen? Does the following work?


Yes, that’s what I had in mind (Stefan, too, I think), and in testing, 
it looks good.


Hanna



Kevin

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 0a2eb11c56..dc09eb8024 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -120,17 +120,11 @@ static void scsi_device_for_each_req_async_bh(void 
*opaque)
  SCSIRequest *next;
  
  /*

- * If the AioContext changed before this BH was called then reschedule into
- * the new AioContext before accessing ->requests. This can happen when
- * scsi_device_for_each_req_async() is called and then the AioContext is
- * changed before BHs are run.
+ * The AioContext can't have changed because we increased the in-flight
+ * counter for s->conf.blk.
   */
  ctx = blk_get_aio_context(s->conf.blk);
-if (ctx != qemu_get_current_aio_context()) {
-aio_bh_schedule_oneshot(ctx, scsi_device_for_each_req_async_bh,
-g_steal_pointer());
-return;
-}
+assert(ctx == qemu_get_current_aio_context());
  
  QTAILQ_FOREACH_SAFE(req, >requests, next, next) {

  data->fn(req, data->fn_opaque);
@@ -138,6 +132,7 @@ static void scsi_device_for_each_req_async_bh(void *opaque)
  
  /* Drop the reference taken by scsi_device_for_each_req_async() */

  object_unref(OBJECT(s));
+blk_dec_in_flight(s->conf.blk);
  }
  
  /*

@@ -163,6 +158,7 @@ static void scsi_device_for_each_req_async(SCSIDevice *s,
   */
  object_ref(OBJECT(s));
  
+blk_inc_in_flight(s->conf.blk);

  aio_bh_schedule_oneshot(blk_get_aio_context(s->conf.blk),
  scsi_device_for_each_req_async_bh,
  data);

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-29 Thread Hanna Czenczek


On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
the requests list.
- When the VM is stopped only the main loop may access the requests
list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
   include/hw/scsi/scsi.h |   7 +-
   hw/scsi/scsi-bus.c | 181 -
   2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:

{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.

(gdb) bt
#0  0x7f32a668d83c in  () at /usr/lib/libc.so.6
#1  0x7f32a663d668 in raise () at /usr/lib/libc.so.6
#2  0x7f32a66254b8 in abort () at /usr/lib/libc.so.6
#3  0x7f32a66253dc in  () at /usr/lib/libc.so.6
#4  0x7f32a6635d26 in  () at /usr/lib/libc.so.6
#5  0x556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at
../block/block-backend.c:2429
#6  blk_get_aio_context (blk=0x556e6e89ccf0) at
../block/block-backend.c:2417
#7  0x556e6b112d87 in scsi_device_for_each_req_async_bh
(opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128
#8  0x556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at
../util/async.c:218
#9  0x556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290,
blocking=blocking@entry=true) at ../util/aio-posix.c:722
#10 0x556e6b4564b6 in iothread_run (opaque=opaque@entry=0x556e6d89d920)
at ../iothread.c:63
#11 0x556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at
../util/qemu-thread-posix.c:541
#12 0x7f32a668b9eb in  () at /usr/lib/libc.so.6
#13 0x7f32a670f7cc in  () at /usr/lib/libc.so.6

I don’t know anything about the problem yet, but as usual, I like
speculation and discovering how wrong I was later on, so one thing I came
across that’s funny about virtio-scsi is that requests can happen even while
a disk is being attached or detached.  That is, Linux seems to probe all
LUNs when a new virtio-scsi device is being attached, and it won’t stop just
because a disk is being attached or removed.  So maybe that’s part of the
problem, that we get a request while the BB is being detached, and
temporarily in an inconsistent state (BDS context differs from BB context).

I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()?


Removing the drain to allow for all of bdrv_try_change_aio_context() to 
require GRAPH_WRLOCK, but that broke tests/unit/test-block-iothread, 
because without draining, block jobs would need to switch AioContexts 
while running, and job_set_aio_context() doesn’t like that.  Similarly 
to blk_get_aio_context(), I assume we can in theory just drop the 
assertion there and change the context while the job is running, because 
then the job can just change AioContexts on the next pause point (and in 
the meantime send requests from the old context, which is fine), but 
this does get quite murky.  (One rather virtual (but present) problem is 
that test-block-iothread itself contains some assertions in the job that 
its AioContext is actually the on its running in, but this assertion 
would no longer necessarily hold true.)


I don’t like using drain as a form of lock specifically against 
AioContext changes, but maybe Stefan is right, and we should use it in 
this specific case to get just the single problem fixed.  (Though it’s 
not quite trivial either.  We’d probably still want to remove the 
assertion from blk_get_aio_context(), so we don’t have to require all of 
its callers to hold a count in the in-flight counter.)


Hanna

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-26 Thread Hanna Czenczek


On 26.01.24 14:18, Kevin Wolf wrote:

Am 25.01.2024 um 18:32 hat Hanna Czenczek geschrieben:

On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
 the requests list.
- When the VM is stopped only the main loop may access the requests
 list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
include/hw/scsi/scsi.h |   7 +-
hw/scsi/scsi-bus.c | 181 -
2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:


Note: We (on issues.redhat.com) have a separate report that seems to be 
concerning this very problem: https://issues.redhat.com/browse/RHEL-19381



{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.

[...]


I don’t know anything about the problem yet, but as usual, I like
speculation and discovering how wrong I was later on, so one thing I came
across that’s funny about virtio-scsi is that requests can happen even while
a disk is being attached or detached.  That is, Linux seems to probe all
LUNs when a new virtio-scsi device is being attached, and it won’t stop just
because a disk is being attached or removed.  So maybe that’s part of the
problem, that we get a request while the BB is being detached, and
temporarily in an inconsistent state (BDS context differs from BB context).

I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()? And of course take the reader lock for
blk_get_aio_context(), but that should be completely unproblematic.

Actually, now that completely unproblematic part is giving me trouble.  I
wanted to just put a graph lock into blk_get_aio_context() (making it a
coroutine with a wrapper)

Which is the first thing I neglected and already not great. We have
calls of blk_get_aio_context() in the SCSI I/O path, and creating a
coroutine and doing at least two context switches simply for this call
is a lot of overhead...


but callers of blk_get_aio_context() generally assume the context is
going to stay the BB’s context for as long as their AioContext *
variable is in scope.

I'm not so sure about that. And taking another step back, I'm actually
also not sure how much it still matters now that they can submit I/O
from any thread.


That’s my impression, too, but “not sure” doesn’t feel great. :) 
scsi_device_for_each_req_async_bh() specifically double-checks whether 
it’s still in the right context before invoking the specified function, 
so it seems there was some intention to continue to run in the context 
associated with the BB.


(Not judging whether that intent makes sense or not, yet.)


Maybe the correct solution is to remove the assertion from
blk_get_aio_context() and just always return blk->ctx. If it's in the
middle of a change, you'll either get the old one or the new one. Either
one is fine to submit I/O from, and if you care about changes for other
reasons (like SCSI does), then you need explicit code to protect it
anyway (which SCSI apparently has, but it doesn't work).


I think most callers do just assume the BB stays in the context they got 
(without any proof, admittedly), but I agree that under re-evaluation, 
it probably doesn’t actually matter to them, really. And yes, basically, 
if the caller doesn’t need to take a lock because it doesn’t really 
matter whether blk->ctx changes while its still using the old value, 
blk_get_aio_context() in turn doesn’t need to double-check blk->ctx 
against the root node’s context either, and nobody needs a lock.


So I agree, it’s on the caller to protect against a potentially changing 
context, blk_get_aio_context() should just return blk->ctx and not check 
against the root node.


(On a tangent: blk_drain() is a caller of blk_get_aio_context(), and it 
polls that context.  Why does it need to

Re: [PATCH 2/2] virtio: Keep notifications disabled during drain

2024-01-25 Thread Hanna Czenczek


On 25.01.24 19:18, Hanna Czenczek wrote:

On 25.01.24 19:03, Stefan Hajnoczi wrote:

On Wed, Jan 24, 2024 at 06:38:30PM +0100, Hanna Czenczek wrote:


[...]

@@ -3563,6 +3574,13 @@ void 
virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx)

  aio_set_event_notifier_poll(ctx, >host_notifier,
virtio_queue_host_notifier_aio_poll_begin,
virtio_queue_host_notifier_aio_poll_end);
+
+    /*
+ * We will have ignored notifications about new requests from 
the guest
+ * during the drain, so "kick" the virt queue to process those 
requests

+ * now.
+ */
+    virtio_queue_notify(vq->vdev, vq->queue_index);

event_notifier_set(>host_notifier) is easier to understand because
it doesn't contain a non-host_notifier code path that we must not take.

Is there a reason why you used virtio_queue_notify() instead?


Not a good one anyway!

virtio_queue_notify() is just what seemed obvious to me (i.e. to 
notify the virtqueue).  Before removal of the AioContext lock, calling 
handle_output seemed safe.  But, yes, there was the discussion on the 
RFC that it really isn’t.  I didn’t consider that means we must rely 
on virtio_queue_notify() calling event_notifier_set(), so we may as 
well call it explicitly here.


I’ll fix it, thanks for pointing it out!


(I think together with this change, I’ll also remove the 
event_notifier_set() call from virtio_blk_data_plane_start().  It’d 
obviously be a duplicate, and removing it shows why 
virtio_queue_aio_attach_host_notifier() should always kick the queue.)

Re: [PATCH 2/2] virtio: Keep notifications disabled during drain

2024-01-25 Thread Hanna Czenczek


On 25.01.24 19:03, Stefan Hajnoczi wrote:

On Wed, Jan 24, 2024 at 06:38:30PM +0100, Hanna Czenczek wrote:

During drain, we do not care about virtqueue notifications, which is why
we remove the handlers on it.  When removing those handlers, whether vq
notifications are enabled or not depends on whether we were in polling
mode or not; if not, they are enabled (by default); if so, they have
been disabled by the io_poll_start callback.

Because we do not care about those notifications after removing the
handlers, this is fine.  However, we have to explicitly ensure they are
enabled when re-attaching the handlers, so we will resume receiving
notifications.  We do this in virtio_queue_aio_attach_host_notifier*().
If such a function is called while we are in a polling section,
attaching the notifiers will then invoke the io_poll_start callback,
re-disabling notifications.

Because we will always miss virtqueue updates in the drained section, we
also need to poll the virtqueue once after attaching the notifiers.

Buglink: https://issues.redhat.com/browse/RHEL-3934
Signed-off-by: Hanna Czenczek 
---
  include/block/aio.h |  7 ++-
  hw/virtio/virtio.c  | 42 ++
  2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 5d0a114988..8378553eb9 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -480,9 +480,14 @@ void aio_set_event_notifier(AioContext *ctx,
  AioPollFn *io_poll,
  EventNotifierHandler *io_poll_ready);
  
-/* Set polling begin/end callbacks for an event notifier that has already been

+/*
+ * Set polling begin/end callbacks for an event notifier that has already been
   * registered with aio_set_event_notifier.  Do nothing if the event notifier 
is
   * not registered.
+ *
+ * Note that if the io_poll_end() callback (or the entire notifier) is removed
+ * during polling, it will not be called, so an io_poll_begin() is not
+ * necessarily always followed by an io_poll_end().
   */
  void aio_set_event_notifier_poll(AioContext *ctx,
   EventNotifier *notifier,
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 7549094154..4166da9e97 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3556,6 +3556,17 @@ static void 
virtio_queue_host_notifier_aio_poll_end(EventNotifier *n)
  
  void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx)

  {
+/*
+ * virtio_queue_aio_detach_host_notifier() can leave notifications 
disabled.
+ * Re-enable them.  (And if detach has not been used before, notifications
+ * being enabled is still the default state while a notifier is attached;
+ * see virtio_queue_host_notifier_aio_poll_end(), which will always leave
+ * notifications enabled once the polling section is left.)
+ */
+if (!virtio_queue_get_notification(vq)) {
+virtio_queue_set_notification(vq, 1);
+}
+
  aio_set_event_notifier(ctx, >host_notifier,
 virtio_queue_host_notifier_read,
 virtio_queue_host_notifier_aio_poll,
@@ -3563,6 +3574,13 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
  aio_set_event_notifier_poll(ctx, >host_notifier,
  virtio_queue_host_notifier_aio_poll_begin,
  virtio_queue_host_notifier_aio_poll_end);
+
+/*
+ * We will have ignored notifications about new requests from the guest
+ * during the drain, so "kick" the virt queue to process those requests
+ * now.
+ */
+virtio_queue_notify(vq->vdev, vq->queue_index);

event_notifier_set(>host_notifier) is easier to understand because
it doesn't contain a non-host_notifier code path that we must not take.

Is there a reason why you used virtio_queue_notify() instead?


Not a good one anyway!

virtio_queue_notify() is just what seemed obvious to me (i.e. to notify 
the virtqueue).  Before removal of the AioContext lock, calling 
handle_output seemed safe.  But, yes, there was the discussion on the 
RFC that it really isn’t.  I didn’t consider that means we must rely on 
virtio_queue_notify() calling event_notifier_set(), so we may as well 
call it explicitly here.


I’ll fix it, thanks for pointing it out!


Otherwise:
Reviewed-by: Stefan Hajnoczi 


Thanks!

Hanna

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-25 Thread Hanna Czenczek


On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
the requests list.
- When the VM is stopped only the main loop may access the requests
list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
   include/hw/scsi/scsi.h |   7 +-
   hw/scsi/scsi-bus.c | 181 -
   2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:

{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.


[...]


I don’t know anything about the problem yet, but as usual, I like
speculation and discovering how wrong I was later on, so one thing I came
across that’s funny about virtio-scsi is that requests can happen even while
a disk is being attached or detached.  That is, Linux seems to probe all
LUNs when a new virtio-scsi device is being attached, and it won’t stop just
because a disk is being attached or removed.  So maybe that’s part of the
problem, that we get a request while the BB is being detached, and
temporarily in an inconsistent state (BDS context differs from BB context).

I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()? And of course take the reader lock for
blk_get_aio_context(), but that should be completely unproblematic.


Actually, now that completely unproblematic part is giving me trouble.  
I wanted to just put a graph lock into blk_get_aio_context() (making it 
a coroutine with a wrapper), but callers of blk_get_aio_context() 
generally assume the context is going to stay the BB’s context for as 
long as their AioContext * variable is in scope.  I was tempted to think 
callers know what happens to the BB they pass to blk_get_aio_context(), 
and it won’t change contexts so easily, but then I remembered this is 
exactly what happens in this case; we run 
scsi_device_for_each_req_async_bh() in one thread (which calls 
blk_get_aio_context()), and in the other, we change the BB’s context.


It seems like there are very few blk_* functions right now that require 
taking a graph lock around it, so I’m hesitant to go that route.  But if 
we’re protecting a BB’s context via the graph write lock, I can’t think 
of a way around having to take a read lock whenever reading a BB’s 
context, and holding it for as long as we assume that context to remain 
the BB’s context.  It’s also hard to figure out how long that is, case 
by case; for example, dma_blk_read() schedules an AIO function in the BB 
context; but we probably don’t care that this context remains the BB’s 
context until the request is done.  In the case of 
scsi_device_for_each_req_async_bh(), we already take care to re-schedule 
it when it turns out the context is outdated, so it does seem quite 
important here, and we probably want to keep the lock until after the 
QTAILQ_FOREACH_SAFE() loop.


On a tangent, this TOCTTOU problem makes me wary of other blk_* 
functions that query information.  For example, fuse_read() (in 
block/export/fuse.c) truncates requests to the BB length.  But what if 
the BB length changes concurrently between blk_getlength() and 
blk_pread()?  While we can justify using the graph lock for a BB’s 
AioContext, we can’t use it for other metadata like its length.


Hanna

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-25 Thread Hanna Czenczek


On 24.01.24 22:53, Stefan Hajnoczi wrote:

On Wed, Jan 24, 2024 at 01:12:47PM +0100, Hanna Czenczek wrote:

On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
 the requests list.
- When the VM is stopped only the main loop may access the requests
 list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
include/hw/scsi/scsi.h |   7 +-
hw/scsi/scsi-bus.c | 181 -
2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:

{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.

[...]


I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()? And of course take the reader lock for
blk_get_aio_context(), but that should be completely unproblematic.

I tried this, and it’s not easy taking the lock just for tran_commit(),
because some callers of bdrv_try_change_aio_context() already hold the write
lock (specifically bdrv_attach_child_common(),
bdrv_attach_child_common_abort(), and bdrv_root_unref_child()[1]), and
qmp_x_blockdev_set_iothread() holds the read lock.  Other callers don’t hold
any lock[2].

So I’m not sure whether we should mark all of bdrv_try_change_aio_context()
as GRAPH_WRLOCK and then make all callers take the lock, or really only take
it for tran_commit(), and have callers release the lock around
bdrv_try_change_aio_context(). Former sounds better to naïve me.

(In any case, FWIW, having blk_set_aio_context() take the write lock, and
scsi_device_for_each_req_async_bh() take the read lock[3], does fix the
assertion failure.)

I wonder if a simpler solution is blk_inc_in_flight() in
scsi_device_for_each_req_async() and blk_dec_in_flight() in
scsi_device_for_each_req_async_bh() so that drain
waits for the BH.

There is a drain around the AioContext change, so as long as
scsi_device_for_each_req_async() was called before blk_set_aio_context()
and not _during_ aio_poll(), we would prevent inconsistent BB vs BDS
aio_contexts.


Actually, Kevin has suggested on IRC that we drop the whole drain. :)

Dropping the write lock in our outside of bdrv_try_change_aio_context() 
for callers that have already taken it seems unsafe, so the only option 
would be to make the whole function write-lock-able.  The drained 
section can cause problems with that if it ends up wanting to reorganize 
the graph, so AFAIU drain should never be done while under a write 
lock.  This is already a problem now because there are three callers 
that do call bdrv_try_change_aio_context() while under a write lock, so 
it seems like we shouldn’t keep the drain as-is.


So, Kevin suggested just dropping that drain, because I/O requests are 
no longer supposed to care about a BDS’s native AioContext anymore 
anyway, so it seems like the need for the drain has gone away with 
multiqueue.  Then we could make the whole function GRAPH_WRLOCK.


Hanna

[PATCH 2/2] virtio: Keep notifications disabled during drain

2024-01-24 Thread Hanna Czenczek

During drain, we do not care about virtqueue notifications, which is why
we remove the handlers on it.  When removing those handlers, whether vq
notifications are enabled or not depends on whether we were in polling
mode or not; if not, they are enabled (by default); if so, they have
been disabled by the io_poll_start callback.

Because we do not care about those notifications after removing the
handlers, this is fine.  However, we have to explicitly ensure they are
enabled when re-attaching the handlers, so we will resume receiving
notifications.  We do this in virtio_queue_aio_attach_host_notifier*().
If such a function is called while we are in a polling section,
attaching the notifiers will then invoke the io_poll_start callback,
re-disabling notifications.

Because we will always miss virtqueue updates in the drained section, we
also need to poll the virtqueue once after attaching the notifiers.

Buglink: https://issues.redhat.com/browse/RHEL-3934
Signed-off-by: Hanna Czenczek 
---
 include/block/aio.h |  7 ++-
 hw/virtio/virtio.c  | 42 ++
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 5d0a114988..8378553eb9 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -480,9 +480,14 @@ void aio_set_event_notifier(AioContext *ctx,
 AioPollFn *io_poll,
 EventNotifierHandler *io_poll_ready);
 
-/* Set polling begin/end callbacks for an event notifier that has already been
+/*
+ * Set polling begin/end callbacks for an event notifier that has already been
  * registered with aio_set_event_notifier.  Do nothing if the event notifier is
  * not registered.
+ *
+ * Note that if the io_poll_end() callback (or the entire notifier) is removed
+ * during polling, it will not be called, so an io_poll_begin() is not
+ * necessarily always followed by an io_poll_end().
  */
 void aio_set_event_notifier_poll(AioContext *ctx,
  EventNotifier *notifier,
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 7549094154..4166da9e97 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3556,6 +3556,17 @@ static void 
virtio_queue_host_notifier_aio_poll_end(EventNotifier *n)
 
 void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
+/*
+ * virtio_queue_aio_detach_host_notifier() can leave notifications 
disabled.
+ * Re-enable them.  (And if detach has not been used before, notifications
+ * being enabled is still the default state while a notifier is attached;
+ * see virtio_queue_host_notifier_aio_poll_end(), which will always leave
+ * notifications enabled once the polling section is left.)
+ */
+if (!virtio_queue_get_notification(vq)) {
+virtio_queue_set_notification(vq, 1);
+}
+
 aio_set_event_notifier(ctx, >host_notifier,
virtio_queue_host_notifier_read,
virtio_queue_host_notifier_aio_poll,
@@ -3563,6 +3574,13 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
 aio_set_event_notifier_poll(ctx, >host_notifier,
 virtio_queue_host_notifier_aio_poll_begin,
 virtio_queue_host_notifier_aio_poll_end);
+
+/*
+ * We will have ignored notifications about new requests from the guest
+ * during the drain, so "kick" the virt queue to process those requests
+ * now.
+ */
+virtio_queue_notify(vq->vdev, vq->queue_index);
 }
 
 /*
@@ -3573,14 +3591,38 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
  */
 void virtio_queue_aio_attach_host_notifier_no_poll(VirtQueue *vq, AioContext 
*ctx)
 {
+/* See virtio_queue_aio_attach_host_notifier() */
+if (!virtio_queue_get_notification(vq)) {
+virtio_queue_set_notification(vq, 1);
+}
+
 aio_set_event_notifier(ctx, >host_notifier,
virtio_queue_host_notifier_read,
NULL, NULL);
+
+/*
+ * See virtio_queue_aio_attach_host_notifier().
+ * Note that this may be unnecessary for the type of virtqueues this
+ * function is used for.  Still, it will not hurt to have a quick look into
+ * whether we can/should process any of the virtqueue elements.
+ */
+virtio_queue_notify(vq->vdev, vq->queue_index);
 }
 
 void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
 aio_set_event_notifier(ctx, >host_notifier, NULL, NULL, NULL);
+
+/*
+ * aio_set_event_notifier_poll() does not guarantee whether io_poll_end()
+ * will run after io_poll_begin(), so by removing the notifier, we do not
+ * know whether virtio_queue_host_notifier_aio_poll_end() has run after a
+ * previous virtio_queue_host_notifier_aio_poll_begin(), i.e. whether
+ * not

[PATCH 1/2] virtio-scsi: Attach event vq notifier with no_poll

2024-01-24 Thread Hanna Czenczek

As of commit 38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi:
don't waste CPU polling the event virtqueue"), we only attach an io_read
notifier for the virtio-scsi event virtqueue instead, and no polling
notifiers.  During operation, the event virtqueue is typically
non-empty, but none of the buffers are intended to be used immediately.
Instead, they only get used when certain events occur.  Therefore, it
makes no sense to continuously poll it when non-empty, because it is
supposed to be and stay non-empty.

We do this by using virtio_queue_aio_attach_host_notifier_no_poll()
instead of virtio_queue_aio_attach_host_notifier() for the event
virtqueue.

Commit 766aa2de0f29b657148e04599320d771c36fd126 ("virtio-scsi: implement
BlockDevOps->drained_begin()") however has virtio_scsi_drained_end() use
virtio_queue_aio_attach_host_notifier() for all virtqueues, including
the event virtqueue.  This can lead to it being polled again, undoing
the benefit of commit 38738f7dbbda90fbc161757b7f4be35b52205552.

Fix it by using virtio_queue_aio_attach_host_notifier_no_poll() for the
event virtqueue.

Reported-by: Fiona Ebner 
Fixes: 766aa2de0f29b657148e04599320d771c36fd126
   ("virtio-scsi: implement BlockDevOps->drained_begin()")
Signed-off-by: Hanna Czenczek 
---
 hw/scsi/virtio-scsi.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 690aceec45..9f02ceea09 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -1149,6 +1149,7 @@ static void virtio_scsi_drained_begin(SCSIBus *bus)
 static void virtio_scsi_drained_end(SCSIBus *bus)
 {
 VirtIOSCSI *s = container_of(bus, VirtIOSCSI, bus);
+VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
 VirtIODevice *vdev = VIRTIO_DEVICE(s);
 uint32_t total_queues = VIRTIO_SCSI_VQ_NUM_FIXED +
 s->parent_obj.conf.num_queues;
@@ -1166,7 +1167,11 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
 
 for (uint32_t i = 0; i < total_queues; i++) {
 VirtQueue *vq = virtio_get_queue(vdev, i);
-virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+if (vq == vs->event_vq) {
+virtio_queue_aio_attach_host_notifier_no_poll(vq, s->ctx);
+} else {
+virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+}
 }
 }
 
-- 
2.43.0

[PATCH 0/2] virtio: Keep notifications disabled during drain

2024-01-24 Thread Hanna Czenczek

Hi,

When registering callbacks via aio_set_event_notifier_poll(), the
io_poll_end() callback is only invoked when polling actually ends.  If
the notifiers are removed while in a polling section, it is not called.
Therefore, io_poll_start() is not necessarily followed up by
io_poll_end().

It is not entirely clear whether this is good or bad behavior.  On one
hand, it may be unexpected to callers.  On the other, it may be
counterproductive to call io_poll_end() when the polling section has not
ended yet.

Right now, there is only one user of aio_set_event_notifier(), which is
virtio_queue_aio_attach_host_notifier().  It does not expect this
behavior, which leads to virtqueue notifiers remaining disabled if
virtio_queue_aio_detach_host_notifier() is called while polling.  That
can happen e.g. through virtio_scsi_drained_begin() or
virtio_blk_drained_begin() (through virtio_blk_data_plane_detach()).
In such a case, the virtqueue may not be processed for a while, letting
the guest driver hang.  This can be reproduced by repeatedly
hot-plugging and -unplugging a virtio-scsi device with a scsi-hd disk,
because the guest will try to enumerate the virtio-scsi device while
we’re attaching the scsi-hd disk, which causes a drain, which can cause
the virtio-scsi virtqueue to stall as described.

Stefan has suggested ensuring we always follow up io_poll_start() by
io_poll_end():

https://lists.nongnu.org/archive/html/qemu-block/2023-12/msg00163.html

I prefer changing the caller instead, because I don’t think we actually
want the virtqueue notifier to be force-enabled when removing our AIO
notifiers.  So I believe we actually only want to take care to
force-enable it when we re-attach the AIO notifiers, and to kick
virtqueue processing once, in case we missed any events while the AIO
notifiers were not attached.

That is done by patch 2.  We have already discussed a prior version of
it here:

https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg1.html

And compared to that, based on the discussion, there are some changes:
1. Used virtio_queue_notify() instead of virtio_queue_notify_vq(), as
   suggested by Paolo, because it’s thread-safe
2. Moved virtio_queue_notify() into
   virtio_queue_aio_attach_host_notifier*(), because we always want it
3. Dropped virtio_queue_set_notification(vq, 0) from
   virtio_queue_aio_detach_host_notifier(): Paolo wasn’t sure whether
   that was safe to do from any context.  We don’t really need to call
   it anyway, so I just dropped it.
4. Added patch 1:

Patch 1 fixes virtio_scsi_drained_end() so it won’t attach polling
notifiers for the event virtqueue.  That didn’t turn out to be an issue
so far, but with patch 2, Fiona saw the virtqueue processing queue
spinning in a loop as described in
38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi: don't waste CPU
polling the event virtqueue").


Note that as of eaad0fe26050c227dc5dad63205835bac4912a51 ("scsi: only
access SCSIDevice->requests from one thread") there’s a different
problem when trying to reproduce the bug via hot-plugging and
-unplugging a virtio-scsi device, specifically, when unplugging, qemu
may crash with an assertion failure[1].  I don’t have a full fix for
that yet, but in case you need a work-around for the specific case of
virtio-scsi hot-plugging and -unplugging, you can use this patch:

https://czenczek.de/0001-DONTMERGE-Fix-crash-on-scsi-unplug.patch


[1] https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg00317.html


Hanna Czenczek (2):
  virtio-scsi: Attach event vq notifier with no_poll
  virtio: Keep notifications disabled during drain

 include/block/aio.h   |  7 ++-
 hw/scsi/virtio-scsi.c |  7 ++-
 hw/virtio/virtio.c| 42 ++
 3 files changed, 54 insertions(+), 2 deletions(-)

-- 
2.43.0

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-24 Thread Hanna Czenczek


On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
the requests list.
- When the VM is stopped only the main loop may access the requests
list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
   include/hw/scsi/scsi.h |   7 +-
   hw/scsi/scsi-bus.c | 181 -
   2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:

{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.


[...]


I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()? And of course take the reader lock for
blk_get_aio_context(), but that should be completely unproblematic.


I tried this, and it’s not easy taking the lock just for tran_commit(), 
because some callers of bdrv_try_change_aio_context() already hold the 
write lock (specifically bdrv_attach_child_common(), 
bdrv_attach_child_common_abort(), and bdrv_root_unref_child()[1]), and 
qmp_x_blockdev_set_iothread() holds the read lock.  Other callers don’t 
hold any lock[2].


So I’m not sure whether we should mark all of 
bdrv_try_change_aio_context() as GRAPH_WRLOCK and then make all callers 
take the lock, or really only take it for tran_commit(), and have 
callers release the lock around bdrv_try_change_aio_context(). Former 
sounds better to naïve me.


(In any case, FWIW, having blk_set_aio_context() take the write lock, 
and scsi_device_for_each_req_async_bh() take the read lock[3], does fix 
the assertion failure.)


Hanna

[1] bdrv_root_unref_child() is not marked as GRAPH_WRLOCK, but it’s 
callers generally seem to ensure that the lock is taken when calling it.


[2] blk_set_aio_context() (evidently), blk_exp_add(), 
external_snapshot_abort(), {blockdev,drive}_backup_action(), 
qmp_{blockdev,drive}_mirror()


[3] I’ve made the _bh a coroutine (for bdrv_graph_co_rdlock()) and 
replaced the aio_bh_schedule_oneshot() by aio_co_enter() – hope that’s 
right.

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-23 Thread Hanna Czenczek


On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
the requests list.
- When the VM is stopped only the main loop may access the requests
list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
   include/hw/scsi/scsi.h |   7 +-
   hw/scsi/scsi-bus.c | 181 -
   2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:

{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.

(gdb) bt
#0  0x7f32a668d83c in  () at /usr/lib/libc.so.6
#1  0x7f32a663d668 in raise () at /usr/lib/libc.so.6
#2  0x7f32a66254b8 in abort () at /usr/lib/libc.so.6
#3  0x7f32a66253dc in  () at /usr/lib/libc.so.6
#4  0x7f32a6635d26 in  () at /usr/lib/libc.so.6
#5  0x556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at
../block/block-backend.c:2429
#6  blk_get_aio_context (blk=0x556e6e89ccf0) at
../block/block-backend.c:2417
#7  0x556e6b112d87 in scsi_device_for_each_req_async_bh
(opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128
#8  0x556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at
../util/async.c:218
#9  0x556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290,
blocking=blocking@entry=true) at ../util/aio-posix.c:722
#10 0x556e6b4564b6 in iothread_run (opaque=opaque@entry=0x556e6d89d920)
at ../iothread.c:63
#11 0x556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at
../util/qemu-thread-posix.c:541
#12 0x7f32a668b9eb in  () at /usr/lib/libc.so.6
#13 0x7f32a670f7cc in  () at /usr/lib/libc.so.6

I don’t know anything about the problem yet, but as usual, I like
speculation and discovering how wrong I was later on, so one thing I came
across that’s funny about virtio-scsi is that requests can happen even while
a disk is being attached or detached.  That is, Linux seems to probe all
LUNs when a new virtio-scsi device is being attached, and it won’t stop just
because a disk is being attached or removed.  So maybe that’s part of the
problem, that we get a request while the BB is being detached, and
temporarily in an inconsistent state (BDS context differs from BB context).

I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()? And of course take the reader lock for
blk_get_aio_context(), but that should be completely unproblematic.

At the first sight I don't see a reason why this would break something,
but I've learnt not to trust my first impression with the graph locking
work...

Of course, I also didn't check if there are more things inside of the
device emulation that need additional locking in this case, too. But
even if so, blk_get_aio_context() should never see an inconsistent
state.


Ah, sorry, saw your reply only now that I hit “send”.

I forgot that we do have that big lock that I thought rather to avoid 
:)  Sounds good and very reasonable to me.  Changing the contexts in the 
graph sounds like a graph change operation, and reading and comparing 
contexts in the graph sounds like reading the graph.


Hanna

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-23 Thread Hanna Czenczek


On 23.01.24 17:40, Hanna Czenczek wrote:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
   the requests list.
- When the VM is stopped only the main loop may access the requests
   list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
  include/hw/scsi/scsi.h |   7 +-
  hw/scsi/scsi-bus.c | 181 -
  2 files changed, 131 insertions(+), 57 deletions(-)

[...]

I don’t know anything about the problem yet, but as usual, I like 
speculation and discovering how wrong I was later on, so one thing I 
came across that’s funny about virtio-scsi is that requests can happen 
even while a disk is being attached or detached.  That is, Linux seems 
to probe all LUNs when a new virtio-scsi device is being attached, and 
it won’t stop just because a disk is being attached or removed.  So 
maybe that’s part of the problem, that we get a request while the BB 
is being detached, and temporarily in an inconsistent state (BDS 
context differs from BB context).


I’ll look more into it.


What I think happens is that scsi_device_purge_requests() runs (perhaps 
through virtio_scsi_hotunplug() -> qdev_simple_device_unplug_cb() -> 
scsi_qdev_unrealize()?), which schedules 
scsi_device_for_each_req_async_bh() to run, but doesn’t await it.  We go 
on, begin to move the BB and its BDS back to the main context (via 
blk_set_aio_context() in virtio_scsi_hotunplug()), but 
scsi_device_for_each_req_async_bh() still runs in the I/O thread, it 
calls blk_get_aio_context() while the contexts are inconsistent, and we 
get the crash.


There is a comment above blk_get_aio_context() in 
scsi_device_for_each_req_async_bh() about the BB potentially being moved 
to a different context prior to the BH running, but it doesn’t consider 
the possibility that that move may occur *concurrently*.


I don’t know how to fix this, though.  The whole idea of anything 
happening to a BB while it is being moved to a different context seems 
so wrong to me that I’d want to slap a big lock on it, but I have the 
feeling that that isn’t what we want.


Hanna

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-23 Thread Hanna Czenczek


On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
   the requests list.
- When the VM is stopped only the main loop may access the requests
   list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
  include/hw/scsi/scsi.h |   7 +-
  hw/scsi/scsi-bus.c | 181 -
  2 files changed, 131 insertions(+), 57 deletions(-)


My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks 
more often because of this commit than because of the original bug, i.e. 
when repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd 
device, this tends to happen when unplugging the scsi-hd:


{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: 
Assertion `ctx == blk->ctx' failed.


(gdb) bt
#0  0x7f32a668d83c in  () at /usr/lib/libc.so.6
#1  0x7f32a663d668 in raise () at /usr/lib/libc.so.6
#2  0x7f32a66254b8 in abort () at /usr/lib/libc.so.6
#3  0x7f32a66253dc in  () at /usr/lib/libc.so.6
#4  0x7f32a6635d26 in  () at /usr/lib/libc.so.6
#5  0x556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at 
../block/block-backend.c:2429
#6  blk_get_aio_context (blk=0x556e6e89ccf0) at 
../block/block-backend.c:2417
#7  0x556e6b112d87 in scsi_device_for_each_req_async_bh 
(opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128
#8  0x556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at 
../util/async.c:218
#9  0x556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290, 
blocking=blocking@entry=true) at ../util/aio-posix.c:722
#10 0x556e6b4564b6 in iothread_run 
(opaque=opaque@entry=0x556e6d89d920) at ../iothread.c:63
#11 0x556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at 
../util/qemu-thread-posix.c:541

#12 0x7f32a668b9eb in  () at /usr/lib/libc.so.6
#13 0x7f32a670f7cc in  () at /usr/lib/libc.so.6

I don’t know anything about the problem yet, but as usual, I like 
speculation and discovering how wrong I was later on, so one thing I 
came across that’s funny about virtio-scsi is that requests can happen 
even while a disk is being attached or detached.  That is, Linux seems 
to probe all LUNs when a new virtio-scsi device is being attached, and 
it won’t stop just because a disk is being attached or removed.  So 
maybe that’s part of the problem, that we get a request while the BB is 
being detached, and temporarily in an inconsistent state (BDS context 
differs from BB context).


I’ll look more into it.

Hanna

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-23 Thread Hanna Czenczek


On 02.01.24 16:24, Hanna Czenczek wrote:

[...]

I’ve attached the preliminary patch that I didn’t get to send (or test 
much) last year.  Not sure if it has the same CPU-usage-spike issue 
Fiona was seeing, the only functional difference is that I notify the 
vq after attaching the notifiers instead of before.


It’ll take me some more time to send a patch mail to that effect, 
because now there’s an assertion failure on hotunplug, which bisects 
back to eaad0fe26050c227dc5dad63205835bac4912a51 (“scsi: only access 
SCSIDevice->requests from one thread”):


{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: 
Assertion `ctx == blk->ctx' failed.


(gdb) bt
#0  0x7f32a668d83c in  () at /usr/lib/libc.so.6
#1  0x7f32a663d668 in raise () at /usr/lib/libc.so.6
#2  0x7f32a66254b8 in abort () at /usr/lib/libc.so.6
#3  0x7f32a66253dc in  () at /usr/lib/libc.so.6
#4  0x7f32a6635d26 in  () at /usr/lib/libc.so.6
#5  0x556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at 
../block/block-backend.c:2429
#6  blk_get_aio_context (blk=0x556e6e89ccf0) at 
../block/block-backend.c:2417
#7  0x556e6b112d87 in scsi_device_for_each_req_async_bh 
(opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128
#8  0x556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at 
../util/async.c:218
#9  0x556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290, 
blocking=blocking@entry=true) at ../util/aio-posix.c:722
#10 0x556e6b4564b6 in iothread_run 
(opaque=opaque@entry=0x556e6d89d920) at ../iothread.c:63
#11 0x556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at 
../util/qemu-thread-posix.c:541

#12 0x7f32a668b9eb in  () at /usr/lib/libc.so.6
#13 0x7f32a670f7cc in  () at /usr/lib/libc.so.6

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-23 Thread Hanna Czenczek


On 23.01.24 12:12, Fiona Ebner wrote:

[...]


I noticed poll_set_started() is not called, because
ctx->fdmon_ops->need_wait(ctx) was true, i.e. ctx->poll_disable_cnt was
positive (I'm using fdmon_poll). I then found this is because of the
notifier for the event vq, being attached with


virtio_queue_aio_attach_host_notifier_no_poll(vs->event_vq, s->ctx);

in virtio_scsi_dataplane_start(). But in virtio_scsi_drained_end() it is
attached with virtio_queue_aio_attach_host_notifier() instead of the
_no_poll() variant. So that might be the actual issue here?

 From a quick test, I cannot see the CPU-usage-spike issue with the
following either:


diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 690aceec45..ba1ab8e410 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -1166,7 +1166,15 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
  
  for (uint32_t i = 0; i < total_queues; i++) {

  VirtQueue *vq = virtio_get_queue(vdev, i);
-virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+if (!virtio_queue_get_notification(vq)) {
+virtio_queue_set_notification(vq, true);
+}
+if (vq == VIRTIO_SCSI_COMMON(s)->event_vq) {
+virtio_queue_aio_attach_host_notifier_no_poll(vq, s->ctx);
+} else {
+virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+}
+virtio_queue_notify(vdev, i);
  }
  }


Perfect, so we agree on trying it that way. :)

Hanna

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-23 Thread Hanna Czenczek


On 22.01.24 18:52, Hanna Czenczek wrote:

On 22.01.24 18:41, Hanna Czenczek wrote:

On 05.01.24 15:30, Fiona Ebner wrote:

Am 05.01.24 um 14:43 schrieb Fiona Ebner:

Am 03.01.24 um 14:35 schrieb Paolo Bonzini:

On 1/3/24 12:40, Fiona Ebner wrote:
I'm happy to report that I cannot reproduce the CPU-usage-spike 
issue
with the patch, but I did run into an assertion failure when 
trying to
verify that it fixes my original stuck-guest-IO issue. See below 
for the
backtrace [0]. Hanna wrote in 
https://issues.redhat.com/browse/RHEL-3934



I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, 
because
both virtio-scsi’s and virtio-blk’s .handle_output() 
implementations
acquire the device’s context, so this should be directly 
callable from

any context.

I guess this is not true anymore now that the AioContext locking was
removed?

Good point and, in fact, even before it was much safer to use
virtio_queue_notify() instead.  Not only does it use the event 
notifier

handler, but it also calls it in the right thread/AioContext just by
doing event_notifier_set().


But with virtio_queue_notify() using the event notifier, the
CPU-usage-spike issue is present:

Back to the CPU-usage-spike issue: I experimented around and it 
doesn't
seem to matter whether I notify the virt queue before or after 
attaching

the notifiers. But there's another functional difference. My patch
called virtio_queue_notify() which contains this block:


 if (vq->host_notifier_enabled) {
event_notifier_set(>host_notifier);
 } else if (vq->handle_output) {
 vq->handle_output(vdev, vq);
In my testing, the first branch was taken, calling 
event_notifier_set().

Hanna's patch uses virtio_queue_notify_vq() and there,
vq->handle_output() will be called. That seems to be the relevant
difference regarding the CPU-usage-spike issue.
I should mention that this is with a VirtIO SCSI disk. I also 
attempted

to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
didn't manage yet.

What I noticed is that in virtio_queue_host_notifier_aio_poll(), 
one of

the queues (but only one) will always show as nonempty. And then,
run_poll_handlers_once() will always detect progress which explains 
the

CPU usage.

The following shows
1. vq address
2. number of times vq was passed to 
virtio_queue_host_notifier_aio_poll()

3. number of times the result of virtio_queue_host_notifier_aio_poll()
was true for the vq


0x555fd93f9c80 17162000 0
0x555fd93f9e48 17162000 6
0x555fd93f9ee0 17162000 0
0x555fd93f9d18 17162000 17162000
0x555fd93f9db0 17162000 0
0x555fd93f9f78 17162000 0

And for the problematic one, the reason it is seen as nonempty is:


0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0

vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and
s->events_dropped is false in my testing, so
virtio_scsi_handle_event_vq() doesn't do anything.


Those values stay like this while the call counts above increase.

So something going wrong with the indices when the event notifier 
is set

from QEMU side (in the main thread)?

The guest is Debian 12 with a 6.1 kernel.


So, trying to figure out a new RFC version:

About the stack trace you, Fiona, posted:  As far as I understand, 
that happens because virtio_blk_drained_end() calling 
virtio_queue_notify_vq() wasn’t safe after all, and instead we need 
to use virtio_queue_notify().  Right?


However, you say using virtio_queue_notify() instead causes busy 
loops of doing nothing in virtio-scsi (what you describe above). I 
mean, better than a crash, but, you know. :)


I don’t know have any prior knowledge about the event handling, 
unfortunately.  The fact that 8 buffers are available but we don’t 
use any sounds OK to me; as far as I understand, we only use those 
buffers if we have any events to push into them, so as long as we 
don’t, we won’t.  Question is, should we not have its poll handler 
return false if we don’t have any events (i.e. events_dropped is 
false)?  Would that solve it?


Or actually, maybe we could just skip the virtio_queue_notify() call 
for the event vq?  I.e. have it be `if (vq != 
VIRTIO_SCSI_COMMON(s)->event_vq) { virtio_queue_notify(vdev, i); }`?  
I wouldn’t like that very much, (a) because this would make it 
slightly cumbersome to put that into 
virtio_queue_aio_attach_host_notifier*(), and (b) in case that does 
fix it, I do kind of feel like the real problem is that we use 
virtio_queue_host_notifier_aio_poll() for the event vq, which tells 
the polling code to poll whenever the vq is non-empty, but we (AFAIU) 
expect the event vq to be non-empty all the time.


Turns out there was commit 38738f7dbbda90fbc161757b7f4be35b52205552 
(“virtio-scsi: don't waste CPU polling the event virtqueue”) by Stefan, 
which specifically intended to not use 
virtio_queue_host_notifier_aio_poll() for the event vq.  I think the 
problem is that virtio_scsi_drained

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-22 Thread Hanna Czenczek


On 22.01.24 18:41, Hanna Czenczek wrote:

On 05.01.24 15:30, Fiona Ebner wrote:

Am 05.01.24 um 14:43 schrieb Fiona Ebner:

Am 03.01.24 um 14:35 schrieb Paolo Bonzini:

On 1/3/24 12:40, Fiona Ebner wrote:

I'm happy to report that I cannot reproduce the CPU-usage-spike issue
with the patch, but I did run into an assertion failure when 
trying to
verify that it fixes my original stuck-guest-IO issue. See below 
for the
backtrace [0]. Hanna wrote in 
https://issues.redhat.com/browse/RHEL-3934



I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, 
because

both virtio-scsi’s and virtio-blk’s .handle_output() implementations
acquire the device’s context, so this should be directly callable 
from

any context.

I guess this is not true anymore now that the AioContext locking was
removed?

Good point and, in fact, even before it was much safer to use
virtio_queue_notify() instead.  Not only does it use the event 
notifier

handler, but it also calls it in the right thread/AioContext just by
doing event_notifier_set().


But with virtio_queue_notify() using the event notifier, the
CPU-usage-spike issue is present:

Back to the CPU-usage-spike issue: I experimented around and it 
doesn't
seem to matter whether I notify the virt queue before or after 
attaching

the notifiers. But there's another functional difference. My patch
called virtio_queue_notify() which contains this block:


 if (vq->host_notifier_enabled) {
 event_notifier_set(>host_notifier);
 } else if (vq->handle_output) {
 vq->handle_output(vdev, vq);
In my testing, the first branch was taken, calling 
event_notifier_set().

Hanna's patch uses virtio_queue_notify_vq() and there,
vq->handle_output() will be called. That seems to be the relevant
difference regarding the CPU-usage-spike issue.

I should mention that this is with a VirtIO SCSI disk. I also attempted
to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
didn't manage yet.

What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of
the queues (but only one) will always show as nonempty. And then,
run_poll_handlers_once() will always detect progress which explains the
CPU usage.

The following shows
1. vq address
2. number of times vq was passed to 
virtio_queue_host_notifier_aio_poll()

3. number of times the result of virtio_queue_host_notifier_aio_poll()
was true for the vq


0x555fd93f9c80 17162000 0
0x555fd93f9e48 17162000 6
0x555fd93f9ee0 17162000 0
0x555fd93f9d18 17162000 17162000
0x555fd93f9db0 17162000 0
0x555fd93f9f78 17162000 0

And for the problematic one, the reason it is seen as nonempty is:


0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0

vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and
s->events_dropped is false in my testing, so
virtio_scsi_handle_event_vq() doesn't do anything.


Those values stay like this while the call counts above increase.

So something going wrong with the indices when the event notifier is 
set

from QEMU side (in the main thread)?

The guest is Debian 12 with a 6.1 kernel.


So, trying to figure out a new RFC version:

About the stack trace you, Fiona, posted:  As far as I understand, 
that happens because virtio_blk_drained_end() calling 
virtio_queue_notify_vq() wasn’t safe after all, and instead we need to 
use virtio_queue_notify().  Right?


However, you say using virtio_queue_notify() instead causes busy loops 
of doing nothing in virtio-scsi (what you describe above). I mean, 
better than a crash, but, you know. :)


I don’t know have any prior knowledge about the event handling, 
unfortunately.  The fact that 8 buffers are available but we don’t use 
any sounds OK to me; as far as I understand, we only use those buffers 
if we have any events to push into them, so as long as we don’t, we 
won’t.  Question is, should we not have its poll handler return false 
if we don’t have any events (i.e. events_dropped is false)?  Would 
that solve it?


Or actually, maybe we could just skip the virtio_queue_notify() call for 
the event vq?  I.e. have it be `if (vq != 
VIRTIO_SCSI_COMMON(s)->event_vq) { virtio_queue_notify(vdev, i); }`?  I 
wouldn’t like that very much, (a) because this would make it slightly 
cumbersome to put that into virtio_queue_aio_attach_host_notifier*(), 
and (b) in case that does fix it, I do kind of feel like the real 
problem is that we use virtio_queue_host_notifier_aio_poll() for the 
event vq, which tells the polling code to poll whenever the vq is 
non-empty, but we (AFAIU) expect the event vq to be non-empty all the time.


Hanna

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-22 Thread Hanna Czenczek


On 05.01.24 15:30, Fiona Ebner wrote:

Am 05.01.24 um 14:43 schrieb Fiona Ebner:

Am 03.01.24 um 14:35 schrieb Paolo Bonzini:

On 1/3/24 12:40, Fiona Ebner wrote:

I'm happy to report that I cannot reproduce the CPU-usage-spike issue
with the patch, but I did run into an assertion failure when trying to
verify that it fixes my original stuck-guest-IO issue. See below for the
backtrace [0]. Hanna wrote in https://issues.redhat.com/browse/RHEL-3934


I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, because
both virtio-scsi’s and virtio-blk’s .handle_output() implementations
acquire the device’s context, so this should be directly callable from
any context.

I guess this is not true anymore now that the AioContext locking was
removed?

Good point and, in fact, even before it was much safer to use
virtio_queue_notify() instead.  Not only does it use the event notifier
handler, but it also calls it in the right thread/AioContext just by
doing event_notifier_set().


But with virtio_queue_notify() using the event notifier, the
CPU-usage-spike issue is present:


Back to the CPU-usage-spike issue: I experimented around and it doesn't
seem to matter whether I notify the virt queue before or after attaching
the notifiers. But there's another functional difference. My patch
called virtio_queue_notify() which contains this block:


 if (vq->host_notifier_enabled) {
 event_notifier_set(>host_notifier);
 } else if (vq->handle_output) {
 vq->handle_output(vdev, vq);

In my testing, the first branch was taken, calling event_notifier_set().
Hanna's patch uses virtio_queue_notify_vq() and there,
vq->handle_output() will be called. That seems to be the relevant
difference regarding the CPU-usage-spike issue.

I should mention that this is with a VirtIO SCSI disk. I also attempted
to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
didn't manage yet.

What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of
the queues (but only one) will always show as nonempty. And then,
run_poll_handlers_once() will always detect progress which explains the
CPU usage.

The following shows
1. vq address
2. number of times vq was passed to virtio_queue_host_notifier_aio_poll()
3. number of times the result of virtio_queue_host_notifier_aio_poll()
was true for the vq


0x555fd93f9c80 17162000 0
0x555fd93f9e48 17162000 6
0x555fd93f9ee0 17162000 0
0x555fd93f9d18 17162000 17162000
0x555fd93f9db0 17162000 0
0x555fd93f9f78 17162000 0

And for the problematic one, the reason it is seen as nonempty is:


0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0

vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and
s->events_dropped is false in my testing, so
virtio_scsi_handle_event_vq() doesn't do anything.


Those values stay like this while the call counts above increase.

So something going wrong with the indices when the event notifier is set
from QEMU side (in the main thread)?

The guest is Debian 12 with a 6.1 kernel.


So, trying to figure out a new RFC version:

About the stack trace you, Fiona, posted:  As far as I understand, that 
happens because virtio_blk_drained_end() calling 
virtio_queue_notify_vq() wasn’t safe after all, and instead we need to 
use virtio_queue_notify().  Right?


However, you say using virtio_queue_notify() instead causes busy loops 
of doing nothing in virtio-scsi (what you describe above).  I mean, 
better than a crash, but, you know. :)


I don’t know have any prior knowledge about the event handling, 
unfortunately.  The fact that 8 buffers are available but we don’t use 
any sounds OK to me; as far as I understand, we only use those buffers 
if we have any events to push into them, so as long as we don’t, we 
won’t.  Question is, should we not have its poll handler return false if 
we don’t have any events (i.e. events_dropped is false)?  Would that 
solve it?


Hanna

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-02 Thread Hanna Czenczek


On 02.01.24 16:53, Paolo Bonzini wrote:

On Tue, Jan 2, 2024 at 4:24 PM Hanna Czenczek  wrote:

I’ve attached the preliminary patch that I didn’t get to send (or test
much) last year.  Not sure if it has the same CPU-usage-spike issue
Fiona was seeing, the only functional difference is that I notify the vq
after attaching the notifiers instead of before.

I think the patch makes sense and cleaning up the logic of aio_poll
(which is one of those functions that grew and grew without much
clarity into who did what) can be done on top.

Just one small thing, the virtio_queue_notify_vq() call is required
because the virtqueue interrupt and eventfd are edge-triggered rather
than level-triggered; so perhaps it should be placed in the
function(s) that establish the handlers,
virtio_queue_aio_attach_host_notifier() and
virtio_queue_aio_attach_host_notifier_no_poll()? Neither
virtio_blk_drained_end() nor virtio_scsi_drained_end() are
particularly special, and the comment applies just as well:

 /*
  * We will have ignored notifications about new requests from the guest
  * while handlers were not attached, so "kick" the virt queue to process
  * those requests now.
  */


I wasn’t entirely whether we want to call notify_vq() if we have 
instated the handlers for the first time (in which case someone ought to 
look for existing unprocessed requests anyway), so I decided to limit it 
to drained_end.


But considering that it must be safe to call notify_vq() right after 
instating the handlers (virtio_queue_host_notifier_read() may then be 
called after all), we might as well do so every time, yes.


Hanna

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-02 Thread Hanna Czenczek


On 13.12.23 22:15, Stefan Hajnoczi wrote:

Hanna and Fiona encountered a bug in aio_set_fd_handler(): there is no matching
io_poll_end() call upon removing an AioHandler when io_poll_begin() was
previously called. The missing io_poll_end() call leaves virtqueue
notifications disabled and the virtqueue's ioeventfd will never become
readable anymore.

The details of how virtio-scsi devices using IOThreads can hang after
hotplug/unplug are covered here:
https://issues.redhat.com/browse/RHEL-3934

Hanna is currently away over the December holidays. I'm sending these RFC
patches in the meantime. They demonstrate running aio_set_fd_handler() in the
AioContext home thread and adding the missing io_poll_end() call.


I agree with Paolo that if the handlers are removed, the caller probably 
isn’t interested in notifications: In our specific case, we remove the 
handlers because the device is to be drained, so we won’t poll the 
virtqueue anyway.  Therefore, if io_poll_end() is to be called to 
complete the start-end pair, it shouldn’t be done when the handlers are 
removed, but instead when they are reinstated.


I believe that’s quite infeasible to do in generic code: We’d have to 
basically remember that we haven’t called a specific io_poll_end 
callback yet, and so once it is reinstated while we’re not in a polling 
phase, we would need to call it then.  That in itself is already hairy, 
but in addition, the callback consists of a function pointer and an 
opaque pointer, and it seems impossible to reliably establish identity 
between two opaque pointers to know whether a newly instated io_poll_end 
callback is the same one as one that had been removed before (pointer 
identity may work, but also may not).


That’s why I think the responsibility should fall on the caller.  I 
believe both virtio_queue_aio_attach_host_notifier*() functions should 
enable vq notifications before instating the handler – if we’re in 
polling mode, io_poll_start() will then immediately be called and 
disable vq notifications again.  Having them enabled briefly shouldn’t 
hurt anything but performance (which I don’t think is terrible in the 
drain case).  For completeness’ sake, we may even have 
virtio_queue_aio_detach_host_notifier() disable vq notifications, 
because otherwise it’s unknown whether notifications are enabled or 
disabled after removing the notifiers.  That isn’t terrible, but I think 
(A) if any of the two, we want them to be disabled, because we won’t 
check for requests anyway, and (B) this gives us a clearer state 
machine, where removing the notifiers will always leave vq notifications 
disabled, so when they are reinstated (i.e. when calling 
virtio_queue_aio_attach_host_notifier*()), it’s clear that we must poll 
once to check for new requests.


I’ve attached the preliminary patch that I didn’t get to send (or test 
much) last year.  Not sure if it has the same CPU-usage-spike issue 
Fiona was seeing, the only functional difference is that I notify the vq 
after attaching the notifiers instead of before.


HannaFrom 451aae74fc19a6ea5cd6381247cd9202571651e8 Mon Sep 17 00:00:00 2001
From: Hanna Czenczek 
Date: Wed, 6 Dec 2023 18:24:55 +0100
Subject: [PATCH] Keep notifications disabled during drain

Preliminary patch with a preliminary description: During drain, we do
not care about virtqueue notifications, which is why we remove the
handlers on it.  When removing those handlers, whether vq notifications
are enabled or not depends on whether we were in polling mode or not; if
not, they are enabled (by default); if so, they have been disabled by
the io_poll_start callback.

Ideally, we would rather have the vq notifications be disabled, because
we will not process requests during a drained section anyway.
Therefore, have virtio_queue_aio_detach_host_notifier() explicitly
disable them, and virtio_queue_aio_attach_host_notifier*() re-enable
them (if we are in a polling section, attaching them will invoke the
io_poll_start callback, which will re-disable them).

Because we will miss virtqueue updates in the drained section, we also
need to poll the virtqueue once in the drained_end functions after
attaching the notifiers.
---
 include/block/aio.h|  7 ++-
 include/hw/virtio/virtio.h |  1 +
 hw/block/virtio-blk.c  | 10 ++
 hw/scsi/virtio-scsi.c  | 10 ++
 hw/virtio/virtio.c | 30 +-
 5 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index f08b358077..795a375ff2 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -497,9 +497,14 @@ void aio_set_event_notifier(AioContext *ctx,
 AioPollFn *io_poll,
 EventNotifierHandler *io_poll_ready);
 
-/* Set polling begin/end callbacks for an event notifier that has already been
+/*
+ * Set polling begin/end callbacks for an event notifier that has already been
  * registered

[PULL 2/3] block/file-posix: fix update_zones_wp() caller

2023-11-06 Thread Hanna Czenczek

From: Sam Li 

When the zoned request fail, it needs to update only the wp of
the target zones for not disrupting the in-flight writes on
these other zones. The wp is updated successfully after the
request completes.

Fixed the callers with right offset and nr_zones.

Signed-off-by: Sam Li 
Message-Id: <20230825040556.4217-1-faithilike...@gmail.com>
Reviewed-by: Stefan Hajnoczi 
[hreitz: Rebased and fixed comment spelling]
Signed-off-by: Hanna Czenczek 
---
 block/file-posix.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 50e2b20d5c..3c0ce9c258 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2523,7 +2523,10 @@ out:
 }
 }
 } else {
-update_zones_wp(bs, s->fd, 0, 1);
+/*
+ * write and append write are not allowed to cross zone boundaries
+ */
+update_zones_wp(bs, s->fd, offset, 1);
 }
 
 qemu_co_mutex_unlock(>colock);
@@ -3470,7 +3473,7 @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState 
*bs, BlockZoneOp op,
 len >> BDRV_SECTOR_BITS);
 ret = raw_thread_pool_submit(handle_aiocb_zone_mgmt, );
 if (ret != 0) {
-update_zones_wp(bs, s->fd, offset, i);
+update_zones_wp(bs, s->fd, offset, nrz);
 error_report("ioctl %s failed %d", op_name, ret);
 return ret;
 }
-- 
2.41.0

[PULL 3/3] file-posix: fix over-writing of returning zone_append offset

2023-11-06 Thread Hanna Czenczek

From: Naohiro Aota 

raw_co_zone_append() sets "s->offset" where "BDRVRawState *s". This pointer
is used later at raw_co_prw() to save the block address where the data is
written.

When multiple IOs are on-going at the same time, a later IO's
raw_co_zone_append() call over-writes a former IO's offset address before
raw_co_prw() completes. As a result, the former zone append IO returns the
initial value (= the start address of the writing zone), instead of the
proper address.

Fix the issue by passing the offset pointer to raw_co_prw() instead of
passing it through s->offset. Also, remove "offset" from BDRVRawState as
there is no usage anymore.

Fixes: 4751d09adcc3 ("block: introduce zone append write for zoned devices")
Signed-off-by: Naohiro Aota 
Message-Id: <20231030073853.2601162-1-naohiro.a...@wdc.com>
Reviewed-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Hanna Czenczek 
---
 block/file-posix.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 3c0ce9c258..b862406c71 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -160,7 +160,6 @@ typedef struct BDRVRawState {
 bool has_write_zeroes:1;
 bool use_linux_aio:1;
 bool use_linux_io_uring:1;
-int64_t *offset; /* offset of zone append operation */
 int page_cache_inconsistent; /* errno from fdatasync failure */
 bool has_fallocate;
 bool needs_alignment;
@@ -2445,12 +2444,13 @@ static bool bdrv_qiov_is_aligned(BlockDriverState *bs, 
QEMUIOVector *qiov)
 return true;
 }
 
-static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
+static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr,
uint64_t bytes, QEMUIOVector *qiov, int 
type)
 {
 BDRVRawState *s = bs->opaque;
 RawPosixAIOData acb;
 int ret;
+uint64_t offset = *offset_ptr;
 
 if (fd_open(bs) < 0)
 return -EIO;
@@ -2513,8 +2513,8 @@ out:
 uint64_t *wp = >wp[offset / bs->bl.zone_size];
 if (!BDRV_ZT_IS_CONV(*wp)) {
 if (type & QEMU_AIO_ZONE_APPEND) {
-*s->offset = *wp;
-trace_zbd_zone_append_complete(bs, *s->offset
+*offset_ptr = *wp;
+trace_zbd_zone_append_complete(bs, *offset_ptr
 >> BDRV_SECTOR_BITS);
 }
 /* Advance the wp if needed */
@@ -2539,14 +2539,14 @@ static int coroutine_fn raw_co_preadv(BlockDriverState 
*bs, int64_t offset,
   int64_t bytes, QEMUIOVector *qiov,
   BdrvRequestFlags flags)
 {
-return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_READ);
+return raw_co_prw(bs, , bytes, qiov, QEMU_AIO_READ);
 }
 
 static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, int64_t offset,
int64_t bytes, QEMUIOVector *qiov,
BdrvRequestFlags flags)
 {
-return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
+return raw_co_prw(bs, , bytes, qiov, QEMU_AIO_WRITE);
 }
 
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
@@ -3509,8 +3509,6 @@ static int coroutine_fn 
raw_co_zone_append(BlockDriverState *bs,
 int64_t zone_size_mask = bs->bl.zone_size - 1;
 int64_t iov_len = 0;
 int64_t len = 0;
-BDRVRawState *s = bs->opaque;
-s->offset = offset;
 
 if (*offset & zone_size_mask) {
 error_report("sector offset %" PRId64 " is not aligned to zone size "
@@ -3531,7 +3529,7 @@ static int coroutine_fn 
raw_co_zone_append(BlockDriverState *bs,
 }
 
 trace_zbd_zone_append(bs, *offset >> BDRV_SECTOR_BITS);
-return raw_co_prw(bs, *offset, len, qiov, QEMU_AIO_ZONE_APPEND);
+return raw_co_prw(bs, offset, len, qiov, QEMU_AIO_ZONE_APPEND);
 }
 #endif
 
-- 
2.41.0

[PULL 0/3] Block patches

2023-11-06 Thread Hanna Czenczek

The following changes since commit 3e01f1147a16ca566694b97eafc941d62fa1e8d8:

  Merge tag 'pull-sp-20231105' of https://gitlab.com/rth7680/qemu into staging 
(2023-11-06 09:34:22 +0800)

are available in the Git repository at:

  https://gitlab.com/hreitz/qemu.git tags/pull-block-2023-11-06

for you to fetch changes up to ad4feaca61d76fecad784e6d5e7bae40d0411c46:

  file-posix: fix over-writing of returning zone_append offset (2023-11-06 
16:15:07 +0100)


Block patches:
- One patch to make qcow2's discard-no-unref option do better what it is
  supposed to do (i.e. prevent fragmentation)
- Two fixes for zoned requests


Jean-Louis Dupond (1):
  qcow2: keep reference on zeroize with discard-no-unref enabled

Naohiro Aota (1):
  file-posix: fix over-writing of returning zone_append offset

Sam Li (1):
  block/file-posix: fix update_zones_wp() caller

 qapi/block-core.json  | 24 ++--
 block/file-posix.c| 23 ---
 block/qcow2-cluster.c | 22 ++
 qemu-options.hx   | 10 +++---
 4 files changed, 51 insertions(+), 28 deletions(-)

-- 
2.41.0

[PULL 1/3] qcow2: keep reference on zeroize with discard-no-unref enabled

2023-11-06 Thread Hanna Czenczek

From: Jean-Louis Dupond 

When the discard-no-unref flag is enabled, we keep the reference for
normal discard requests.
But when a discard is executed on a snapshot/qcow2 image with backing,
the discards are saved as zero clusters in the snapshot image.

When committing the snapshot to the backing file, not
discard_in_l2_slice is called but zero_in_l2_slice. Which did not had
any logic to keep the reference when discard-no-unref is enabled.

Therefor we add logic in the zero_in_l2_slice call to keep the reference
on commit.

Fixes: https://gitlab.com/qemu-project/qemu/-/issues/1621
Signed-off-by: Jean-Louis Dupond 
Message-Id: <20231003125236.216473-2-jean-lo...@dupond.be>
[hreitz: Made the documentation change more verbose, as discussed
 on-list]
Signed-off-by: Hanna Czenczek 
---
 qapi/block-core.json  | 24 ++--
 block/qcow2-cluster.c | 22 ++
 qemu-options.hx   | 10 +++---
 3 files changed, 39 insertions(+), 17 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 99961256f2..ca390c5700 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3528,16 +3528,20 @@
 # @pass-discard-other: whether discard requests for the data source
 # should be issued on other occasions where a cluster gets freed
 #
-# @discard-no-unref: when enabled, discards from the guest will not
-# cause cluster allocations to be relinquished.  This prevents
-# qcow2 fragmentation that would be caused by such discards.
-# Besides potential performance degradation, such fragmentation
-# can lead to increased allocation of clusters past the end of the
-# image file, resulting in image files whose file length can grow
-# much larger than their guest disk size would suggest.  If image
-# file length is of concern (e.g. when storing qcow2 images
-# directly on block devices), you should consider enabling this
-# option.  (since 8.1)
+# @discard-no-unref: when enabled, data clusters will remain
+# preallocated when they are no longer used, e.g. because they are
+# discarded or converted to zero clusters.  As usual, whether the
+# old data is discarded or kept on the protocol level (i.e. in the
+# image file) depends on the setting of the pass-discard-request
+# option.  Keeping the clusters preallocated prevents qcow2
+# fragmentation that would otherwise be caused by freeing and
+# re-allocating them later.  Besides potential performance
+# degradation, such fragmentation can lead to increased allocation
+# of clusters past the end of the image file, resulting in image
+# files whose file length can grow much larger than their guest disk
+# size would suggest.  If image file length is of concern (e.g. when
+# storing qcow2 images directly on block devices), you should
+# consider enabling this option.  (since 8.1)
 #
 # @overlap-check: which overlap checks to perform for writes to the
 # image, defaults to 'cached' (since 2.2)
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 904f00d1b3..5af439bd11 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1983,7 +1983,7 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t 
offset, uint64_t nb_clusters,
 /* If we keep the reference, pass on the discard still */
 bdrv_pdiscard(s->data_file, old_l2_entry & L2E_OFFSET_MASK,
   s->cluster_size);
-   }
+}
 }
 
 qcow2_cache_put(s->l2_table_cache, (void **) _slice);
@@ -2061,9 +2061,15 @@ zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
 QCow2ClusterType type = qcow2_get_cluster_type(bs, old_l2_entry);
 bool unmap = (type == QCOW2_CLUSTER_COMPRESSED) ||
 ((flags & BDRV_REQ_MAY_UNMAP) && qcow2_cluster_is_allocated(type));
-uint64_t new_l2_entry = unmap ? 0 : old_l2_entry;
+bool keep_reference =
+(s->discard_no_unref && type != QCOW2_CLUSTER_COMPRESSED);
+uint64_t new_l2_entry = old_l2_entry;
 uint64_t new_l2_bitmap = old_l2_bitmap;
 
+if (unmap && !keep_reference) {
+new_l2_entry = 0;
+}
+
 if (has_subclusters(s)) {
 new_l2_bitmap = QCOW_L2_BITMAP_ALL_ZEROES;
 } else {
@@ -2081,9 +2087,17 @@ zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
 set_l2_bitmap(s, l2_slice, l2_index + i, new_l2_bitmap);
 }
 
-/* Then decrease the refcount */
 if (unmap) {
-qcow2_free_any_cluster(bs, old_l2_entry, QCOW2_DISCARD_REQUEST);
+if (!keep_reference) {
+/* Then decrease the refcount */
+qcow2_free_any_cluster(bs, old_l2_entry, 
QCOW2_DISCARD_REQUEST);
+} else if (s->discard_passthrough[QCOW2_DISCARD_REQUEST] &&
+   (type == QCOW2_CLUSTER_NORMAL

Re: [PATCH] file-posix: fix over-writing of returning zone_append offset

2023-11-06 Thread Hanna Czenczek


On 30.10.23 08:38, Naohiro Aota wrote:

raw_co_zone_append() sets "s->offset" where "BDRVRawState *s". This pointer
is used later at raw_co_prw() to save the block address where the data is
written.

When multiple IOs are on-going at the same time, a later IO's
raw_co_zone_append() call over-writes a former IO's offset address before
raw_co_prw() completes. As a result, the former zone append IO returns the
initial value (= the start address of the writing zone), instead of the
proper address.

Fix the issue by passing the offset pointer to raw_co_prw() instead of
passing it through s->offset. Also, remove "offset" from BDRVRawState as
there is no usage anymore.

Fixes: 4751d09adcc3 ("block: introduce zone append write for zoned devices")
Signed-off-by: Naohiro Aota 
---
  block/file-posix.c | 16 +++-
  1 file changed, 7 insertions(+), 9 deletions(-)


Thanks, applied to my block branch:

https://gitlab.com/hreitz/qemu/-/commits/block

Hanna

Re: [PATCH v2 09/10] block: Convert qmp_query_block() to coroutine_fn

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

This is another caller of bdrv_get_allocated_file_size() that needs to
be converted to a coroutine because that function will be made
asynchronous when called (indirectly) from the QMP dispatcher.

This QMP command is a candidate because it calls bdrv_do_query_node_info(),
which in turn calls bdrv_get_allocated_file_size().

We've determined bdrv_do_query_node_info() to be coroutine-safe (see
previous commits), so we can just put this QMP command in a coroutine.

Since qmp_query_block() now expects to run in a coroutine, its callers
need to be converted as well. Convert hmp_info_block(), which calls
only coroutine-safe code, including qmp_query_named_block_nodes()
which has been converted to coroutine in the previous patches.

Now that all callers of bdrv_[co_]block_device_info() are using the
coroutine version, a few things happen:

  - we can return to using bdrv_block_device_info() without a wrapper;

  - bdrv_get_allocated_file_size() can stop being mixed;

  - bdrv_co_get_allocated_file_size() needs to be put under the graph
lock because it is being called wthout the wrapper;

  - bdrv_do_query_node_info() doesn't need to acquire the AioContext
because it doesn't call aio_poll anymore;

Signed-off-by: Fabiano Rosas 
---
  block.c|  2 +-
  block/monitor/block-hmp-cmds.c |  2 +-
  block/qapi.c   | 18 +-
  hmp-commands-info.hx   |  1 +
  include/block/block-hmp-cmds.h |  2 +-
  include/block/block-io.h   |  2 +-
  include/block/qapi.h   | 12 
  qapi/block-core.json   |  2 +-
  8 files changed, 19 insertions(+), 22 deletions(-)


After this series has been sent, we got some usages of 
GRAPH_RDLOCK_GUARD_MAINLOOP() that no longer fit with this patch – I’ve 
also mentioned one case on patch 7, not yet realizing that this was a 
new thing.  Those must now be fixed (e.g. in qmp_query_block(), or in 
bdrv_snapshot_list()), or they’ll crash.


Hanna

Re: [PATCH v2 10/10] block: Add a thread-pool version of fstat

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

From: João Silva 

The fstat call can take a long time to finish when running over
NFS. Add a version of it that runs in the thread pool.

Adapt one of its users, raw_co_get_allocated_file size to use the new
version. That function is called via QMP under the qemu_global_mutex
so it has a large chance of blocking VCPU threads in case it takes too
long to finish.

Reviewed-by: Claudio Fontana 
Signed-off-by: João Silva 
Signed-off-by: Fabiano Rosas 
---
  block/file-posix.c  | 40 +---
  include/block/raw-aio.h |  4 +++-
  2 files changed, 40 insertions(+), 4 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH v2 05/10] block: Convert bdrv_query_block_graph_info to coroutine

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

We're converting callers of bdrv_get_allocated_file_size() to run in
coroutines because that function will be made asynchronous when called
(indirectly) from the QMP dispatcher.

This function is a candidate because it calls bdrv_do_query_node_info(),
which in turn calls bdrv_get_allocated_file_size().

All the functions called from bdrv_do_query_node_info() onwards are
coroutine-safe, either have a coroutine version themselves[1] or are
mostly simple code/string manipulation[2].

1) bdrv_getlength(), bdrv_get_allocated_file_size(), bdrv_get_info(),
bdrv_get_specific_info();

2) bdrv_refresh_filename(), bdrv_get_format_name(),
bdrv_get_full_backing_filename(), bdrv_query_snapshot_info_list();

Signed-off-by: Fabiano Rosas 
---
  block/qapi.c | 12 +++-
  include/block/qapi.h |  6 +-
  qemu-img.c   |  2 --
  3 files changed, 12 insertions(+), 8 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH v2 04/10] block: Temporarily mark bdrv_co_get_allocated_file_size as mixed

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

Some callers of this function are about to be converted to run in
coroutines, so allow it to be executed both inside and outside a
coroutine while we convert all the callers.

This will be reverted once all callers of bdrv_do_query_node_info run
in a coroutine.

Signed-off-by: Fabiano Rosas 
Reviewed-by: Eric Blake 
---
  include/block/block-io.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH v2 03/10] block: Allow the wrapper script to see functions declared in qapi.h

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

The following patches will add co_wrapper annotations to functions
declared in qapi.h. Add that header to the set of files used by
block-coroutine-wrapper.py.

Signed-off-by: Fabiano Rosas 
---
  block/meson.build  | 1 +
  scripts/block-coroutine-wrapper.py | 1 +
  2 files changed, 2 insertions(+)


Reviewed-by: Hanna Czenczek

Re: [PATCH v2 09/10] block: Convert qmp_query_block() to coroutine_fn

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

This is another caller of bdrv_get_allocated_file_size() that needs to
be converted to a coroutine because that function will be made
asynchronous when called (indirectly) from the QMP dispatcher.

This QMP command is a candidate because it calls bdrv_do_query_node_info(),
which in turn calls bdrv_get_allocated_file_size().

We've determined bdrv_do_query_node_info() to be coroutine-safe (see
previous commits), so we can just put this QMP command in a coroutine.

Since qmp_query_block() now expects to run in a coroutine, its callers
need to be converted as well. Convert hmp_info_block(), which calls
only coroutine-safe code, including qmp_query_named_block_nodes()
which has been converted to coroutine in the previous patches.

Now that all callers of bdrv_[co_]block_device_info() are using the
coroutine version, a few things happen:

  - we can return to using bdrv_block_device_info() without a wrapper;

  - bdrv_get_allocated_file_size() can stop being mixed;

  - bdrv_co_get_allocated_file_size() needs to be put under the graph
lock because it is being called wthout the wrapper;


But bdrv_do_query_node_info() is marked GRAPH_RDLOCK, so the whole 
function must not be called without holding the lock.  I don’t 
understand why we need to explicitly take it another time.



  - bdrv_do_query_node_info() doesn't need to acquire the AioContext
because it doesn't call aio_poll anymore;


In the past (very likely outdated, but still mentioning it) you’d need 
to take the AioContext just in general when operating on a block device 
that might be processed in a different AioContext than the main one, and 
the current code runs in the main context, i.e. which is the situation 
we have here.


Speaking of contexts, I wonder how the threading is actually supposed to 
work.  I assume QMP coroutines run in the main thread, so now we run 
bdrv_co_get_allocated_file_size() in the main thread – is that correct, 
or do we need to use bdrv_co_enter() like qmp_block_resize() does?  And 
so, if we run it in the main thread, is it OK not to acquire the 
AioContext around it to prevent interference from a potential I/O thread?



Signed-off-by: Fabiano Rosas 
---
  block.c|  2 +-
  block/monitor/block-hmp-cmds.c |  2 +-
  block/qapi.c   | 18 +-
  hmp-commands-info.hx   |  1 +
  include/block/block-hmp-cmds.h |  2 +-
  include/block/block-io.h   |  2 +-
  include/block/qapi.h   | 12 
  qapi/block-core.json   |  2 +-
  8 files changed, 19 insertions(+), 22 deletions(-)


This patch implicitly assumes that quite a lot of functions (at least 
bdrv_query_info(), bdrv_query_image_info(), bdrv_do_query_node_info()) 
are now run in coroutine context.  This assumption must be formalized by 
annotating them all with coroutine_fn, and ideally adding a _co_ into 
their name.


Also, those functions should be checked whether they call coroutine 
wrappers, and made to use the native coroutine version now if so. (At 
least I’d find that nicer, FWIW.)  I’ve seen at least bdrv_getlength() 
in bdrv_do_query_node_info(), which could be a bdrv_co_getlength().



diff --git a/block.c b/block.c
index abed744b60..f94cee8930 100644
--- a/block.c
+++ b/block.c
@@ -6148,7 +6148,7 @@ BlockDeviceInfoList *bdrv_named_nodes_list(bool flat,
  
  list = NULL;

  QTAILQ_FOREACH(bs, _bdrv_states, node_list) {
-BlockDeviceInfo *info = bdrv_co_block_device_info(NULL, bs, flat, 
errp);
+BlockDeviceInfo *info = bdrv_block_device_info(NULL, bs, flat, errp);
  if (!info) {
  qapi_free_BlockDeviceInfoList(list);
  return NULL;
diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index 26116fe831..1049f9b006 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -742,7 +742,7 @@ static void print_block_info(Monitor *mon, BlockInfo *info,
  }
  }
  
-void hmp_info_block(Monitor *mon, const QDict *qdict)

+void coroutine_fn hmp_info_block(Monitor *mon, const QDict *qdict)
  {
  BlockInfoList *block_list, *info;
  BlockDeviceInfoList *blockdev_list, *blockdev;
diff --git a/block/qapi.c b/block/qapi.c
index 20660e15d6..3b4bc0b782 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -41,10 +41,10 @@
  #include "qemu/qemu-print.h"
  #include "sysemu/block-backend.h"
  
-BlockDeviceInfo *coroutine_fn bdrv_co_block_device_info(BlockBackend *blk,

-BlockDriverState *bs,
-bool flat,
-Error **errp)
+BlockDeviceInfo *coroutine_fn bdrv_block_device_info(BlockBackend *blk,
+ BlockDriverState *bs,
+ bool flat,
+

Re: [PATCH v2 08/10] block: Don't query all block devices at hmp_nbd_server_start

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

We're currently doing a full query-block just to enumerate the devices
for qmp_nbd_server_add and then discarding the BlockInfoList
afterwards. Alter hmp_nbd_server_start to instead iterate explicitly
over the block_backends list.

This allows the removal of the dependency on qmp_query_block from
hmp_nbd_server_start. This is desirable because we're about to move
qmp_query_block into a coroutine and don't need to change the NBD code
at the same time.

Signed-off-by: Fabiano Rosas 
---
  block/monitor/block-hmp-cmds.c | 20 
  1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index ca2599de44..26116fe831 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -394,7 +394,7 @@ void hmp_nbd_server_start(Monitor *mon, const QDict *qdict)
  bool writable = qdict_get_try_bool(qdict, "writable", false);
  bool all = qdict_get_try_bool(qdict, "all", false);
  Error *local_err = NULL;
-BlockInfoList *block_list, *info;
+BlockBackend *blk;
  SocketAddress *addr;
  NbdServerAddOptions export;
  
@@ -419,18 +419,24 @@ void hmp_nbd_server_start(Monitor *mon, const QDict *qdict)

  return;
  }
  
-/* Then try adding all block devices.  If one fails, close all and

+/*
+ * Then try adding all block devices.  If one fails, close all and
   * exit.
   */
-block_list = qmp_query_block(NULL);
+for (blk = blk_all_next(NULL); blk; blk = blk_all_next(blk)) {
+BlockDriverState *bs = blk_bs(blk);
  
-for (info = block_list; info; info = info->next) {

-if (!info->value->inserted) {
+if (!*blk_name(blk) && !blk_get_attached_dev(blk)) {


I’d like a comment here that historically, we’ve used qmp_query_block() 
here, and this is the same condition that it uses.  (Otherwise, it’s 
hard to see why it matters whether a device is attached or not.)



+continue;
+}
+
+bs = bdrv_skip_implicit_filters(bs);
+if (!bs || !bs->drv) {


Same here.  Just checking blk_is_inserted() would make more sense in 
this place, but if you want to absolutely keep behavior unchanged, then 
there should be a comment here why this check is done (because 
bdrv_query_info() does it to determine whether info->inserted should be 
set, which was historically used to determine whether this BlockBackend 
can be exported).



  continue;
  }
  
  export = (NbdServerAddOptions) {

-.device = info->value->device,
+.device = g_strdup(blk_name(blk)),


Do we need to g_strdup() here?  We didn’t before, so I think this will 
leak export.device.


I know bdrv_query_info() uses g_strdup(), but that was released by the 
qapi_free_BlockInfoList() below, which is now removed without replacement.


(On that note, it also looks like qmp_nbd_server_add() can leak 
arg->name (i.e. device.name) if it is not set by the caller.  It also 
uses g_strdup() there, but never frees it.  It does free the export_opts 
it creates, and `arg` is put into it, but as a deep copy, so anything in 
`arg` is leaked.)


Hanna


  .has_writable   = true,
  .writable   = writable,
  };
@@ -443,8 +449,6 @@ void hmp_nbd_server_start(Monitor *mon, const QDict *qdict)
  }
  }
  
-qapi_free_BlockInfoList(block_list);

-
  exit:
  hmp_handle_error(mon, local_err);
  }

Re: [PATCH v2 06/10] block: Convert bdrv_block_device_info into co_wrapper

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

We're converting callers of bdrv_get_allocated_file_size() to run in
coroutines because that function will be made asynchronous when called
(indirectly) from the QMP dispatcher.

This function is a candidate because it calls bdrv_query_image_info()
-> bdrv_do_query_node_info() -> bdrv_get_allocated_file_size().

It is safe to turn this is a coroutine because the code it calls is
made up of either simple accessors and string manipulation functions
[1] or it has already been determined to be safe [2].

1) bdrv_refresh_filename(), bdrv_is_read_only(),
blk_enable_write_cache(), bdrv_cow_bs(), blk_get_public(),
throttle_group_get_name(), bdrv_write_threshold_get(),
bdrv_query_dirty_bitmaps(), throttle_group_get_config(),
bdrv_filter_or_cow_bs(), bdrv_skip_implicit_filters()

2) bdrv_do_query_node_info() (see previous commit);

Signed-off-by: Fabiano Rosas 
---
  block/qapi.c |  8 
  include/block/qapi.h | 12 
  2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index a2e71edaff..20660e15d6 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -41,10 +41,10 @@
  #include "qemu/qemu-print.h"
  #include "sysemu/block-backend.h"
  
-BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,

-BlockDriverState *bs,
-bool flat,
-Error **errp)
+BlockDeviceInfo *coroutine_fn bdrv_co_block_device_info(BlockBackend *blk,
+BlockDriverState *bs,
+bool flat,
+Error **errp)
  {
  ImageInfo **p_image_info;
  ImageInfo *backing_info;
diff --git a/include/block/qapi.h b/include/block/qapi.h
index 7035bcd1ae..5cb0202791 100644
--- a/include/block/qapi.h
+++ b/include/block/qapi.h
@@ -30,10 +30,14 @@
  #include "block/snapshot.h"
  #include "qapi/qapi-types-block-core.h"
  
-BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,

-BlockDriverState *bs,
-bool flat,
-Error **errp);
+BlockDeviceInfo *coroutine_fn bdrv_co_block_device_info(BlockBackend *blk,
+BlockDriverState *bs,
+bool flat,
+Error **errp);
+BlockDeviceInfo *co_wrapper bdrv_block_device_info(BlockBackend *blk,
+   BlockDriverState *bs,
+   bool flat,
+   Error **errp);


bdrv_co_block_device_info() is now marked as GRAPH_RDLOCK, so should 
this use a co_wrapper_bdrv_rdlock instead?


Hanna


  int bdrv_query_snapshot_info_list(BlockDriverState *bs,
SnapshotInfoList **p_list,
Error **errp);

Re: [PATCH v2 07/10] block: Convert qmp_query_named_block_nodes to coroutine

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

From: Lin Ma 

We're converting callers of bdrv_get_allocated_file_size() to run in
coroutines because that function will be made asynchronous when called
(indirectly) from the QMP dispatcher.

This QMP command is a candidate because it indirectly calls
bdrv_get_allocated_file_size() through bdrv_block_device_info() ->
bdrv_query_image_info() -> bdrv_query_image_info().

The previous patches have determined that bdrv_query_image_info() and
bdrv_do_query_node_info() are coroutine-safe so we can just make the
QMP command run in a coroutine.

Signed-off-by: Lin Ma 
Signed-off-by: Fabiano Rosas 
---
  block.c  | 2 +-
  blockdev.c   | 6 +++---
  qapi/block-core.json | 3 ++-
  3 files changed, 6 insertions(+), 5 deletions(-)


(I see patch 9 does something with HMP code, but) hmp_info_block() calls 
qmp_query_named_block_nodes(), and I don’t think it may call such a 
coroutine_fn directly.



diff --git a/block.c b/block.c
index f94cee8930..abed744b60 100644
--- a/block.c
+++ b/block.c
@@ -6148,7 +6148,7 @@ BlockDeviceInfoList *bdrv_named_nodes_list(bool flat,
  
  list = NULL;

  QTAILQ_FOREACH(bs, _bdrv_states, node_list) {
-BlockDeviceInfo *info = bdrv_block_device_info(NULL, bs, flat, errp);
+BlockDeviceInfo *info = bdrv_co_block_device_info(NULL, bs, flat, 
errp);


As far as I understand, only functions marked as coroutine_fn may call 
other coroutine_fn.  Regardless of whether bdrv_named_nodes_list() is 
only called by another coroutine_fn, we still have to mark it as 
coroutine_fn, too (and probably rename it to bdrv_co_named_nodes_list()).


Also, this function (bdrv_named_nodes_list()) uses 
GRAPH_RDLOCK_GUARD_MAINLOOP().  Is that the correct thing to use in a 
coroutine context?


Hanna


  if (!info) {
  qapi_free_BlockDeviceInfoList(list);
  return NULL;
diff --git a/blockdev.c b/blockdev.c
index e6eba61484..8b5f7d06c8 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2818,9 +2818,9 @@ void qmp_drive_backup(DriveBackup *backup, Error **errp)
  blockdev_do_action(, errp);
  }
  
-BlockDeviceInfoList *qmp_query_named_block_nodes(bool has_flat,

- bool flat,
- Error **errp)
+BlockDeviceInfoList *coroutine_fn qmp_query_named_block_nodes(bool has_flat,
+  bool flat,
+  Error **errp)
  {
  bool return_flat = has_flat && flat;
  
diff --git a/qapi/block-core.json b/qapi/block-core.json

index 5dd5f7e4b0..9d4c92f2c9 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1972,7 +1972,8 @@
  { 'command': 'query-named-block-nodes',
'returns': [ 'BlockDeviceInfo' ],
'data': { '*flat': 'bool' },
-  'allow-preconfig': true }
+  'allow-preconfig': true,
+  'coroutine': true}
  
  ##

  # @XDbgBlockGraphNodeType:

Re: [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap

2023-11-03 Thread Hanna Czenczek


On 03.11.23 16:51, Hanna Czenczek wrote:
On 20.10.23 23:56, Andrey Drobyshev wrote: 


[...]


@@ -528,6 +543,14 @@ for use_backing_file in yes no; do
  else
  _make_test_img -o extended_l2=on 1M
  fi
+    # Write cluster #0 and discard its subclusters #0-#3
+    $QEMU_IO -c 'write -q 0 64k' "$TEST_IMG"
+    before=$(disk_usage "$TEST_IMG")
+    $QEMU_IO -c 'discard -q 0 8k' "$TEST_IMG"
+    after=$(disk_usage "$TEST_IMG")
+    _verify_du_delta $before $after 8192
+    alloc="$(seq 4 31)"; zero="$(seq 0 3)"
+    _verify_l2_bitmap 0
  # Write clusters #0-#2 and then discard them
  $QEMU_IO -c 'write -q 0 128k' "$TEST_IMG"
  $QEMU_IO -c 'discard -q 0 128k' "$TEST_IMG"


Similarly to above, I think it would be good if we combined this 
following case with the one you added, i.e. to write 128k from the 
beginning, drop the write here, and change the discard to be “discard 
-q 8k 120k”, i.e. skip the subclusters we have already discarded, to 
see that this is still combined to discard the whole first cluster.


...Ah, see, and when I try this, the following assertion fails:

qemu-io: ../block/qcow2-cache.c:156: qcow2_cache_destroy: Assertion 
`c->entries[i].ref == 0' failed.
./common.rc: line 220: 128894 Aborted (core dumped) ( 
VALGRIND_QEMU="${VALGRIND_QEMU_IO}" _qemu_proc_exec 
"${VALGRIND_LOGFILE}" "$QEMU_IO_PROG" $QEMU_IO_ARGS "$@" )


Looks like an L2 table is leaked somewhere.  That’s why SCRI should be 
a g_auto()-able type.


Forgot to add: This single test case here is the only place where we 
test the added functionality.  I think there should be more cases. It 
doesn’t really make sense now that 271 has so many cases for writing 
zeroes, but so few for discarding, now that discarding works on 
subclusters.  Most of them should at least be considered whether we can 
run them for discard as well.


I didn’t want to push for such an extensive set of tests, but, well, now 
it turned out I overlooked a bug in patch 4, and only found it because I 
thought “this place might also make a nice test case for this series”.


Hanna

Re: [PATCH 4/7] qcow2: make subclusters discardable

2023-11-03 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

This commit makes the discard operation work on the subcluster level
rather than cluster level.  It introduces discard_l2_subclusters()
function and makes use of it in qcow2 discard implementation, much like
it's done with zero_in_l2_slice() / zero_l2_subclusters().  It also
changes the qcow2 driver pdiscard_alignment to subcluster_size.  That
way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE)
operation and free host disk space.

This feature will let us gain additional disk space on guest
TRIM/discard requests, especially when using large enough clusters
(1M, 2M) with subclusters enabled.

Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-cluster.c | 100 --
  block/qcow2.c |   8 ++--
  2 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 7c6fa5524c..cf40f2dc12 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2042,6 +2042,74 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t 
offset, uint64_t nb_clusters,
  return nb_clusters;
  }
  
+static int coroutine_fn GRAPH_RDLOCK

+discard_l2_subclusters(BlockDriverState *bs, uint64_t offset,
+   uint64_t nb_subclusters,
+   enum qcow2_discard_type type,
+   bool full_discard,
+   SubClusterRangeInfo *pscri)
+{
+BDRVQcow2State *s = bs->opaque;
+uint64_t new_l2_bitmap, l2_bitmap_mask;
+int ret, sc = offset_to_sc_index(s, offset);
+SubClusterRangeInfo scri = { 0 };
+
+if (!pscri) {
+ret = get_sc_range_info(bs, offset, nb_subclusters, );
+if (ret < 0) {
+goto out;
+}
+} else {
+scri = *pscri;
+}
+
+l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
+new_l2_bitmap = scri.l2_bitmap;
+new_l2_bitmap &= ~l2_bitmap_mask;
+
+/*
+ * If there're no allocated subclusters left, we might as well discard
+ * the entire cluster.  That way we'd also update the refcount table.
+ */
+if (!(new_l2_bitmap & QCOW_L2_BITMAP_ALL_ALLOC)) {
+return discard_in_l2_slice(bs,
+   QEMU_ALIGN_DOWN(offset, s->cluster_size),
+   1, type, full_discard);
+}


scri.l2_slice is leaked here.

Hanna

Re: [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap

2023-11-03 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

Add _verify_du_delta() checker which is used to check that real disk
usage delta meets the expectations.  For now we use it for checking that
subcluster-based discard/unmap operations lead to actual disk usage
decrease (i.e. PUNCH_HOLE operation is performed).


I’m not too happy about checking the disk usage because that relies on 
the underlying filesystem actually accepting and executing the unmap.  
Why is it not enough to check the L2 bitmap?


…Coming back later (I had to fix the missing `ret = ` I mentioned in 
patch 2, or this test would hang, so I couldn’t run it at first), I note 
that checking the disk usage in fact doesn’t work on tmpfs.  I usually 
run the iotests in tmpfs, so that’s not great.



Also add separate test case for discarding particular subcluster within
one cluster.

Signed-off-by: Andrey Drobyshev 
---
  tests/qemu-iotests/271 | 25 -
  tests/qemu-iotests/271.out |  2 ++
  2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271
index c7c2cadda0..5fcb209f5f 100755
--- a/tests/qemu-iotests/271
+++ b/tests/qemu-iotests/271
@@ -81,6 +81,15 @@ _verify_l2_bitmap()
  fi
  }
  
+# Check disk usage delta after a discard/unmap operation

+# _verify_du_delta $before $after $expected_delta
+_verify_du_delta()
+{
+if [ $(($1 - $2)) -ne $3 ]; then
+printf "ERROR: unexpected delta: $1 - $2 = $(($1 - $2)) != $3\n"
+fi
+}
+
  # This should be called as _run_test c=XXX sc=XXX off=XXX len=XXX cmd=XXX
  # c:   cluster number (0 if unset)
  # sc:  subcluster number inside cluster @c (0 if unset)
@@ -198,9 +207,12 @@ for use_backing_file in yes no; do
  alloc="$(seq 0 31)"; zero=""
  _run_test sc=0 len=64k
  
-### Zero and unmap half of cluster #0 (this won't unmap it)

+### Zero and unmap half of cluster #0 (this will unmap it)


I think “it” refers to the cluster, and it is not unmapped.  This test 
case does not use a discard, but write -z instead, so it worked before.  
(The L2 bitmap shown in the output doesn’t change, so functionally, this 
patch series didn’t change this case.)



  alloc="$(seq 16 31)"; zero="$(seq 0 15)"
+before=$(disk_usage "$TEST_IMG")
  _run_test sc=0 len=32k cmd=unmap
+after=$(disk_usage "$TEST_IMG")
+_verify_du_delta $before $after 32768
  
  ### Zero and unmap cluster #0

  alloc=""; zero="$(seq 0 31)"


For this following case shown truncated here, why don’t we try 
“_run_test sc=16 len=32k cmd=unmap” instead of “sc=0 len=64k”?  I.e. 
unmap only the second half, which, thanks to patch 3, should still unmap 
the whole cluster, because the first half is already unmapped.



@@ -447,7 +459,10 @@ for use_backing_file in yes no; do
  
  # Subcluster-aligned request from clusters #12 to #14

  alloc="$(seq 0 15)"; zero="$(seq 16 31)"
+before=$(disk_usage "$TEST_IMG")
  _run_test c=12 sc=16 len=128k cmd=unmap
+after=$(disk_usage "$TEST_IMG")
+_verify_du_delta $before $after $((128 * 1024))
  alloc=""; zero="$(seq 0 31)"
  _verify_l2_bitmap 13
  alloc="$(seq 16 31)"; zero="$(seq 0 15)"
@@ -528,6 +543,14 @@ for use_backing_file in yes no; do
  else
  _make_test_img -o extended_l2=on 1M
  fi
+# Write cluster #0 and discard its subclusters #0-#3
+$QEMU_IO -c 'write -q 0 64k' "$TEST_IMG"
+before=$(disk_usage "$TEST_IMG")
+$QEMU_IO -c 'discard -q 0 8k' "$TEST_IMG"
+after=$(disk_usage "$TEST_IMG")
+_verify_du_delta $before $after 8192
+alloc="$(seq 4 31)"; zero="$(seq 0 3)"
+_verify_l2_bitmap 0
  # Write clusters #0-#2 and then discard them
  $QEMU_IO -c 'write -q 0 128k' "$TEST_IMG"
  $QEMU_IO -c 'discard -q 0 128k' "$TEST_IMG"


Similarly to above, I think it would be good if we combined this 
following case with the one you added, i.e. to write 128k from the 
beginning, drop the write here, and change the discard to be “discard -q 
8k 120k”, i.e. skip the subclusters we have already discarded, to see 
that this is still combined to discard the whole first cluster.


...Ah, see, and when I try this, the following assertion fails:

qemu-io: ../block/qcow2-cache.c:156: qcow2_cache_destroy: Assertion 
`c->entries[i].ref == 0' failed.
./common.rc: line 220: 128894 Aborted (core dumped) ( 
VALGRIND_QEMU="${VALGRIND_QEMU_IO}" _qemu_proc_exec 
"${VALGRIND_LOGFILE}" "$QEMU_IO_PROG" $QEMU_IO_ARGS "$@" )


Looks like an L2 table is leaked somewhere.  That’s why SCRI should be a 
g_auto()-able type.


Hanna


diff --git a/tests/qemu-iotests/271.out b/tests/qemu-iotests/271.out
index 5be780de76..0da8d72cde 100644
--- a/tests/qemu-iotests/271.out
+++ b/tests/qemu-iotests/271.out
@@ -426,6 +426,7 @@ L2 entry #29: 0x 
  ### Discarding clusters with non-zero bitmaps (backing file: yes) ###
  
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576

Re: [PATCH 6/7] iotests/common.rc: add disk_usage function

2023-11-03 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

Move the definition from iotests/250 to common.rc.  This is used to
detect real disk usage of sparse files.  In particular, we want to use
it for checking subclusters-based discards.

Signed-off-by: Andrey Drobyshev 
---
  tests/qemu-iotests/250   | 5 -
  tests/qemu-iotests/common.rc | 6 ++
  2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/tests/qemu-iotests/250 b/tests/qemu-iotests/250
index af48f83aba..c0a0dbc0ff 100755
--- a/tests/qemu-iotests/250
+++ b/tests/qemu-iotests/250
@@ -52,11 +52,6 @@ _unsupported_imgopts data_file
  # bdrv_co_truncate(bs->file) call in qcow2_co_truncate(), which might succeed
  # anyway.
  
-disk_usage()

-{
-du --block-size=1 $1 | awk '{print $1}'
-}
-
  size=2100M
  
  _make_test_img -o "cluster_size=1M,preallocation=metadata" $size

diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index 95c12577dd..5d2ea26c7f 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -140,6 +140,12 @@ _optstr_add()
  fi
  }
  
+# report real disk usage for sparse files

+disk_usage()
+{
+du --block-size=1 $1 | awk '{print $1}'


Pre-existing, but since you’re touching this now: Can you please change 
the $1 to "$1"?


Hanna


+}
+
  # Set the variables to the empty string to turn Valgrind off
  # for specific processes, e.g.
  # $ VALGRIND_QEMU_IO= ./check -qcow2 -valgrind 015

Re: [PATCH 5/7] qcow2: zero_l2_subclusters: fall through to discard operation when requested

2023-11-03 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

When zeroizing subclusters within single cluster, detect usage of the
BDRV_REQ_MAY_UNMAP flag and fall through to the subcluster-based discard
operation, much like it's done with the cluster-based discards.  That
way subcluster-aligned operations "qemu-io -c 'write -z -u ...'" will
lead to actual unmap.


Ever since the introduction of discard-no-unref, I wonder whether that 
is actually an advantage or not.  I can’t think of a reason why it’d be 
advantageous to drop the allocation.


On the other hand, zero_in_l2_slice() does it indeed, so consistency is 
a reasonable argument.



Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-cluster.c | 17 ++---
  1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index cf40f2dc12..040251f2c3 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2242,7 +2242,7 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
  unsigned nb_subclusters, int flags)
  {
  BDRVQcow2State *s = bs->opaque;
-uint64_t new_l2_bitmap;
+uint64_t new_l2_bitmap, l2_bitmap_mask;
  int ret, sc = offset_to_sc_index(s, offset);
  SubClusterRangeInfo scri = { 0 };
  
@@ -2251,9 +2251,10 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,

  goto out;
  }
  
+l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);


“l2_bitmap_mask” already wasn’t a great name in patch 4 because it 
doesn’t say what mask it is, but in patch 4, there was at least only one 
relevant mask.  Here, we have two (ALLOC_RANGE and ZERO_RANGE), so the 
name should reflect what kind of mask it is.



  new_l2_bitmap = scri.l2_bitmap;
-new_l2_bitmap |=  QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
-new_l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
+new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
+new_l2_bitmap &= ~l2_bitmap_mask;
  
  /*

   * If there're no non-zero subclusters left, we might as well zeroize
@@ -2266,6 +2267,16 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t 
offset,
  1, flags);
  }
  
+/*

+ * If the request allows discarding subclusters and they're actually
+ * allocated, we go down the discard path since after the discard
+ * operation the subclusters are going to be read as zeroes anyway.
+ */
+if ((flags & BDRV_REQ_MAY_UNMAP) && (scri.l2_bitmap & l2_bitmap_mask)) {
+return discard_l2_subclusters(bs, offset, nb_subclusters,
+  QCOW2_DISCARD_REQUEST, false, );
+}


I wonder why it matters whether any subclusters are allocated, i.e. why 
can’t we just immediately use discard_l2_subclusters() whenever 
BDRV_REQ_MAY_UNMAP is set, without even fetching SCRI (that would also 
save us passing SCRI here, which I’ve already said on patch 4, I don’t 
find great).


Doesn’t discard_l2_subclusters() guarantee the clusters read as zero 
when full_discard=false?  There is this case where it won’t mark the 
subclusters as zero if there is no backing file, and none of the 
subclusters is allocated.  But still, (A) those subclusters will read as 
zero, and (B) if there is a problem there, why don’t we just always mark 
those subclusters as zero?  This optimization only has effect when the 
guest discards/zeroes subclusters (not whole clusters) that were already 
discarded, so sounds miniscule.


Hanna


+
  if (new_l2_bitmap != scri.l2_bitmap) {
  set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
  qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);

Re: [PATCH v5 0/7] vhost-user: Back-end state migration

2023-11-02 Thread Hanna Czenczek


On 16.10.23 15:42, Hanna Czenczek wrote:

Based-on: <20231004014532.1228637-1-stefa...@redhat.com>
   ([PATCH v2 0/3] vhost: clean up device reset)

Based-on: <20231016083201.23736-1-hre...@redhat.com>
   ([PATCH] vhost-user: Fix protocol feature bit conflict)


Hi,

v5 is basically the same as v4, only that I’ve dropped the patch
deprecating F_STATUS (which doesn’t affect the rest of the series), that
I’ve amended the documentation in patch 1 as suggested by Stefan and
with help from Michael, and that I’ve rebased everything on top of the
F_SHARED_OBJECT changes that have been merged upstream.


Ping – both of the dependencies are merged.

Hanna

Re: [PATCH] block-jobs: add final flush

2023-11-02 Thread Hanna Czenczek


On 01.11.23 20:53, Vladimir Sementsov-Ogievskiy wrote:

On 31.10.23 17:05, Hanna Czenczek wrote:

On 04.10.23 15:56, Vladimir Sementsov-Ogievskiy wrote:

From: Vladimir Sementsov-Ogievskiy 

Actually block job is not completed without the final flush. It's
rather unexpected to have broken target when job was successfully
completed long ago and now we fail to flush or process just
crashed/killed.

Mirror job already has mirror_flush() for this. So, it's OK.

Add similar things for other jobs: backup, stream, commit.

Note, that stream has (documented) different treatment of IGNORE
action: it don't retry the operation, continue execution and report
error at last. We keep it for final flush too.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

Was: [PATCH v4] block-jobs: flush target at the end of .run()
   But now rewritten.
Supersedes: <20230725174008.1147467-1-vsement...@yandex-team.ru>

  block/backup.c |  2 +-
  block/block-copy.c |  7 +++
  block/commit.c | 16 
  block/stream.c | 21 +
  include/block/block-copy.h |  1 +
  5 files changed, 38 insertions(+), 9 deletions(-)


[...]


diff --git a/block/commit.c b/block/commit.c
index aa45beb0f0..5205c77ec9 100644
--- a/block/commit.c
+++ b/block/commit.c


[...]

@@ -187,7 +187,15 @@ static int coroutine_fn commit_run(Job *job, 
Error **errp)

  }
  }
-    return 0;
+    do {
+    ret = blk_co_flush(s->base);
+    if (ret < 0) {
+    action = block_job_error_action(>common, s->on_error,
+    false, -ret);
+    }
+    } while (ret < 0 && action != BLOCK_ERROR_ACTION_REPORT);


Do we need to yield in this loop somewhere so that 
BLOCK_ERROR_ACTION_STOP can pause the job?




block_job_error_action calls job_pause_locked() itself in this case


But that doesn’t really pause the job, does it?  As far as I understand, 
it increases job->pause_count, then enters the job, and the job is then 
supposed to yield at some point so job_pause_point_locked() is called, 
which sees the increased job->pause_count and will actually pause the job.


Hanna

Re: [PATCH 4/7] qcow2: make subclusters discardable

2023-10-31 Thread Hanna Czenczek


(Sorry, opened another reply window, forgot I already had one open...)

On 20.10.23 23:56, Andrey Drobyshev wrote:

This commit makes the discard operation work on the subcluster level
rather than cluster level.  It introduces discard_l2_subclusters()
function and makes use of it in qcow2 discard implementation, much like
it's done with zero_in_l2_slice() / zero_l2_subclusters().  It also
changes the qcow2 driver pdiscard_alignment to subcluster_size.  That
way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE)
operation and free host disk space.

This feature will let us gain additional disk space on guest
TRIM/discard requests, especially when using large enough clusters
(1M, 2M) with subclusters enabled.

Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-cluster.c | 100 --
  block/qcow2.c |   8 ++--
  2 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 7c6fa5524c..cf40f2dc12 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2042,6 +2042,74 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t 
offset, uint64_t nb_clusters,
  return nb_clusters;
  }
  
+static int coroutine_fn GRAPH_RDLOCK

+discard_l2_subclusters(BlockDriverState *bs, uint64_t offset,
+   uint64_t nb_subclusters,
+   enum qcow2_discard_type type,
+   bool full_discard,
+   SubClusterRangeInfo *pscri)
+{
+BDRVQcow2State *s = bs->opaque;
+uint64_t new_l2_bitmap, l2_bitmap_mask;
+int ret, sc = offset_to_sc_index(s, offset);
+SubClusterRangeInfo scri = { 0 };
+
+if (!pscri) {
+ret = get_sc_range_info(bs, offset, nb_subclusters, );
+if (ret < 0) {
+goto out;
+}
+} else {
+scri = *pscri;


Allowing to takes this from the caller sounds dangerous, considering we 
need to track who takes care of freeing scri.l2_slice.



+}
+
+l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
+new_l2_bitmap = scri.l2_bitmap;
+new_l2_bitmap &= ~l2_bitmap_mask;
+
+/*
+ * If there're no allocated subclusters left, we might as well discard
+ * the entire cluster.  That way we'd also update the refcount table.
+ */
+if (!(new_l2_bitmap & QCOW_L2_BITMAP_ALL_ALLOC)) {


What if there are subclusters in the cluster that are marked as zero, 
outside of the discarded range?  It sounds wrong to apply a discard with 
either full_discard set or cleared to them.



+return discard_in_l2_slice(bs,
+   QEMU_ALIGN_DOWN(offset, s->cluster_size),
+   1, type, full_discard);
+}
+
+/*
+ * Full discard means we fall through to the backing file, thus we only
+ * need to mark the subclusters as deallocated.


I think it also means we need to clear the zero bits.

Hanna


+ *
+ * Non-full discard means subclusters should be explicitly marked as
+ * zeroes.  In this case QCOW2 specification requires the corresponding
+ * allocation status bits to be unset as well.  If the subclusters are
+ * deallocated in the first place and there's no backing, the operation
+ * can be skipped.
+ */
+if (!full_discard &&
+(bs->backing || scri.l2_bitmap & l2_bitmap_mask)) {
+new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
+}

Re: [PATCH 4/7] qcow2: make subclusters discardable

2023-10-31 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

This commit makes the discard operation work on the subcluster level
rather than cluster level.  It introduces discard_l2_subclusters()
function and makes use of it in qcow2 discard implementation, much like
it's done with zero_in_l2_slice() / zero_l2_subclusters().  It also
changes the qcow2 driver pdiscard_alignment to subcluster_size.  That
way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE)
operation and free host disk space.

This feature will let us gain additional disk space on guest
TRIM/discard requests, especially when using large enough clusters
(1M, 2M) with subclusters enabled.

Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-cluster.c | 100 --
  block/qcow2.c |   8 ++--
  2 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 7c6fa5524c..cf40f2dc12 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c


[...]


+if (scri.l2_bitmap != new_l2_bitmap) {
+set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
+qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);
+}
+
+if (s->discard_passthrough[type]) {
+qcow2_queue_discard(bs, (scri.l2_entry & L2E_OFFSET_MASK) +
+offset_into_cluster(s, offset),
+nb_subclusters * s->subcluster_size);


Are we sure that the cluster is allocated, i.e. that scri.l2_entry & 
L2E_OFFSET_MASK != 0?


As a side note, I guess discard_in_l2_slice() should also use 
qcow2_queue_discard() instead of bdrv_pdiscard() then.



+}
+
+ret = 0;
+out:
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
+
+return ret;
+}
+
  int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
uint64_t bytes, enum qcow2_discard_type type,
bool full_discard)
@@ -2049,19 +2117,36 @@ int qcow2_cluster_discard(BlockDriverState *bs, 
uint64_t offset,
  BDRVQcow2State *s = bs->opaque;
  uint64_t end_offset = offset + bytes;
  uint64_t nb_clusters;
+unsigned head, tail;
  int64_t cleared;
  int ret;
  
  /* Caller must pass aligned values, except at image end */

-assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
-assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||
+assert(QEMU_IS_ALIGNED(offset, s->subcluster_size));
+assert(QEMU_IS_ALIGNED(end_offset, s->subcluster_size) ||
 end_offset == bs->total_sectors << BDRV_SECTOR_BITS);
  
-nb_clusters = size_to_clusters(s, bytes);

+head = MIN(end_offset, ROUND_UP(offset, s->cluster_size)) - offset;
+offset += head;
+
+tail = (end_offset >= bs->total_sectors << BDRV_SECTOR_BITS) ? 0 :
+   end_offset - MAX(offset, start_of_cluster(s, end_offset));
+end_offset -= tail;
  
  s->cache_discards = true;
  
+if (head) {

+ret = discard_l2_subclusters(bs, offset - head,
+ size_to_subclusters(s, head), type,
+ full_discard, NULL);
+if (ret < 0) {
+goto fail;
+}
+}
+
  /* Each L2 slice is handled by its own loop iteration */
+nb_clusters = size_to_clusters(s, end_offset - offset);
+
  while (nb_clusters > 0) {


I think the comment should stay attached to the `while`.

Hanna


  cleared = discard_in_l2_slice(bs, offset, nb_clusters, type,
full_discard);

Re: [PATCH 3/7] qcow2: zeroize the entire cluster when there're no non-zero subclusters

2023-10-31 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

When zeroizing the last non-zero subclusters within single cluster, it
makes sense to go zeroize the entire cluster and go down zero_in_l2_slice()
path right away.  That way we'd also update the corresponding refcount
table.

Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-cluster.c | 18 +++---
  1 file changed, 15 insertions(+), 3 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH 2/7] qcow2: add get_sc_range_info() helper for working with subcluster ranges

2023-10-31 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

This helper simply obtains the l2 table parameters of the cluster which
contains the given subclusters range.  Right now this info is being
obtained and used by zero_l2_subclusters().  As we're about to introduce
the subclusters discard operation, this helper would let us avoid code
duplication.

Also introduce struct SubClusterRangeInfo, which would contain all the
needed params.

Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-cluster.c | 90 +--
  1 file changed, 62 insertions(+), 28 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 904f00d1b3..8801856b93 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -32,6 +32,13 @@
  #include "qemu/memalign.h"
  #include "trace.h"
  
+typedef struct SubClusterRangeInfo {

+uint64_t *l2_slice;


We should document that this is a strong reference that must be returned 
via qcow2_cache_put().  Maybe you could also define a clean-up function 
using G_DEFINE_AUTO_CLEANUP_CLEAR_FUNC() that does this, allowing this 
type to be used with g_auto().



+int l2_index;
+uint64_t l2_entry;
+uint64_t l2_bitmap;
+} SubClusterRangeInfo;
+
  int coroutine_fn qcow2_shrink_l1_table(BlockDriverState *bs,
 uint64_t exact_size)
  {
@@ -1892,6 +1899,50 @@ again:
  return 0;
  }
  
+static int get_sc_range_info(BlockDriverState *bs, uint64_t offset,

+ unsigned nb_subclusters,
+ SubClusterRangeInfo *scri)


It would be good to have documentation for this function, for example 
that it only works on a single cluster, i.e. that the range denoted by 
@offset and @nb_subclusters must not cross a cluster boundary, and that 
@offset must be aligned to subclusters.


In general, it is unclear to me at this point what this function does.  
OK, it gets the SCRI for all subclusters in the cluster at @offset (this 
is what its name implies), but then it also has this loop that checks 
whether there are compressed clusters among the @nb_subclusters.  It has 
a comment about being unable to zero/discard subclusters in compressed 
clusters, but the function name says nothing about this scope of 
zeroing/discarding.



+{
+BDRVQcow2State *s = bs->opaque;
+int ret, sc_cleared = 0, sc_index = offset_to_sc_index(s, offset);
+QCow2SubclusterType sctype;
+
+/* Here we only work with the subclusters within single cluster. */
+assert(nb_subclusters > 0 && nb_subclusters < s->subclusters_per_cluster);
+assert(sc_index + nb_subclusters <= s->subclusters_per_cluster);
+assert(offset_into_subcluster(s, offset) == 0);
+
+ret = get_cluster_table(bs, offset, >l2_slice, >l2_index);
+if (ret < 0) {
+return ret;
+}
+
+scri->l2_entry = get_l2_entry(s, scri->l2_slice, scri->l2_index);
+scri->l2_bitmap = get_l2_bitmap(s, scri->l2_slice, scri->l2_index);
+
+do {
+qcow2_get_subcluster_range_type(bs, scri->l2_entry, scri->l2_bitmap,
+sc_index, );


I think there’s a `ret = ` missing here.


+if (ret < 0) {
+return ret;
+}
+
+switch (sctype) {
+case QCOW2_SUBCLUSTER_COMPRESSED:
+/* We cannot partially zeroize/discard compressed clusters. */
+return -ENOTSUP;
+case QCOW2_SUBCLUSTER_INVALID:
+return -EINVAL;
+default:
+break;
+}
+
+sc_cleared += ret;
+} while (sc_cleared < nb_subclusters);
+
+return 0;
+}
+
  /*
   * This discards as many clusters of nb_clusters as possible at once (i.e.
   * all clusters in the same L2 slice) and returns the number of discarded
@@ -2097,44 +2148,27 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t 
offset,
  unsigned nb_subclusters)
  {
  BDRVQcow2State *s = bs->opaque;
-uint64_t *l2_slice;
-uint64_t old_l2_bitmap, l2_bitmap;
-int l2_index, ret, sc = offset_to_sc_index(s, offset);
+uint64_t new_l2_bitmap;
+int ret, sc = offset_to_sc_index(s, offset);
+SubClusterRangeInfo scri = { 0 };
  
-/* For full clusters use zero_in_l2_slice() instead */

-assert(nb_subclusters > 0 && nb_subclusters < s->subclusters_per_cluster);
-assert(sc + nb_subclusters <= s->subclusters_per_cluster);
-assert(offset_into_subcluster(s, offset) == 0);
-
-ret = get_cluster_table(bs, offset, _slice, _index);
+ret = get_sc_range_info(bs, offset, nb_subclusters, );
  if (ret < 0) {
-return ret;
-}
-
-switch (qcow2_get_cluster_type(bs, get_l2_entry(s, l2_slice, l2_index))) {
-case QCOW2_CLUSTER_COMPRESSED:
-ret = -ENOTSUP; /* We cannot partially zeroize compressed clusters */
  goto out;
-case QCOW2_CLUSTER_NORMAL:
-case QCOW2_CLUSTER_UNALLOCATED:
-break;
-default:
-g_assert_not_reached();
  }
  
-

Re: [PATCH 1/7] qcow2: make function update_refcount_discard() global

2023-10-31 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

We are going to need it for discarding separate subclusters.  The
function itself doesn't do anything with the refcount tables,


I think the idea behind the naming was that updating refcounts can lead 
to clusters being discarded, i.e. update_refcount_discard() did the 
“discard” part of “update_refcount”.



it simply
adds a discard request to the queue, so rename it to qcow2_queue_discard().


But that’s fine, too.


Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-refcount.c | 8 
  block/qcow2.h  | 2 ++
  2 files changed, 6 insertions(+), 4 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH v2 1/1] qemu-img: do not erase destination file in qemu-img dd command

2023-10-31 Thread Hanna Czenczek


On 01.10.23 22:46, Denis V. Lunev wrote:

Can you please not top-post. This makes the discussion complex. This
approach is followed in this mailing list and in other similar lists
like LKML.

On 10/1/23 19:08, Mike Maslenkin wrote:

I thought about "conv=notrunc", but my main concern is changed virtual
disk metadata.
It depends on how qemu-img used.
May be I followed to wrong pattern, but pros and cons of adding "conv"
parameter was not in my mind in scope of the first patch version.
I see 4 obvious ways of using `qemu-img dd`:
1. Copy virtual disk data between images of same format. I think disk
geometry must be preserved in this case.
2. Copy virtual disk data between different formats. It is a valid
pattern? May be `qemu-img convert` should to be used instead?
3. Merge snapshots to specified disk image, i.e read current state and
write it to new disk image.
4. Copy virtual disk data to raw binary file. Actually this patch
breaks 'dd' behavior for this case when source image is less (in terms
of logical blocks) than existed raw binary file.
 May be for this case condition can be improved to smth like
    if (strcmp(fmt, "raw") || !g_file_test(out.filename,
G_FILE_TEST_EXISTS)) . And parameter "conv=notrunc" may be implemented
additionally for this case.

My personal opinion is that qemu dd when you will need to
extract the SOME data from the original image and process
it further. Thus I use it to copy some data into raw binary
file. My next goal here would add ability to put data into
stdout that would be beneficial for. Though this is out of the
equation at the moment.

Though, speaking about the approach, I would say that the
patch changes current behavior which is not totally buggy
under a matter of this or that taste. It should be noted that
we are here in Linux world, not in the Mac world where we
were in position to avoid options and selections.

Thus my opinion that original behavior is to be preserved
as somebody is relying on it. The option you are proposing
seems valuable to me also and thus the switch is to be added.
The switch is well-defined in the original 'dd' world thus
either conv= option would be good, either nocreat or notrunc.
For me 'nocreat' seems more natural.

Anyway, the last word here belongs to either Hanna or Kevin ;)


Personally, and honestly, I see no actual use for qemu-img dd at all, 
because we’re trying to mimic a subset of an interface of a rather 
complex program that has been designed to do what it does. We can only 
fail at that.  Personally, whenever I need dd functionality, I use 
qemu-storage-daemon’s fuse export, and then use the actual dd program on 
top.  Alternatively, qemu-img convert is our native interface; 
unfortunately, its feature set is lacking when compared to qemu-img dd, 
but I think it would be better to improve that rather than working on 
qemu-img dd.


I tend to agree with you, Denis, though, that this patch changes 
behavior, and users may be relying on the current behavior.


Comparing to “what would dd do” is difficult because dd just has no 
concept of file formats.  (That’s another reason why I think it’s bad to 
provide a dd-like interface, and users should rather use dd itself, or 
qemu-img convert.)  But in any case, adding conv=notrunc to keep the 
existing target file would seem fair to me.  I understand conv=nocreat 
would advise dd to never create the target file, which is something 
slightly different (i.e. conv=notrunc will still create a new target 
file if it doesn’t exist yet, but won’t create/truncate one if it 
already exists; while conv=nocreat would disable creating a new target 
file if it doesn’t exist yet, but still truncate it if it does exist; 
using both together would ensure that the target file is never 
created/truncated).


Summary: If we do this under a new conv=notrunc, fine with me.  I just 
don’t think qemu-img dd is something that should be used at all.


Hanna

Re: [PATCH] block-jobs: add final flush

2023-10-31 Thread Hanna Czenczek


On 04.10.23 15:56, Vladimir Sementsov-Ogievskiy wrote:

From: Vladimir Sementsov-Ogievskiy 

Actually block job is not completed without the final flush. It's
rather unexpected to have broken target when job was successfully
completed long ago and now we fail to flush or process just
crashed/killed.

Mirror job already has mirror_flush() for this. So, it's OK.

Add similar things for other jobs: backup, stream, commit.

Note, that stream has (documented) different treatment of IGNORE
action: it don't retry the operation, continue execution and report
error at last. We keep it for final flush too.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

Was: [PATCH v4] block-jobs: flush target at the end of .run()
   But now rewritten.
Supersedes: <20230725174008.1147467-1-vsement...@yandex-team.ru>

  block/backup.c |  2 +-
  block/block-copy.c |  7 +++
  block/commit.c | 16 
  block/stream.c | 21 +
  include/block/block-copy.h |  1 +
  5 files changed, 38 insertions(+), 9 deletions(-)


[...]


diff --git a/block/commit.c b/block/commit.c
index aa45beb0f0..5205c77ec9 100644
--- a/block/commit.c
+++ b/block/commit.c


[...]


@@ -187,7 +187,15 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
  }
  }
  
-return 0;

+do {
+ret = blk_co_flush(s->base);
+if (ret < 0) {
+action = block_job_error_action(>common, s->on_error,
+false, -ret);
+}
+} while (ret < 0 && action != BLOCK_ERROR_ACTION_REPORT);


Do we need to yield in this loop somewhere so that 
BLOCK_ERROR_ACTION_STOP can pause the job?


Hanna

Re: [PATCH 2/2] iotests: Test media change with iothreads

2023-10-31 Thread Hanna Czenczek


On 13.10.23 17:33, Kevin Wolf wrote:

iotests case 118 already tests all relevant operations for media change
with multiple devices, however never with iothreads. This changes the
test so that the virtio-scsi tests run with an iothread.

Signed-off-by: Kevin Wolf 
---
  tests/qemu-iotests/118 | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH 1/2] block: Fix locking in media change monitor commands

2023-10-31 Thread Hanna Czenczek


On 13.10.23 17:33, Kevin Wolf wrote:

blk_insert_bs() requires that the caller holds the AioContext lock for
the node to be inserted. Since commit c066e808e11, neglecting to do so
causes a crash when the child has to be moved to a different AioContext
to attach it to the BlockBackend.

This fixes qmp_blockdev_insert_anon_medium(), which is called for the
QMP commands 'blockdev-insert-medium' and 'blockdev-change-medium', to
correctly take the lock.

Cc: qemu-sta...@nongnu.org
Fixes: https://issues.redhat.com/browse/RHEL-3922
Fixes: c066e808e11a5c181b625537b6c78e0de27a4801
Signed-off-by: Kevin Wolf 
---
  block/qapi-sysemu.c | 5 +
  1 file changed, 5 insertions(+)


Do we need to take the lock for the dev_ops tray callbacks, too?  I 
suppose not, and it also wouldn’t really matter in light of the lock 
being supposed to go away anyway, but still thought I should ask.


In any case, this change here is necessary, so:

Reviewed-by: Hanna Czenczek

Re: [PATCH v2] block/file-posix: fix update_zones_wp() caller

2023-10-31 Thread Hanna Czenczek


On 25.08.23 06:05, Sam Li wrote:

When the zoned request fail, it needs to update only the wp of
the target zones for not disrupting the in-flight writes on
these other zones. The wp is updated successfully after the
request completes.

Fixed the callers with right offset and nr_zones.

Signed-off-by: Sam Li
---
  block/file-posix.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)


Thanks, applied to my block branch:

https://gitlab.com/hreitz/qemu/-/commits/block

(Rebased on master, and I’ve also fixed the comment to read “boundaries” 
instead of “bounaries”.  Hope that’s OK!)


Hanna

Re: [PATCH v3] qcow2: keep reference on zeroize with discard-no-unref enabled

2023-10-30 Thread Hanna Czenczek


On 03.10.23 14:52, Jean-Louis Dupond wrote:

When the discard-no-unref flag is enabled, we keep the reference for
normal discard requests.
But when a discard is executed on a snapshot/qcow2 image with backing,
the discards are saved as zero clusters in the snapshot image.

When committing the snapshot to the backing file, not
discard_in_l2_slice is called but zero_in_l2_slice. Which did not had
any logic to keep the reference when discard-no-unref is enabled.

Therefor we add logic in the zero_in_l2_slice call to keep the reference
on commit.

Fixes: https://gitlab.com/qemu-project/qemu/-/issues/1621
Signed-off-by: Jean-Louis Dupond 
---
  block/qcow2-cluster.c | 22 ++
  qapi/block-core.json  |  7 ---
  qemu-options.hx   |  3 ++-
  3 files changed, 24 insertions(+), 8 deletions(-)


Thanks, applied to my block branch:

https://gitlab.com/hreitz/qemu/-/commits/block

Hanna

Re: [PATCH v3] qcow2: keep reference on zeroize with discard-no-unref enabled

2023-10-27 Thread Hanna Czenczek


On 03.10.23 14:52, Jean-Louis Dupond wrote:

When the discard-no-unref flag is enabled, we keep the reference for
normal discard requests.
But when a discard is executed on a snapshot/qcow2 image with backing,
the discards are saved as zero clusters in the snapshot image.

When committing the snapshot to the backing file, not
discard_in_l2_slice is called but zero_in_l2_slice. Which did not had
any logic to keep the reference when discard-no-unref is enabled.

Therefor we add logic in the zero_in_l2_slice call to keep the reference
on commit.

Fixes: https://gitlab.com/qemu-project/qemu/-/issues/1621
Signed-off-by: Jean-Louis Dupond 
---
  block/qcow2-cluster.c | 22 ++
  qapi/block-core.json  |  7 ---
  qemu-options.hx   |  3 ++-
  3 files changed, 24 insertions(+), 8 deletions(-)


[...]


diff --git a/qapi/block-core.json b/qapi/block-core.json
index 89751d81f2..9836195850 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3476,15 +3476,16 @@
  # should be issued on other occasions where a cluster gets freed
  #
  # @discard-no-unref: when enabled, discards from the guest will not
-# cause cluster allocations to be relinquished.  This prevents
+# cause cluster allocations to be relinquished. The same will
+# happen for discards triggered by zeroize. This prevents


I don’t think “zeroize” has any meaning outside of qemu’s qcow2 code.  
I’d write “when enabled, data clusters will remain preallocated when 
they are no longer used, e.g. because they are discarded or converted to 
zero clusters.  As usual, whether the old data is discarded or kept on 
the protocol level (i.e. in the image file) depends on the setting of 
the pass-discard-request option. Keeping the clusters preallocated 
prevents qcow2 fragmentation that would otherwise be caused by freeing 
and re-allocating them later. Besides potential performance degradation, 
[...]”


If you’re OK with that, I can change that (here and in qemu-options.hx) 
when taking the patch.



  # qcow2 fragmentation that would be caused by such discards.
  # Besides potential performance degradation, such fragmentation
  # can lead to increased allocation of clusters past the end of the
  # image file, resulting in image files whose file length can grow
-# much larger than their guest disk size would suggest.  If image
+# much larger than their guest disk size would suggest. If image
  # file length is of concern (e.g. when storing qcow2 images
  # directly on block devices), you should consider enabling this
-# option.  (since 8.1)
+# option. (since 8.1)


These two changes don’t seem related, I’d remove them, too. 
(Double-space after '.' is fairly common in block-core.json, and in my 
emails, too. :))


Hanna

Re: [PATCH v8 1/5] qemu-iotests: Filter warnings about block migration being deprecated

2023-10-24 Thread Hanna Czenczek


On 18.10.23 13:55, Juan Quintela wrote:

Create a new filter that removes the two warnings for test 183.

Signed-off-by: Juan Quintela 
---
  tests/qemu-iotests/183   | 2 +-
  tests/qemu-iotests/common.filter | 7 +++
  2 files changed, 8 insertions(+), 1 deletion(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings

2023-10-18 Thread Hanna Czenczek


On 18.10.23 14:14, Michael S. Tsirkin wrote:

On Wed, Oct 04, 2023 at 02:58:59PM +0200, Hanna Czenczek wrote:

Currently, the vhost-user documentation says that rings are to be
initialized in a disabled state when VHOST_USER_F_PROTOCOL_FEATURES is
negotiated.  However, by the time of feature negotiation, all rings have
already been initialized, so it is not entirely clear what this means.

At least the vhost-user-backend Rust crate's implementation interpreted
it to mean that whenever this feature is negotiated, all rings are to
put into a disabled state, which means that every SET_FEATURES call
would disable all rings, effectively halting the device.  This is
problematic because the VHOST_F_LOG_ALL feature is also set or cleared
this way, which happens during migration.  Doing so should not halt the
device.

Other implementations have interpreted this to mean that the device is
to be initialized with all rings disabled, and a subsequent SET_FEATURES
call that does not set VHOST_USER_F_PROTOCOL_FEATURES will enable all of
them.  Here, SET_FEATURES will never disable any ring.

This interpretation does not suffer the problem of unintentionally
halting the device whenever features are set or cleared, so it seems
better and more reasonable.

We can clarify this in the documentation by making it explicit that the
enabled/disabled state is tracked even while the vring is stopped.
Every vring is initialized in a disabled state, and SET_FEATURES without
VHOST_USER_F_PROTOCOL_FEATURES simply becomes one way to enable all
vrings.

Signed-off-by: Hanna Czenczek 


OK so I am expecting v5. My advice is to move patch 1 to end of patchset
so we can defer it if we want to.


Already sent – I’ve just dropped patch 1, since it doesn’t add anything 
to the objective of the patch series itself:


https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg04727.html

Hanna

Re: [PATCH] vhost-user: Fix protocol feature bit conflict

2023-10-17 Thread Hanna Czenczek


On 17.10.23 09:53, Viresh Kumar wrote:

On 17-10-23, 09:51, Hanna Czenczek wrote:

Not that I’m really opposed to that, but I don’t see the problem with just
doing that in the same work that makes qemu actually use this flag, exactly
because it’s just a -1/+1 change.

I can send a v2, but should I do the same for libvhost-user and define the
flag there?  Do I have to add a patch to do the same for F_STATUS, which so
far only got a placeholder comment?

Sure, that's fine too.


I would rather not, though, and don’t see a tangible benefit in doing so.

Re: [Virtio-fs] (no subject)

2023-10-17 Thread Hanna Czenczek


On 17.10.23 09:49, Viresh Kumar wrote:

On 13-10-23, 20:02, Hanna Czenczek wrote:

On 10.10.23 16:35, Alex Bennée wrote:

I was going to say there is also the rust-vmm vhost-user-master crates
which we've imported:

https://github.com/vireshk/vhost

for the Xen Vhost Frontend:

https://github.com/vireshk/xen-vhost-frontend

but I can't actually see any handling for GET/SET_STATUS at all which
makes me wonder how we actually work. Viresh?

As far as I know the only back-end implementation of F_STATUS is in DPDK.
As I said, if anyone else implemented it right now, that would be dangerous,
because qemu doesn’t adhere to the virtio protocol when it comes to the
status byte.

Yeah, none of the Rust based Virtio backends enable `STATUS` in
`VhostUserProtocolFeatures` and so these messages are never exchanged.

The generic Rust code for the backends, doesn't even implement them.
Not sure if they should or not.


It absolutely should not, for evidence see this whole thread.  qemu 
sends a SET_STATUS 0, which amounts to a reset, when the VM is merely 
paused[1], and when it sets status bytes, it does not set them according 
to virtio specification.  Implementing it right now means relying on and 
working around qemu’s implementation-defined spec-breaking behavior.  
Also, note that qemu ignores feature negotiation response through 
FEATURES_OK, and DEVICE_NEEDS_RESET, so unless it’s worth working around 
the problems just to get some form of DRIVER_OK information (note this 
information does not come from the driver, but qemu makes it up), I 
absolutely would not implement it.


[1] Notably, it does restore the virtio state to the best of its 
abilities when the VM is resumed, but this is all still wrong (there is 
no point in doing so much on a pause/resume, it needlessly costs time) 
and any implementation that does a reset then will rely on the 
implementation-defined behavior that qemu is actually able to restore 
all the state that the back-end would lose during a reset. Notably, 
reset is not even well-defined in the vhost-user specification.  It was 
argued, in this thread, that DPDK works just fine with this, precisely 
because it ignores SET_STATUS 0.  Finally, if virtiofsd in particular, 
as a user of the Rust crates, is reset, it would lose its internal 
state, which qemu cannot restore short of using the upcoming migration 
facilities.

Re: [PATCH] vhost-user: Fix protocol feature bit conflict

2023-10-17 Thread Hanna Czenczek


On 17.10.23 07:36, Viresh Kumar wrote:

On 16-10-23, 12:40, Alex Bennée wrote:

Viresh Kumar  writes:


On 16-10-23, 11:45, Manos Pitsidianakis wrote:

On Mon, 16 Oct 2023 11:32, Hanna Czenczek  wrote:

diff --git a/include/hw/virtio/vhost-user.h
b/include/hw/virtio/vhost-user.h
index 9f9ddf878d..1d4121431b 100644
--- a/include/hw/virtio/vhost-user.h
+++ b/include/hw/virtio/vhost-user.h
@@ -29,7 +29,8 @@ enum VhostUserProtocolFeature {
 VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS = 14,
 VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
 VHOST_USER_PROTOCOL_F_STATUS = 16,
-VHOST_USER_PROTOCOL_F_SHARED_OBJECT = 17,
+/* Feature 17 reserved for VHOST_USER_PROTOCOL_F_XEN_MMAP. */
+VHOST_USER_PROTOCOL_F_SHARED_OBJECT = 18,
 VHOST_USER_PROTOCOL_F_MAX
};

May I ask, why not define VHOST_USER_PROTOCOL_F_XEN_MMAP as well instead of
a comment mention?

Perhaps because we will never use it from Qemu code ?

Vikram's work on enabling xenpvh support will mean enabling grant
support and while I suspect most VirtIO backends will be within QEMU
itself if it ever want to off-load something to a vhost-user backend it
will need to ensure this flag is set.

Hanna,

It would be good to define it then in the current commit itself.


Not that I’m really opposed to that, but I don’t see the problem with 
just doing that in the same work that makes qemu actually use this flag, 
exactly because it’s just a -1/+1 change.


I can send a v2, but should I do the same for libvhost-user and define 
the flag there?  Do I have to add a patch to do the same for F_STATUS, 
which so far only got a placeholder comment?


Hanna

[PATCH v5 5/7] vhost-user: Interface for migration state transfer

2023-10-16 Thread Hanna Czenczek

Add the interface for transferring the back-end's state during migration
as defined previously in vhost-user.rst.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Hanna Czenczek 
---
 include/hw/virtio/vhost-backend.h |  24 +
 include/hw/virtio/vhost-user.h|   1 +
 include/hw/virtio/vhost.h |  78 
 hw/virtio/vhost-user.c| 146 ++
 hw/virtio/vhost.c |  37 
 5 files changed, 286 insertions(+)

diff --git a/include/hw/virtio/vhost-backend.h 
b/include/hw/virtio/vhost-backend.h
index 1860b541d8..7d399ec418 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
 VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
 } VhostSetConfigType;
 
+typedef enum VhostDeviceStateDirection {
+/* Transfer state from back-end (device) to front-end */
+VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
+/* Transfer state from front-end to back-end (device) */
+VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
+} VhostDeviceStateDirection;
+
+typedef enum VhostDeviceStatePhase {
+/* The device (and all its vrings) is stopped */
+VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
+} VhostDeviceStatePhase;
+
 struct vhost_inflight;
 struct vhost_dev;
 struct vhost_log;
@@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev 
*dev,
 
 typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
 
+typedef bool (*vhost_supports_device_state_op)(struct vhost_dev *dev);
+typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev,
+VhostDeviceStateDirection 
direction,
+VhostDeviceStatePhase phase,
+int fd,
+int *reply_fd,
+Error **errp);
+typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error 
**errp);
+
 typedef struct VhostOps {
 VhostBackendType backend_type;
 vhost_backend_init vhost_backend_init;
@@ -181,6 +202,9 @@ typedef struct VhostOps {
 vhost_force_iommu_op vhost_force_iommu;
 vhost_set_config_call_op vhost_set_config_call;
 vhost_reset_status_op vhost_reset_status;
+vhost_supports_device_state_op vhost_supports_device_state;
+vhost_set_device_state_fd_op vhost_set_device_state_fd;
+vhost_check_device_state_op vhost_check_device_state;
 } VhostOps;
 
 int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
diff --git a/include/hw/virtio/vhost-user.h b/include/hw/virtio/vhost-user.h
index 1d4121431b..324cd8663a 100644
--- a/include/hw/virtio/vhost-user.h
+++ b/include/hw/virtio/vhost-user.h
@@ -31,6 +31,7 @@ enum VhostUserProtocolFeature {
 VHOST_USER_PROTOCOL_F_STATUS = 16,
 /* Feature 17 reserved for VHOST_USER_PROTOCOL_F_XEN_MMAP. */
 VHOST_USER_PROTOCOL_F_SHARED_OBJECT = 18,
+VHOST_USER_PROTOCOL_F_DEVICE_STATE = 19,
 VHOST_USER_PROTOCOL_F_MAX
 };
 
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 14621f9e79..a0d03c9fdf 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -348,4 +348,82 @@ static inline int vhost_reset_device(struct vhost_dev 
*hdev)
 }
 #endif /* CONFIG_VHOST */
 
+/**
+ * vhost_supports_device_state(): Checks whether the back-end supports
+ * transferring internal device state for the purpose of migration.
+ * Support for this feature is required for vhost_set_device_state_fd()
+ * and vhost_check_device_state().
+ *
+ * @dev: The vhost device
+ *
+ * Returns true if the device supports these commands, and false if it
+ * does not.
+ */
+bool vhost_supports_device_state(struct vhost_dev *dev);
+
+/**
+ * vhost_set_device_state_fd(): Begin transfer of internal state from/to
+ * the back-end for the purpose of migration.  Data is to be transferred
+ * over a pipe according to @direction and @phase.  The sending end must
+ * only write to the pipe, and the receiving end must only read from it.
+ * Once the sending end is done, it closes its FD.  The receiving end
+ * must take this as the end-of-transfer signal and close its FD, too.
+ *
+ * @fd is the back-end's end of the pipe: The write FD for SAVE, and the
+ * read FD for LOAD.  This function transfers ownership of @fd to the
+ * back-end, i.e. closes it in the front-end.
+ *
+ * The back-end may optionally reply with an FD of its own, if this
+ * improves efficiency on its end.  In this case, the returned FD is
+ * stored in *reply_fd.  The back-end will discard the FD sent to it,
+ * and the front-end must use *reply_fd for transferring state to/from
+ * the back-end.
+ *
+ * @dev: The vhost device
+ * @direction: The direction in which the state is to be transferred.
+ * For outgoing migrations, this is SAVE, and data is read
+ * from the back-end and stored by the front-end

[PATCH v5 4/7] vhost-user.rst: Migrating back-end-internal state

2023-10-16 Thread Hanna Czenczek

For vhost-user devices, qemu can migrate the virtio state, but not the
back-end's internal state.  To do so, we need to be able to transfer
this internal state between front-end (qemu) and back-end.

At this point, this new feature is added for the purpose of virtio-fs
migration.  Because virtiofsd's internal state will not be too large, we
believe it is best to transfer it as a single binary blob after the
streaming phase.

These are the additions to the protocol:
- New vhost-user protocol feature VHOST_USER_PROTOCOL_F_DEVICE_STATE
- SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a file
  descriptor over which to transfer the state.
- CHECK_DEVICE_STATE: After the state has been transferred through the
  file descriptor, the front-end invokes this function to verify
  success.  There is no in-band way (through the file descriptor) to
  indicate failure, so we need to check explicitly.

Once the transfer FD has been established via SET_DEVICE_STATE_FD
(which includes establishing the direction of transfer and migration
phase), the sending side writes its data into it, and the reading side
reads it until it sees an EOF.  Then, the front-end will check for
success via CHECK_DEVICE_STATE, which on the destination side includes
checking for integrity (i.e. errors during deserialization).

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Hanna Czenczek 
---
 docs/interop/vhost-user.rst | 172 
 1 file changed, 172 insertions(+)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 035a23ed35..9f1103f85a 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -322,6 +322,32 @@ VhostUserShared
 :UUID: 16 bytes UUID, whose first three components (a 32-bit value, then
   two 16-bit values) are stored in big endian.
 
+Device state transfer parameters
+
+
+++-+
+| transfer direction | migration phase |
+++-+
+
+:transfer direction: a 32-bit enum, describing the direction in which
+  the state is transferred:
+
+  - 0: Save: Transfer the state from the back-end to the front-end,
+which happens on the source side of migration
+  - 1: Load: Transfer the state from the front-end to the back-end,
+which happens on the destination side of migration
+
+:migration phase: a 32-bit enum, describing the state in which the VM
+  guest and devices are:
+
+  - 0: Stopped (in the period after the transfer of memory-mapped
+regions before switch-over to the destination): The VM guest is
+stopped, and the vhost-user device is suspended (see
+:ref:`Suspended device state `).
+
+  In the future, additional phases might be added e.g. to allow
+  iterative migration while the device is running.
+
 C structure
 ---
 
@@ -381,6 +407,7 @@ in the ancillary data:
 * ``VHOST_USER_SET_VRING_ERR``
 * ``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name 
``VHOST_USER_SET_SLAVE_REQ_FD``)
 * ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
+* ``VHOST_USER_SET_DEVICE_STATE_FD``
 
 If *front-end* is unable to send the full message or receives a wrong
 reply it will close the connection. An optional reconnection mechanism
@@ -555,6 +582,80 @@ it performs WAKE ioctl's on the userfaultfd to wake the 
stalled
 back-end.  The front-end indicates support for this via the
 ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature.
 
+.. _migrating_backend_state:
+
+Migrating back-end state
+
+
+Migrating device state involves transferring the state from one
+back-end, called the source, to another back-end, called the
+destination.  After migration, the destination transparently resumes
+operation without requiring the driver to re-initialize the device at
+the VIRTIO level.  If the migration fails, then the source can
+transparently resume operation until another migration attempt is made.
+
+Generally, the front-end is connected to a virtual machine guest (which
+contains the driver), which has its own state to transfer between source
+and destination, and therefore will have an implementation-specific
+mechanism to do so.  The ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature
+provides functionality to have the front-end include the back-end's
+state in this transfer operation so the back-end does not need to
+implement its own mechanism, and so the virtual machine may have its
+complete state, including vhost-user devices' states, contained within a
+single stream of data.
+
+To do this, the back-end state is transferred from back-end to front-end
+on the source side, and vice versa on the destination side.  This
+transfer happens over a channel that is negotiated using the
+``VHOST_USER_SET_DEVICE_STATE_FD`` message.  This message has two
+parameters:
+
+* Direction of transfer: On the source, the data is saved, transferring
+  it from the back-end to the front-end.  On the destination, the data

[PATCH v5 2/7] vhost-user.rst: Clarify enabling/disabling vrings

2023-10-16 Thread Hanna Czenczek

Currently, the vhost-user documentation says that rings are to be
initialized in a disabled state when VHOST_USER_F_PROTOCOL_FEATURES is
negotiated.  However, by the time of feature negotiation, all rings have
already been initialized, so it is not entirely clear what this means.

At least the vhost-user-backend Rust crate's implementation interpreted
it to mean that whenever this feature is negotiated, all rings are to
put into a disabled state, which means that every SET_FEATURES call
would disable all rings, effectively halting the device.  This is
problematic because the VHOST_F_LOG_ALL feature is also set or cleared
this way, which happens during migration.  Doing so should not halt the
device.

Other implementations have interpreted this to mean that the device is
to be initialized with all rings disabled, and a subsequent SET_FEATURES
call that does not set VHOST_USER_F_PROTOCOL_FEATURES will enable all of
them.  Here, SET_FEATURES will never disable any ring.

This interpretation does not suffer the problem of unintentionally
halting the device whenever features are set or cleared, so it seems
better and more reasonable.

We can clarify this in the documentation by making it explicit that the
enabled/disabled state is tracked even while the vring is stopped.
Every vring is initialized in a disabled state, and SET_FEATURES without
VHOST_USER_F_PROTOCOL_FEATURES simply becomes one way to enable all
vrings.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Hanna Czenczek 
---
 docs/interop/vhost-user.rst | 32 +---
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 9202b167dd..e5a04c04ed 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -411,31 +411,33 @@ negotiation.
 Ring states
 ---
 
-Rings can be in one of three states:
+Rings have two independent states: started/stopped, and enabled/disabled.
 
-* stopped: the back-end must not process the ring at all.
+* While a ring is stopped, the back-end must not process the ring at
+  all, regardless of whether it is enabled or disabled.  The
+  enabled/disabled state should still be tracked, though, so it can come
+  into effect once the ring is started.
 
-* started but disabled: the back-end must process the ring without
+* started and disabled: The back-end must process the ring without
   causing any side effects.  For example, for a networking device,
   in the disabled state the back-end must not supply any new RX packets,
   but must process and discard any TX packets.
 
-* started and enabled.
+* started and enabled: The back-end must process the ring normally, i.e.
+  process all requests and execute them.
 
-Each ring is initialized in a stopped state.  The back-end must start
-ring upon receiving a kick (that is, detecting that file descriptor is
-readable) on the descriptor specified by ``VHOST_USER_SET_VRING_KICK``
-or receiving the in-band message ``VHOST_USER_VRING_KICK`` if negotiated,
-and stop ring upon receiving ``VHOST_USER_GET_VRING_BASE``.
+Each ring is initialized in a stopped and disabled state.  The back-end
+must start a ring upon receiving a kick (that is, detecting that file
+descriptor is readable) on the descriptor specified by
+``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message
+``VHOST_USER_VRING_KICK`` if negotiated, and stop a ring upon receiving
+``VHOST_USER_GET_VRING_BASE``.
 
 Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``.
 
-If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
-ring starts directly in the enabled state.
-
-If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
-initialized in a disabled state and is enabled by
-``VHOST_USER_SET_VRING_ENABLE`` with parameter 1.
+In addition, upon receiving a ``VHOST_USER_SET_FEATURES`` message from
+the front-end without ``VHOST_USER_F_PROTOCOL_FEATURES`` set, the
+back-end must enable all rings immediately.
 
 While processing the rings (whether they are enabled or not), the back-end
 must support changing some configuration aspects on the fly.
-- 
2.41.0

[PATCH v5 3/7] vhost-user.rst: Introduce suspended state

2023-10-16 Thread Hanna Czenczek

In vDPA, GET_VRING_BASE does not stop the queried vring, which is why
SUSPEND was introduced so that the returned index would be stable.  In
vhost-user, it does stop the vring, so under the same reasoning, it can
get away without SUSPEND.

Still, we do want to clarify that if the device is completely stopped,
i.e. all vrings are stopped, the back-end should cease to modify any
state relating to the guest.  Do this by calling it "suspended".

Suggested-by: Stefan Hajnoczi 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Hanna Czenczek 
---
 docs/interop/vhost-user.rst | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index e5a04c04ed..035a23ed35 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -442,6 +442,19 @@ back-end must enable all rings immediately.
 While processing the rings (whether they are enabled or not), the back-end
 must support changing some configuration aspects on the fly.
 
+.. _suspended_device_state:
+
+Suspended device state
+^^
+
+While all vrings are stopped, the device is *suspended*.  In addition to
+not processing any vring (because they are stopped), the device must:
+
+* not write to any guest memory regions,
+* not send any notifications to the guest,
+* not send any messages to the front-end,
+* still process and reply to messages from the front-end.
+
 Multiple queue support
 --
 
@@ -529,7 +542,8 @@ ancillary data, it may be used to inform the front-end that 
the log has
 been modified.
 
 Once the source has finished migration, rings will be stopped by the
-source. No further update must be done before rings are restarted.
+source (:ref:`Suspended device state `). No
+further update must be done before rings are restarted.
 
 In postcopy migration the back-end is started before all the memory has
 been received from the source host, and care must be taken to avoid
@@ -1123,6 +1137,10 @@ Front-end message types
   (*a vring descriptor index for split virtqueues* vs. *vring descriptor
   indices for packed virtqueues*).
 
+  When and as long as all of a device’s vrings are stopped, it is
+  *suspended*, see :ref:`Suspended device state
+  `.
+
   The request payload’s *num* field is currently reserved and must be
   set to 0.
 
-- 
2.41.0

[PATCH v5 6/7] vhost: Add high-level state save/load functions

2023-10-16 Thread Hanna Czenczek

vhost_save_backend_state() and vhost_load_backend_state() can be used by
vhost front-ends to easily save and load the back-end's state to/from
the migration stream.

Because we do not know the full state size ahead of time,
vhost_save_backend_state() simply reads the data in 1 MB chunks, and
writes each chunk consecutively into the migration stream, prefixed by
its length.  EOF is indicated by a 0-length chunk.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Hanna Czenczek 
---
 include/hw/virtio/vhost.h |  35 +++
 hw/virtio/vhost.c | 204 ++
 2 files changed, 239 insertions(+)

diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index a0d03c9fdf..100fcc874d 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -426,4 +426,39 @@ int vhost_set_device_state_fd(struct vhost_dev *dev,
  */
 int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
 
+/**
+ * vhost_save_backend_state(): High-level function to receive a vhost
+ * back-end's state, and save it in @f.  Uses
+ * `vhost_set_device_state_fd()` to get the data from the back-end, and
+ * stores it in consecutive chunks that are each prefixed by their
+ * respective length (be32).  The end is marked by a 0-length chunk.
+ *
+ * Must only be called while the device and all its vrings are stopped
+ * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
+ *
+ * @dev: The vhost device from which to save the state
+ * @f: Migration stream in which to save the state
+ * @errp: Potential error message
+ *
+ * Returns 0 on success, and -errno otherwise.
+ */
+int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp);
+
+/**
+ * vhost_load_backend_state(): High-level function to load a vhost
+ * back-end's state from @f, and send it over to the back-end.  Reads
+ * the data from @f in the format used by `vhost_save_state()`, and uses
+ * `vhost_set_device_state_fd()` to transfer it to the back-end.
+ *
+ * Must only be called while the device and all its vrings are stopped
+ * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
+ *
+ * @dev: The vhost device to which to send the sate
+ * @f: Migration stream from which to load the state
+ * @errp: Potential error message
+ *
+ * Returns 0 on success, and -errno otherwise.
+ */
+int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp);
+
 #endif
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index ca4bdd9d66..643c1af5a0 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -2133,3 +2133,207 @@ int vhost_check_device_state(struct vhost_dev *dev, 
Error **errp)
"vhost transport does not support migration state transfer");
 return -ENOSYS;
 }
+
+int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp)
+{
+/* Maximum chunk size in which to transfer the state */
+const size_t chunk_size = 1 * 1024 * 1024;
+g_autofree void *transfer_buf = NULL;
+g_autoptr(GError) g_err = NULL;
+int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1;
+int ret;
+
+/* [0] for reading (our end), [1] for writing (back-end's end) */
+if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, _err)) {
+error_setg(errp, "Failed to set up state transfer pipe: %s",
+   g_err->message);
+ret = -EINVAL;
+goto fail;
+}
+
+read_fd = pipe_fds[0];
+write_fd = pipe_fds[1];
+
+/*
+ * VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped.
+ * Ideally, it is suspended, but SUSPEND/RESUME currently do not exist for
+ * vhost-user, so just check that it is stopped at all.
+ */
+assert(!dev->started);
+
+/* Transfer ownership of write_fd to the back-end */
+ret = vhost_set_device_state_fd(dev,
+VHOST_TRANSFER_STATE_DIRECTION_SAVE,
+VHOST_TRANSFER_STATE_PHASE_STOPPED,
+write_fd,
+_fd,
+errp);
+if (ret < 0) {
+error_prepend(errp, "Failed to initiate state transfer: ");
+goto fail;
+}
+
+/* If the back-end wishes to use a different pipe, switch over */
+if (reply_fd >= 0) {
+close(read_fd);
+read_fd = reply_fd;
+}
+
+transfer_buf = g_malloc(chunk_size);
+
+while (true) {
+ssize_t read_ret;
+
+read_ret = RETRY_ON_EINTR(read(read_fd, transfer_buf, chunk_size));
+if (read_ret < 0) {
+ret = -errno;
+error_setg_errno(errp, -ret, "Failed to receive state");
+goto fail;
+}
+
+assert(read_ret <= chunk_size);
+qemu_put_be32(f, read_ret);
+
+if (read_ret == 0) {
+/* EOF */
+break;
+}
+
+qemu_put_buffer(f, transfer_buf, read_ret);
+}
+
+/*
+ * Back-end will no

[PATCH v5 0/7] vhost-user: Back-end state migration

2023-10-16 Thread Hanna Czenczek

v2:
https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02604.html

v3:
https://lists.nongnu.org/archive/html/qemu-devel/2023-09/msg03750.html

v4:
https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg01046.html


Based-on: <20231004014532.1228637-1-stefa...@redhat.com>
  ([PATCH v2 0/3] vhost: clean up device reset)

Based-on: <20231016083201.23736-1-hre...@redhat.com>
  ([PATCH] vhost-user: Fix protocol feature bit conflict)


Hi,

v5 is basically the same as v4, only that I’ve dropped the patch
deprecating F_STATUS (which doesn’t affect the rest of the series), that
I’ve amended the documentation in patch 1 as suggested by Stefan and
with help from Michael, and that I’ve rebased everything on top of the
F_SHARED_OBJECT changes that have been merged upstream.


git-backport-diff against v4:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/7:[0029] [FC] 'vhost-user.rst: Improve [GS]ET_VRING_BASE doc'
002/7:[] [--] 'vhost-user.rst: Clarify enabling/disabling vrings'
003/7:[] [--] 'vhost-user.rst: Introduce suspended state'
004/7:[0006] [FC] 'vhost-user.rst: Migrating back-end-internal state'
005/7:[0007] [FC] 'vhost-user: Interface for migration state transfer'
006/7:[] [--] 'vhost: Add high-level state save/load functions'
007/7:[] [--] 'vhost-user-fs: Implement internal migration'


Changes patch by patch:
- Patch 1: Amended documentation
- Patches 4 and 5: Bumped feature bit and command values as necessary so
  as not to conflict with F_SHARED_OBJECT


Hanna Czenczek (7):
  vhost-user.rst: Improve [GS]ET_VRING_BASE doc
  vhost-user.rst: Clarify enabling/disabling vrings
  vhost-user.rst: Introduce suspended state
  vhost-user.rst: Migrating back-end-internal state
  vhost-user: Interface for migration state transfer
  vhost: Add high-level state save/load functions
  vhost-user-fs: Implement internal migration

 docs/interop/vhost-user.rst   | 301 --
 include/hw/virtio/vhost-backend.h |  24 +++
 include/hw/virtio/vhost-user.h|   1 +
 include/hw/virtio/vhost.h | 113 +++
 hw/virtio/vhost-user-fs.c | 101 +-
 hw/virtio/vhost-user.c| 146 +++
 hw/virtio/vhost.c | 241 
 7 files changed, 906 insertions(+), 21 deletions(-)

-- 
2.41.0

[PATCH v5 1/7] vhost-user.rst: Improve [GS]ET_VRING_BASE doc

2023-10-16 Thread Hanna Czenczek

GET_VRING_BASE does not mention that it stops the respective ring.  Fix
that.

Furthermore, it is not fully clear what the "base offset" these
commands' documentation refers to is; an offset could be many things.
Be more precise and verbose about it, especially given that these
commands use different payload structures depending on whether the vring
is split or packed.

Signed-off-by: Hanna Czenczek 
---
 docs/interop/vhost-user.rst | 77 +++--
 1 file changed, 73 insertions(+), 4 deletions(-)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 768fb5c28c..9202b167dd 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -108,6 +108,43 @@ A vring state description
 
 :num: a 32-bit number
 
+A vring descriptor index for split virtqueues
+^
+
++-+-+
+| vring index | index in avail ring |
++-+-+
+
+:vring index: 32-bit index of the respective virtqueue
+
+:index in avail ring: 32-bit value, of which currently only the lower 16
+  bits are used:
+
+  - Bits 0–15: Index of the next *Available Ring* descriptor that the
+back-end will process.  This is a free-running index that is not
+wrapped by the ring size.
+  - Bits 16–31: Reserved (set to zero)
+
+Vring descriptor indices for packed virtqueues
+^^
+
++-++
+| vring index | descriptor indices |
++-++
+
+:vring index: 32-bit index of the respective virtqueue
+
+:descriptor indices: 32-bit value:
+
+  - Bits 0–14: Index of the next *Available Ring* descriptor that the
+back-end will process.  This is a free-running index that is not
+wrapped by the ring size.
+  - Bit 15: Driver (Available) Ring Wrap Counter
+  - Bits 16–30: Index of the entry in the *Used Ring* where the back-end
+will place the next descriptor.  This is a free-running index that
+is not wrapped by the ring size.
+  - Bit 31: Device (Used) Ring Wrap Counter
+
 A vring address description
 ^^^
 
@@ -1042,18 +1079,50 @@ Front-end message types
 ``VHOST_USER_SET_VRING_BASE``
   :id: 10
   :equivalent ioctl: ``VHOST_SET_VRING_BASE``
-  :request payload: vring state description
+  :request payload: vring descriptor index/indices
   :reply payload: N/A
 
-  Sets the base offset in the available vring.
+  Sets the next index to use for descriptors in this vring:
+
+  * For a split virtqueue, sets only the next descriptor index to
+process in the *Available Ring*.  The device is supposed to read the
+next index in the *Used Ring* from the respective vring structure in
+guest memory.
+
+  * For a packed virtqueue, both indices are supplied, as they are not
+explicitly available in memory.
+
+  Consequently, the payload type is specific to the type of virt queue
+  (*a vring descriptor index for split virtqueues* vs. *vring descriptor
+  indices for packed virtqueues*).
 
 ``VHOST_USER_GET_VRING_BASE``
   :id: 11
   :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
   :request payload: vring state description
-  :reply payload: vring state description
+  :reply payload: vring descriptor index/indices
+
+  Stops the vring and returns the current descriptor index or indices:
+
+* For a split virtqueue, returns only the 16-bit next descriptor
+  index to process in the *Available Ring*.  Note that this may
+  differ from the available ring index in the vring structure in
+  memory, which points to where the driver will put new available
+  descriptors.  For the *Used Ring*, the device only needs the next
+  descriptor index at which to put new descriptors, which is the
+  value in the vring structure in memory, so this value is not
+  covered by this message.
+
+* For a packed virtqueue, neither index is explicitly available to
+  read from memory, so both indices (as maintained by the device) are
+  returned.
+
+  Consequently, the payload type is specific to the type of virt queue
+  (*a vring descriptor index for split virtqueues* vs. *vring descriptor
+  indices for packed virtqueues*).
 
-  Get the available vring base offset.
+  The request payload’s *num* field is currently reserved and must be
+  set to 0.
 
 ``VHOST_USER_SET_VRING_KICK``
   :id: 12
-- 
2.41.0

[PATCH v5 7/7] vhost-user-fs: Implement internal migration

2023-10-16 Thread Hanna Czenczek

A virtio-fs device's VM state consists of:
- the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
- the back-end's (virtiofsd's) internal state

We get/set the latter via the new vhost operations to transfer migratory
state.  It is its own dedicated subsection, so that for external
migration, it can be disabled.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Hanna Czenczek 
---
 hw/virtio/vhost-user-fs.c | 101 +-
 1 file changed, 100 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index 49d699ffc2..eb91723855 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -298,9 +298,108 @@ static struct vhost_dev *vuf_get_vhost(VirtIODevice *vdev)
 return >vhost_dev;
 }
 
+/**
+ * Fetch the internal state from virtiofsd and save it to `f`.
+ */
+static int vuf_save_state(QEMUFile *f, void *pv, size_t size,
+  const VMStateField *field, JSONWriter *vmdesc)
+{
+VirtIODevice *vdev = pv;
+VHostUserFS *fs = VHOST_USER_FS(vdev);
+Error *local_error = NULL;
+int ret;
+
+ret = vhost_save_backend_state(>vhost_dev, f, _error);
+if (ret < 0) {
+error_reportf_err(local_error,
+  "Error saving back-end state of %s device %s "
+  "(tag: \"%s\"): ",
+  vdev->name, vdev->parent_obj.canonical_path,
+  fs->conf.tag ?: "");
+return ret;
+}
+
+return 0;
+}
+
+/**
+ * Load virtiofsd's internal state from `f` and send it over to virtiofsd.
+ */
+static int vuf_load_state(QEMUFile *f, void *pv, size_t size,
+  const VMStateField *field)
+{
+VirtIODevice *vdev = pv;
+VHostUserFS *fs = VHOST_USER_FS(vdev);
+Error *local_error = NULL;
+int ret;
+
+ret = vhost_load_backend_state(>vhost_dev, f, _error);
+if (ret < 0) {
+error_reportf_err(local_error,
+  "Error loading back-end state of %s device %s "
+  "(tag: \"%s\"): ",
+  vdev->name, vdev->parent_obj.canonical_path,
+  fs->conf.tag ?: "");
+return ret;
+}
+
+return 0;
+}
+
+static bool vuf_is_internal_migration(void *opaque)
+{
+/* TODO: Return false when an external migration is requested */
+return true;
+}
+
+static int vuf_check_migration_support(void *opaque)
+{
+VirtIODevice *vdev = opaque;
+VHostUserFS *fs = VHOST_USER_FS(vdev);
+
+if (!vhost_supports_device_state(>vhost_dev)) {
+error_report("Back-end of %s device %s (tag: \"%s\") does not support "
+ "migration through qemu",
+ vdev->name, vdev->parent_obj.canonical_path,
+ fs->conf.tag ?: "");
+return -ENOTSUP;
+}
+
+return 0;
+}
+
+static const VMStateDescription vuf_backend_vmstate;
+
 static const VMStateDescription vuf_vmstate = {
 .name = "vhost-user-fs",
-.unmigratable = 1,
+.version_id = 0,
+.fields = (VMStateField[]) {
+VMSTATE_VIRTIO_DEVICE,
+VMSTATE_END_OF_LIST()
+},
+.subsections = (const VMStateDescription * []) {
+_backend_vmstate,
+NULL,
+}
+};
+
+static const VMStateDescription vuf_backend_vmstate = {
+.name = "vhost-user-fs-backend",
+.version_id = 0,
+.needed = vuf_is_internal_migration,
+.pre_load = vuf_check_migration_support,
+.pre_save = vuf_check_migration_support,
+.fields = (VMStateField[]) {
+{
+.name = "back-end",
+.info = &(const VMStateInfo) {
+.name = "virtio-fs back-end state",
+.get = vuf_load_state,
+.put = vuf_save_state,
+},
+},
+VMSTATE_END_OF_LIST()
+},
 };
 
 static Property vuf_properties[] = {
-- 
2.41.0

Re: [PATCH] vhost-user: Fix protocol feature bit conflict

2023-10-16 Thread Hanna Czenczek


On 16.10.23 10:45, Manos Pitsidianakis wrote:

On Mon, 16 Oct 2023 11:32, Hanna Czenczek  wrote:
diff --git a/include/hw/virtio/vhost-user.h 
b/include/hw/virtio/vhost-user.h

index 9f9ddf878d..1d4121431b 100644
--- a/include/hw/virtio/vhost-user.h
+++ b/include/hw/virtio/vhost-user.h
@@ -29,7 +29,8 @@ enum VhostUserProtocolFeature {
    VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS = 14,
    VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
    VHOST_USER_PROTOCOL_F_STATUS = 16,
-    VHOST_USER_PROTOCOL_F_SHARED_OBJECT = 17,
+    /* Feature 17 reserved for VHOST_USER_PROTOCOL_F_XEN_MMAP. */
+    VHOST_USER_PROTOCOL_F_SHARED_OBJECT = 18,
    VHOST_USER_PROTOCOL_F_MAX
};


May I ask, why not define VHOST_USER_PROTOCOL_F_XEN_MMAP as well 
instead of a comment mention?


No particular reason, it’s just what I had planned to do for a while in 
other series that would introduce feature bits (e.g. 
https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02452.html). 
I think I took that from libvhost-user, which does this for 
VHOST_USER_PROTOCOL_F_STATUS.



Otherwise:

Reviewed-by: Emmanouil Pitsidianakis 


Thanks!

Hanna

[PATCH] vhost-user: Fix protocol feature bit conflict

2023-10-16 Thread Hanna Czenczek

The VHOST_USER_PROTOCOL_F_XEN_MMAP feature bit was defined in
f21e95ee97d, which has been part of qemu's 8.1.0 release.  However, it
seems it was never added to qemu's code, but it is well possible that it
is already used by different front-ends outside of qemu (i.e., Xen).

VHOST_USER_PROTOCOL_F_SHARED_OBJECT in contrast was added to qemu's code
in 16094766627, but never defined in the vhost-user specification.  As a
consequence, both bits were defined to be 17, which cannot work.

Regardless of whether actual code or the specification should take
precedence, F_XEN_MMAP is already part of a qemu release, while
F_SHARED_OBJECT is not.  Therefore, bump the latter to take number 18
instead of 17, and add this to the specification.

Take the opportunity to add at least a little note on the
VhostUserShared structure to the specification.  This structure is
referenced by the new commands introduced in 16094766627, but was not
defined.

Fixes: 160947666276c5b7f6bca4d746bcac2966635d79
   ("vhost-user: add shared_object msg")
Signed-off-by: Hanna Czenczek 
---
 docs/interop/vhost-user.rst   | 11 +++
 include/hw/virtio/vhost-user.h|  3 ++-
 subprojects/libvhost-user/libvhost-user.h |  3 ++-
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 415bb47a19..768fb5c28c 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -275,6 +275,16 @@ Inflight description
 
 :queue size: a 16-bit size of virtqueues
 
+VhostUserShared
+^^^
+
++--+
+| UUID |
++--+
+
+:UUID: 16 bytes UUID, whose first three components (a 32-bit value, then
+  two 16-bit values) are stored in big endian.
+
 C structure
 ---
 
@@ -885,6 +895,7 @@ Protocol features
   #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
   #define VHOST_USER_PROTOCOL_F_STATUS   16
   #define VHOST_USER_PROTOCOL_F_XEN_MMAP 17
+  #define VHOST_USER_PROTOCOL_F_SHARED_OBJECT18
 
 Front-end message types
 ---
diff --git a/include/hw/virtio/vhost-user.h b/include/hw/virtio/vhost-user.h
index 9f9ddf878d..1d4121431b 100644
--- a/include/hw/virtio/vhost-user.h
+++ b/include/hw/virtio/vhost-user.h
@@ -29,7 +29,8 @@ enum VhostUserProtocolFeature {
 VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS = 14,
 VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
 VHOST_USER_PROTOCOL_F_STATUS = 16,
-VHOST_USER_PROTOCOL_F_SHARED_OBJECT = 17,
+/* Feature 17 reserved for VHOST_USER_PROTOCOL_F_XEN_MMAP. */
+VHOST_USER_PROTOCOL_F_SHARED_OBJECT = 18,
 VHOST_USER_PROTOCOL_F_MAX
 };
 
diff --git a/subprojects/libvhost-user/libvhost-user.h 
b/subprojects/libvhost-user/libvhost-user.h
index b36a42a7ca..c2352904f0 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -65,7 +65,8 @@ enum VhostUserProtocolFeature {
 VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS = 14,
 VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
 /* Feature 16 is reserved for VHOST_USER_PROTOCOL_F_STATUS. */
-VHOST_USER_PROTOCOL_F_SHARED_OBJECT = 17,
+/* Feature 17 reserved for VHOST_USER_PROTOCOL_F_XEN_MMAP. */
+VHOST_USER_PROTOCOL_F_SHARED_OBJECT = 18,
 VHOST_USER_PROTOCOL_F_MAX
 };
 
-- 
2.41.0

Re: [Virtio-fs] (no subject)

2023-10-13 Thread Hanna Czenczek


On 10.10.23 16:35, Alex Bennée wrote:

Hanna Czenczek  writes:

(adding Viresh to CC for Xen Vhost questions)


On 10.10.23 12:36, Alex Bennée wrote:

Hanna Czenczek  writes:


On 10.10.23 06:00, Yajun Wu wrote:

On 10/9/2023 5:13 PM, Hanna Czenczek wrote:

External email: Use caution opening links or attachments


On 09.10.23 11:07, Hanna Czenczek wrote:

On 09.10.23 10:21, Hanna Czenczek wrote:

On 07.10.23 04:22, Yajun Wu wrote:

[...]




So as far as I understand, the feature is supposed to rely on
implementation-specific behavior between specifically qemu as a
front-end and dpdk as a back-end, nothing else.  Honestly, that to me
is a very good reason to deprecate it.  That would make it clear that
any implementation that implements it does so because it relies on
implementation-specific behavior from other implementations.

Option 2 is to fix it.  It is not right to use this broadly defined
feature with its clear protocol as given in the virtio specification
just to set and clear a single bit (DRIVER_OK).  The vhost-user
specification points to that virtio protocol.  We must adhere to the
protocol.  And note that we must not reset devices just because the VM
is paused/resumed.  (That is why I wanted to deprecate SET_STATUS, so
that Stefan’s series would introduce RESET_DEVICE where we need it,
and we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().)

Option 3 would be to just be honest in the specification, and limit
the scope of F_STATUS to say the only bit that matters is DRIVER_OK.
I would say this is not really different from deprecating, though it
wouldn’t affect your case.  However, I understand Alex relies on a
full status byte.  I’m still interested to know why that is.

For an F_TRANSPORT backend (or whatever the final name ends up being) we
need the backend to have full control of the status byte because all the
handling of VirtIO is deferred to it. Therefor it has to handle all the
feature negotiation and indicate when the device needs resetting.

(side note: feature negotiation is another slippery area when QEMU gets
involved in gating which feature bits may or may not be exposed to the
backend. The only one it should ever mask is F_UNUSED which is used
(sic) to trigger the vhost protocol negotiation)

That’s the thing, feature negotiation is done with GET_FEATURES and
SET_FEATURES.  Configuring F_REPLY_ACK lets SET_FEATURES return
errors.

OK but then what - QEMU fakes up FEATURES_OK in the Device Status field
on the behalf of the backend?


It does that right now.  When using qemu, vhost-user status byte is not 
exposed to the guest at all.  qemu makes it up completely, and 
effectively ignores the response from GET_STATUS completely.


(The only use of GET_STATUS is (right now): There is a function to set a 
flag in the status byte, and it calls GET_STATUS, ORs the flag in, and 
calls SET_STATUS with the result.)



I should point out QEMU doesn't exist in some of these use case. When
using the rust-vmm backends with Xen for example there is no VMM to talk
to so we have a Xen Vhost Frontend which is entirely concerned with
setup and then once connected up leaves the backend to do its thing. I'd
rather leave the frontend as dumb as possible rather than splitting
logic between the two.


Indicating that the device needs reset is a good point, there is no
other feature to do that.  (And something qemu currently ignores, just
like any value the device returns through GET_STATUS, but that’s
besides the point.)


Option 4 is of course not to do anything, and leave everything as-is,
waiting for the next person to stir the hornet’s nest.


Cc-ing Alex on this mail, because to me, this seems like an important
detail when he plans on using the byte in the future. If we need a
virtio status byte, I can’t see how we could use the existing F_STATUS
for it.

What would we use instead of F_STATUS to query the Device Status field?

We would emulate it in the front-end, just like we need to do for
back-ends without F_STATUS.  We can’t emulate the DEVICE_NEEDS_RESET
bit, though, that’s correct.

Given that qemu currently ignores DEVICE_NEEDS_RESET, I’m not 100 %
convinced that your use case has a hard dependency on F_STATUS.
However, this still does make a fair point in general that it would be
useful to keep it.

OK/


That still leaves us with the situation that currently, the only
implementations with F_STATUS support are qemu and dpdk, which both
handle it incorrectly.

I was going to say there is also the rust-vmm vhost-user-master crates
which we've imported:

   https://github.com/vireshk/vhost

for the Xen Vhost Frontend:

   https://github.com/vireshk/xen-vhost-frontend

but I can't actually see any handling for GET/SET_STATUS at all which
makes me wonder how we actually work. Viresh?


As far as I know the only back-end implementation of F_STATUS is in 
DPDK.  As I said, if anyone else implemented it right now, that would be 
dangerous, because qemu doesn’t adhere to the virtio

Re: [Virtio-fs] (no subject)

2023-10-10 Thread Hanna Czenczek


On 10.10.23 12:36, Alex Bennée wrote:

Hanna Czenczek  writes:


On 10.10.23 06:00, Yajun Wu wrote:

On 10/9/2023 5:13 PM, Hanna Czenczek wrote:

External email: Use caution opening links or attachments


On 09.10.23 11:07, Hanna Czenczek wrote:

On 09.10.23 10:21, Hanna Czenczek wrote:

On 07.10.23 04:22, Yajun Wu wrote:

[...]




So as far as I understand, the feature is supposed to rely on
implementation-specific behavior between specifically qemu as a
front-end and dpdk as a back-end, nothing else.  Honestly, that to me
is a very good reason to deprecate it.  That would make it clear that
any implementation that implements it does so because it relies on
implementation-specific behavior from other implementations.

Option 2 is to fix it.  It is not right to use this broadly defined
feature with its clear protocol as given in the virtio specification
just to set and clear a single bit (DRIVER_OK).  The vhost-user
specification points to that virtio protocol.  We must adhere to the
protocol.  And note that we must not reset devices just because the VM
is paused/resumed.  (That is why I wanted to deprecate SET_STATUS, so
that Stefan’s series would introduce RESET_DEVICE where we need it,
and we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().)

Option 3 would be to just be honest in the specification, and limit
the scope of F_STATUS to say the only bit that matters is DRIVER_OK.
I would say this is not really different from deprecating, though it
wouldn’t affect your case.  However, I understand Alex relies on a
full status byte.  I’m still interested to know why that is.

For an F_TRANSPORT backend (or whatever the final name ends up being) we
need the backend to have full control of the status byte because all the
handling of VirtIO is deferred to it. Therefor it has to handle all the
feature negotiation and indicate when the device needs resetting.

(side note: feature negotiation is another slippery area when QEMU gets
involved in gating which feature bits may or may not be exposed to the
backend. The only one it should ever mask is F_UNUSED which is used
(sic) to trigger the vhost protocol negotiation)


That’s the thing, feature negotiation is done with GET_FEATURES and 
SET_FEATURES.  Configuring F_REPLY_ACK lets SET_FEATURES return errors.


Indicating that the device needs reset is a good point, there is no 
other feature to do that.  (And something qemu currently ignores, just 
like any value the device returns through GET_STATUS, but that’s besides 
the point.)



Option 4 is of course not to do anything, and leave everything as-is,
waiting for the next person to stir the hornet’s nest.


Cc-ing Alex on this mail, because to me, this seems like an important
detail when he plans on using the byte in the future. If we need a
virtio status byte, I can’t see how we could use the existing F_STATUS
for it.

What would we use instead of F_STATUS to query the Device Status field?


We would emulate it in the front-end, just like we need to do for 
back-ends without F_STATUS.  We can’t emulate the DEVICE_NEEDS_RESET 
bit, though, that’s correct.


Given that qemu currently ignores DEVICE_NEEDS_RESET, I’m not 100 % 
convinced that your use case has a hard dependency on F_STATUS. However, 
this still does make a fair point in general that it would be useful to 
keep it.


That still leaves us with the situation that currently, the only 
implementations with F_STATUS support are qemu and dpdk, which both 
handle it incorrectly.  Furthermore, the specification leaves much to be 
desired, specifically in how F_STATUS interacts with other vhost-user 
commands (which is something I cited as a reason for my original patch), 
i.e. whether RESET_DEVICE and SET_STATUS 0 are equivalent, and whether 
failures in feature negotiation must result in both SET_FEATURES 
returning an error (with F_REPLY_ACK), and FEATURES_OK being reset in 
the status byte, or whether either is sufficient.  What happens when 
DEVICE_NEEDS_RESET is set, i.e. do we just need RESET_DEVICE / 
SET_STATUS 0, or do we also need to reset some protocol state?  (This is 
also connected to the fact that what happens on RESET_DEVICE is largely 
undefined, which I said on Stefan’s series.)


In general, because we have our own transport, we should make a note how 
it interacts with the status negotiation phases, i.e. that GET_FEATURES 
must not be called before S_ACKNOWLEDGE | S_DRIVER are set, that 
FEATURES_OK must be set after the SET_FEATURES call, and that DRIVER_OK 
must not be set without FEATURES_OK set / SET_FEATURES having returned 
success.  Here we would also answer the question about the interaction 
of F_REPLY_ACK+SET_FEATURES with F_STATUS, specifically whether an 
implementation with F_REPLY_ACK even needs to read back the status byte 
after setting FEATURES_OK because it could have got the feature 
negotiation result already as a result of the SET_FEATURES call.


After migration, can you just set all flags immediately

Re: [Virtio-fs] (no subject)

2023-10-10 Thread Hanna Czenczek


On 10.10.23 06:00, Yajun Wu wrote:


On 10/9/2023 5:13 PM, Hanna Czenczek wrote:

External email: Use caution opening links or attachments


On 09.10.23 11:07, Hanna Czenczek wrote:

On 09.10.23 10:21, Hanna Czenczek wrote:

On 07.10.23 04:22, Yajun Wu wrote:

[...]


The main motivation of adding VHOST_USER_SET_STATUS is to let
backend DPDK know
when DRIVER_OK bit is valid. It's an indication of all VQ
configuration has sent,
otherwise DPDK has to rely on first queue pair is ready, then
receiving/applying
VQ configuration one by one.

During live migration, configuring VQ one by one is very time
consuming.

One question I have here is why it wasn’t then introduced in the live
migration code, but in the general VM stop/cont code instead. It does
seem time-consuming to do this every time the VM is paused and 
resumed.


Yes, VM stop/cont will call vhost_net_stop/vhost_net_start. Maybe 
because there's no device level stop/cont vhost message?


No, it is because qemu will reset the status in stop/cont*, which it 
should not do.  Aside from guest-initiated resets, the only thing where 
a reset comes into play is when the back-end is changed, e.g. during 
migration.  In that case, the source back-end will see a disconnect on 
the vhost-user socket and can then do whatever uninitialization it needs 
to do, and the destination front-end will need to be reconfigured by 
qemu anyway, because it’s just a case of the destination qemu initiating 
a fresh connection to a new back-end (except that it will need to 
restore the state from the source).


*Yes, technically, dpdk will ignore that reset, but it still stops the 
device on a different message (when it should just pause processing 
vrings), so the outcome is the same.





For VIRTIO
net vDPA, HW needs to know how many VQs are enabled to set
RSS(Receive-Side Scaling).

If you don’t want SET_STATUS message, backend can remove protocol
feature bit
VHOST_USER_PROTOCOL_F_STATUS.

The problem isn’t back-ends that don’t want the message, the problem
is that qemu uses the message wrongly, which prevents well-behaving
back-ends from implementing the message.


DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device
close/reset.

So the right thing to do for back-ends is to announce STATUS support
and then not implement it correctly?

GET_VRING_BASE should not reset the close or reset the device, by the
way.  It should stop that one vring, not more.  We have a
RESET_DEVICE command for resetting.
I believe dpdk uses GET_VRING_BASE long before qemu has RESET_DEVICE? 


I don’t think it matters who came first.  What matters is the 
specification, and that dpdk decided to rely on implementation-specific 
behavior without having all involved parties agree by matters of putting 
that in the specification.  And now dpdk clearly deviates from the 
specification as a result of that action, which can result in problems 
if the front-end doesn’t do what qemu always used to do.  (E.g. the 
front-end might just send GET_VRING_BASE for all vrings when suspending 
the guest, and then only send kicks on resume to re-start the vrings.  
dpdk would most likely be left in a state where the whole device is 
stopped, expecting DRIVER_OK.  Same thing in general for front-ends that 
don’t support F_STATUS.)


It's a compatible issue. For new backend implements, we can have 
better solution, right?


The fact that dpdk and qemu deviate from the specification is a problem 
as-is.



I'm not involved in discussion about adding SET_STATUS in Vhost
protocol. This feature
is essential for vDPA(same as vhost-vdpa implements
VHOST_VDPA_SET_STATUS).

So from what I gather from your response is that there is only a
single use for SET_STATUS, which is the DRIVER_OK bit.  If so,
documenting that all other bits are to be ignored by both back-end
and front-end would be fine by me.

I’m not fully serious about that suggestion, but I hear the strong
implication that nothing but DRIVER_OK was of any concern, and this
is really important to note when we talk about the status of the
STATUS feature in vhost today.  It seems to me now that it was not
intended to be the virtio-level status byte, but just a DRIVER_OK
signalling path from front-end to back-end.  That makes it a
vhost-level protocol feature to me.

On second thought, it just is a pure vhost-level protocol feature, and
has nothing to do with the virtio status byte as-is.  The only stated
purpose is for the front-end to send DRIVER_OK after migration, but
migration is transparent to the guest, so the guest would never change
the status byte during migration.  Therefore, if this feature is
essential, we will never be able to have a status byte that is
transparently shared between guest and back-end device, i.e. the
virtio status byte.

On third thought, scratch that.  The guest wouldn’t set it, but
naturally, after migration, the front-end will need to restore the
status byte from the source, so the front-end will always need to set

Re: [Virtio-fs] (no subject)

2023-10-09 Thread Hanna Czenczek


On 09.10.23 11:07, Hanna Czenczek wrote:

On 09.10.23 10:21, Hanna Czenczek wrote:

On 07.10.23 04:22, Yajun Wu wrote:


[...]

The main motivation of adding VHOST_USER_SET_STATUS is to let 
backend DPDK know
when DRIVER_OK bit is valid. It's an indication of all VQ 
configuration has sent,
otherwise DPDK has to rely on first queue pair is ready, then 
receiving/applying

VQ configuration one by one.

During live migration, configuring VQ one by one is very time 
consuming.


One question I have here is why it wasn’t then introduced in the live 
migration code, but in the general VM stop/cont code instead. It does 
seem time-consuming to do this every time the VM is paused and resumed.



For VIRTIO
net vDPA, HW needs to know how many VQs are enabled to set 
RSS(Receive-Side Scaling).


If you don’t want SET_STATUS message, backend can remove protocol 
feature bit

VHOST_USER_PROTOCOL_F_STATUS.


The problem isn’t back-ends that don’t want the message, the problem 
is that qemu uses the message wrongly, which prevents well-behaving 
back-ends from implementing the message.


DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device 
close/reset.


So the right thing to do for back-ends is to announce STATUS support 
and then not implement it correctly?


GET_VRING_BASE should not reset the close or reset the device, by the 
way.  It should stop that one vring, not more.  We have a 
RESET_DEVICE command for resetting.


I'm not involved in discussion about adding SET_STATUS in Vhost 
protocol. This feature
is essential for vDPA(same as vhost-vdpa implements 
VHOST_VDPA_SET_STATUS).


So from what I gather from your response is that there is only a 
single use for SET_STATUS, which is the DRIVER_OK bit.  If so, 
documenting that all other bits are to be ignored by both back-end 
and front-end would be fine by me.


I’m not fully serious about that suggestion, but I hear the strong 
implication that nothing but DRIVER_OK was of any concern, and this 
is really important to note when we talk about the status of the 
STATUS feature in vhost today.  It seems to me now that it was not 
intended to be the virtio-level status byte, but just a DRIVER_OK 
signalling path from front-end to back-end.  That makes it a 
vhost-level protocol feature to me.


On second thought, it just is a pure vhost-level protocol feature, and 
has nothing to do with the virtio status byte as-is.  The only stated 
purpose is for the front-end to send DRIVER_OK after migration, but 
migration is transparent to the guest, so the guest would never change 
the status byte during migration.  Therefore, if this feature is 
essential, we will never be able to have a status byte that is 
transparently shared between guest and back-end device, i.e. the 
virtio status byte.


On third thought, scratch that.  The guest wouldn’t set it, but 
naturally, after migration, the front-end will need to restore the 
status byte from the source, so the front-end will always need to set 
it, even if it were otherwise used controlled only by the guest and the 
back-end device.  So technically, this doesn’t prevent such a use case.  
(In practice, it isn’t controlled by the guest right now, but that could 
be fixed.)


Cc-ing Alex on this mail, because to me, this seems like an important 
detail when he plans on using the byte in the future. If we need a 
virtio status byte, I can’t see how we could use the existing F_STATUS 
for it.


Hanna

Re: [Virtio-fs] (no subject)

2023-10-09 Thread Hanna Czenczek


On 09.10.23 10:21, Hanna Czenczek wrote:

On 07.10.23 04:22, Yajun Wu wrote:


[...]

The main motivation of adding VHOST_USER_SET_STATUS is to let backend 
DPDK know
when DRIVER_OK bit is valid. It's an indication of all VQ 
configuration has sent,
otherwise DPDK has to rely on first queue pair is ready, then 
receiving/applying

VQ configuration one by one.

During live migration, configuring VQ one by one is very time consuming.


One question I have here is why it wasn’t then introduced in the live 
migration code, but in the general VM stop/cont code instead. It does 
seem time-consuming to do this every time the VM is paused and resumed.



For VIRTIO
net vDPA, HW needs to know how many VQs are enabled to set 
RSS(Receive-Side Scaling).


If you don’t want SET_STATUS message, backend can remove protocol 
feature bit

VHOST_USER_PROTOCOL_F_STATUS.


The problem isn’t back-ends that don’t want the message, the problem 
is that qemu uses the message wrongly, which prevents well-behaving 
back-ends from implementing the message.


DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device 
close/reset.


So the right thing to do for back-ends is to announce STATUS support 
and then not implement it correctly?


GET_VRING_BASE should not reset the close or reset the device, by the 
way.  It should stop that one vring, not more.  We have a RESET_DEVICE 
command for resetting.


I'm not involved in discussion about adding SET_STATUS in Vhost 
protocol. This feature
is essential for vDPA(same as vhost-vdpa implements 
VHOST_VDPA_SET_STATUS).


So from what I gather from your response is that there is only a 
single use for SET_STATUS, which is the DRIVER_OK bit.  If so, 
documenting that all other bits are to be ignored by both back-end and 
front-end would be fine by me.


I’m not fully serious about that suggestion, but I hear the strong 
implication that nothing but DRIVER_OK was of any concern, and this is 
really important to note when we talk about the status of the STATUS 
feature in vhost today.  It seems to me now that it was not intended 
to be the virtio-level status byte, but just a DRIVER_OK signalling 
path from front-end to back-end.  That makes it a vhost-level protocol 
feature to me.


On second thought, it just is a pure vhost-level protocol feature, and 
has nothing to do with the virtio status byte as-is.  The only stated 
purpose is for the front-end to send DRIVER_OK after migration, but 
migration is transparent to the guest, so the guest would never change 
the status byte during migration.  Therefore, if this feature is 
essential, we will never be able to have a status byte that is 
transparently shared between guest and back-end device, i.e. the virtio 
status byte.


Cc-ing Alex on this mail, because to me, this seems like an important 
detail when he plans on using the byte in the future.  If we need a 
virtio status byte, I can’t see how we could use the existing F_STATUS 
for it.


Hanna

Re: [Virtio-fs] (no subject)

2023-10-09 Thread Hanna Czenczek


On 07.10.23 04:22, Yajun Wu wrote:


On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote:

External email: Use caution opening links or attachments


On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:

On 06.10.23 11:26, Michael S. Tsirkin wrote:

On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:

On 06.10.23 10:45, Michael S. Tsirkin wrote:

On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:

On 05.10.23 19:15, Michael S. Tsirkin wrote:

On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:

On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
There is no clearly defined purpose for the virtio status 
byte in
vhost-user: For resetting, we already have RESET_DEVICE; and 
for virtio

feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK
protocol extension, it is possible for SET_FEATURES to return 
errors

(SET_PROTOCOL_FEATURES may be called before SET_FEATURES).

As for implementations, SET_STATUS is not widely 
implemented.  dpdk does
implement it, but only uses it to signal feature negotiation 
failure.
While it does log reset requests (SET_STATUS 0) as such, it 
effectively
ignores them, in contrast to RESET_OWNER (which is 
deprecated, and today

means the same thing as RESET_DEVICE).

While qemu superficially has support for [GS]ET_STATUS, it 
does not

forward the guest-set status byte, but instead just makes it up
internally, and actually completely ignores what the back-end 
returns,
only using it as the template for a subsequent SET_STATUS to 
add single
bits to it.  Notably, after setting FEATURES_OK, it never 
reads it back
to see whether the flag is still set, which is the only way 
in which

dpdk uses the status byte.

As-is, no front-end or back-end can rely on the other side 
handling this
field in a useful manner, and it also provides no practical 
use over
other mechanisms the vhost-user protocol has, which are more 
clearly

defined.  Deprecate it.

Suggested-by: Stefan Hajnoczi 
Signed-off-by: Hanna Czenczek 
---
 docs/interop/vhost-user.rst | 28 
+---

 1 file changed, 21 insertions(+), 7 deletions(-)

Reviewed-by: Stefan Hajnoczi 
SET_STATUS is the only way to signal failure to acknowledge 
FEATURES_OK.
The fact current backends never check errors does not mean they 
never

will. So no, not applying this.
Can this not be done with REPLY_ACK?  I.e., with the following 
message

order:

1. GET_FEATURES to find out whether 
VHOST_USER_F_PROTOCOL_FEATURES is

present
2. GET_PROTOCOL_FEATURES to hopefully get 
VHOST_USER_PROTOCOL_F_REPLY_ACK

3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
4. SET_FEATURES with need_reply

If not, the problem is that qemu has sent SET_STATUS 0 for a 
while when the
vCPUs are stopped, which generally seems to request a device 
reset.  If we
don’t state at least that SET_STATUS 0 is to be ignored, 
back-ends that will
implement SET_STATUS later may break with at least these qemu 
versions.  But
documenting that a particular use of the status byte is to be 
ignored would

be really strange.

Hanna
Hmm I guess. Though just following virtio spec seems cleaner to 
me...

vhost-user reconfigures the state fully on start.
Not the internal device state, though.  virtiofsd has internal 
state, and

other devices like vhost-gpu back-ends would probably, too.

Stefan has recently sent a series
(https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) 
to
put the reset (RESET_DEVICE) into virtio_reset() (when we really 
need a

reset).

I really don’t like our current approach with the status byte. 
Following the
virtio specification to me would mean that the guest directly 
controls this
byte, which it does not.  qemu makes up values as it deems 
appropriate, and
this includes sending a SET_STATUS 0 when the guest is just 
paused, i.e.

when the guest really doesn’t want a device reset.

That means that qemu does not treat this as a virtio device field 
(because
that would mean exposing it to the guest driver), but instead 
treats it as
part of the vhost(-user) protocol.  It doesn’t feel right to me 
that we use
a virtio-defined feature for communication on the vhost level, 
i.e. between
front-end and back-end, and not between guest driver and device.  
I think
all vhost-level protocol features should be fully defined in the 
vhost-user

specification, which REPLY_ACK is.

Hmm that makes sense. Maybe we should have done what stefan's patch
is doing.

Do look at the original commit that introduced it to understand why
it was added.
I don’t understand why this was added to the stop/cont code, 
though.  If it
is time consuming to make these changes, why are they done every 
time the VM

is paused
and resumed?  It makes sense that this would be done for the initial
configuration (where a reset also wouldn’t hurt), but here it seems 
wrong.


(To be clear, a reset in the stop/cont code is wrong, because it breaks
stateful devices.)

Also, note

Re: [Virtio-fs] (no subject)

2023-10-09 Thread Hanna Czenczek


On 06.10.23 22:49, Alex Bennée wrote:

Hanna Czenczek  writes:


On 06.10.23 17:17, Alex Bennée wrote:

Hanna Czenczek  writes:


On 06.10.23 12:34, Michael S. Tsirkin wrote:

On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:

On 06.10.23 11:26, Michael S. Tsirkin wrote:

On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:

On 06.10.23 10:45, Michael S. Tsirkin wrote:

On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:

On 05.10.23 19:15, Michael S. Tsirkin wrote:

On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:

On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:



What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
devices that would implement them as per virtio spec, and even today it’s
broken for stateful devices.  The mentioned performance issue is likely
real, but we can’t address it by making up SET_STATUS calls that are wrong.

I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
final configuration that would happen upon a DRIVER_OK once the first vring
is started (i.e. receives a kick).  That has the added benefit of being
asynchronous because it doesn’t block any vhost-user messages (which are
synchronous, and thus block downtime).

Hanna

For better or worse kick is per ring. It's out of spec to start rings
that were not kicked but I guess you could do configuration ...
Seems somewhat asymmetrical though.

I meant to take the first ring being started as the signal to do the
global configuration, i.e. not do this once per vring, but once
globally.


Let's wait until next week, hopefully Yajun Wu will answer.

I mean, personally I don’t really care about the whole SET_STATUS
thing.  It’s clear that it’s broken for stateful devices.  The fact
that it took until 6f8be29ec17d to fix it for just any device that
would implement it according to spec to me is a strong indication that
nobody does implement it according to spec, and is currently only used
to signal to some specific back-end that all rings have been set up
and should be configured in a single block.

I'm certainly using [GS]ET_STATUS for the proposed F_TRANSPORT
extensions where everything is off-loaded to the vhost-user backend.

How do these back-ends work with the fact that qemu uses SET_STATUS
incorrectly when not offloading?  Do you plan on fixing that?

Mainly having a common base implementation which does it right and
having very lightweight derivations for legacy stubs using it. The
aim is to eliminate the need for QEMU stubs entirely by fully specifying
the device from the vhost-user API.


If the current SET_STATUS use is overhauled, too, that would be good.  I 
wonder why you need the status byte, though.



(I.e. that we send SET_STATUS 0 when the VM is paused, potentially
resetting state that is not recoverable, and that we set DRIVER and
DRIVER_OK simultaneously.)

This is QEMU simulating a SET_STATUS rather than the guest triggering
it?


Yes, and the fact that we simulate it when the guest will not have 
triggered it, i.e. we reset the device (SET_STATUS 0) when the VM is 
paused.  Effectively, qemu injects virtio commands that the guest has 
never requested, which generally feels like a bad idea, because qemu 
will need to get the device back to its previous state before the guest 
is resumed, which may or may not work.  Specifically, it won’t work for 
devices that have internal state.


Furthermore, we use SET_STATUS to set ACKNOWLEDGE | DRIVER | DRIVER_OK 
simultaneously, which is wrong.  ACKNOWLEDGE | DRIVER may perhaps be set 
simultaneously, but then comes feature negotiation (setting and checking 
FEATURES_OK), and then DRIVER_OK.


Finally, how the status byte is to be used is not noted in the 
vhost-user specification, which instead points to the virtio 
specification.  I think if we keep SET_STATUS, it must be documented how 
it interacts with other vhost-user commands.  For example, how the 
FEATURES_OK protocol described in the virtio specification interacts 
with GET_FEATURES/SET_FEATURES, or whether SET_STATUS 0 and RESET_DEVICE 
are equivalent.  Currently, the only implementation of SET_STATUS I know 
(DPDK) ignores SET_STATUS 0, i.e. doesn’t do a reset.  To me that 
indicates that the spec must be clear on what these status values mean 
with regards to the vhost-user protocol as a whole.


So every software implementation with STATUS support that I know 
implements SET_STATUS wrongly right now, and that’s a problem, because 
it prevents implementations like virtiofsd from doing so correctly.


Hanna

Re: [Virtio-fs] (no subject)

2023-10-06 Thread Hanna Czenczek


On 06.10.23 17:17, Alex Bennée wrote:

Hanna Czenczek  writes:


On 06.10.23 12:34, Michael S. Tsirkin wrote:

On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:

On 06.10.23 11:26, Michael S. Tsirkin wrote:

On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:

On 06.10.23 10:45, Michael S. Tsirkin wrote:

On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:

On 05.10.23 19:15, Michael S. Tsirkin wrote:

On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:

On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:



What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
devices that would implement them as per virtio spec, and even today it’s
broken for stateful devices.  The mentioned performance issue is likely
real, but we can’t address it by making up SET_STATUS calls that are wrong.

I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
final configuration that would happen upon a DRIVER_OK once the first vring
is started (i.e. receives a kick).  That has the added benefit of being
asynchronous because it doesn’t block any vhost-user messages (which are
synchronous, and thus block downtime).

Hanna

For better or worse kick is per ring. It's out of spec to start rings
that were not kicked but I guess you could do configuration ...
Seems somewhat asymmetrical though.

I meant to take the first ring being started as the signal to do the
global configuration, i.e. not do this once per vring, but once
globally.


Let's wait until next week, hopefully Yajun Wu will answer.

I mean, personally I don’t really care about the whole SET_STATUS
thing.  It’s clear that it’s broken for stateful devices.  The fact
that it took until 6f8be29ec17d to fix it for just any device that
would implement it according to spec to me is a strong indication that
nobody does implement it according to spec, and is currently only used
to signal to some specific back-end that all rings have been set up
and should be configured in a single block.

I'm certainly using [GS]ET_STATUS for the proposed F_TRANSPORT
extensions where everything is off-loaded to the vhost-user backend.


How do these back-ends work with the fact that qemu uses SET_STATUS 
incorrectly when not offloading?  Do you plan on fixing that?


(I.e. that we send SET_STATUS 0 when the VM is paused, potentially 
resetting state that is not recoverable, and that we set DRIVER and 
DRIVER_OK simultaneously.)


Hanna

1 2 3 4 >

1 - 100 of 387 matches

Mail list logo