from:"Stefan Hajnoczi"

Re: [PATCH v3 0/6] block: add blk_io_plug_call() API

2023-05-31 Thread Stefan Hajnoczi

Hi Kevin,
Do you want to review the thread-local blk_io_plug() patch series or
should I merge it?

Thanks,
Stefan


signature.asc
Description: PGP signature

[PATCH v3 5/6] block/linux-aio: convert to blk_io_plug_call() API

2023-05-30 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Note that a dev_max_batch check is dropped in laio_io_unplug() because
the semantics of unplug_fn() are different from .bdrv_co_unplug():
1. unplug_fn() is only called when the last blk_io_unplug() call occurs,
   not every time blk_io_unplug() is called.
2. unplug_fn() is per-thread, not per-BlockDriverState, so there is no
   way to get per-BlockDriverState fields like dev_max_batch.

Therefore this condition cannot be moved to laio_unplug_fn(). It is not
obvious that this condition affects performance in practice, so I am
removing it instead of trying to come up with a more complex mechanism
to preserve the condition.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
---
 include/block/raw-aio.h |  7 ---
 block/file-posix.c  | 28 
 block/linux-aio.c   | 41 +++--
 3 files changed, 11 insertions(+), 65 deletions(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index da60ca13ef..0f63c2800c 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -62,13 +62,6 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, 
QEMUIOVector *qiov,
 
 void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
 void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
-
-/*
- * laio_io_plug/unplug work in the thread's current AioContext, therefore the
- * caller must ensure that they are paired in the same IOThread.
- */
-void laio_io_plug(void);
-void laio_io_unplug(uint64_t dev_max_batch);
 #endif
 /* io_uring.c - Linux io_uring implementation */
 #ifdef CONFIG_LINUX_IO_URING
diff --git a/block/file-posix.c b/block/file-posix.c
index 7baa8491dd..ac1ed54811 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2550,26 +2550,6 @@ static int coroutine_fn raw_co_pwritev(BlockDriverState 
*bs, int64_t offset,
 return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
 }
 
-static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
-{
-BDRVRawState __attribute__((unused)) *s = bs->opaque;
-#ifdef CONFIG_LINUX_AIO
-if (s->use_linux_aio) {
-laio_io_plug();
-}
-#endif
-}
-
-static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
-{
-BDRVRawState __attribute__((unused)) *s = bs->opaque;
-#ifdef CONFIG_LINUX_AIO
-if (s->use_linux_aio) {
-laio_io_unplug(s->aio_max_batch);
-}
-#endif
-}
-
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
 {
 BDRVRawState *s = bs->opaque;
@@ -3914,8 +3894,6 @@ BlockDriver bdrv_file = {
 .bdrv_co_copy_range_from = raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = raw_co_copy_range_to,
 .bdrv_refresh_limits = raw_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
@@ -4286,8 +4264,6 @@ static BlockDriver bdrv_host_device = {
 .bdrv_co_copy_range_from = raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = raw_co_copy_range_to,
 .bdrv_refresh_limits = raw_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
@@ -4424,8 +4400,6 @@ static BlockDriver bdrv_host_cdrom = {
 .bdrv_co_pwritev= raw_co_pwritev,
 .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
 .bdrv_refresh_limits= cdrom_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
@@ -4552,8 +4526,6 @@ static BlockDriver bdrv_host_cdrom = {
 .bdrv_co_pwritev= raw_co_pwritev,
 .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
 .bdrv_refresh_limits= cdrom_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 442c86209b..5021aed68f 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -15,6 +15,7 @@
 #include "qemu/event_notifier.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "sysemu/block-backend.h"
 
 /* Only used for assertions.  */
 #include "qemu/coroutine_int.h"
@@ -46,7 +47,6 @@ struct qemu_laiocb {
 };
 
 typedef struct {
-int plugged;
 unsigned int in_queue;
 unsigned int in_flight;
 bool blocked;
@@ -236,7 +236,7

[PATCH v3 1/6] block: add blk_io_plug_call() API

2023-05-30 Thread Stefan Hajnoczi

Introduce a new API for thread-local blk_io_plug() that does not
traverse the block graph. The goal is to make blk_io_plug() multi-queue
friendly.

Instead of having block drivers track whether or not we're in a plugged
section, provide an API that allows them to defer a function call until
we're unplugged: blk_io_plug_call(fn, opaque). If blk_io_plug_call() is
called multiple times with the same fn/opaque pair, then fn() is only
called once at the end of the function - resulting in batching.

This patch introduces the API and changes blk_io_plug()/blk_io_unplug().
blk_io_plug()/blk_io_unplug() no longer require a BlockBackend argument
because the plug state is now thread-local.

Later patches convert block drivers to blk_io_plug_call() and then we
can finally remove .bdrv_co_io_plug() once all block drivers have been
converted.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Reviewed-by: Stefano Garzarella 
---
v2
- "is not be freed" -> "is not freed" [Eric]
---
 MAINTAINERS   |   1 +
 include/sysemu/block-backend-io.h |  13 +--
 block/block-backend.c |  22 -
 block/plug.c  | 159 ++
 hw/block/dataplane/xen-block.c|   8 +-
 hw/block/virtio-blk.c |   4 +-
 hw/scsi/virtio-scsi.c |   6 +-
 block/meson.build |   1 +
 8 files changed, 173 insertions(+), 41 deletions(-)
 create mode 100644 block/plug.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 4b025a7b63..89f274f85e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2650,6 +2650,7 @@ F: util/aio-*.c
 F: util/aio-*.h
 F: util/fdmon-*.c
 F: block/io.c
+F: block/plug.c
 F: migration/block*
 F: include/block/aio.h
 F: include/block/aio-wait.h
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index d62a7ee773..be4dcef59d 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -100,16 +100,9 @@ void blk_iostatus_set_err(BlockBackend *blk, int error);
 int blk_get_max_iov(BlockBackend *blk);
 int blk_get_max_hw_iov(BlockBackend *blk);
 
-/*
- * blk_io_plug/unplug are thread-local operations. This means that multiple
- * IOThreads can simultaneously call plug/unplug, but the caller must ensure
- * that each unplug() is called in the same IOThread of the matching plug().
- */
-void coroutine_fn blk_co_io_plug(BlockBackend *blk);
-void co_wrapper blk_io_plug(BlockBackend *blk);
-
-void coroutine_fn blk_co_io_unplug(BlockBackend *blk);
-void co_wrapper blk_io_unplug(BlockBackend *blk);
+void blk_io_plug(void);
+void blk_io_unplug(void);
+void blk_io_plug_call(void (*fn)(void *), void *opaque);
 
 AioContext *blk_get_aio_context(BlockBackend *blk);
 BlockAcctStats *blk_get_stats(BlockBackend *blk);
diff --git a/block/block-backend.c b/block/block-backend.c
index ca537cd0ad..1f1d226ba6 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2568,28 +2568,6 @@ void blk_add_insert_bs_notifier(BlockBackend *blk, 
Notifier *notify)
 notifier_list_add(&blk->insert_bs_notifiers, notify);
 }
 
-void coroutine_fn blk_co_io_plug(BlockBackend *blk)
-{
-BlockDriverState *bs = blk_bs(blk);
-IO_CODE();
-GRAPH_RDLOCK_GUARD();
-
-if (bs) {
-bdrv_co_io_plug(bs);
-}
-}
-
-void coroutine_fn blk_co_io_unplug(BlockBackend *blk)
-{
-BlockDriverState *bs = blk_bs(blk);
-IO_CODE();
-GRAPH_RDLOCK_GUARD();
-
-if (bs) {
-bdrv_co_io_unplug(bs);
-}
-}
-
 BlockAcctStats *blk_get_stats(BlockBackend *blk)
 {
 IO_CODE();
diff --git a/block/plug.c b/block/plug.c
new file mode 100644
index 00..98a155d2f4
--- /dev/null
+++ b/block/plug.c
@@ -0,0 +1,159 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Block I/O plugging
+ *
+ * Copyright Red Hat.
+ *
+ * This API defers a function call within a blk_io_plug()/blk_io_unplug()
+ * section, allowing multiple calls to batch up. This is a performance
+ * optimization that is used in the block layer to submit several I/O requests
+ * at once instead of individually:
+ *
+ *   blk_io_plug(); <-- start of plugged region
+ *   ...
+ *   blk_io_plug_call(my_func, my_obj); <-- deferred my_func(my_obj) call
+ *   blk_io_plug_call(my_func, my_obj); <-- another
+ *   blk_io_plug_call(my_func, my_obj); <-- another
+ *   ...
+ *   blk_io_unplug(); <-- end of plugged region, my_func(my_obj) is called once
+ *
+ * This code is actually generic and not tied to the block layer. If another
+ * subsystem needs this functionality, it could be renamed.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/coroutine-tls.h"
+#include "qemu/notify.h"
+#include "qemu/thread.h"
+#include "sysemu/block-backend.h"
+
+/* A function call that has been deferred until unplug() */
+typedef struct {
+void (*fn)(void *);
+void *opaque;
+} UnplugFn;
+
+/* Per-thread state */
+typedef struct {
+u

[PATCH v3 4/6] block/io_uring: convert to blk_io_plug_call() API

2023-05-30 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Reviewed-by: Stefano Garzarella 
---
v2
- Removed whitespace hunk [Eric]
---
 include/block/raw-aio.h |  7 ---
 block/file-posix.c  | 10 --
 block/io_uring.c| 44 -
 block/trace-events  |  5 ++---
 4 files changed, 19 insertions(+), 47 deletions(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 0fe85ade77..da60ca13ef 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -81,13 +81,6 @@ int coroutine_fn luring_co_submit(BlockDriverState *bs, int 
fd, uint64_t offset,
   QEMUIOVector *qiov, int type);
 void luring_detach_aio_context(LuringState *s, AioContext *old_context);
 void luring_attach_aio_context(LuringState *s, AioContext *new_context);
-
-/*
- * luring_io_plug/unplug work in the thread's current AioContext, therefore the
- * caller must ensure that they are paired in the same IOThread.
- */
-void luring_io_plug(void);
-void luring_io_unplug(void);
 #endif
 
 #ifdef _WIN32
diff --git a/block/file-posix.c b/block/file-posix.c
index 0ab158efba..7baa8491dd 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2558,11 +2558,6 @@ static void coroutine_fn raw_co_io_plug(BlockDriverState 
*bs)
 laio_io_plug();
 }
 #endif
-#ifdef CONFIG_LINUX_IO_URING
-if (s->use_linux_io_uring) {
-luring_io_plug();
-}
-#endif
 }
 
 static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
@@ -2573,11 +2568,6 @@ static void coroutine_fn 
raw_co_io_unplug(BlockDriverState *bs)
 laio_io_unplug(s->aio_max_batch);
 }
 #endif
-#ifdef CONFIG_LINUX_IO_URING
-if (s->use_linux_io_uring) {
-luring_io_unplug();
-}
-#endif
 }
 
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
diff --git a/block/io_uring.c b/block/io_uring.c
index 82cab6a5bd..4e25b56c9c 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -16,6 +16,7 @@
 #include "block/raw-aio.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "sysemu/block-backend.h"
 #include "trace.h"
 
 /* Only used for assertions.  */
@@ -41,7 +42,6 @@ typedef struct LuringAIOCB {
 } LuringAIOCB;
 
 typedef struct LuringQueue {
-int plugged;
 unsigned int in_queue;
 unsigned int in_flight;
 bool blocked;
@@ -267,7 +267,7 @@ static void 
luring_process_completions_and_submit(LuringState *s)
 {
 luring_process_completions(s);
 
-if (!s->io_q.plugged && s->io_q.in_queue > 0) {
+if (s->io_q.in_queue > 0) {
 ioq_submit(s);
 }
 }
@@ -301,29 +301,17 @@ static void qemu_luring_poll_ready(void *opaque)
 static void ioq_init(LuringQueue *io_q)
 {
 QSIMPLEQ_INIT(&io_q->submit_queue);
-io_q->plugged = 0;
 io_q->in_queue = 0;
 io_q->in_flight = 0;
 io_q->blocked = false;
 }
 
-void luring_io_plug(void)
+static void luring_unplug_fn(void *opaque)
 {
-AioContext *ctx = qemu_get_current_aio_context();
-LuringState *s = aio_get_linux_io_uring(ctx);
-trace_luring_io_plug(s);
-s->io_q.plugged++;
-}
-
-void luring_io_unplug(void)
-{
-AioContext *ctx = qemu_get_current_aio_context();
-LuringState *s = aio_get_linux_io_uring(ctx);
-assert(s->io_q.plugged);
-trace_luring_io_unplug(s, s->io_q.blocked, s->io_q.plugged,
-   s->io_q.in_queue, s->io_q.in_flight);
-if (--s->io_q.plugged == 0 &&
-!s->io_q.blocked && s->io_q.in_queue > 0) {
+LuringState *s = opaque;
+trace_luring_unplug_fn(s, s->io_q.blocked, s->io_q.in_queue,
+   s->io_q.in_flight);
+if (!s->io_q.blocked && s->io_q.in_queue > 0) {
 ioq_submit(s);
 }
 }
@@ -370,14 +358,16 @@ static int luring_do_submit(int fd, LuringAIOCB 
*luringcb, LuringState *s,
 
 QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
 s->io_q.in_queue++;
-trace_luring_do_submit(s, s->io_q.blocked, s->io_q.plugged,
-   s->io_q.in_queue, s->io_q.in_flight);
-if (!s->io_q.blocked &&
-(!s->io_q.plugged ||
- s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES)) {
-ret = ioq_submit(s);
-trace_luring_do_submit_done(s, ret);
-return ret;
+trace_luring_do_submit(s, s->io_q.blocked, s->io_q.in_queue,
+   s->io_q.in_flight);
+if (!s->io_q.blocked) {
+if (s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES) {
+ret = ioq_submit(s);
+trace_luring_do_submit_done(s, ret);
+r

[PATCH v3 6/6] block: remove bdrv_co_io_plug() API

2023-05-30 Thread Stefan Hajnoczi

No block driver implements .bdrv_co_io_plug() anymore. Get rid of the
function pointers.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Reviewed-by: Stefano Garzarella 
---
 include/block/block-io.h |  3 ---
 include/block/block_int-common.h | 11 --
 block/io.c   | 37 
 3 files changed, 51 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index a27e471a87..43af816d75 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -259,9 +259,6 @@ void coroutine_fn bdrv_co_leave(BlockDriverState *bs, 
AioContext *old_ctx);
 
 AioContext *child_of_bds_get_parent_aio_context(BdrvChild *c);
 
-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_plug(BlockDriverState *bs);
-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_unplug(BlockDriverState *bs);
-
 bool coroutine_fn GRAPH_RDLOCK
 bdrv_co_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
uint32_t granularity, Error **errp);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 6492a1e538..958962aa3a 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -753,11 +753,6 @@ struct BlockDriver {
 void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_debug_event)(
 BlockDriverState *bs, BlkdebugEvent event);
 
-/* io queue for linux-aio */
-void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_plug)(BlockDriverState 
*bs);
-void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_unplug)(
-BlockDriverState *bs);
-
 /**
  * bdrv_drain_begin is called if implemented in the beginning of a
  * drain operation to drain and stop any internal sources of requests in
@@ -1227,12 +1222,6 @@ struct BlockDriverState {
 unsigned int in_flight;
 unsigned int serialising_in_flight;
 
-/*
- * counter for nested bdrv_io_plug.
- * Accessed with atomic ops.
- */
-unsigned io_plugged;
-
 /* do we need to tell the quest if we have a volatile write cache? */
 int enable_write_cache;
 
diff --git a/block/io.c b/block/io.c
index 4d54fda593..56b0c1ce6c 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3219,43 +3219,6 @@ void *qemu_try_blockalign0(BlockDriverState *bs, size_t 
size)
 return mem;
 }
 
-void coroutine_fn bdrv_co_io_plug(BlockDriverState *bs)
-{
-BdrvChild *child;
-IO_CODE();
-assert_bdrv_graph_readable();
-
-QLIST_FOREACH(child, &bs->children, next) {
-bdrv_co_io_plug(child->bs);
-}
-
-if (qatomic_fetch_inc(&bs->io_plugged) == 0) {
-BlockDriver *drv = bs->drv;
-if (drv && drv->bdrv_co_io_plug) {
-drv->bdrv_co_io_plug(bs);
-}
-}
-}
-
-void coroutine_fn bdrv_co_io_unplug(BlockDriverState *bs)
-{
-BdrvChild *child;
-IO_CODE();
-assert_bdrv_graph_readable();
-
-assert(bs->io_plugged);
-if (qatomic_fetch_dec(&bs->io_plugged) == 1) {
-BlockDriver *drv = bs->drv;
-if (drv && drv->bdrv_co_io_unplug) {
-drv->bdrv_co_io_unplug(bs);
-}
-}
-
-QLIST_FOREACH(child, &bs->children, next) {
-bdrv_co_io_unplug(child->bs);
-}
-}
-
 /* Helper that undoes bdrv_register_buf() when it fails partway through */
 static void GRAPH_RDLOCK
 bdrv_register_buf_rollback(BlockDriverState *bs, void *host, size_t size,
-- 
2.40.1

[PATCH v3 3/6] block/blkio: convert to blk_io_plug_call() API

2023-05-30 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Reviewed-by: Stefano Garzarella 
---
v2
- Add missing #include and fix blkio_unplug_fn() prototype [Stefano]
---
 block/blkio.c | 43 ---
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/block/blkio.c b/block/blkio.c
index 0cdc99a729..93c6d20d39 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -17,6 +17,7 @@
 #include "qemu/error-report.h"
 #include "qapi/qmp/qdict.h"
 #include "qemu/module.h"
+#include "sysemu/block-backend.h"
 #include "exec/memory.h" /* for ram_block_discard_disable() */
 
 #include "block/block-io.h"
@@ -325,16 +326,30 @@ static void blkio_detach_aio_context(BlockDriverState *bs)
false, NULL, NULL, NULL, NULL, NULL);
 }
 
-/* Call with s->blkio_lock held to submit I/O after enqueuing a new request */
-static void blkio_submit_io(BlockDriverState *bs)
+/*
+ * Called by blk_io_unplug() or immediately if not plugged. Called without
+ * blkio_lock.
+ */
+static void blkio_unplug_fn(void *opaque)
 {
-if (qatomic_read(&bs->io_plugged) == 0) {
-BDRVBlkioState *s = bs->opaque;
+BDRVBlkioState *s = opaque;
 
+WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_do_io(s->blkioq, NULL, 0, 0, NULL);
 }
 }
 
+/*
+ * Schedule I/O submission after enqueuing a new request. Called without
+ * blkio_lock.
+ */
+static void blkio_submit_io(BlockDriverState *bs)
+{
+BDRVBlkioState *s = bs->opaque;
+
+blk_io_plug_call(blkio_unplug_fn, s);
+}
+
 static int coroutine_fn
 blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
@@ -345,9 +360,9 @@ blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, 
int64_t bytes)
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_discard(s->blkioq, offset, bytes, &cod, 0);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 return cod.ret;
 }
@@ -378,9 +393,9 @@ blkio_co_preadv(BlockDriverState *bs, int64_t offset, 
int64_t bytes,
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_readv(s->blkioq, offset, iov, iovcnt, &cod, 0);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 
 if (use_bounce_buffer) {
@@ -423,9 +438,9 @@ static int coroutine_fn blkio_co_pwritev(BlockDriverState 
*bs, int64_t offset,
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_writev(s->blkioq, offset, iov, iovcnt, &cod, blkio_flags);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 
 if (use_bounce_buffer) {
@@ -444,9 +459,9 @@ static int coroutine_fn blkio_co_flush(BlockDriverState *bs)
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_flush(s->blkioq, &cod, 0);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 return cod.ret;
 }
@@ -472,22 +487,13 @@ static int coroutine_fn 
blkio_co_pwrite_zeroes(BlockDriverState *bs,
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_write_zeroes(s->blkioq, offset, bytes, &cod, blkio_flags);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 return cod.ret;
 }
 
-static void coroutine_fn blkio_co_io_unplug(BlockDriverState *bs)
-{
-BDRVBlkioState *s = bs->opaque;
-
-WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
-blkio_submit_io(bs);
-}
-}
-
 typedef enum {
 BMRR_OK,
 BMRR_SKIP,
@@ -1009,7 +1015,6 @@ static void blkio_refresh_limits(BlockDriverState *bs, 
Error **errp)
 .bdrv_co_pwritev = blkio_co_pwritev, \
 .bdrv_co_flush_to_disk   = blkio_co_flush, \
 .bdrv_co_pwrite_zeroes   = blkio_co_pwrite_zeroes, \
-.bdrv_co_io_unplug   = blkio_co_io_unplug, \
 .bdrv_refresh_limits = blkio_refresh_limits, \
 .bdrv_register_buf   = blkio_register_buf, \
 .bdrv_unregister_buf = blkio_unregister_buf, \
-- 
2.40.1

[PATCH v3 2/6] block/nvme: convert to blk_io_plug_call() API

2023-05-30 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Reviewed-by: Stefano Garzarella 
---
v2
- Remove unused nvme_process_completion_queue_plugged trace event
  [Stefano]
---
 block/nvme.c   | 44 
 block/trace-events |  1 -
 2 files changed, 12 insertions(+), 33 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 5b744c2bda..100b38b592 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -25,6 +25,7 @@
 #include "qemu/vfio-helpers.h"
 #include "block/block-io.h"
 #include "block/block_int.h"
+#include "sysemu/block-backend.h"
 #include "sysemu/replay.h"
 #include "trace.h"
 
@@ -119,7 +120,6 @@ struct BDRVNVMeState {
 int blkshift;
 
 uint64_t max_transfer;
-bool plugged;
 
 bool supports_write_zeroes;
 bool supports_discard;
@@ -282,7 +282,7 @@ static void nvme_kick(NVMeQueuePair *q)
 {
 BDRVNVMeState *s = q->s;
 
-if (s->plugged || !q->need_kick) {
+if (!q->need_kick) {
 return;
 }
 trace_nvme_kick(s, q->index);
@@ -387,10 +387,6 @@ static bool nvme_process_completion(NVMeQueuePair *q)
 NvmeCqe *c;
 
 trace_nvme_process_completion(s, q->index, q->inflight);
-if (s->plugged) {
-trace_nvme_process_completion_queue_plugged(s, q->index);
-return false;
-}
 
 /*
  * Support re-entrancy when a request cb() function invokes aio_poll().
@@ -480,6 +476,15 @@ static void nvme_trace_command(const NvmeCmd *cmd)
 }
 }
 
+static void nvme_unplug_fn(void *opaque)
+{
+NVMeQueuePair *q = opaque;
+
+QEMU_LOCK_GUARD(&q->lock);
+nvme_kick(q);
+nvme_process_completion(q);
+}
+
 static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
 NvmeCmd *cmd, BlockCompletionFunc cb,
 void *opaque)
@@ -496,8 +501,7 @@ static void nvme_submit_command(NVMeQueuePair *q, 
NVMeRequest *req,
q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
 q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
 q->need_kick++;
-nvme_kick(q);
-nvme_process_completion(q);
+blk_io_plug_call(nvme_unplug_fn, q);
 qemu_mutex_unlock(&q->lock);
 }
 
@@ -1567,27 +1571,6 @@ static void nvme_attach_aio_context(BlockDriverState *bs,
 }
 }
 
-static void coroutine_fn nvme_co_io_plug(BlockDriverState *bs)
-{
-BDRVNVMeState *s = bs->opaque;
-assert(!s->plugged);
-s->plugged = true;
-}
-
-static void coroutine_fn nvme_co_io_unplug(BlockDriverState *bs)
-{
-BDRVNVMeState *s = bs->opaque;
-assert(s->plugged);
-s->plugged = false;
-for (unsigned i = INDEX_IO(0); i < s->queue_count; i++) {
-NVMeQueuePair *q = s->queues[i];
-qemu_mutex_lock(&q->lock);
-nvme_kick(q);
-nvme_process_completion(q);
-qemu_mutex_unlock(&q->lock);
-}
-}
-
 static bool nvme_register_buf(BlockDriverState *bs, void *host, size_t size,
   Error **errp)
 {
@@ -1664,9 +1647,6 @@ static BlockDriver bdrv_nvme = {
 .bdrv_detach_aio_context  = nvme_detach_aio_context,
 .bdrv_attach_aio_context  = nvme_attach_aio_context,
 
-.bdrv_co_io_plug  = nvme_co_io_plug,
-.bdrv_co_io_unplug= nvme_co_io_unplug,
-
 .bdrv_register_buf= nvme_register_buf,
 .bdrv_unregister_buf  = nvme_unregister_buf,
 };
diff --git a/block/trace-events b/block/trace-events
index 32665158d6..048ad27519 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -141,7 +141,6 @@ nvme_kick(void *s, unsigned q_index) "s %p q #%u"
 nvme_dma_flush_queue_wait(void *s) "s %p"
 nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) 
"cmd_specific %d sq_head %d sqid %d cid %d status 0x%x"
 nvme_process_completion(void *s, unsigned q_index, int inflight) "s %p q #%u 
inflight %d"
-nvme_process_completion_queue_plugged(void *s, unsigned q_index) "s %p q #%u"
 nvme_complete_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
 nvme_submit_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
 nvme_submit_command_raw(int c0, int c1, int c2, int c3, int c4, int c5, int 
c6, int c7) "%02x %02x %02x %02x %02x %02x %02x %02x"
-- 
2.40.1

[PATCH v3 0/6] block: add blk_io_plug_call() API

2023-05-30 Thread Stefan Hajnoczi

v3
- Patch 5: Mention why dev_max_batch condition was dropped [Stefano]
v2
- Patch 1: "is not be freed" -> "is not freed" [Eric]
- Patch 2: Remove unused nvme_process_completion_queue_plugged trace event
  [Stefano]
- Patch 3: Add missing #include and fix blkio_unplug_fn() prototype [Stefano]
- Patch 4: Removed whitespace hunk [Eric]

The existing blk_io_plug() API is not block layer multi-queue friendly because
the plug state is per-BlockDriverState.

Change blk_io_plug()'s implementation so it is thread-local. This is done by
introducing the blk_io_plug_call() function that block drivers use to batch
calls while plugged. It is relatively easy to convert block drivers from
.bdrv_co_io_plug() to blk_io_plug_call().

Random read 4KB performance with virtio-blk on a host NVMe block device:

iodepth   iops   change vs today
145612   -4%
287967   +2%
4   129872   +0%
8   171096   -3%
16  194508   -4%
32  208947   -1%
64  217647   +0%
128 229629   +0%

The results are within the noise for these benchmarks. This is to be expected
because the plugging behavior for a single thread hasn't changed in this patch
series, only that the state is thread-local now.

The following graph compares several approaches:
https://vmsplice.net/~stefan/blk_io_plug-thread-local.png
- v7.2.0: before most of the multi-queue block layer changes landed.
- with-blk_io_plug: today's post-8.0.0 QEMU.
- blk_io_plug-thread-local: this patch series.
- no-blk_io_plug: what happens when we simply remove plugging?
- call-after-dispatch: what if we integrate plugging into the event loop? I
  decided against this approach in the end because it's more likely to
  introduce performance regressions since I/O submission is deferred until the
  end of the event loop iteration.

Aside from the no-blk_io_plug case, which bottlenecks much earlier than the
others, we see that all plugging approaches are more or less equivalent in this
benchmark. It is also clear that QEMU 8.0.0 has lower performance than 7.2.0.

The Ansible playbook, fio results, and a Jupyter notebook are available here:
https://github.com/stefanha/qemu-perf/tree/remove-blk_io_plug

Stefan Hajnoczi (6):
  block: add blk_io_plug_call() API
  block/nvme: convert to blk_io_plug_call() API
  block/blkio: convert to blk_io_plug_call() API
  block/io_uring: convert to blk_io_plug_call() API
  block/linux-aio: convert to blk_io_plug_call() API
  block: remove bdrv_co_io_plug() API

 MAINTAINERS   |   1 +
 include/block/block-io.h  |   3 -
 include/block/block_int-common.h  |  11 ---
 include/block/raw-aio.h   |  14 ---
 include/sysemu/block-backend-io.h |  13 +--
 block/blkio.c |  43 
 block/block-backend.c |  22 -
 block/file-posix.c|  38 ---
 block/io.c|  37 ---
 block/io_uring.c  |  44 -
 block/linux-aio.c |  41 +++-
 block/nvme.c  |  44 +++--
 block/plug.c  | 159 ++
 hw/block/dataplane/xen-block.c|   8 +-
 hw/block/virtio-blk.c |   4 +-
 hw/scsi/virtio-scsi.c |   6 +-
 block/meson.build |   1 +
 block/trace-events|   6 +-
 18 files changed, 239 insertions(+), 256 deletions(-)
 create mode 100644 block/plug.c

-- 
2.40.1

Re: [PATCH v2 5/6] block/linux-aio: convert to blk_io_plug_call() API

2023-05-30 Thread Stefan Hajnoczi

On Mon, May 29, 2023 at 10:50:34AM +0200, Stefano Garzarella wrote:
> On Wed, May 24, 2023 at 03:36:34PM -0400, Stefan Hajnoczi wrote:
> > On Wed, May 24, 2023 at 10:52:03AM +0200, Stefano Garzarella wrote:
> > > On Tue, May 23, 2023 at 01:12:59PM -0400, Stefan Hajnoczi wrote:
> > > > Stop using the .bdrv_co_io_plug() API because it is not multi-queue
> > > > block layer friendly. Use the new blk_io_plug_call() API to batch I/O
> > > > submission instead.
> > > >
> > > > Signed-off-by: Stefan Hajnoczi 
> > > > Reviewed-by: Eric Blake 
> > > > ---
> > > > include/block/raw-aio.h |  7 ---
> > > > block/file-posix.c  | 28 
> > > > block/linux-aio.c   | 41 +++--
> > > > 3 files changed, 11 insertions(+), 65 deletions(-)
> > > >
> > > > diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
> > > > index da60ca13ef..0f63c2800c 100644
> > > > --- a/include/block/raw-aio.h
> > > > +++ b/include/block/raw-aio.h
> > > > @@ -62,13 +62,6 @@ int coroutine_fn laio_co_submit(int fd, uint64_t 
> > > > offset, QEMUIOVector *qiov,
> > > >
> > > > void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
> > > > void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
> > > > -
> > > > -/*
> > > > - * laio_io_plug/unplug work in the thread's current AioContext, 
> > > > therefore the
> > > > - * caller must ensure that they are paired in the same IOThread.
> > > > - */
> > > > -void laio_io_plug(void);
> > > > -void laio_io_unplug(uint64_t dev_max_batch);
> > > > #endif
> > > > /* io_uring.c - Linux io_uring implementation */
> > > > #ifdef CONFIG_LINUX_IO_URING
> > > > diff --git a/block/file-posix.c b/block/file-posix.c
> > > > index 7baa8491dd..ac1ed54811 100644
> > > > --- a/block/file-posix.c
> > > > +++ b/block/file-posix.c
> > > > @@ -2550,26 +2550,6 @@ static int coroutine_fn 
> > > > raw_co_pwritev(BlockDriverState *bs, int64_t offset,
> > > > return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
> > > > }
> > > >
> > > > -static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
> > > > -{
> > > > -BDRVRawState __attribute__((unused)) *s = bs->opaque;
> > > > -#ifdef CONFIG_LINUX_AIO
> > > > -if (s->use_linux_aio) {
> > > > -laio_io_plug();
> > > > -}
> > > > -#endif
> > > > -}
> > > > -
> > > > -static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
> > > > -{
> > > > -BDRVRawState __attribute__((unused)) *s = bs->opaque;
> > > > -#ifdef CONFIG_LINUX_AIO
> > > > -if (s->use_linux_aio) {
> > > > -laio_io_unplug(s->aio_max_batch);
> > > > -}
> > > > -#endif
> > > > -}
> > > > -
> > > > static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
> > > > {
> > > > BDRVRawState *s = bs->opaque;
> > > > @@ -3914,8 +3894,6 @@ BlockDriver bdrv_file = {
> > > > .bdrv_co_copy_range_from = raw_co_copy_range_from,
> > > > .bdrv_co_copy_range_to  = raw_co_copy_range_to,
> > > > .bdrv_refresh_limits = raw_refresh_limits,
> > > > -.bdrv_co_io_plug= raw_co_io_plug,
> > > > -.bdrv_co_io_unplug  = raw_co_io_unplug,
> > > > .bdrv_attach_aio_context = raw_aio_attach_aio_context,
> > > >
> > > > .bdrv_co_truncate   = raw_co_truncate,
> > > > @@ -4286,8 +4264,6 @@ static BlockDriver bdrv_host_device = {
> > > > .bdrv_co_copy_range_from = raw_co_copy_range_from,
> > > > .bdrv_co_copy_range_to  = raw_co_copy_range_to,
> > > > .bdrv_refresh_limits = raw_refresh_limits,
> > > > -.bdrv_co_io_plug= raw_co_io_plug,
> > > > -.bdrv_co_io_unplug  = raw_co_io_unplug,
> > > > .bdrv_attach_aio_context = raw_aio_attach_aio_context,
> > > >
> > > > .bdrv_co_truncate   = raw_co_truncate,
> > > > @@ -4424,8 +4400,6 @@ static BlockDriver bdrv_host_cdrom = {
> > > > .bdrv_co_pwritev= raw_co_pwritev,
> > &

Re: [PATCH v2 5/6] block/linux-aio: convert to blk_io_plug_call() API

2023-05-25 Thread Stefan Hajnoczi

On Wed, May 24, 2023 at 10:52:03AM +0200, Stefano Garzarella wrote:
> On Tue, May 23, 2023 at 01:12:59PM -0400, Stefan Hajnoczi wrote:
> > Stop using the .bdrv_co_io_plug() API because it is not multi-queue
> > block layer friendly. Use the new blk_io_plug_call() API to batch I/O
> > submission instead.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > Reviewed-by: Eric Blake 
> > ---
> > include/block/raw-aio.h |  7 ---
> > block/file-posix.c  | 28 
> > block/linux-aio.c   | 41 +++--
> > 3 files changed, 11 insertions(+), 65 deletions(-)
> > 
> > diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
> > index da60ca13ef..0f63c2800c 100644
> > --- a/include/block/raw-aio.h
> > +++ b/include/block/raw-aio.h
> > @@ -62,13 +62,6 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, 
> > QEMUIOVector *qiov,
> > 
> > void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
> > void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
> > -
> > -/*
> > - * laio_io_plug/unplug work in the thread's current AioContext, therefore 
> > the
> > - * caller must ensure that they are paired in the same IOThread.
> > - */
> > -void laio_io_plug(void);
> > -void laio_io_unplug(uint64_t dev_max_batch);
> > #endif
> > /* io_uring.c - Linux io_uring implementation */
> > #ifdef CONFIG_LINUX_IO_URING
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 7baa8491dd..ac1ed54811 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -2550,26 +2550,6 @@ static int coroutine_fn 
> > raw_co_pwritev(BlockDriverState *bs, int64_t offset,
> > return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
> > }
> > 
> > -static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
> > -{
> > -BDRVRawState __attribute__((unused)) *s = bs->opaque;
> > -#ifdef CONFIG_LINUX_AIO
> > -if (s->use_linux_aio) {
> > -laio_io_plug();
> > -}
> > -#endif
> > -}
> > -
> > -static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
> > -{
> > -BDRVRawState __attribute__((unused)) *s = bs->opaque;
> > -#ifdef CONFIG_LINUX_AIO
> > -if (s->use_linux_aio) {
> > -laio_io_unplug(s->aio_max_batch);
> > -}
> > -#endif
> > -}
> > -
> > static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
> > {
> > BDRVRawState *s = bs->opaque;
> > @@ -3914,8 +3894,6 @@ BlockDriver bdrv_file = {
> > .bdrv_co_copy_range_from = raw_co_copy_range_from,
> > .bdrv_co_copy_range_to  = raw_co_copy_range_to,
> > .bdrv_refresh_limits = raw_refresh_limits,
> > -.bdrv_co_io_plug= raw_co_io_plug,
> > -.bdrv_co_io_unplug  = raw_co_io_unplug,
> > .bdrv_attach_aio_context = raw_aio_attach_aio_context,
> > 
> > .bdrv_co_truncate   = raw_co_truncate,
> > @@ -4286,8 +4264,6 @@ static BlockDriver bdrv_host_device = {
> > .bdrv_co_copy_range_from = raw_co_copy_range_from,
> > .bdrv_co_copy_range_to  = raw_co_copy_range_to,
> > .bdrv_refresh_limits = raw_refresh_limits,
> > -.bdrv_co_io_plug= raw_co_io_plug,
> > -.bdrv_co_io_unplug  = raw_co_io_unplug,
> > .bdrv_attach_aio_context = raw_aio_attach_aio_context,
> > 
> > .bdrv_co_truncate   = raw_co_truncate,
> > @@ -4424,8 +4400,6 @@ static BlockDriver bdrv_host_cdrom = {
> > .bdrv_co_pwritev= raw_co_pwritev,
> > .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
> > .bdrv_refresh_limits= cdrom_refresh_limits,
> > -.bdrv_co_io_plug= raw_co_io_plug,
> > -.bdrv_co_io_unplug  = raw_co_io_unplug,
> > .bdrv_attach_aio_context = raw_aio_attach_aio_context,
> > 
> > .bdrv_co_truncate   = raw_co_truncate,
> > @@ -4552,8 +4526,6 @@ static BlockDriver bdrv_host_cdrom = {
> > .bdrv_co_pwritev= raw_co_pwritev,
> > .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
> > .bdrv_refresh_limits= cdrom_refresh_limits,
> > -.bdrv_co_io_plug= raw_co_io_plug,
> > -.bdrv_co_io_unplug  = raw_co_io_unplug,
> > .bdrv_attach_aio_context = raw_aio_attach_aio_context,
> > 
> > .bdrv_co_truncate   = raw_co_truncate,
> > diff --git a/block/linux-aio.c b/block/linux-aio.c
> > index 442c86209b..

[PATCH v2 5/6] block/linux-aio: convert to blk_io_plug_call() API

2023-05-23 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
---
 include/block/raw-aio.h |  7 ---
 block/file-posix.c  | 28 
 block/linux-aio.c   | 41 +++--
 3 files changed, 11 insertions(+), 65 deletions(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index da60ca13ef..0f63c2800c 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -62,13 +62,6 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, 
QEMUIOVector *qiov,
 
 void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
 void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
-
-/*
- * laio_io_plug/unplug work in the thread's current AioContext, therefore the
- * caller must ensure that they are paired in the same IOThread.
- */
-void laio_io_plug(void);
-void laio_io_unplug(uint64_t dev_max_batch);
 #endif
 /* io_uring.c - Linux io_uring implementation */
 #ifdef CONFIG_LINUX_IO_URING
diff --git a/block/file-posix.c b/block/file-posix.c
index 7baa8491dd..ac1ed54811 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2550,26 +2550,6 @@ static int coroutine_fn raw_co_pwritev(BlockDriverState 
*bs, int64_t offset,
 return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
 }
 
-static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
-{
-BDRVRawState __attribute__((unused)) *s = bs->opaque;
-#ifdef CONFIG_LINUX_AIO
-if (s->use_linux_aio) {
-laio_io_plug();
-}
-#endif
-}
-
-static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
-{
-BDRVRawState __attribute__((unused)) *s = bs->opaque;
-#ifdef CONFIG_LINUX_AIO
-if (s->use_linux_aio) {
-laio_io_unplug(s->aio_max_batch);
-}
-#endif
-}
-
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
 {
 BDRVRawState *s = bs->opaque;
@@ -3914,8 +3894,6 @@ BlockDriver bdrv_file = {
 .bdrv_co_copy_range_from = raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = raw_co_copy_range_to,
 .bdrv_refresh_limits = raw_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
@@ -4286,8 +4264,6 @@ static BlockDriver bdrv_host_device = {
 .bdrv_co_copy_range_from = raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = raw_co_copy_range_to,
 .bdrv_refresh_limits = raw_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
@@ -4424,8 +4400,6 @@ static BlockDriver bdrv_host_cdrom = {
 .bdrv_co_pwritev= raw_co_pwritev,
 .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
 .bdrv_refresh_limits= cdrom_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
@@ -4552,8 +4526,6 @@ static BlockDriver bdrv_host_cdrom = {
 .bdrv_co_pwritev= raw_co_pwritev,
 .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
 .bdrv_refresh_limits= cdrom_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 442c86209b..5021aed68f 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -15,6 +15,7 @@
 #include "qemu/event_notifier.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "sysemu/block-backend.h"
 
 /* Only used for assertions.  */
 #include "qemu/coroutine_int.h"
@@ -46,7 +47,6 @@ struct qemu_laiocb {
 };
 
 typedef struct {
-int plugged;
 unsigned int in_queue;
 unsigned int in_flight;
 bool blocked;
@@ -236,7 +236,7 @@ static void 
qemu_laio_process_completions_and_submit(LinuxAioState *s)
 {
 qemu_laio_process_completions(s);
 
-if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
+if (!QSIMPLEQ_EMPTY(&s->io_q.pending)) {
 ioq_submit(s);
 }
 }
@@ -277,7 +277,6 @@ static void qemu_laio_poll_ready(EventNotifier *opaque)
 static void ioq_init(LaioQueue *io_q)
 {
 QSIMPLEQ_INIT(&io_q->pending);
-io_q->plugged = 0;
 io_q->in_queue = 0;
 io_q->in_flight = 0;
 io_q->blocked = false;
@@ -354,31 +353,11 @@ static uint64_t laio_max_batch(LinuxAioSta

[PATCH v2 6/6] block: remove bdrv_co_io_plug() API

2023-05-23 Thread Stefan Hajnoczi

No block driver implements .bdrv_co_io_plug() anymore. Get rid of the
function pointers.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
---
 include/block/block-io.h |  3 ---
 include/block/block_int-common.h | 11 --
 block/io.c   | 37 
 3 files changed, 51 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index a27e471a87..43af816d75 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -259,9 +259,6 @@ void coroutine_fn bdrv_co_leave(BlockDriverState *bs, 
AioContext *old_ctx);
 
 AioContext *child_of_bds_get_parent_aio_context(BdrvChild *c);
 
-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_plug(BlockDriverState *bs);
-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_unplug(BlockDriverState *bs);
-
 bool coroutine_fn GRAPH_RDLOCK
 bdrv_co_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
uint32_t granularity, Error **errp);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 6492a1e538..958962aa3a 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -753,11 +753,6 @@ struct BlockDriver {
 void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_debug_event)(
 BlockDriverState *bs, BlkdebugEvent event);
 
-/* io queue for linux-aio */
-void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_plug)(BlockDriverState 
*bs);
-void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_unplug)(
-BlockDriverState *bs);
-
 /**
  * bdrv_drain_begin is called if implemented in the beginning of a
  * drain operation to drain and stop any internal sources of requests in
@@ -1227,12 +1222,6 @@ struct BlockDriverState {
 unsigned int in_flight;
 unsigned int serialising_in_flight;
 
-/*
- * counter for nested bdrv_io_plug.
- * Accessed with atomic ops.
- */
-unsigned io_plugged;
-
 /* do we need to tell the quest if we have a volatile write cache? */
 int enable_write_cache;
 
diff --git a/block/io.c b/block/io.c
index 4d54fda593..56b0c1ce6c 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3219,43 +3219,6 @@ void *qemu_try_blockalign0(BlockDriverState *bs, size_t 
size)
 return mem;
 }
 
-void coroutine_fn bdrv_co_io_plug(BlockDriverState *bs)
-{
-BdrvChild *child;
-IO_CODE();
-assert_bdrv_graph_readable();
-
-QLIST_FOREACH(child, &bs->children, next) {
-bdrv_co_io_plug(child->bs);
-}
-
-if (qatomic_fetch_inc(&bs->io_plugged) == 0) {
-BlockDriver *drv = bs->drv;
-if (drv && drv->bdrv_co_io_plug) {
-drv->bdrv_co_io_plug(bs);
-}
-}
-}
-
-void coroutine_fn bdrv_co_io_unplug(BlockDriverState *bs)
-{
-BdrvChild *child;
-IO_CODE();
-assert_bdrv_graph_readable();
-
-assert(bs->io_plugged);
-if (qatomic_fetch_dec(&bs->io_plugged) == 1) {
-BlockDriver *drv = bs->drv;
-if (drv && drv->bdrv_co_io_unplug) {
-drv->bdrv_co_io_unplug(bs);
-}
-}
-
-QLIST_FOREACH(child, &bs->children, next) {
-bdrv_co_io_unplug(child->bs);
-}
-}
-
 /* Helper that undoes bdrv_register_buf() when it fails partway through */
 static void GRAPH_RDLOCK
 bdrv_register_buf_rollback(BlockDriverState *bs, void *host, size_t size,
-- 
2.40.1

[PATCH v2 3/6] block/blkio: convert to blk_io_plug_call() API

2023-05-23 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
---
v2
- Add missing #include and fix blkio_unplug_fn() prototype [Stefano]
---
 block/blkio.c | 43 ---
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/block/blkio.c b/block/blkio.c
index 0cdc99a729..93c6d20d39 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -17,6 +17,7 @@
 #include "qemu/error-report.h"
 #include "qapi/qmp/qdict.h"
 #include "qemu/module.h"
+#include "sysemu/block-backend.h"
 #include "exec/memory.h" /* for ram_block_discard_disable() */
 
 #include "block/block-io.h"
@@ -325,16 +326,30 @@ static void blkio_detach_aio_context(BlockDriverState *bs)
false, NULL, NULL, NULL, NULL, NULL);
 }
 
-/* Call with s->blkio_lock held to submit I/O after enqueuing a new request */
-static void blkio_submit_io(BlockDriverState *bs)
+/*
+ * Called by blk_io_unplug() or immediately if not plugged. Called without
+ * blkio_lock.
+ */
+static void blkio_unplug_fn(void *opaque)
 {
-if (qatomic_read(&bs->io_plugged) == 0) {
-BDRVBlkioState *s = bs->opaque;
+BDRVBlkioState *s = opaque;
 
+WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_do_io(s->blkioq, NULL, 0, 0, NULL);
 }
 }
 
+/*
+ * Schedule I/O submission after enqueuing a new request. Called without
+ * blkio_lock.
+ */
+static void blkio_submit_io(BlockDriverState *bs)
+{
+BDRVBlkioState *s = bs->opaque;
+
+blk_io_plug_call(blkio_unplug_fn, s);
+}
+
 static int coroutine_fn
 blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
@@ -345,9 +360,9 @@ blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, 
int64_t bytes)
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_discard(s->blkioq, offset, bytes, &cod, 0);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 return cod.ret;
 }
@@ -378,9 +393,9 @@ blkio_co_preadv(BlockDriverState *bs, int64_t offset, 
int64_t bytes,
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_readv(s->blkioq, offset, iov, iovcnt, &cod, 0);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 
 if (use_bounce_buffer) {
@@ -423,9 +438,9 @@ static int coroutine_fn blkio_co_pwritev(BlockDriverState 
*bs, int64_t offset,
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_writev(s->blkioq, offset, iov, iovcnt, &cod, blkio_flags);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 
 if (use_bounce_buffer) {
@@ -444,9 +459,9 @@ static int coroutine_fn blkio_co_flush(BlockDriverState *bs)
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_flush(s->blkioq, &cod, 0);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 return cod.ret;
 }
@@ -472,22 +487,13 @@ static int coroutine_fn 
blkio_co_pwrite_zeroes(BlockDriverState *bs,
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_write_zeroes(s->blkioq, offset, bytes, &cod, blkio_flags);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 return cod.ret;
 }
 
-static void coroutine_fn blkio_co_io_unplug(BlockDriverState *bs)
-{
-BDRVBlkioState *s = bs->opaque;
-
-WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
-blkio_submit_io(bs);
-}
-}
-
 typedef enum {
 BMRR_OK,
 BMRR_SKIP,
@@ -1009,7 +1015,6 @@ static void blkio_refresh_limits(BlockDriverState *bs, 
Error **errp)
 .bdrv_co_pwritev = blkio_co_pwritev, \
 .bdrv_co_flush_to_disk   = blkio_co_flush, \
 .bdrv_co_pwrite_zeroes   = blkio_co_pwrite_zeroes, \
-.bdrv_co_io_unplug   = blkio_co_io_unplug, \
 .bdrv_refresh_limits = blkio_refresh_limits, \
 .bdrv_register_buf   = blkio_register_buf, \
 .bdrv_unregister_buf = blkio_unregister_buf, \
-- 
2.40.1

[PATCH v2 1/6] block: add blk_io_plug_call() API

2023-05-23 Thread Stefan Hajnoczi

Introduce a new API for thread-local blk_io_plug() that does not
traverse the block graph. The goal is to make blk_io_plug() multi-queue
friendly.

Instead of having block drivers track whether or not we're in a plugged
section, provide an API that allows them to defer a function call until
we're unplugged: blk_io_plug_call(fn, opaque). If blk_io_plug_call() is
called multiple times with the same fn/opaque pair, then fn() is only
called once at the end of the function - resulting in batching.

This patch introduces the API and changes blk_io_plug()/blk_io_unplug().
blk_io_plug()/blk_io_unplug() no longer require a BlockBackend argument
because the plug state is now thread-local.

Later patches convert block drivers to blk_io_plug_call() and then we
can finally remove .bdrv_co_io_plug() once all block drivers have been
converted.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
---
v2
- "is not be freed" -> "is not freed" [Eric]
---
 MAINTAINERS   |   1 +
 include/sysemu/block-backend-io.h |  13 +--
 block/block-backend.c |  22 -
 block/plug.c  | 159 ++
 hw/block/dataplane/xen-block.c|   8 +-
 hw/block/virtio-blk.c |   4 +-
 hw/scsi/virtio-scsi.c |   6 +-
 block/meson.build |   1 +
 8 files changed, 173 insertions(+), 41 deletions(-)
 create mode 100644 block/plug.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1b6466496d..2be6f0c26b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2646,6 +2646,7 @@ F: util/aio-*.c
 F: util/aio-*.h
 F: util/fdmon-*.c
 F: block/io.c
+F: block/plug.c
 F: migration/block*
 F: include/block/aio.h
 F: include/block/aio-wait.h
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index d62a7ee773..be4dcef59d 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -100,16 +100,9 @@ void blk_iostatus_set_err(BlockBackend *blk, int error);
 int blk_get_max_iov(BlockBackend *blk);
 int blk_get_max_hw_iov(BlockBackend *blk);
 
-/*
- * blk_io_plug/unplug are thread-local operations. This means that multiple
- * IOThreads can simultaneously call plug/unplug, but the caller must ensure
- * that each unplug() is called in the same IOThread of the matching plug().
- */
-void coroutine_fn blk_co_io_plug(BlockBackend *blk);
-void co_wrapper blk_io_plug(BlockBackend *blk);
-
-void coroutine_fn blk_co_io_unplug(BlockBackend *blk);
-void co_wrapper blk_io_unplug(BlockBackend *blk);
+void blk_io_plug(void);
+void blk_io_unplug(void);
+void blk_io_plug_call(void (*fn)(void *), void *opaque);
 
 AioContext *blk_get_aio_context(BlockBackend *blk);
 BlockAcctStats *blk_get_stats(BlockBackend *blk);
diff --git a/block/block-backend.c b/block/block-backend.c
index ca537cd0ad..1f1d226ba6 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2568,28 +2568,6 @@ void blk_add_insert_bs_notifier(BlockBackend *blk, 
Notifier *notify)
 notifier_list_add(&blk->insert_bs_notifiers, notify);
 }
 
-void coroutine_fn blk_co_io_plug(BlockBackend *blk)
-{
-BlockDriverState *bs = blk_bs(blk);
-IO_CODE();
-GRAPH_RDLOCK_GUARD();
-
-if (bs) {
-bdrv_co_io_plug(bs);
-}
-}
-
-void coroutine_fn blk_co_io_unplug(BlockBackend *blk)
-{
-BlockDriverState *bs = blk_bs(blk);
-IO_CODE();
-GRAPH_RDLOCK_GUARD();
-
-if (bs) {
-bdrv_co_io_unplug(bs);
-}
-}
-
 BlockAcctStats *blk_get_stats(BlockBackend *blk)
 {
 IO_CODE();
diff --git a/block/plug.c b/block/plug.c
new file mode 100644
index 00..98a155d2f4
--- /dev/null
+++ b/block/plug.c
@@ -0,0 +1,159 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Block I/O plugging
+ *
+ * Copyright Red Hat.
+ *
+ * This API defers a function call within a blk_io_plug()/blk_io_unplug()
+ * section, allowing multiple calls to batch up. This is a performance
+ * optimization that is used in the block layer to submit several I/O requests
+ * at once instead of individually:
+ *
+ *   blk_io_plug(); <-- start of plugged region
+ *   ...
+ *   blk_io_plug_call(my_func, my_obj); <-- deferred my_func(my_obj) call
+ *   blk_io_plug_call(my_func, my_obj); <-- another
+ *   blk_io_plug_call(my_func, my_obj); <-- another
+ *   ...
+ *   blk_io_unplug(); <-- end of plugged region, my_func(my_obj) is called once
+ *
+ * This code is actually generic and not tied to the block layer. If another
+ * subsystem needs this functionality, it could be renamed.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/coroutine-tls.h"
+#include "qemu/notify.h"
+#include "qemu/thread.h"
+#include "sysemu/block-backend.h"
+
+/* A function call that has been deferred until unplug() */
+typedef struct {
+void (*fn)(void *);
+void *opaque;
+} UnplugFn;
+
+/* Per-thread state */
+typedef struct {
+unsigned count;   /*

[PATCH v2 2/6] block/nvme: convert to blk_io_plug_call() API

2023-05-23 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
---
v2
- Remove unused nvme_process_completion_queue_plugged trace event
  [Stefano]
---
 block/nvme.c   | 44 
 block/trace-events |  1 -
 2 files changed, 12 insertions(+), 33 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 5b744c2bda..100b38b592 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -25,6 +25,7 @@
 #include "qemu/vfio-helpers.h"
 #include "block/block-io.h"
 #include "block/block_int.h"
+#include "sysemu/block-backend.h"
 #include "sysemu/replay.h"
 #include "trace.h"
 
@@ -119,7 +120,6 @@ struct BDRVNVMeState {
 int blkshift;
 
 uint64_t max_transfer;
-bool plugged;
 
 bool supports_write_zeroes;
 bool supports_discard;
@@ -282,7 +282,7 @@ static void nvme_kick(NVMeQueuePair *q)
 {
 BDRVNVMeState *s = q->s;
 
-if (s->plugged || !q->need_kick) {
+if (!q->need_kick) {
 return;
 }
 trace_nvme_kick(s, q->index);
@@ -387,10 +387,6 @@ static bool nvme_process_completion(NVMeQueuePair *q)
 NvmeCqe *c;
 
 trace_nvme_process_completion(s, q->index, q->inflight);
-if (s->plugged) {
-trace_nvme_process_completion_queue_plugged(s, q->index);
-return false;
-}
 
 /*
  * Support re-entrancy when a request cb() function invokes aio_poll().
@@ -480,6 +476,15 @@ static void nvme_trace_command(const NvmeCmd *cmd)
 }
 }
 
+static void nvme_unplug_fn(void *opaque)
+{
+NVMeQueuePair *q = opaque;
+
+QEMU_LOCK_GUARD(&q->lock);
+nvme_kick(q);
+nvme_process_completion(q);
+}
+
 static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
 NvmeCmd *cmd, BlockCompletionFunc cb,
 void *opaque)
@@ -496,8 +501,7 @@ static void nvme_submit_command(NVMeQueuePair *q, 
NVMeRequest *req,
q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
 q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
 q->need_kick++;
-nvme_kick(q);
-nvme_process_completion(q);
+blk_io_plug_call(nvme_unplug_fn, q);
 qemu_mutex_unlock(&q->lock);
 }
 
@@ -1567,27 +1571,6 @@ static void nvme_attach_aio_context(BlockDriverState *bs,
 }
 }
 
-static void coroutine_fn nvme_co_io_plug(BlockDriverState *bs)
-{
-BDRVNVMeState *s = bs->opaque;
-assert(!s->plugged);
-s->plugged = true;
-}
-
-static void coroutine_fn nvme_co_io_unplug(BlockDriverState *bs)
-{
-BDRVNVMeState *s = bs->opaque;
-assert(s->plugged);
-s->plugged = false;
-for (unsigned i = INDEX_IO(0); i < s->queue_count; i++) {
-NVMeQueuePair *q = s->queues[i];
-qemu_mutex_lock(&q->lock);
-nvme_kick(q);
-nvme_process_completion(q);
-qemu_mutex_unlock(&q->lock);
-}
-}
-
 static bool nvme_register_buf(BlockDriverState *bs, void *host, size_t size,
   Error **errp)
 {
@@ -1664,9 +1647,6 @@ static BlockDriver bdrv_nvme = {
 .bdrv_detach_aio_context  = nvme_detach_aio_context,
 .bdrv_attach_aio_context  = nvme_attach_aio_context,
 
-.bdrv_co_io_plug  = nvme_co_io_plug,
-.bdrv_co_io_unplug= nvme_co_io_unplug,
-
 .bdrv_register_buf= nvme_register_buf,
 .bdrv_unregister_buf  = nvme_unregister_buf,
 };
diff --git a/block/trace-events b/block/trace-events
index 32665158d6..048ad27519 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -141,7 +141,6 @@ nvme_kick(void *s, unsigned q_index) "s %p q #%u"
 nvme_dma_flush_queue_wait(void *s) "s %p"
 nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) 
"cmd_specific %d sq_head %d sqid %d cid %d status 0x%x"
 nvme_process_completion(void *s, unsigned q_index, int inflight) "s %p q #%u 
inflight %d"
-nvme_process_completion_queue_plugged(void *s, unsigned q_index) "s %p q #%u"
 nvme_complete_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
 nvme_submit_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
 nvme_submit_command_raw(int c0, int c1, int c2, int c3, int c4, int c5, int 
c6, int c7) "%02x %02x %02x %02x %02x %02x %02x %02x"
-- 
2.40.1

[PATCH v2 4/6] block/io_uring: convert to blk_io_plug_call() API

2023-05-23 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
---
v2
- Removed whitespace hunk [Eric]
---
 include/block/raw-aio.h |  7 ---
 block/file-posix.c  | 10 --
 block/io_uring.c| 44 -
 block/trace-events  |  5 ++---
 4 files changed, 19 insertions(+), 47 deletions(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 0fe85ade77..da60ca13ef 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -81,13 +81,6 @@ int coroutine_fn luring_co_submit(BlockDriverState *bs, int 
fd, uint64_t offset,
   QEMUIOVector *qiov, int type);
 void luring_detach_aio_context(LuringState *s, AioContext *old_context);
 void luring_attach_aio_context(LuringState *s, AioContext *new_context);
-
-/*
- * luring_io_plug/unplug work in the thread's current AioContext, therefore the
- * caller must ensure that they are paired in the same IOThread.
- */
-void luring_io_plug(void);
-void luring_io_unplug(void);
 #endif
 
 #ifdef _WIN32
diff --git a/block/file-posix.c b/block/file-posix.c
index 0ab158efba..7baa8491dd 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2558,11 +2558,6 @@ static void coroutine_fn raw_co_io_plug(BlockDriverState 
*bs)
 laio_io_plug();
 }
 #endif
-#ifdef CONFIG_LINUX_IO_URING
-if (s->use_linux_io_uring) {
-luring_io_plug();
-}
-#endif
 }
 
 static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
@@ -2573,11 +2568,6 @@ static void coroutine_fn 
raw_co_io_unplug(BlockDriverState *bs)
 laio_io_unplug(s->aio_max_batch);
 }
 #endif
-#ifdef CONFIG_LINUX_IO_URING
-if (s->use_linux_io_uring) {
-luring_io_unplug();
-}
-#endif
 }
 
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
diff --git a/block/io_uring.c b/block/io_uring.c
index 82cab6a5bd..4e25b56c9c 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -16,6 +16,7 @@
 #include "block/raw-aio.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "sysemu/block-backend.h"
 #include "trace.h"
 
 /* Only used for assertions.  */
@@ -41,7 +42,6 @@ typedef struct LuringAIOCB {
 } LuringAIOCB;
 
 typedef struct LuringQueue {
-int plugged;
 unsigned int in_queue;
 unsigned int in_flight;
 bool blocked;
@@ -267,7 +267,7 @@ static void 
luring_process_completions_and_submit(LuringState *s)
 {
 luring_process_completions(s);
 
-if (!s->io_q.plugged && s->io_q.in_queue > 0) {
+if (s->io_q.in_queue > 0) {
 ioq_submit(s);
 }
 }
@@ -301,29 +301,17 @@ static void qemu_luring_poll_ready(void *opaque)
 static void ioq_init(LuringQueue *io_q)
 {
 QSIMPLEQ_INIT(&io_q->submit_queue);
-io_q->plugged = 0;
 io_q->in_queue = 0;
 io_q->in_flight = 0;
 io_q->blocked = false;
 }
 
-void luring_io_plug(void)
+static void luring_unplug_fn(void *opaque)
 {
-AioContext *ctx = qemu_get_current_aio_context();
-LuringState *s = aio_get_linux_io_uring(ctx);
-trace_luring_io_plug(s);
-s->io_q.plugged++;
-}
-
-void luring_io_unplug(void)
-{
-AioContext *ctx = qemu_get_current_aio_context();
-LuringState *s = aio_get_linux_io_uring(ctx);
-assert(s->io_q.plugged);
-trace_luring_io_unplug(s, s->io_q.blocked, s->io_q.plugged,
-   s->io_q.in_queue, s->io_q.in_flight);
-if (--s->io_q.plugged == 0 &&
-!s->io_q.blocked && s->io_q.in_queue > 0) {
+LuringState *s = opaque;
+trace_luring_unplug_fn(s, s->io_q.blocked, s->io_q.in_queue,
+   s->io_q.in_flight);
+if (!s->io_q.blocked && s->io_q.in_queue > 0) {
 ioq_submit(s);
 }
 }
@@ -370,14 +358,16 @@ static int luring_do_submit(int fd, LuringAIOCB 
*luringcb, LuringState *s,
 
 QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
 s->io_q.in_queue++;
-trace_luring_do_submit(s, s->io_q.blocked, s->io_q.plugged,
-   s->io_q.in_queue, s->io_q.in_flight);
-if (!s->io_q.blocked &&
-(!s->io_q.plugged ||
- s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES)) {
-ret = ioq_submit(s);
-trace_luring_do_submit_done(s, ret);
-return ret;
+trace_luring_do_submit(s, s->io_q.blocked, s->io_q.in_queue,
+   s->io_q.in_flight);
+if (!s->io_q.blocked) {
+if (s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES) {
+ret = ioq_submit(s);
+trace_luring_do_submit_done(s, ret);
+return ret;
+}
+
+blk

[PATCH v2 0/6] block: add blk_io_plug_call() API

2023-05-23 Thread Stefan Hajnoczi

v2
- Patch 1: "is not be freed" -> "is not freed" [Eric]
- Patch 2: Remove unused nvme_process_completion_queue_plugged trace event
  [Stefano]
- Patch 3: Add missing #include and fix blkio_unplug_fn() prototype [Stefano]
- Patch 4: Removed whitespace hunk [Eric]

The existing blk_io_plug() API is not block layer multi-queue friendly because
the plug state is per-BlockDriverState.

Change blk_io_plug()'s implementation so it is thread-local. This is done by
introducing the blk_io_plug_call() function that block drivers use to batch
calls while plugged. It is relatively easy to convert block drivers from
.bdrv_co_io_plug() to blk_io_plug_call().

Random read 4KB performance with virtio-blk on a host NVMe block device:

iodepth   iops   change vs today
145612   -4%
287967   +2%
4   129872   +0%
8   171096   -3%
16  194508   -4%
32  208947   -1%
64  217647   +0%
128 229629   +0%

The results are within the noise for these benchmarks. This is to be expected
because the plugging behavior for a single thread hasn't changed in this patch
series, only that the state is thread-local now.

The following graph compares several approaches:
https://vmsplice.net/~stefan/blk_io_plug-thread-local.png
- v7.2.0: before most of the multi-queue block layer changes landed.
- with-blk_io_plug: today's post-8.0.0 QEMU.
- blk_io_plug-thread-local: this patch series.
- no-blk_io_plug: what happens when we simply remove plugging?
- call-after-dispatch: what if we integrate plugging into the event loop? I
  decided against this approach in the end because it's more likely to
  introduce performance regressions since I/O submission is deferred until the
  end of the event loop iteration.

Aside from the no-blk_io_plug case, which bottlenecks much earlier than the
others, we see that all plugging approaches are more or less equivalent in this
benchmark. It is also clear that QEMU 8.0.0 has lower performance than 7.2.0.

The Ansible playbook, fio results, and a Jupyter notebook are available here:
https://github.com/stefanha/qemu-perf/tree/remove-blk_io_plug

Stefan Hajnoczi (6):
  block: add blk_io_plug_call() API
  block/nvme: convert to blk_io_plug_call() API
  block/blkio: convert to blk_io_plug_call() API
  block/io_uring: convert to blk_io_plug_call() API
  block/linux-aio: convert to blk_io_plug_call() API
  block: remove bdrv_co_io_plug() API

 MAINTAINERS   |   1 +
 include/block/block-io.h  |   3 -
 include/block/block_int-common.h  |  11 ---
 include/block/raw-aio.h   |  14 ---
 include/sysemu/block-backend-io.h |  13 +--
 block/blkio.c |  43 
 block/block-backend.c |  22 -
 block/file-posix.c|  38 ---
 block/io.c|  37 ---
 block/io_uring.c  |  44 -
 block/linux-aio.c |  41 +++-
 block/nvme.c  |  44 +++--
 block/plug.c  | 159 ++
 hw/block/dataplane/xen-block.c|   8 +-
 hw/block/virtio-blk.c |   4 +-
 hw/scsi/virtio-scsi.c |   6 +-
 block/meson.build |   1 +
 block/trace-events|   6 +-
 18 files changed, 239 insertions(+), 256 deletions(-)
 create mode 100644 block/plug.c

-- 
2.40.1

Re: [PATCH 3/6] block/blkio: convert to blk_io_plug_call() API

2023-05-23 Thread Stefan Hajnoczi

On Fri, May 19, 2023 at 10:47:00AM +0200, Stefano Garzarella wrote:
> On Wed, May 17, 2023 at 06:10:19PM -0400, Stefan Hajnoczi wrote:
> > Stop using the .bdrv_co_io_plug() API because it is not multi-queue
> > block layer friendly. Use the new blk_io_plug_call() API to batch I/O
> > submission instead.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> > block/blkio.c | 40 +---
> > 1 file changed, 21 insertions(+), 19 deletions(-)
> 
> With this patch, the build fails in several places, maybe it's an old
> version:
> 
> ../block/blkio.c:347:5: error: implicit declaration of function
> ‘blk_io_plug_call’ [-Werror=implicit-function-declaration]
>   347 | blk_io_plug_call(blkio_unplug_fn, bs);

Will fix, thanks!

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 4/6] block/io_uring: convert to blk_io_plug_call() API

2023-05-23 Thread Stefan Hajnoczi

On Thu, May 18, 2023 at 07:18:42PM -0500, Eric Blake wrote:
> On Wed, May 17, 2023 at 06:10:20PM -0400, Stefan Hajnoczi wrote:
> > Stop using the .bdrv_co_io_plug() API because it is not multi-queue
> > block layer friendly. Use the new blk_io_plug_call() API to batch I/O
> > submission instead.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> >  include/block/raw-aio.h |  7 ---
> >  block/file-posix.c  | 10 -
> >  block/io_uring.c| 45 -
> >  block/trace-events  |  5 ++---
> >  4 files changed, 19 insertions(+), 48 deletions(-)
> > 
> 
> > @@ -337,7 +325,6 @@ void luring_io_unplug(void)
> >   * @type: type of request
> >   *
> >   * Fetches sqes from ring, adds to pending queue and preps them
> > - *
> >   */
> >  static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
> >  uint64_t offset, int type)
> > @@ -370,14 +357,16 @@ static int luring_do_submit(int fd, LuringAIOCB 
> > *luringcb, LuringState *s,
> 
> Looks a bit like a stray hunk, but you are touching the function, so
> it's okay.

I'm respinning, so I'll drop this.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 1/6] block: add blk_io_plug_call() API

2023-05-23 Thread Stefan Hajnoczi

On Thu, May 18, 2023 at 07:04:52PM -0500, Eric Blake wrote:
> On Wed, May 17, 2023 at 06:10:17PM -0400, Stefan Hajnoczi wrote:
> > Introduce a new API for thread-local blk_io_plug() that does not
> > traverse the block graph. The goal is to make blk_io_plug() multi-queue
> > friendly.
> > 
> > Instead of having block drivers track whether or not we're in a plugged
> > section, provide an API that allows them to defer a function call until
> > we're unplugged: blk_io_plug_call(fn, opaque). If blk_io_plug_call() is
> > called multiple times with the same fn/opaque pair, then fn() is only
> > called once at the end of the function - resulting in batching.
> > 
> > This patch introduces the API and changes blk_io_plug()/blk_io_unplug().
> > blk_io_plug()/blk_io_unplug() no longer require a BlockBackend argument
> > because the plug state is now thread-local.
> > 
> > Later patches convert block drivers to blk_io_plug_call() and then we
> > can finally remove .bdrv_co_io_plug() once all block drivers have been
> > converted.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> 
> > +++ b/block/plug.c
> > +
> > +/**
> > + * blk_io_plug_call:
> > + * @fn: a function pointer to be invoked
> > + * @opaque: a user-defined argument to @fn()
> > + *
> > + * Call @fn(@opaque) immediately if not within a 
> > blk_io_plug()/blk_io_unplug()
> > + * section.
> > + *
> > + * Otherwise defer the call until the end of the outermost
> > + * blk_io_plug()/blk_io_unplug() section in this thread. If the same
> > + * @fn/@opaque pair has already been deferred, it will only be called once 
> > upon
> > + * blk_io_unplug() so that accumulated calls are batched into a single 
> > call.
> > + *
> > + * The caller must ensure that @opaque is not be freed before @fn() is 
> > invoked.
> 
> s/be //

Will fix, thanks!

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 2/6] block/nvme: convert to blk_io_plug_call() API

2023-05-23 Thread Stefan Hajnoczi

On Fri, May 19, 2023 at 10:46:25AM +0200, Stefano Garzarella wrote:
> On Wed, May 17, 2023 at 06:10:18PM -0400, Stefan Hajnoczi wrote:
> > Stop using the .bdrv_co_io_plug() API because it is not multi-queue
> > block layer friendly. Use the new blk_io_plug_call() API to batch I/O
> > submission instead.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> > block/nvme.c | 44 
> > 1 file changed, 12 insertions(+), 32 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 5b744c2bda..100b38b592 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -25,6 +25,7 @@
> > #include "qemu/vfio-helpers.h"
> > #include "block/block-io.h"
> > #include "block/block_int.h"
> > +#include "sysemu/block-backend.h"
> > #include "sysemu/replay.h"
> > #include "trace.h"
> > 
> > @@ -119,7 +120,6 @@ struct BDRVNVMeState {
> > int blkshift;
> > 
> > uint64_t max_transfer;
> > -bool plugged;
> > 
> > bool supports_write_zeroes;
> > bool supports_discard;
> > @@ -282,7 +282,7 @@ static void nvme_kick(NVMeQueuePair *q)
> > {
> > BDRVNVMeState *s = q->s;
> > 
> > -if (s->plugged || !q->need_kick) {
> > +if (!q->need_kick) {
> > return;
> > }
> > trace_nvme_kick(s, q->index);
> > @@ -387,10 +387,6 @@ static bool nvme_process_completion(NVMeQueuePair *q)
> > NvmeCqe *c;
> > 
> > trace_nvme_process_completion(s, q->index, q->inflight);
> > -if (s->plugged) {
> > -trace_nvme_process_completion_queue_plugged(s, q->index);
> 
> Should we remove "nvme_process_completion_queue_plugged(void *s,
> unsigned q_index) "s %p q #%u" from block/trace-events?

Will fix, thanks!

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 1/6] block: add blk_io_plug_call() API

2023-05-23 Thread Stefan Hajnoczi

On Fri, May 19, 2023 at 10:45:57AM +0200, Stefano Garzarella wrote:
> On Wed, May 17, 2023 at 06:10:17PM -0400, Stefan Hajnoczi wrote:
> > Introduce a new API for thread-local blk_io_plug() that does not
> > traverse the block graph. The goal is to make blk_io_plug() multi-queue
> > friendly.
> > 
> > Instead of having block drivers track whether or not we're in a plugged
> > section, provide an API that allows them to defer a function call until
> > we're unplugged: blk_io_plug_call(fn, opaque). If blk_io_plug_call() is
> > called multiple times with the same fn/opaque pair, then fn() is only
> > called once at the end of the function - resulting in batching.
> > 
> > This patch introduces the API and changes blk_io_plug()/blk_io_unplug().
> > blk_io_plug()/blk_io_unplug() no longer require a BlockBackend argument
> > because the plug state is now thread-local.
> > 
> > Later patches convert block drivers to blk_io_plug_call() and then we
> > can finally remove .bdrv_co_io_plug() once all block drivers have been
> > converted.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> > MAINTAINERS   |   1 +
> > include/sysemu/block-backend-io.h |  13 +--
> > block/block-backend.c |  22 -
> > block/plug.c  | 159 ++
> > hw/block/dataplane/xen-block.c|   8 +-
> > hw/block/virtio-blk.c |   4 +-
> > hw/scsi/virtio-scsi.c |   6 +-
> > block/meson.build |   1 +
> > 8 files changed, 173 insertions(+), 41 deletions(-)
> > create mode 100644 block/plug.c
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 50585117a0..574202295c 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -2644,6 +2644,7 @@ F: util/aio-*.c
> > F: util/aio-*.h
> > F: util/fdmon-*.c
> > F: block/io.c
> > +F: block/plug.c
> > F: migration/block*
> > F: include/block/aio.h
> > F: include/block/aio-wait.h
> > diff --git a/include/sysemu/block-backend-io.h 
> > b/include/sysemu/block-backend-io.h
> > index d62a7ee773..be4dcef59d 100644
> > --- a/include/sysemu/block-backend-io.h
> > +++ b/include/sysemu/block-backend-io.h
> > @@ -100,16 +100,9 @@ void blk_iostatus_set_err(BlockBackend *blk, int 
> > error);
> > int blk_get_max_iov(BlockBackend *blk);
> > int blk_get_max_hw_iov(BlockBackend *blk);
> > 
> > -/*
> > - * blk_io_plug/unplug are thread-local operations. This means that multiple
> > - * IOThreads can simultaneously call plug/unplug, but the caller must 
> > ensure
> > - * that each unplug() is called in the same IOThread of the matching 
> > plug().
> > - */
> > -void coroutine_fn blk_co_io_plug(BlockBackend *blk);
> > -void co_wrapper blk_io_plug(BlockBackend *blk);
> > -
> > -void coroutine_fn blk_co_io_unplug(BlockBackend *blk);
> > -void co_wrapper blk_io_unplug(BlockBackend *blk);
> > +void blk_io_plug(void);
> > +void blk_io_unplug(void);
> > +void blk_io_plug_call(void (*fn)(void *), void *opaque);
> > 
> > AioContext *blk_get_aio_context(BlockBackend *blk);
> > BlockAcctStats *blk_get_stats(BlockBackend *blk);
> > diff --git a/block/block-backend.c b/block/block-backend.c
> > index ca537cd0ad..1f1d226ba6 100644
> > --- a/block/block-backend.c
> > +++ b/block/block-backend.c
> > @@ -2568,28 +2568,6 @@ void blk_add_insert_bs_notifier(BlockBackend *blk, 
> > Notifier *notify)
> > notifier_list_add(&blk->insert_bs_notifiers, notify);
> > }
> > 
> > -void coroutine_fn blk_co_io_plug(BlockBackend *blk)
> > -{
> > -BlockDriverState *bs = blk_bs(blk);
> > -IO_CODE();
> > -GRAPH_RDLOCK_GUARD();
> > -
> > -if (bs) {
> > -bdrv_co_io_plug(bs);
> > -}
> > -}
> > -
> > -void coroutine_fn blk_co_io_unplug(BlockBackend *blk)
> > -{
> > -BlockDriverState *bs = blk_bs(blk);
> > -IO_CODE();
> > -GRAPH_RDLOCK_GUARD();
> > -
> > -if (bs) {
> > -bdrv_co_io_unplug(bs);
> > -}
> > -}
> > -
> > BlockAcctStats *blk_get_stats(BlockBackend *blk)
> > {
> > IO_CODE();
> > diff --git a/block/plug.c b/block/plug.c
> > new file mode 100644
> > index 00..6738a568ba
> > --- /dev/null
> > +++ b/block/plug.c
> > @@ -0,0 +1,159 @@
> > +/* SPDX-License-Identifier: GPL-2.0-or-later */
> > +/*
> > + * Block I/O plugging
> > + *
> > + * Copyright Red Hat.
> >

[PATCH 0/6] block: add blk_io_plug_call() API

2023-05-17 Thread Stefan Hajnoczi

The existing blk_io_plug() API is not block layer multi-queue friendly because
the plug state is per-BlockDriverState.

Change blk_io_plug()'s implementation so it is thread-local. This is done by
introducing the blk_io_plug_call() function that block drivers use to batch
calls while plugged. It is relatively easy to convert block drivers from
.bdrv_co_io_plug() to blk_io_plug_call().

Random read 4KB performance with virtio-blk on a host NVMe block device:

iodepth   iops   change vs today
145612   -4%
287967   +2%
4   129872   +0%
8   171096   -3%
16  194508   -4%
32  208947   -1%
64  217647   +0%
128 229629   +0%

The results are within the noise for these benchmarks. This is to be expected
because the plugging behavior for a single thread hasn't changed in this patch
series, only that the state is thread-local now.

The following graph compares several approaches:
https://vmsplice.net/~stefan/blk_io_plug-thread-local.png
- v7.2.0: before most of the multi-queue block layer changes landed.
- with-blk_io_plug: today's post-8.0.0 QEMU.
- blk_io_plug-thread-local: this patch series.
- no-blk_io_plug: what happens when we simply remove plugging?
- call-after-dispatch: what if we integrate plugging into the event loop? I
  decided against this approach in the end because it's more likely to
  introduce performance regressions since I/O submission is deferred until the
  end of the event loop iteration.

Aside from the no-blk_io_plug case, which bottlenecks much earlier than the
others, we see that all plugging approaches are more or less equivalent in this
benchmark. It is also clear that QEMU 8.0.0 has lower performance than 7.2.0.

The Ansible playbook, fio results, and a Jupyter notebook are available here:
https://github.com/stefanha/qemu-perf/tree/remove-blk_io_plug

Stefan Hajnoczi (6):
  block: add blk_io_plug_call() API
  block/nvme: convert to blk_io_plug_call() API
  block/blkio: convert to blk_io_plug_call() API
  block/io_uring: convert to blk_io_plug_call() API
  block/linux-aio: convert to blk_io_plug_call() API
  block: remove bdrv_co_io_plug() API

 MAINTAINERS   |   1 +
 include/block/block-io.h  |   3 -
 include/block/block_int-common.h  |  11 ---
 include/block/raw-aio.h   |  14 ---
 include/sysemu/block-backend-io.h |  13 +--
 block/blkio.c |  40 
 block/block-backend.c |  22 -
 block/file-posix.c|  38 ---
 block/io.c|  37 ---
 block/io_uring.c  |  45 -
 block/linux-aio.c |  41 +++-
 block/nvme.c  |  44 +++--
 block/plug.c  | 159 ++
 hw/block/dataplane/xen-block.c|   8 +-
 hw/block/virtio-blk.c |   4 +-
 hw/scsi/virtio-scsi.c |   6 +-
 block/meson.build |   1 +
 block/trace-events|   5 +-
 18 files changed, 236 insertions(+), 256 deletions(-)
 create mode 100644 block/plug.c

-- 
2.40.1

[PATCH 4/6] block/io_uring: convert to blk_io_plug_call() API

2023-05-17 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
---
 include/block/raw-aio.h |  7 ---
 block/file-posix.c  | 10 -
 block/io_uring.c| 45 -
 block/trace-events  |  5 ++---
 4 files changed, 19 insertions(+), 48 deletions(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 0fe85ade77..da60ca13ef 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -81,13 +81,6 @@ int coroutine_fn luring_co_submit(BlockDriverState *bs, int 
fd, uint64_t offset,
   QEMUIOVector *qiov, int type);
 void luring_detach_aio_context(LuringState *s, AioContext *old_context);
 void luring_attach_aio_context(LuringState *s, AioContext *new_context);
-
-/*
- * luring_io_plug/unplug work in the thread's current AioContext, therefore the
- * caller must ensure that they are paired in the same IOThread.
- */
-void luring_io_plug(void);
-void luring_io_unplug(void);
 #endif
 
 #ifdef _WIN32
diff --git a/block/file-posix.c b/block/file-posix.c
index 0ab158efba..7baa8491dd 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2558,11 +2558,6 @@ static void coroutine_fn raw_co_io_plug(BlockDriverState 
*bs)
 laio_io_plug();
 }
 #endif
-#ifdef CONFIG_LINUX_IO_URING
-if (s->use_linux_io_uring) {
-luring_io_plug();
-}
-#endif
 }
 
 static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
@@ -2573,11 +2568,6 @@ static void coroutine_fn 
raw_co_io_unplug(BlockDriverState *bs)
 laio_io_unplug(s->aio_max_batch);
 }
 #endif
-#ifdef CONFIG_LINUX_IO_URING
-if (s->use_linux_io_uring) {
-luring_io_unplug();
-}
-#endif
 }
 
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
diff --git a/block/io_uring.c b/block/io_uring.c
index 82cab6a5bd..9a45e5fb8b 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -16,6 +16,7 @@
 #include "block/raw-aio.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "sysemu/block-backend.h"
 #include "trace.h"
 
 /* Only used for assertions.  */
@@ -41,7 +42,6 @@ typedef struct LuringAIOCB {
 } LuringAIOCB;
 
 typedef struct LuringQueue {
-int plugged;
 unsigned int in_queue;
 unsigned int in_flight;
 bool blocked;
@@ -267,7 +267,7 @@ static void 
luring_process_completions_and_submit(LuringState *s)
 {
 luring_process_completions(s);
 
-if (!s->io_q.plugged && s->io_q.in_queue > 0) {
+if (s->io_q.in_queue > 0) {
 ioq_submit(s);
 }
 }
@@ -301,29 +301,17 @@ static void qemu_luring_poll_ready(void *opaque)
 static void ioq_init(LuringQueue *io_q)
 {
 QSIMPLEQ_INIT(&io_q->submit_queue);
-io_q->plugged = 0;
 io_q->in_queue = 0;
 io_q->in_flight = 0;
 io_q->blocked = false;
 }
 
-void luring_io_plug(void)
+static void luring_unplug_fn(void *opaque)
 {
-AioContext *ctx = qemu_get_current_aio_context();
-LuringState *s = aio_get_linux_io_uring(ctx);
-trace_luring_io_plug(s);
-s->io_q.plugged++;
-}
-
-void luring_io_unplug(void)
-{
-AioContext *ctx = qemu_get_current_aio_context();
-LuringState *s = aio_get_linux_io_uring(ctx);
-assert(s->io_q.plugged);
-trace_luring_io_unplug(s, s->io_q.blocked, s->io_q.plugged,
-   s->io_q.in_queue, s->io_q.in_flight);
-if (--s->io_q.plugged == 0 &&
-!s->io_q.blocked && s->io_q.in_queue > 0) {
+LuringState *s = opaque;
+trace_luring_unplug_fn(s, s->io_q.blocked, s->io_q.in_queue,
+   s->io_q.in_flight);
+if (!s->io_q.blocked && s->io_q.in_queue > 0) {
 ioq_submit(s);
 }
 }
@@ -337,7 +325,6 @@ void luring_io_unplug(void)
  * @type: type of request
  *
  * Fetches sqes from ring, adds to pending queue and preps them
- *
  */
 static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
 uint64_t offset, int type)
@@ -370,14 +357,16 @@ static int luring_do_submit(int fd, LuringAIOCB 
*luringcb, LuringState *s,
 
 QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
 s->io_q.in_queue++;
-trace_luring_do_submit(s, s->io_q.blocked, s->io_q.plugged,
-   s->io_q.in_queue, s->io_q.in_flight);
-if (!s->io_q.blocked &&
-(!s->io_q.plugged ||
- s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES)) {
-ret = ioq_submit(s);
-trace_luring_do_submit_done(s, ret);
-return ret;
+trace_luring_do_submit(s, s->io_q.blocked, s->io_q.in_queue,
+   s->io_q.in_flight);
+if (!s-

[PATCH 5/6] block/linux-aio: convert to blk_io_plug_call() API

2023-05-17 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
---
 include/block/raw-aio.h |  7 ---
 block/file-posix.c  | 28 
 block/linux-aio.c   | 41 +++--
 3 files changed, 11 insertions(+), 65 deletions(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index da60ca13ef..0f63c2800c 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -62,13 +62,6 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, 
QEMUIOVector *qiov,
 
 void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
 void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
-
-/*
- * laio_io_plug/unplug work in the thread's current AioContext, therefore the
- * caller must ensure that they are paired in the same IOThread.
- */
-void laio_io_plug(void);
-void laio_io_unplug(uint64_t dev_max_batch);
 #endif
 /* io_uring.c - Linux io_uring implementation */
 #ifdef CONFIG_LINUX_IO_URING
diff --git a/block/file-posix.c b/block/file-posix.c
index 7baa8491dd..ac1ed54811 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2550,26 +2550,6 @@ static int coroutine_fn raw_co_pwritev(BlockDriverState 
*bs, int64_t offset,
 return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
 }
 
-static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
-{
-BDRVRawState __attribute__((unused)) *s = bs->opaque;
-#ifdef CONFIG_LINUX_AIO
-if (s->use_linux_aio) {
-laio_io_plug();
-}
-#endif
-}
-
-static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
-{
-BDRVRawState __attribute__((unused)) *s = bs->opaque;
-#ifdef CONFIG_LINUX_AIO
-if (s->use_linux_aio) {
-laio_io_unplug(s->aio_max_batch);
-}
-#endif
-}
-
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
 {
 BDRVRawState *s = bs->opaque;
@@ -3914,8 +3894,6 @@ BlockDriver bdrv_file = {
 .bdrv_co_copy_range_from = raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = raw_co_copy_range_to,
 .bdrv_refresh_limits = raw_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
@@ -4286,8 +4264,6 @@ static BlockDriver bdrv_host_device = {
 .bdrv_co_copy_range_from = raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = raw_co_copy_range_to,
 .bdrv_refresh_limits = raw_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
@@ -4424,8 +4400,6 @@ static BlockDriver bdrv_host_cdrom = {
 .bdrv_co_pwritev= raw_co_pwritev,
 .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
 .bdrv_refresh_limits= cdrom_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
@@ -4552,8 +4526,6 @@ static BlockDriver bdrv_host_cdrom = {
 .bdrv_co_pwritev= raw_co_pwritev,
 .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
 .bdrv_refresh_limits= cdrom_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 442c86209b..5021aed68f 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -15,6 +15,7 @@
 #include "qemu/event_notifier.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "sysemu/block-backend.h"
 
 /* Only used for assertions.  */
 #include "qemu/coroutine_int.h"
@@ -46,7 +47,6 @@ struct qemu_laiocb {
 };
 
 typedef struct {
-int plugged;
 unsigned int in_queue;
 unsigned int in_flight;
 bool blocked;
@@ -236,7 +236,7 @@ static void 
qemu_laio_process_completions_and_submit(LinuxAioState *s)
 {
 qemu_laio_process_completions(s);
 
-if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
+if (!QSIMPLEQ_EMPTY(&s->io_q.pending)) {
 ioq_submit(s);
 }
 }
@@ -277,7 +277,6 @@ static void qemu_laio_poll_ready(EventNotifier *opaque)
 static void ioq_init(LaioQueue *io_q)
 {
 QSIMPLEQ_INIT(&io_q->pending);
-io_q->plugged = 0;
 io_q->in_queue = 0;
 io_q->in_flight = 0;
 io_q->blocked = false;
@@ -354,31 +353,11 @@ static uint64_t laio_max_batch(LinuxAioState *s, uint64_t 
dev_max_bat

[PATCH 6/6] block: remove bdrv_co_io_plug() API

2023-05-17 Thread Stefan Hajnoczi

No block driver implements .bdrv_co_io_plug() anymore. Get rid of the
function pointers.

Signed-off-by: Stefan Hajnoczi 
---
 include/block/block-io.h |  3 ---
 include/block/block_int-common.h | 11 --
 block/io.c   | 37 
 3 files changed, 51 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index a27e471a87..43af816d75 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -259,9 +259,6 @@ void coroutine_fn bdrv_co_leave(BlockDriverState *bs, 
AioContext *old_ctx);
 
 AioContext *child_of_bds_get_parent_aio_context(BdrvChild *c);
 
-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_plug(BlockDriverState *bs);
-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_unplug(BlockDriverState *bs);
-
 bool coroutine_fn GRAPH_RDLOCK
 bdrv_co_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
uint32_t granularity, Error **errp);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index dbec0e3bb4..fa369d83dd 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -753,11 +753,6 @@ struct BlockDriver {
 void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_debug_event)(
 BlockDriverState *bs, BlkdebugEvent event);
 
-/* io queue for linux-aio */
-void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_plug)(BlockDriverState 
*bs);
-void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_unplug)(
-BlockDriverState *bs);
-
 /**
  * bdrv_drain_begin is called if implemented in the beginning of a
  * drain operation to drain and stop any internal sources of requests in
@@ -1227,12 +1222,6 @@ struct BlockDriverState {
 unsigned int in_flight;
 unsigned int serialising_in_flight;
 
-/*
- * counter for nested bdrv_io_plug.
- * Accessed with atomic ops.
- */
-unsigned io_plugged;
-
 /* do we need to tell the quest if we have a volatile write cache? */
 int enable_write_cache;
 
diff --git a/block/io.c b/block/io.c
index 4d54fda593..56b0c1ce6c 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3219,43 +3219,6 @@ void *qemu_try_blockalign0(BlockDriverState *bs, size_t 
size)
 return mem;
 }
 
-void coroutine_fn bdrv_co_io_plug(BlockDriverState *bs)
-{
-BdrvChild *child;
-IO_CODE();
-assert_bdrv_graph_readable();
-
-QLIST_FOREACH(child, &bs->children, next) {
-bdrv_co_io_plug(child->bs);
-}
-
-if (qatomic_fetch_inc(&bs->io_plugged) == 0) {
-BlockDriver *drv = bs->drv;
-if (drv && drv->bdrv_co_io_plug) {
-drv->bdrv_co_io_plug(bs);
-}
-}
-}
-
-void coroutine_fn bdrv_co_io_unplug(BlockDriverState *bs)
-{
-BdrvChild *child;
-IO_CODE();
-assert_bdrv_graph_readable();
-
-assert(bs->io_plugged);
-if (qatomic_fetch_dec(&bs->io_plugged) == 1) {
-BlockDriver *drv = bs->drv;
-if (drv && drv->bdrv_co_io_unplug) {
-drv->bdrv_co_io_unplug(bs);
-}
-}
-
-QLIST_FOREACH(child, &bs->children, next) {
-bdrv_co_io_unplug(child->bs);
-}
-}
-
 /* Helper that undoes bdrv_register_buf() when it fails partway through */
 static void GRAPH_RDLOCK
 bdrv_register_buf_rollback(BlockDriverState *bs, void *host, size_t size,
-- 
2.40.1

[PATCH 1/6] block: add blk_io_plug_call() API

2023-05-17 Thread Stefan Hajnoczi

Introduce a new API for thread-local blk_io_plug() that does not
traverse the block graph. The goal is to make blk_io_plug() multi-queue
friendly.

Instead of having block drivers track whether or not we're in a plugged
section, provide an API that allows them to defer a function call until
we're unplugged: blk_io_plug_call(fn, opaque). If blk_io_plug_call() is
called multiple times with the same fn/opaque pair, then fn() is only
called once at the end of the function - resulting in batching.

This patch introduces the API and changes blk_io_plug()/blk_io_unplug().
blk_io_plug()/blk_io_unplug() no longer require a BlockBackend argument
because the plug state is now thread-local.

Later patches convert block drivers to blk_io_plug_call() and then we
can finally remove .bdrv_co_io_plug() once all block drivers have been
converted.

Signed-off-by: Stefan Hajnoczi 
---
 MAINTAINERS   |   1 +
 include/sysemu/block-backend-io.h |  13 +--
 block/block-backend.c |  22 -
 block/plug.c  | 159 ++
 hw/block/dataplane/xen-block.c|   8 +-
 hw/block/virtio-blk.c |   4 +-
 hw/scsi/virtio-scsi.c |   6 +-
 block/meson.build |   1 +
 8 files changed, 173 insertions(+), 41 deletions(-)
 create mode 100644 block/plug.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 50585117a0..574202295c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2644,6 +2644,7 @@ F: util/aio-*.c
 F: util/aio-*.h
 F: util/fdmon-*.c
 F: block/io.c
+F: block/plug.c
 F: migration/block*
 F: include/block/aio.h
 F: include/block/aio-wait.h
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index d62a7ee773..be4dcef59d 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -100,16 +100,9 @@ void blk_iostatus_set_err(BlockBackend *blk, int error);
 int blk_get_max_iov(BlockBackend *blk);
 int blk_get_max_hw_iov(BlockBackend *blk);
 
-/*
- * blk_io_plug/unplug are thread-local operations. This means that multiple
- * IOThreads can simultaneously call plug/unplug, but the caller must ensure
- * that each unplug() is called in the same IOThread of the matching plug().
- */
-void coroutine_fn blk_co_io_plug(BlockBackend *blk);
-void co_wrapper blk_io_plug(BlockBackend *blk);
-
-void coroutine_fn blk_co_io_unplug(BlockBackend *blk);
-void co_wrapper blk_io_unplug(BlockBackend *blk);
+void blk_io_plug(void);
+void blk_io_unplug(void);
+void blk_io_plug_call(void (*fn)(void *), void *opaque);
 
 AioContext *blk_get_aio_context(BlockBackend *blk);
 BlockAcctStats *blk_get_stats(BlockBackend *blk);
diff --git a/block/block-backend.c b/block/block-backend.c
index ca537cd0ad..1f1d226ba6 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2568,28 +2568,6 @@ void blk_add_insert_bs_notifier(BlockBackend *blk, 
Notifier *notify)
 notifier_list_add(&blk->insert_bs_notifiers, notify);
 }
 
-void coroutine_fn blk_co_io_plug(BlockBackend *blk)
-{
-BlockDriverState *bs = blk_bs(blk);
-IO_CODE();
-GRAPH_RDLOCK_GUARD();
-
-if (bs) {
-bdrv_co_io_plug(bs);
-}
-}
-
-void coroutine_fn blk_co_io_unplug(BlockBackend *blk)
-{
-BlockDriverState *bs = blk_bs(blk);
-IO_CODE();
-GRAPH_RDLOCK_GUARD();
-
-if (bs) {
-bdrv_co_io_unplug(bs);
-}
-}
-
 BlockAcctStats *blk_get_stats(BlockBackend *blk)
 {
 IO_CODE();
diff --git a/block/plug.c b/block/plug.c
new file mode 100644
index 00..6738a568ba
--- /dev/null
+++ b/block/plug.c
@@ -0,0 +1,159 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Block I/O plugging
+ *
+ * Copyright Red Hat.
+ *
+ * This API defers a function call within a blk_io_plug()/blk_io_unplug()
+ * section, allowing multiple calls to batch up. This is a performance
+ * optimization that is used in the block layer to submit several I/O requests
+ * at once instead of individually:
+ *
+ *   blk_io_plug(); <-- start of plugged region
+ *   ...
+ *   blk_io_plug_call(my_func, my_obj); <-- deferred my_func(my_obj) call
+ *   blk_io_plug_call(my_func, my_obj); <-- another
+ *   blk_io_plug_call(my_func, my_obj); <-- another
+ *   ...
+ *   blk_io_unplug(); <-- end of plugged region, my_func(my_obj) is called once
+ *
+ * This code is actually generic and not tied to the block layer. If another
+ * subsystem needs this functionality, it could be renamed.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/coroutine-tls.h"
+#include "qemu/notify.h"
+#include "qemu/thread.h"
+#include "sysemu/block-backend.h"
+
+/* A function call that has been deferred until unplug() */
+typedef struct {
+void (*fn)(void *);
+void *opaque;
+} UnplugFn;
+
+/* Per-thread state */
+typedef struct {
+unsigned count;   /* how many times has plug() been called? */
+GArray *unplug_fns;   /* functions to call at

[PATCH 3/6] block/blkio: convert to blk_io_plug_call() API

2023-05-17 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
---
 block/blkio.c | 40 +---
 1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/block/blkio.c b/block/blkio.c
index 0cdc99a729..f2a1dc1fb2 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -325,16 +325,28 @@ static void blkio_detach_aio_context(BlockDriverState *bs)
false, NULL, NULL, NULL, NULL, NULL);
 }
 
-/* Call with s->blkio_lock held to submit I/O after enqueuing a new request */
-static void blkio_submit_io(BlockDriverState *bs)
+/*
+ * Called by blk_io_unplug() or immediately if not plugged. Called without
+ * blkio_lock.
+ */
+static void blkio_unplug_fn(BlockDriverState *bs)
 {
-if (qatomic_read(&bs->io_plugged) == 0) {
-BDRVBlkioState *s = bs->opaque;
+BDRVBlkioState *s = bs->opaque;
 
+WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_do_io(s->blkioq, NULL, 0, 0, NULL);
 }
 }
 
+/*
+ * Schedule I/O submission after enqueuing a new request. Called without
+ * blkio_lock.
+ */
+static void blkio_submit_io(BlockDriverState *bs)
+{
+blk_io_plug_call(blkio_unplug_fn, bs);
+}
+
 static int coroutine_fn
 blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
@@ -345,9 +357,9 @@ blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, 
int64_t bytes)
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_discard(s->blkioq, offset, bytes, &cod, 0);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 return cod.ret;
 }
@@ -378,9 +390,9 @@ blkio_co_preadv(BlockDriverState *bs, int64_t offset, 
int64_t bytes,
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_readv(s->blkioq, offset, iov, iovcnt, &cod, 0);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 
 if (use_bounce_buffer) {
@@ -423,9 +435,9 @@ static int coroutine_fn blkio_co_pwritev(BlockDriverState 
*bs, int64_t offset,
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_writev(s->blkioq, offset, iov, iovcnt, &cod, blkio_flags);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 
 if (use_bounce_buffer) {
@@ -444,9 +456,9 @@ static int coroutine_fn blkio_co_flush(BlockDriverState *bs)
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_flush(s->blkioq, &cod, 0);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 return cod.ret;
 }
@@ -472,22 +484,13 @@ static int coroutine_fn 
blkio_co_pwrite_zeroes(BlockDriverState *bs,
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_write_zeroes(s->blkioq, offset, bytes, &cod, blkio_flags);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 return cod.ret;
 }
 
-static void coroutine_fn blkio_co_io_unplug(BlockDriverState *bs)
-{
-BDRVBlkioState *s = bs->opaque;
-
-WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
-blkio_submit_io(bs);
-}
-}
-
 typedef enum {
 BMRR_OK,
 BMRR_SKIP,
@@ -1009,7 +1012,6 @@ static void blkio_refresh_limits(BlockDriverState *bs, 
Error **errp)
 .bdrv_co_pwritev = blkio_co_pwritev, \
 .bdrv_co_flush_to_disk   = blkio_co_flush, \
 .bdrv_co_pwrite_zeroes   = blkio_co_pwrite_zeroes, \
-.bdrv_co_io_unplug   = blkio_co_io_unplug, \
 .bdrv_refresh_limits = blkio_refresh_limits, \
 .bdrv_register_buf   = blkio_register_buf, \
 .bdrv_unregister_buf = blkio_unregister_buf, \
-- 
2.40.1

[PATCH 2/6] block/nvme: convert to blk_io_plug_call() API

2023-05-17 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
---
 block/nvme.c | 44 
 1 file changed, 12 insertions(+), 32 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 5b744c2bda..100b38b592 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -25,6 +25,7 @@
 #include "qemu/vfio-helpers.h"
 #include "block/block-io.h"
 #include "block/block_int.h"
+#include "sysemu/block-backend.h"
 #include "sysemu/replay.h"
 #include "trace.h"
 
@@ -119,7 +120,6 @@ struct BDRVNVMeState {
 int blkshift;
 
 uint64_t max_transfer;
-bool plugged;
 
 bool supports_write_zeroes;
 bool supports_discard;
@@ -282,7 +282,7 @@ static void nvme_kick(NVMeQueuePair *q)
 {
 BDRVNVMeState *s = q->s;
 
-if (s->plugged || !q->need_kick) {
+if (!q->need_kick) {
 return;
 }
 trace_nvme_kick(s, q->index);
@@ -387,10 +387,6 @@ static bool nvme_process_completion(NVMeQueuePair *q)
 NvmeCqe *c;
 
 trace_nvme_process_completion(s, q->index, q->inflight);
-if (s->plugged) {
-trace_nvme_process_completion_queue_plugged(s, q->index);
-return false;
-}
 
 /*
  * Support re-entrancy when a request cb() function invokes aio_poll().
@@ -480,6 +476,15 @@ static void nvme_trace_command(const NvmeCmd *cmd)
 }
 }
 
+static void nvme_unplug_fn(void *opaque)
+{
+NVMeQueuePair *q = opaque;
+
+QEMU_LOCK_GUARD(&q->lock);
+nvme_kick(q);
+nvme_process_completion(q);
+}
+
 static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
 NvmeCmd *cmd, BlockCompletionFunc cb,
 void *opaque)
@@ -496,8 +501,7 @@ static void nvme_submit_command(NVMeQueuePair *q, 
NVMeRequest *req,
q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
 q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
 q->need_kick++;
-nvme_kick(q);
-nvme_process_completion(q);
+blk_io_plug_call(nvme_unplug_fn, q);
 qemu_mutex_unlock(&q->lock);
 }
 
@@ -1567,27 +1571,6 @@ static void nvme_attach_aio_context(BlockDriverState *bs,
 }
 }
 
-static void coroutine_fn nvme_co_io_plug(BlockDriverState *bs)
-{
-BDRVNVMeState *s = bs->opaque;
-assert(!s->plugged);
-s->plugged = true;
-}
-
-static void coroutine_fn nvme_co_io_unplug(BlockDriverState *bs)
-{
-BDRVNVMeState *s = bs->opaque;
-assert(s->plugged);
-s->plugged = false;
-for (unsigned i = INDEX_IO(0); i < s->queue_count; i++) {
-NVMeQueuePair *q = s->queues[i];
-qemu_mutex_lock(&q->lock);
-nvme_kick(q);
-nvme_process_completion(q);
-qemu_mutex_unlock(&q->lock);
-}
-}
-
 static bool nvme_register_buf(BlockDriverState *bs, void *host, size_t size,
   Error **errp)
 {
@@ -1664,9 +1647,6 @@ static BlockDriver bdrv_nvme = {
 .bdrv_detach_aio_context  = nvme_detach_aio_context,
 .bdrv_attach_aio_context  = nvme_attach_aio_context,
 
-.bdrv_co_io_plug  = nvme_co_io_plug,
-.bdrv_co_io_unplug= nvme_co_io_unplug,
-
 .bdrv_register_buf= nvme_register_buf,
 .bdrv_unregister_buf  = nvme_unregister_buf,
 };
-- 
2.40.1

[PATCH v6 17/20] virtio-blk: implement BlockDevOps->drained_begin()

2023-05-16 Thread Stefan Hajnoczi

Detach ioeventfds during drained sections to stop I/O submission from
the guest. virtio-blk is no longer reliant on aio_disable_external()
after this patch. This will allow us to remove the
aio_disable_external() API once all other code that relies on it is
converted.

Take extra care to avoid attaching/detaching ioeventfds if the data
plane is started/stopped during a drained section. This should be rare,
but maybe the mirror block job can trigger it.

Signed-off-by: Stefan Hajnoczi 
---
 hw/block/dataplane/virtio-blk.c | 16 --
 hw/block/virtio-blk.c   | 38 -
 2 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 4f5c7cd55f..b90456c08c 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -246,13 +246,15 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 }
 
 /* Get this show started by hooking up our callbacks */
-aio_context_acquire(s->ctx);
-for (i = 0; i < nvqs; i++) {
-VirtQueue *vq = virtio_get_queue(s->vdev, i);
+if (!blk_in_drain(s->conf->conf.blk)) {
+aio_context_acquire(s->ctx);
+for (i = 0; i < nvqs; i++) {
+VirtQueue *vq = virtio_get_queue(s->vdev, i);
 
-virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+}
+aio_context_release(s->ctx);
 }
-aio_context_release(s->ctx);
 return 0;
 
   fail_aio_context:
@@ -322,7 +324,9 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
 s->stopping = true;
 trace_virtio_blk_data_plane_stop(s);
 
-aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
+if (!blk_in_drain(s->conf->conf.blk)) {
+aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
+}
 
 aio_context_acquire(s->ctx);
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 8f65ea4659..4ca66b5860 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1506,8 +1506,44 @@ static void virtio_blk_resize(void *opaque)
 aio_bh_schedule_oneshot(qemu_get_aio_context(), virtio_resize_cb, vdev);
 }
 
+/* Suspend virtqueue ioeventfd processing during drain */
+static void virtio_blk_drained_begin(void *opaque)
+{
+VirtIOBlock *s = opaque;
+VirtIODevice *vdev = VIRTIO_DEVICE(opaque);
+AioContext *ctx = blk_get_aio_context(s->conf.conf.blk);
+
+if (!s->dataplane || !s->dataplane_started) {
+return;
+}
+
+for (uint16_t i = 0; i < s->conf.num_queues; i++) {
+VirtQueue *vq = virtio_get_queue(vdev, i);
+virtio_queue_aio_detach_host_notifier(vq, ctx);
+}
+}
+
+/* Resume virtqueue ioeventfd processing after drain */
+static void virtio_blk_drained_end(void *opaque)
+{
+VirtIOBlock *s = opaque;
+VirtIODevice *vdev = VIRTIO_DEVICE(opaque);
+AioContext *ctx = blk_get_aio_context(s->conf.conf.blk);
+
+if (!s->dataplane || !s->dataplane_started) {
+return;
+}
+
+for (uint16_t i = 0; i < s->conf.num_queues; i++) {
+VirtQueue *vq = virtio_get_queue(vdev, i);
+virtio_queue_aio_attach_host_notifier(vq, ctx);
+}
+}
+
 static const BlockDevOps virtio_block_ops = {
-.resize_cb = virtio_blk_resize,
+.resize_cb = virtio_blk_resize,
+.drained_begin = virtio_blk_drained_begin,
+.drained_end   = virtio_blk_drained_end,
 };
 
 static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
-- 
2.40.1

[PATCH v6 18/20] virtio-scsi: implement BlockDevOps->drained_begin()

2023-05-16 Thread Stefan Hajnoczi

The virtio-scsi Host Bus Adapter provides access to devices on a SCSI
bus. Those SCSI devices typically have a BlockBackend. When the
BlockBackend enters a drained section, the SCSI device must temporarily
stop submitting new I/O requests.

Implement this behavior by temporarily stopping virtio-scsi virtqueue
processing when one of the SCSI devices enters a drained section. The
new scsi_device_drained_begin() API allows scsi-disk to message the
virtio-scsi HBA.

scsi_device_drained_begin() uses a drain counter so that multiple SCSI
devices can have overlapping drained sections. The HBA only sees one
pair of .drained_begin/end() calls.

After this commit, virtio-scsi no longer depends on hw/virtio's
ioeventfd aio_set_event_notifier(is_external=true). This commit is a
step towards removing the aio_disable_external() API.

Signed-off-by: Stefan Hajnoczi 
---
 include/hw/scsi/scsi.h  | 14 
 hw/scsi/scsi-bus.c  | 40 +
 hw/scsi/scsi-disk.c | 27 +-
 hw/scsi/virtio-scsi-dataplane.c | 18 +--
 hw/scsi/virtio-scsi.c   | 38 +++
 hw/scsi/trace-events|  2 ++
 6 files changed, 127 insertions(+), 12 deletions(-)

diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index 6f23a7a73e..e2bb1a2fbf 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -133,6 +133,16 @@ struct SCSIBusInfo {
 void (*save_request)(QEMUFile *f, SCSIRequest *req);
 void *(*load_request)(QEMUFile *f, SCSIRequest *req);
 void (*free_request)(SCSIBus *bus, void *priv);
+
+/*
+ * Temporarily stop submitting new requests between drained_begin() and
+ * drained_end(). Called from the main loop thread with the BQL held.
+ *
+ * Implement these callbacks if request processing is triggered by a file
+ * descriptor like an EventNotifier. Otherwise set them to NULL.
+ */
+void (*drained_begin)(SCSIBus *bus);
+void (*drained_end)(SCSIBus *bus);
 };
 
 #define TYPE_SCSI_BUS "SCSI"
@@ -144,6 +154,8 @@ struct SCSIBus {
 
 SCSISense unit_attention;
 const SCSIBusInfo *info;
+
+int drain_count; /* protected by BQL */
 };
 
 /**
@@ -213,6 +225,8 @@ void scsi_req_cancel_complete(SCSIRequest *req);
 void scsi_req_cancel(SCSIRequest *req);
 void scsi_req_cancel_async(SCSIRequest *req, Notifier *notifier);
 void scsi_req_retry(SCSIRequest *req);
+void scsi_device_drained_begin(SCSIDevice *sdev);
+void scsi_device_drained_end(SCSIDevice *sdev);
 void scsi_device_purge_requests(SCSIDevice *sdev, SCSISense sense);
 void scsi_device_set_ua(SCSIDevice *sdev, SCSISense sense);
 void scsi_device_report_change(SCSIDevice *dev, SCSISense sense);
diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 64013c8a24..f80f4cb4fc 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -1669,6 +1669,46 @@ void scsi_device_purge_requests(SCSIDevice *sdev, 
SCSISense sense)
 scsi_device_set_ua(sdev, sense);
 }
 
+void scsi_device_drained_begin(SCSIDevice *sdev)
+{
+SCSIBus *bus = DO_UPCAST(SCSIBus, qbus, sdev->qdev.parent_bus);
+if (!bus) {
+return;
+}
+
+assert(qemu_get_current_aio_context() == qemu_get_aio_context());
+assert(bus->drain_count < INT_MAX);
+
+/*
+ * Multiple BlockBackends can be on a SCSIBus and each may begin/end
+ * draining at any time. Keep a counter so HBAs only see begin/end once.
+ */
+if (bus->drain_count++ == 0) {
+trace_scsi_bus_drained_begin(bus, sdev);
+if (bus->info->drained_begin) {
+bus->info->drained_begin(bus);
+}
+}
+}
+
+void scsi_device_drained_end(SCSIDevice *sdev)
+{
+SCSIBus *bus = DO_UPCAST(SCSIBus, qbus, sdev->qdev.parent_bus);
+if (!bus) {
+return;
+}
+
+assert(qemu_get_current_aio_context() == qemu_get_aio_context());
+assert(bus->drain_count > 0);
+
+if (bus->drain_count-- == 1) {
+trace_scsi_bus_drained_end(bus, sdev);
+if (bus->info->drained_end) {
+bus->info->drained_end(bus);
+}
+}
+}
+
 static char *scsibus_get_dev_path(DeviceState *dev)
 {
 SCSIDevice *d = SCSI_DEVICE(dev);
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 97c9b1c8cd..e0d79c7966 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2360,6 +2360,20 @@ static void scsi_disk_reset(DeviceState *dev)
 s->qdev.scsi_version = s->qdev.default_scsi_version;
 }
 
+static void scsi_disk_drained_begin(void *opaque)
+{
+SCSIDiskState *s = opaque;
+
+scsi_device_drained_begin(&s->qdev);
+}
+
+static void scsi_disk_drained_end(void *opaque)
+{
+SCSIDiskState *s = opaque;
+
+scsi_device_drained_end(&s->qdev);
+}
+
 static void scsi_disk_resize_cb(void *opaque)
 {
 SCSIDiskState *s = opaque;
@@ -2414,16 +2428,19 @@ static bool scsi_cd_is_mediu

[PATCH v6 16/20] virtio: make it possible to detach host notifier from any thread

2023-05-16 Thread Stefan Hajnoczi

virtio_queue_aio_detach_host_notifier() does two things:
1. It removes the fd handler from the event loop.
2. It processes the virtqueue one last time.

The first step can be peformed by any thread and without taking the
AioContext lock.

The second step may need the AioContext lock (depending on the device
implementation) and runs in the thread where request processing takes
place. virtio-blk and virtio-scsi therefore call
virtio_queue_aio_detach_host_notifier() from a BH that is scheduled in
AioContext.

The next patch will introduce a .drained_begin() function that needs to
call virtio_queue_aio_detach_host_notifier(). .drained_begin() functions
cannot call aio_poll() to wait synchronously for the BH. It is possible
for a .drained_poll() callback to asynchronously wait for the BH, but
that is more complex than necessary here.

Move the virtqueue processing out to the callers of
virtio_queue_aio_detach_host_notifier() so that the function can be
called from any thread. This is in preparation for the next patch.

Signed-off-by: Stefan Hajnoczi 
---
 hw/block/dataplane/virtio-blk.c |  7 +++
 hw/scsi/virtio-scsi-dataplane.c | 14 ++
 hw/virtio/virtio.c  |  3 ---
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index af1c24c40c..4f5c7cd55f 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -287,8 +287,15 @@ static void virtio_blk_data_plane_stop_bh(void *opaque)
 
 for (i = 0; i < s->conf->num_queues; i++) {
 VirtQueue *vq = virtio_get_queue(s->vdev, i);
+EventNotifier *host_notifier = virtio_queue_get_host_notifier(vq);
 
 virtio_queue_aio_detach_host_notifier(vq, s->ctx);
+
+/*
+ * Test and clear notifier after disabling event, in case poll callback
+ * didn't have time to run.
+ */
+virtio_queue_host_notifier_read(host_notifier);
 }
 }
 
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index f3214e1c57..b3a1ed21f7 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -71,12 +71,26 @@ static void virtio_scsi_dataplane_stop_bh(void *opaque)
 {
 VirtIOSCSI *s = opaque;
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
+EventNotifier *host_notifier;
 int i;
 
 virtio_queue_aio_detach_host_notifier(vs->ctrl_vq, s->ctx);
+host_notifier = virtio_queue_get_host_notifier(vs->ctrl_vq);
+
+/*
+ * Test and clear notifier after disabling event, in case poll callback
+ * didn't have time to run.
+ */
+virtio_queue_host_notifier_read(host_notifier);
+
 virtio_queue_aio_detach_host_notifier(vs->event_vq, s->ctx);
+host_notifier = virtio_queue_get_host_notifier(vs->event_vq);
+virtio_queue_host_notifier_read(host_notifier);
+
 for (i = 0; i < vs->conf.num_queues; i++) {
 virtio_queue_aio_detach_host_notifier(vs->cmd_vqs[i], s->ctx);
+host_notifier = virtio_queue_get_host_notifier(vs->cmd_vqs[i]);
+virtio_queue_host_notifier_read(host_notifier);
 }
 }
 
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 272d930721..cb09cb6464 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3516,9 +3516,6 @@ void 
virtio_queue_aio_attach_host_notifier_no_poll(VirtQueue *vq, AioContext *ct
 void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
 aio_set_event_notifier(ctx, &vq->host_notifier, true, NULL, NULL, NULL);
-/* Test and clear notifier before after disabling event,
- * in case poll callback didn't have time to run. */
-virtio_queue_host_notifier_read(&vq->host_notifier);
 }
 
 void virtio_queue_host_notifier_read(EventNotifier *n)
-- 
2.40.1

[PATCH v6 12/20] hw/xen: do not set is_external=true on evtchn fds

2023-05-16 Thread Stefan Hajnoczi

is_external=true suspends fd handlers between aio_disable_external() and
aio_enable_external(). The block layer's drain operation uses this
mechanism to prevent new I/O from sneaking in between
bdrv_drained_begin() and bdrv_drained_end().

The previous commit converted the xen-block device to use BlockDevOps
.drained_begin/end() callbacks. It no longer relies on is_external=true
so it is safe to pass is_external=false.

This is part of ongoing work to remove the aio_disable_external() API.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
---
 hw/xen/xen-bus.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/xen/xen-bus.c b/hw/xen/xen-bus.c
index b8f408c9ed..bf256d4da2 100644
--- a/hw/xen/xen-bus.c
+++ b/hw/xen/xen-bus.c
@@ -842,14 +842,14 @@ void xen_device_set_event_channel_context(XenDevice 
*xendev,
 }
 
 if (channel->ctx)
-aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), 
true,
+aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), 
false,
NULL, NULL, NULL, NULL, NULL);
 
 channel->ctx = ctx;
 if (ctx) {
 aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh),
-   true, xen_device_event, NULL, xen_device_poll, NULL,
-   channel);
+   false, xen_device_event, NULL, xen_device_poll,
+   NULL, channel);
 }
 }
 
@@ -923,7 +923,7 @@ void xen_device_unbind_event_channel(XenDevice *xendev,
 
 QLIST_REMOVE(channel, list);
 
-aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), true,
+aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), false,
NULL, NULL, NULL, NULL, NULL);
 
 if (qemu_xen_evtchn_unbind(channel->xeh, channel->local_port) < 0) {
-- 
2.40.1

[PATCH v6 11/20] xen-block: implement BlockDevOps->drained_begin()

2023-05-16 Thread Stefan Hajnoczi

Detach event channels during drained sections to stop I/O submission
from the ring. xen-block is no longer reliant on aio_disable_external()
after this patch. This will allow us to remove the
aio_disable_external() API once all other code that relies on it is
converted.

Extend xen_device_set_event_channel_context() to allow ctx=NULL. The
event channel still exists but the event loop does not monitor the file
descriptor. Event channel processing can resume by calling
xen_device_set_event_channel_context() with a non-NULL ctx.

Factor out xen_device_set_event_channel_context() calls in
hw/block/dataplane/xen-block.c into attach/detach helper functions.
Incidentally, these don't require the AioContext lock because
aio_set_fd_handler() is thread-safe.

It's safer to register BlockDevOps after the dataplane instance has been
created. The BlockDevOps .drained_begin/end() callbacks depend on the
dataplane instance, so move the blk_set_dev_ops() call after
xen_block_dataplane_create().

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
---
 hw/block/dataplane/xen-block.h |  2 ++
 hw/block/dataplane/xen-block.c | 42 +-
 hw/block/xen-block.c   | 24 ---
 hw/xen/xen-bus.c   |  7 --
 4 files changed, 59 insertions(+), 16 deletions(-)

diff --git a/hw/block/dataplane/xen-block.h b/hw/block/dataplane/xen-block.h
index 76dcd51c3d..7b8e9df09f 100644
--- a/hw/block/dataplane/xen-block.h
+++ b/hw/block/dataplane/xen-block.h
@@ -26,5 +26,7 @@ void xen_block_dataplane_start(XenBlockDataPlane *dataplane,
unsigned int protocol,
Error **errp);
 void xen_block_dataplane_stop(XenBlockDataPlane *dataplane);
+void xen_block_dataplane_attach(XenBlockDataPlane *dataplane);
+void xen_block_dataplane_detach(XenBlockDataPlane *dataplane);
 
 #endif /* HW_BLOCK_DATAPLANE_XEN_BLOCK_H */
diff --git a/hw/block/dataplane/xen-block.c b/hw/block/dataplane/xen-block.c
index d8bc39d359..2597f38805 100644
--- a/hw/block/dataplane/xen-block.c
+++ b/hw/block/dataplane/xen-block.c
@@ -664,6 +664,30 @@ void xen_block_dataplane_destroy(XenBlockDataPlane 
*dataplane)
 g_free(dataplane);
 }
 
+void xen_block_dataplane_detach(XenBlockDataPlane *dataplane)
+{
+if (!dataplane || !dataplane->event_channel) {
+return;
+}
+
+/* Only reason for failure is a NULL channel */
+xen_device_set_event_channel_context(dataplane->xendev,
+ dataplane->event_channel,
+ NULL, &error_abort);
+}
+
+void xen_block_dataplane_attach(XenBlockDataPlane *dataplane)
+{
+if (!dataplane || !dataplane->event_channel) {
+return;
+}
+
+/* Only reason for failure is a NULL channel */
+xen_device_set_event_channel_context(dataplane->xendev,
+ dataplane->event_channel,
+ dataplane->ctx, &error_abort);
+}
+
 void xen_block_dataplane_stop(XenBlockDataPlane *dataplane)
 {
 XenDevice *xendev;
@@ -674,13 +698,11 @@ void xen_block_dataplane_stop(XenBlockDataPlane 
*dataplane)
 
 xendev = dataplane->xendev;
 
-aio_context_acquire(dataplane->ctx);
-if (dataplane->event_channel) {
-/* Only reason for failure is a NULL channel */
-xen_device_set_event_channel_context(xendev, dataplane->event_channel,
- qemu_get_aio_context(),
- &error_abort);
+if (!blk_in_drain(dataplane->blk)) {
+xen_block_dataplane_detach(dataplane);
 }
+
+aio_context_acquire(dataplane->ctx);
 /* Xen doesn't have multiple users for nodes, so this can't fail */
 blk_set_aio_context(dataplane->blk, qemu_get_aio_context(), &error_abort);
 aio_context_release(dataplane->ctx);
@@ -819,11 +841,9 @@ void xen_block_dataplane_start(XenBlockDataPlane 
*dataplane,
 blk_set_aio_context(dataplane->blk, dataplane->ctx, NULL);
 aio_context_release(old_context);
 
-/* Only reason for failure is a NULL channel */
-aio_context_acquire(dataplane->ctx);
-xen_device_set_event_channel_context(xendev, dataplane->event_channel,
- dataplane->ctx, &error_abort);
-aio_context_release(dataplane->ctx);
+if (!blk_in_drain(dataplane->blk)) {
+xen_block_dataplane_attach(dataplane);
+}
 
 return;
 
diff --git a/hw/block/xen-block.c b/hw/block/xen-block.c
index f5a744589d..f099914831 100644
--- a/hw/block/xen-block.c
+++ b/hw/block/xen-block.c
@@ -189,8 +189,26 @@ static void xen_block_resize_cb(void *opaque)
 xen_device_backend_printf(xendev, "state", "%u", state);
 }
 
+/* Suspend request handling */
+static void xen_block_drained_begin(void *opaq

[PATCH v6 20/20] aio: remove aio_disable_external() API

2023-05-16 Thread Stefan Hajnoczi

All callers now pass is_external=false to aio_set_fd_handler() and
aio_set_event_notifier(). The aio_disable_external() API that
temporarily disables fd handlers that were registered is_external=true
is therefore dead code.

Remove aio_disable_external(), aio_enable_external(), and the
is_external arguments to aio_set_fd_handler() and
aio_set_event_notifier().

The entire test-fdmon-epoll test is removed because its sole purpose was
testing aio_disable_external().

Parts of this patch were generated using the following coccinelle
(https://coccinelle.lip6.fr/) semantic patch:

  @@
  expression ctx, fd, is_external, io_read, io_write, io_poll, io_poll_ready, 
opaque;
  @@
  - aio_set_fd_handler(ctx, fd, is_external, io_read, io_write, io_poll, 
io_poll_ready, opaque)
  + aio_set_fd_handler(ctx, fd, io_read, io_write, io_poll, io_poll_ready, 
opaque)

  @@
  expression ctx, notifier, is_external, io_read, io_poll, io_poll_ready;
  @@
  - aio_set_event_notifier(ctx, notifier, is_external, io_read, io_poll, 
io_poll_ready)
  + aio_set_event_notifier(ctx, notifier, io_read, io_poll, io_poll_ready)

Reviewed-by: Juan Quintela 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Stefan Hajnoczi 
---
 include/block/aio.h   | 57 ---
 util/aio-posix.h  |  1 -
 block.c   |  7 
 block/blkio.c | 15 +++
 block/curl.c  | 10 ++---
 block/export/fuse.c   |  8 ++--
 block/export/vduse-blk.c  | 10 ++---
 block/io.c|  2 -
 block/io_uring.c  |  4 +-
 block/iscsi.c |  3 +-
 block/linux-aio.c |  4 +-
 block/nfs.c   |  5 +--
 block/nvme.c  |  8 ++--
 block/ssh.c   |  4 +-
 block/win32-aio.c |  6 +--
 hw/i386/kvm/xen_xenstore.c|  2 +-
 hw/virtio/virtio.c|  6 +--
 hw/xen/xen-bus.c  |  8 ++--
 io/channel-command.c  |  6 +--
 io/channel-file.c |  3 +-
 io/channel-socket.c   |  3 +-
 migration/rdma.c  | 16 
 tests/unit/test-aio.c | 27 +
 tests/unit/test-bdrv-drain.c  |  1 -
 tests/unit/test-fdmon-epoll.c | 73 ---
 util/aio-posix.c  | 20 +++---
 util/aio-win32.c  |  8 +---
 util/async.c  |  3 +-
 util/fdmon-epoll.c| 10 -
 util/fdmon-io_uring.c |  8 +---
 util/fdmon-poll.c |  3 +-
 util/main-loop.c  |  7 ++--
 util/qemu-coroutine-io.c  |  7 ++--
 util/vhost-user-server.c  | 11 +++---
 tests/unit/meson.build|  3 --
 35 files changed, 76 insertions(+), 293 deletions(-)
 delete mode 100644 tests/unit/test-fdmon-epoll.c

diff --git a/include/block/aio.h b/include/block/aio.h
index 89bbc536f9..32042e8905 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -225,8 +225,6 @@ struct AioContext {
  */
 QEMUTimerListGroup tlg;
 
-int external_disable_cnt;
-
 /* Number of AioHandlers without .io_poll() */
 int poll_disable_cnt;
 
@@ -481,7 +479,6 @@ bool aio_poll(AioContext *ctx, bool blocking);
  */
 void aio_set_fd_handler(AioContext *ctx,
 int fd,
-bool is_external,
 IOHandler *io_read,
 IOHandler *io_write,
 AioPollFn *io_poll,
@@ -497,7 +494,6 @@ void aio_set_fd_handler(AioContext *ctx,
  */
 void aio_set_event_notifier(AioContext *ctx,
 EventNotifier *notifier,
-bool is_external,
 EventNotifierHandler *io_read,
 AioPollFn *io_poll,
 EventNotifierHandler *io_poll_ready);
@@ -626,59 +622,6 @@ static inline void aio_timer_init(AioContext *ctx,
  */
 int64_t aio_compute_timeout(AioContext *ctx);
 
-/**
- * aio_disable_external:
- * @ctx: the aio context
- *
- * Disable the further processing of external clients.
- */
-static inline void aio_disable_external(AioContext *ctx)
-{
-qatomic_inc(&ctx->external_disable_cnt);
-}
-
-/**
- * aio_enable_external:
- * @ctx: the aio context
- *
- * Enable the processing of external clients.
- */
-static inline void aio_enable_external(AioContext *ctx)
-{
-int old;
-
-old = qatomic_fetch_dec(&ctx->external_disable_cnt);
-assert(old > 0);
-if (old == 1) {
-/* Kick event loop so it re-arms file descriptors */
-aio_notify(ctx);
-}
-}
-
-/**
- * aio_external_disabled:
- * @ctx: the aio context
- *
- * Return true if the external clients are disabled.
- */
-static inline bool aio_external_disabled(AioContext *ctx)
-{
-return qatomic_read(&ctx->external_disable_cnt);
-}
-
-/**
- * aio_node_check:
- * @ctx: the aio context
- * @is_external: Whether or not the checked node is an external event sour

[PATCH v6 19/20] virtio: do not set is_external=true on host notifiers

2023-05-16 Thread Stefan Hajnoczi

Host notifiers can now use is_external=false since virtio-blk and
virtio-scsi no longer rely on is_external=true for drained sections.

Signed-off-by: Stefan Hajnoczi 
---
 hw/virtio/virtio.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index cb09cb6464..08011be8dc 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3491,7 +3491,7 @@ static void 
virtio_queue_host_notifier_aio_poll_end(EventNotifier *n)
 
 void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
-aio_set_event_notifier(ctx, &vq->host_notifier, true,
+aio_set_event_notifier(ctx, &vq->host_notifier, false,
virtio_queue_host_notifier_read,
virtio_queue_host_notifier_aio_poll,
virtio_queue_host_notifier_aio_poll_ready);
@@ -3508,14 +3508,14 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
  */
 void virtio_queue_aio_attach_host_notifier_no_poll(VirtQueue *vq, AioContext 
*ctx)
 {
-aio_set_event_notifier(ctx, &vq->host_notifier, true,
+aio_set_event_notifier(ctx, &vq->host_notifier, false,
virtio_queue_host_notifier_read,
NULL, NULL);
 }
 
 void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
-aio_set_event_notifier(ctx, &vq->host_notifier, true, NULL, NULL, NULL);
+aio_set_event_notifier(ctx, &vq->host_notifier, false, NULL, NULL, NULL);
 }
 
 void virtio_queue_host_notifier_read(EventNotifier *n)
-- 
2.40.1

[PATCH v6 13/20] block/export: rewrite vduse-blk drain code

2023-05-16 Thread Stefan Hajnoczi

vduse_blk_detach_ctx() waits for in-flight requests using
AIO_WAIT_WHILE(). This is not allowed according to a comment in
bdrv_set_aio_context_commit():

  /*
   * Take the old AioContex when detaching it from bs.
   * At this point, new_context lock is already acquired, and we are now
   * also taking old_context. This is safe as long as bdrv_detach_aio_context
   * does not call AIO_POLL_WHILE().
   */

Use this opportunity to rewrite the drain code in vduse-blk:

- Use the BlockExport refcount so that vduse_blk_exp_delete() is only
  called when there are no more requests in flight.

- Implement .drained_poll() so in-flight request coroutines are stopped
  by the time .bdrv_detach_aio_context() is called.

- Remove AIO_WAIT_WHILE() from vduse_blk_detach_ctx() to solve the
  .bdrv_detach_aio_context() constraint violation. It's no longer
  needed due to the previous changes.

- Always handle the VDUSE file descriptor, even in drained sections. The
  VDUSE file descriptor doesn't submit I/O, so it's safe to handle it in
  drained sections. This ensures that the VDUSE kernel code gets a fast
  response.

- Suspend virtqueue fd handlers in .drained_begin() and resume them in
  .drained_end(). This eliminates the need for the
  aio_set_fd_handler(is_external=true) flag, which is being removed from
  QEMU.

This is a long list but splitting it into individual commits would
probably lead to git bisect failures - the changes are all related.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
---
 block/export/vduse-blk.c | 132 +++
 1 file changed, 93 insertions(+), 39 deletions(-)

diff --git a/block/export/vduse-blk.c b/block/export/vduse-blk.c
index b53ef39da0..a25556fe04 100644
--- a/block/export/vduse-blk.c
+++ b/block/export/vduse-blk.c
@@ -31,7 +31,8 @@ typedef struct VduseBlkExport {
 VduseDev *dev;
 uint16_t num_queues;
 char *recon_file;
-unsigned int inflight;
+unsigned int inflight; /* atomic */
+bool vqs_started;
 } VduseBlkExport;
 
 typedef struct VduseBlkReq {
@@ -41,13 +42,24 @@ typedef struct VduseBlkReq {
 
 static void vduse_blk_inflight_inc(VduseBlkExport *vblk_exp)
 {
-vblk_exp->inflight++;
+if (qatomic_fetch_inc(&vblk_exp->inflight) == 0) {
+/* Prevent export from being deleted */
+aio_context_acquire(vblk_exp->export.ctx);
+blk_exp_ref(&vblk_exp->export);
+aio_context_release(vblk_exp->export.ctx);
+}
 }
 
 static void vduse_blk_inflight_dec(VduseBlkExport *vblk_exp)
 {
-if (--vblk_exp->inflight == 0) {
+if (qatomic_fetch_dec(&vblk_exp->inflight) == 1) {
+/* Wake AIO_WAIT_WHILE() */
 aio_wait_kick();
+
+/* Now the export can be deleted */
+aio_context_acquire(vblk_exp->export.ctx);
+blk_exp_unref(&vblk_exp->export);
+aio_context_release(vblk_exp->export.ctx);
 }
 }
 
@@ -124,8 +136,12 @@ static void vduse_blk_enable_queue(VduseDev *dev, 
VduseVirtq *vq)
 {
 VduseBlkExport *vblk_exp = vduse_dev_get_priv(dev);
 
+if (!vblk_exp->vqs_started) {
+return; /* vduse_blk_drained_end() will start vqs later */
+}
+
 aio_set_fd_handler(vblk_exp->export.ctx, vduse_queue_get_fd(vq),
-   true, on_vduse_vq_kick, NULL, NULL, NULL, vq);
+   false, on_vduse_vq_kick, NULL, NULL, NULL, vq);
 /* Make sure we don't miss any kick afer reconnecting */
 eventfd_write(vduse_queue_get_fd(vq), 1);
 }
@@ -133,9 +149,14 @@ static void vduse_blk_enable_queue(VduseDev *dev, 
VduseVirtq *vq)
 static void vduse_blk_disable_queue(VduseDev *dev, VduseVirtq *vq)
 {
 VduseBlkExport *vblk_exp = vduse_dev_get_priv(dev);
+int fd = vduse_queue_get_fd(vq);
 
-aio_set_fd_handler(vblk_exp->export.ctx, vduse_queue_get_fd(vq),
-   true, NULL, NULL, NULL, NULL, NULL);
+if (fd < 0) {
+return;
+}
+
+aio_set_fd_handler(vblk_exp->export.ctx, fd, false,
+   NULL, NULL, NULL, NULL, NULL);
 }
 
 static const VduseOps vduse_blk_ops = {
@@ -152,42 +173,19 @@ static void on_vduse_dev_kick(void *opaque)
 
 static void vduse_blk_attach_ctx(VduseBlkExport *vblk_exp, AioContext *ctx)
 {
-int i;
-
 aio_set_fd_handler(vblk_exp->export.ctx, vduse_dev_get_fd(vblk_exp->dev),
-   true, on_vduse_dev_kick, NULL, NULL, NULL,
+   false, on_vduse_dev_kick, NULL, NULL, NULL,
vblk_exp->dev);
 
-for (i = 0; i < vblk_exp->num_queues; i++) {
-VduseVirtq *vq = vduse_dev_get_queue(vblk_exp->dev, i);
-int fd = vduse_queue_get_fd(vq);
-
-if (fd < 0) {
-continue;
-}
-aio_set_fd_handler(vblk_exp->export.ctx, fd, true,
-   on_vduse_vq_kick, NULL, NULL, NULL, vq);
-}
+/* Virtqueues are handled by

[PATCH v6 15/20] block/fuse: do not set is_external=true on FUSE fd

2023-05-16 Thread Stefan Hajnoczi

This is part of ongoing work to remove the aio_disable_external() API.

Use BlockDevOps .drained_begin/end/poll() instead of
aio_set_fd_handler(is_external=true).

As a side-effect the FUSE export now follows AioContext changes like the
other export types.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
---
 block/export/fuse.c | 56 +++--
 1 file changed, 54 insertions(+), 2 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 06fa41079e..adf3236b5a 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -50,6 +50,7 @@ typedef struct FuseExport {
 
 struct fuse_session *fuse_session;
 struct fuse_buf fuse_buf;
+unsigned int in_flight; /* atomic */
 bool mounted, fd_handler_set_up;
 
 char *mountpoint;
@@ -78,6 +79,42 @@ static void read_from_fuse_export(void *opaque);
 static bool is_regular_file(const char *path, Error **errp);
 
 
+static void fuse_export_drained_begin(void *opaque)
+{
+FuseExport *exp = opaque;
+
+aio_set_fd_handler(exp->common.ctx,
+   fuse_session_fd(exp->fuse_session), false,
+   NULL, NULL, NULL, NULL, NULL);
+exp->fd_handler_set_up = false;
+}
+
+static void fuse_export_drained_end(void *opaque)
+{
+FuseExport *exp = opaque;
+
+/* Refresh AioContext in case it changed */
+exp->common.ctx = blk_get_aio_context(exp->common.blk);
+
+aio_set_fd_handler(exp->common.ctx,
+   fuse_session_fd(exp->fuse_session), false,
+   read_from_fuse_export, NULL, NULL, NULL, exp);
+exp->fd_handler_set_up = true;
+}
+
+static bool fuse_export_drained_poll(void *opaque)
+{
+FuseExport *exp = opaque;
+
+return qatomic_read(&exp->in_flight) > 0;
+}
+
+static const BlockDevOps fuse_export_blk_dev_ops = {
+.drained_begin = fuse_export_drained_begin,
+.drained_end   = fuse_export_drained_end,
+.drained_poll  = fuse_export_drained_poll,
+};
+
 static int fuse_export_create(BlockExport *blk_exp,
   BlockExportOptions *blk_exp_args,
   Error **errp)
@@ -101,6 +138,15 @@ static int fuse_export_create(BlockExport *blk_exp,
 }
 }
 
+blk_set_dev_ops(exp->common.blk, &fuse_export_blk_dev_ops, exp);
+
+/*
+ * We handle draining ourselves using an in-flight counter and by disabling
+ * the FUSE fd handler. Do not queue BlockBackend requests, they need to
+ * complete so the in-flight counter reaches zero.
+ */
+blk_set_disable_request_queuing(exp->common.blk, true);
+
 init_exports_table();
 
 /*
@@ -224,7 +270,7 @@ static int setup_fuse_export(FuseExport *exp, const char 
*mountpoint,
 g_hash_table_insert(exports, g_strdup(mountpoint), NULL);
 
 aio_set_fd_handler(exp->common.ctx,
-   fuse_session_fd(exp->fuse_session), true,
+   fuse_session_fd(exp->fuse_session), false,
read_from_fuse_export, NULL, NULL, NULL, exp);
 exp->fd_handler_set_up = true;
 
@@ -246,6 +292,8 @@ static void read_from_fuse_export(void *opaque)
 
 blk_exp_ref(&exp->common);
 
+qatomic_inc(&exp->in_flight);
+
 do {
 ret = fuse_session_receive_buf(exp->fuse_session, &exp->fuse_buf);
 } while (ret == -EINTR);
@@ -256,6 +304,10 @@ static void read_from_fuse_export(void *opaque)
 fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
 
 out:
+if (qatomic_fetch_dec(&exp->in_flight) == 1) {
+aio_wait_kick(); /* wake AIO_WAIT_WHILE() */
+}
+
 blk_exp_unref(&exp->common);
 }
 
@@ -268,7 +320,7 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
 
 if (exp->fd_handler_set_up) {
 aio_set_fd_handler(exp->common.ctx,
-   fuse_session_fd(exp->fuse_session), true,
+   fuse_session_fd(exp->fuse_session), false,
NULL, NULL, NULL, NULL, NULL);
 exp->fd_handler_set_up = false;
 }
-- 
2.40.1

[PATCH v6 14/20] block/export: don't require AioContext lock around blk_exp_ref/unref()

2023-05-16 Thread Stefan Hajnoczi

The FUSE export calls blk_exp_ref/unref() without the AioContext lock.
Instead of fixing the FUSE export, adjust blk_exp_ref/unref() so they
work without the AioContext lock. This way it's less error-prone.

Suggested-by: Paolo Bonzini 
Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
---
 include/block/export.h   |  2 ++
 block/export/export.c| 13 ++---
 block/export/vduse-blk.c |  4 
 3 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/block/export.h b/include/block/export.h
index 7feb02e10d..f2fe0f8078 100644
--- a/include/block/export.h
+++ b/include/block/export.h
@@ -57,6 +57,8 @@ struct BlockExport {
  * Reference count for this block export. This includes strong references
  * both from the owner (qemu-nbd or the monitor) and clients connected to
  * the export.
+ *
+ * Use atomics to access this field.
  */
 int refcount;
 
diff --git a/block/export/export.c b/block/export/export.c
index 62c7c22d45..ab007e9d31 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -202,11 +202,10 @@ fail:
 return NULL;
 }
 
-/* Callers must hold exp->ctx lock */
 void blk_exp_ref(BlockExport *exp)
 {
-assert(exp->refcount > 0);
-exp->refcount++;
+assert(qatomic_read(&exp->refcount) > 0);
+qatomic_inc(&exp->refcount);
 }
 
 /* Runs in the main thread */
@@ -229,11 +228,10 @@ static void blk_exp_delete_bh(void *opaque)
 aio_context_release(aio_context);
 }
 
-/* Callers must hold exp->ctx lock */
 void blk_exp_unref(BlockExport *exp)
 {
-assert(exp->refcount > 0);
-if (--exp->refcount == 0) {
+assert(qatomic_read(&exp->refcount) > 0);
+if (qatomic_fetch_dec(&exp->refcount) == 1) {
 /* Touch the block_exports list only in the main thread */
 aio_bh_schedule_oneshot(qemu_get_aio_context(), blk_exp_delete_bh,
 exp);
@@ -341,7 +339,8 @@ void qmp_block_export_del(const char *id,
 if (!has_mode) {
 mode = BLOCK_EXPORT_REMOVE_MODE_SAFE;
 }
-if (mode == BLOCK_EXPORT_REMOVE_MODE_SAFE && exp->refcount > 1) {
+if (mode == BLOCK_EXPORT_REMOVE_MODE_SAFE &&
+qatomic_read(&exp->refcount) > 1) {
 error_setg(errp, "export '%s' still in use", exp->id);
 error_append_hint(errp, "Use mode='hard' to force client "
   "disconnect\n");
diff --git a/block/export/vduse-blk.c b/block/export/vduse-blk.c
index a25556fe04..e041f9 100644
--- a/block/export/vduse-blk.c
+++ b/block/export/vduse-blk.c
@@ -44,9 +44,7 @@ static void vduse_blk_inflight_inc(VduseBlkExport *vblk_exp)
 {
 if (qatomic_fetch_inc(&vblk_exp->inflight) == 0) {
 /* Prevent export from being deleted */
-aio_context_acquire(vblk_exp->export.ctx);
 blk_exp_ref(&vblk_exp->export);
-aio_context_release(vblk_exp->export.ctx);
 }
 }
 
@@ -57,9 +55,7 @@ static void vduse_blk_inflight_dec(VduseBlkExport *vblk_exp)
 aio_wait_kick();
 
 /* Now the export can be deleted */
-aio_context_acquire(vblk_exp->export.ctx);
 blk_exp_unref(&vblk_exp->export);
-aio_context_release(vblk_exp->export.ctx);
 }
 }
 
-- 
2.40.1

[PATCH v6 10/20] block: drain from main loop thread in bdrv_co_yield_to_drain()

2023-05-16 Thread Stefan Hajnoczi

For simplicity, always run BlockDevOps .drained_begin/end/poll()
callbacks in the main loop thread. This makes it easier to implement the
callbacks and avoids extra locks.

Move the function pointer declarations from the I/O Code section to the
Global State section for BlockDevOps, BdrvChildClass, and BlockDriver.

Narrow IO_OR_GS_CODE() to GLOBAL_STATE_CODE() where appropriate.

The test-bdrv-drain test case calls bdrv_drain() from an IOThread. This
is now only allowed from coroutine context, so update the test case to
run in a coroutine.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
---
 include/block/block_int-common.h  | 90 +--
 include/sysemu/block-backend-common.h | 25 
 block/io.c| 14 +++--
 tests/unit/test-bdrv-drain.c  | 14 +++--
 4 files changed, 76 insertions(+), 67 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index dbec0e3bb4..17eb54edcb 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -363,6 +363,21 @@ struct BlockDriver {
 void (*bdrv_attach_aio_context)(BlockDriverState *bs,
 AioContext *new_context);
 
+/**
+ * bdrv_drain_begin is called if implemented in the beginning of a
+ * drain operation to drain and stop any internal sources of requests in
+ * the driver.
+ * bdrv_drain_end is called if implemented at the end of the drain.
+ *
+ * They should be used by the driver to e.g. manage scheduled I/O
+ * requests, or toggle an internal state. After the end of the drain new
+ * requests will continue normally.
+ *
+ * Implementations of both functions must not call aio_poll().
+ */
+void (*bdrv_drain_begin)(BlockDriverState *bs);
+void (*bdrv_drain_end)(BlockDriverState *bs);
+
 /**
  * Try to get @bs's logical and physical block size.
  * On success, store them in @bsz and return zero.
@@ -758,21 +773,6 @@ struct BlockDriver {
 void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_unplug)(
 BlockDriverState *bs);
 
-/**
- * bdrv_drain_begin is called if implemented in the beginning of a
- * drain operation to drain and stop any internal sources of requests in
- * the driver.
- * bdrv_drain_end is called if implemented at the end of the drain.
- *
- * They should be used by the driver to e.g. manage scheduled I/O
- * requests, or toggle an internal state. After the end of the drain new
- * requests will continue normally.
- *
- * Implementations of both functions must not call aio_poll().
- */
-void (*bdrv_drain_begin)(BlockDriverState *bs);
-void (*bdrv_drain_end)(BlockDriverState *bs);
-
 bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
 
 bool coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_can_store_new_dirty_bitmap)(
@@ -955,36 +955,6 @@ struct BdrvChildClass {
 void GRAPH_WRLOCK_PTR (*attach)(BdrvChild *child);
 void GRAPH_WRLOCK_PTR (*detach)(BdrvChild *child);
 
-/*
- * Notifies the parent that the filename of its child has changed (e.g.
- * because the direct child was removed from the backing chain), so that it
- * can update its reference.
- */
-int (*update_filename)(BdrvChild *child, BlockDriverState *new_base,
-   const char *filename, Error **errp);
-
-bool (*change_aio_ctx)(BdrvChild *child, AioContext *ctx,
-   GHashTable *visited, Transaction *tran,
-   Error **errp);
-
-/*
- * I/O API functions. These functions are thread-safe.
- *
- * See include/block/block-io.h for more information about
- * the I/O API.
- */
-
-void (*resize)(BdrvChild *child);
-
-/*
- * Returns a name that is supposedly more useful for human users than the
- * node name for identifying the node in question (in particular, a BB
- * name), or NULL if the parent can't provide a better name.
- */
-const char *(*get_name)(BdrvChild *child);
-
-AioContext *(*get_parent_aio_context)(BdrvChild *child);
-
 /*
  * If this pair of functions is implemented, the parent doesn't issue new
  * requests after returning from .drained_begin() until .drained_end() is
@@ -1005,6 +975,36 @@ struct BdrvChildClass {
  * activity on the child has stopped.
  */
 bool (*drained_poll)(BdrvChild *child);
+
+/*
+ * Notifies the parent that the filename of its child has changed (e.g.
+ * because the direct child was removed from the backing chain), so that it
+ * can update its reference.
+ */
+int (*update_filename)(BdrvChild *child, BlockDriverState *new_base,
+   const char *filename, Error **errp);
+
+bool (*change_aio_ctx)(BdrvChild *child, AioContext *ctx,
+   GHashTable *visited, Tr

[PATCH v6 09/20] block: add blk_in_drain() API

2023-05-16 Thread Stefan Hajnoczi

The BlockBackend quiesce_counter is greater than zero during drained
sections. Add an API to check whether the BlockBackend is in a drained
section.

The next patch will use this API.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
---
 include/sysemu/block-backend-global-state.h | 1 +
 block/block-backend.c   | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/include/sysemu/block-backend-global-state.h 
b/include/sysemu/block-backend-global-state.h
index fa83f9389c..184e667ebd 100644
--- a/include/sysemu/block-backend-global-state.h
+++ b/include/sysemu/block-backend-global-state.h
@@ -81,6 +81,7 @@ void blk_activate(BlockBackend *blk, Error **errp);
 int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags);
 void blk_aio_cancel(BlockAIOCB *acb);
 int blk_commit_all(void);
+bool blk_in_drain(BlockBackend *blk);
 void blk_drain(BlockBackend *blk);
 void blk_drain_all(void);
 void blk_set_on_error(BlockBackend *blk, BlockdevOnError on_read_error,
diff --git a/block/block-backend.c b/block/block-backend.c
index 68087437ac..3a5949ecce 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1270,6 +1270,13 @@ blk_check_byte_request(BlockBackend *blk, int64_t 
offset, int64_t bytes)
 return 0;
 }
 
+/* Are we currently in a drained section? */
+bool blk_in_drain(BlockBackend *blk)
+{
+GLOBAL_STATE_CODE(); /* change to IO_OR_GS_CODE(), if necessary */
+return qatomic_read(&blk->quiesce_counter);
+}
+
 /* To be called between exactly one pair of blk_inc/dec_in_flight() */
 static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
 {
-- 
2.40.1

[PATCH v6 08/20] hw/xen: do not use aio_set_fd_handler(is_external=true) in xen_xenstore

2023-05-16 Thread Stefan Hajnoczi

There is no need to suspend activity between aio_disable_external() and
aio_enable_external(), which is mainly used for the block layer's drain
operation.

This is part of ongoing work to remove the aio_disable_external() API.

Reviewed-by: David Woodhouse 
Reviewed-by: Paul Durrant 
Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
---
 hw/i386/kvm/xen_xenstore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/kvm/xen_xenstore.c b/hw/i386/kvm/xen_xenstore.c
index 900679af8a..6e81bc8791 100644
--- a/hw/i386/kvm/xen_xenstore.c
+++ b/hw/i386/kvm/xen_xenstore.c
@@ -133,7 +133,7 @@ static void xen_xenstore_realize(DeviceState *dev, Error 
**errp)
 error_setg(errp, "Xenstore evtchn port init failed");
 return;
 }
-aio_set_fd_handler(qemu_get_aio_context(), xen_be_evtchn_fd(s->eh), true,
+aio_set_fd_handler(qemu_get_aio_context(), xen_be_evtchn_fd(s->eh), false,
xen_xenstore_event, NULL, NULL, NULL, s);
 
 s->impl = xs_impl_create(xen_domid);
-- 
2.40.1

[PATCH v6 06/20] block/export: wait for vhost-user-blk requests when draining

2023-05-16 Thread Stefan Hajnoczi

Each vhost-user-blk request runs in a coroutine. When the BlockBackend
enters a drained section we need to enter a quiescent state. Currently
any in-flight requests race with bdrv_drained_begin() because it is
unaware of vhost-user-blk requests.

When blk_co_preadv/pwritev()/etc returns it wakes the
bdrv_drained_begin() thread but vhost-user-blk request processing has
not yet finished. The request coroutine continues executing while the
main loop thread thinks it is in a drained section.

One example where this is unsafe is for blk_set_aio_context() where
bdrv_drained_begin() is called before .aio_context_detached() and
.aio_context_attach(). If request coroutines are still running after
bdrv_drained_begin(), then the AioContext could change underneath them
and they race with new requests processed in the new AioContext. This
could lead to virtqueue corruption, for example.

(This example is theoretical, I came across this while reading the
code and have not tried to reproduce it.)

It's easy to make bdrv_drained_begin() wait for in-flight requests: add
a .drained_poll() callback that checks the VuServer's in-flight counter.
VuServer just needs an API that returns true when there are requests in
flight. The in-flight counter needs to be atomic.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
---
v5:
- Use atomic accesses for in_flight counter in vhost-user-server.c [Kevin]
---
 include/qemu/vhost-user-server.h |  4 +++-
 block/export/vhost-user-blk-server.c | 13 +
 util/vhost-user-server.c | 18 --
 3 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/qemu/vhost-user-server.h b/include/qemu/vhost-user-server.h
index bc0ac9ddb6..b1c1cda886 100644
--- a/include/qemu/vhost-user-server.h
+++ b/include/qemu/vhost-user-server.h
@@ -40,8 +40,9 @@ typedef struct {
 int max_queues;
 const VuDevIface *vu_iface;
 
+unsigned int in_flight; /* atomic */
+
 /* Protected by ctx lock */
-unsigned int in_flight;
 bool wait_idle;
 VuDev vu_dev;
 QIOChannel *ioc; /* The I/O channel with the client */
@@ -62,6 +63,7 @@ void vhost_user_server_stop(VuServer *server);
 
 void vhost_user_server_inc_in_flight(VuServer *server);
 void vhost_user_server_dec_in_flight(VuServer *server);
+bool vhost_user_server_has_in_flight(VuServer *server);
 
 void vhost_user_server_attach_aio_context(VuServer *server, AioContext *ctx);
 void vhost_user_server_detach_aio_context(VuServer *server);
diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index 841acb36e3..f51a36a14f 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -272,7 +272,20 @@ static void vu_blk_exp_resize(void *opaque)
 vu_config_change_msg(&vexp->vu_server.vu_dev);
 }
 
+/*
+ * Ensures that bdrv_drained_begin() waits until in-flight requests complete.
+ *
+ * Called with vexp->export.ctx acquired.
+ */
+static bool vu_blk_drained_poll(void *opaque)
+{
+VuBlkExport *vexp = opaque;
+
+return vhost_user_server_has_in_flight(&vexp->vu_server);
+}
+
 static const BlockDevOps vu_blk_dev_ops = {
+.drained_poll  = vu_blk_drained_poll,
 .resize_cb = vu_blk_exp_resize,
 };
 
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index 1622f8cfb3..68c3bf162f 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -78,17 +78,23 @@ static void panic_cb(VuDev *vu_dev, const char *buf)
 void vhost_user_server_inc_in_flight(VuServer *server)
 {
 assert(!server->wait_idle);
-server->in_flight++;
+qatomic_inc(&server->in_flight);
 }
 
 void vhost_user_server_dec_in_flight(VuServer *server)
 {
-server->in_flight--;
-if (server->wait_idle && !server->in_flight) {
-aio_co_wake(server->co_trip);
+if (qatomic_fetch_dec(&server->in_flight) == 1) {
+if (server->wait_idle) {
+aio_co_wake(server->co_trip);
+}
 }
 }
 
+bool vhost_user_server_has_in_flight(VuServer *server)
+{
+return qatomic_load_acquire(&server->in_flight) > 0;
+}
+
 static bool coroutine_fn
 vu_message_read(VuDev *vu_dev, int conn_fd, VhostUserMsg *vmsg)
 {
@@ -192,13 +198,13 @@ static coroutine_fn void vu_client_trip(void *opaque)
 /* Keep running */
 }
 
-if (server->in_flight) {
+if (vhost_user_server_has_in_flight(server)) {
 /* Wait for requests to complete before we can unmap the memory */
 server->wait_idle = true;
 qemu_coroutine_yield();
 server->wait_idle = false;
 }
-assert(server->in_flight == 0);
+assert(!vhost_user_server_has_in_flight(server));
 
 vu_deinit(vu_dev);
 
-- 
2.40.1

[PATCH v6 07/20] block/export: stop using is_external in vhost-user-blk server

2023-05-16 Thread Stefan Hajnoczi

vhost-user activity must be suspended during bdrv_drained_begin/end().
This prevents new requests from interfering with whatever is happening
in the drained section.

Previously this was done using aio_set_fd_handler()'s is_external
argument. In a multi-queue block layer world the aio_disable_external()
API cannot be used since multiple AioContext may be processing I/O, not
just one.

Switch to BlockDevOps->drained_begin/end() callbacks.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
---
 block/export/vhost-user-blk-server.c | 28 ++--
 util/vhost-user-server.c | 10 +-
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index f51a36a14f..81b59761e3 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -212,15 +212,21 @@ static void blk_aio_attached(AioContext *ctx, void 
*opaque)
 {
 VuBlkExport *vexp = opaque;
 
+/*
+ * The actual attach will happen in vu_blk_drained_end() and we just
+ * restore ctx here.
+ */
 vexp->export.ctx = ctx;
-vhost_user_server_attach_aio_context(&vexp->vu_server, ctx);
 }
 
 static void blk_aio_detach(void *opaque)
 {
 VuBlkExport *vexp = opaque;
 
-vhost_user_server_detach_aio_context(&vexp->vu_server);
+/*
+ * The actual detach already happened in vu_blk_drained_begin() but from
+ * this point on we must not access ctx anymore.
+ */
 vexp->export.ctx = NULL;
 }
 
@@ -272,6 +278,22 @@ static void vu_blk_exp_resize(void *opaque)
 vu_config_change_msg(&vexp->vu_server.vu_dev);
 }
 
+/* Called with vexp->export.ctx acquired */
+static void vu_blk_drained_begin(void *opaque)
+{
+VuBlkExport *vexp = opaque;
+
+vhost_user_server_detach_aio_context(&vexp->vu_server);
+}
+
+/* Called with vexp->export.blk AioContext acquired */
+static void vu_blk_drained_end(void *opaque)
+{
+VuBlkExport *vexp = opaque;
+
+vhost_user_server_attach_aio_context(&vexp->vu_server, vexp->export.ctx);
+}
+
 /*
  * Ensures that bdrv_drained_begin() waits until in-flight requests complete.
  *
@@ -285,6 +307,8 @@ static bool vu_blk_drained_poll(void *opaque)
 }
 
 static const BlockDevOps vu_blk_dev_ops = {
+.drained_begin = vu_blk_drained_begin,
+.drained_end   = vu_blk_drained_end,
 .drained_poll  = vu_blk_drained_poll,
 .resize_cb = vu_blk_exp_resize,
 };
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index 68c3bf162f..a12b2d1bba 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -278,7 +278,7 @@ set_watch(VuDev *vu_dev, int fd, int vu_evt,
 vu_fd_watch->fd = fd;
 vu_fd_watch->cb = cb;
 qemu_socket_set_nonblock(fd);
-aio_set_fd_handler(server->ioc->ctx, fd, true, kick_handler,
+aio_set_fd_handler(server->ioc->ctx, fd, false, kick_handler,
NULL, NULL, NULL, vu_fd_watch);
 vu_fd_watch->vu_dev = vu_dev;
 vu_fd_watch->pvt = pvt;
@@ -299,7 +299,7 @@ static void remove_watch(VuDev *vu_dev, int fd)
 if (!vu_fd_watch) {
 return;
 }
-aio_set_fd_handler(server->ioc->ctx, fd, true,
+aio_set_fd_handler(server->ioc->ctx, fd, false,
NULL, NULL, NULL, NULL, NULL);
 
 QTAILQ_REMOVE(&server->vu_fd_watches, vu_fd_watch, next);
@@ -362,7 +362,7 @@ void vhost_user_server_stop(VuServer *server)
 VuFdWatch *vu_fd_watch;
 
 QTAILQ_FOREACH(vu_fd_watch, &server->vu_fd_watches, next) {
-aio_set_fd_handler(server->ctx, vu_fd_watch->fd, true,
+aio_set_fd_handler(server->ctx, vu_fd_watch->fd, false,
NULL, NULL, NULL, NULL, vu_fd_watch);
 }
 
@@ -403,7 +403,7 @@ void vhost_user_server_attach_aio_context(VuServer *server, 
AioContext *ctx)
 qio_channel_attach_aio_context(server->ioc, ctx);
 
 QTAILQ_FOREACH(vu_fd_watch, &server->vu_fd_watches, next) {
-aio_set_fd_handler(ctx, vu_fd_watch->fd, true, kick_handler, NULL,
+aio_set_fd_handler(ctx, vu_fd_watch->fd, false, kick_handler, NULL,
NULL, NULL, vu_fd_watch);
 }
 
@@ -417,7 +417,7 @@ void vhost_user_server_detach_aio_context(VuServer *server)
 VuFdWatch *vu_fd_watch;
 
 QTAILQ_FOREACH(vu_fd_watch, &server->vu_fd_watches, next) {
-aio_set_fd_handler(server->ctx, vu_fd_watch->fd, true,
+aio_set_fd_handler(server->ctx, vu_fd_watch->fd, false,
NULL, NULL, NULL, NULL, vu_fd_watch);
 }
 
-- 
2.40.1

[PATCH v6 05/20] util/vhost-user-server: rename refcount to in_flight counter

2023-05-16 Thread Stefan Hajnoczi

The VuServer object has a refcount field and ref/unref APIs. The name is
confusing because it's actually an in-flight request counter instead of
a refcount.

Normally a refcount destroys the object upon reaching zero. The VuServer
counter is used to wake up the vhost-user coroutine when there are no
more requests.

Avoid confusing by renaming refcount and ref/unref to in_flight and
inc/dec.

Reviewed-by: Paolo Bonzini 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
---
 include/qemu/vhost-user-server.h |  6 +++---
 block/export/vhost-user-blk-server.c | 11 +++
 util/vhost-user-server.c | 14 +++---
 3 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/include/qemu/vhost-user-server.h b/include/qemu/vhost-user-server.h
index 25c72433ca..bc0ac9ddb6 100644
--- a/include/qemu/vhost-user-server.h
+++ b/include/qemu/vhost-user-server.h
@@ -41,7 +41,7 @@ typedef struct {
 const VuDevIface *vu_iface;
 
 /* Protected by ctx lock */
-unsigned int refcount;
+unsigned int in_flight;
 bool wait_idle;
 VuDev vu_dev;
 QIOChannel *ioc; /* The I/O channel with the client */
@@ -60,8 +60,8 @@ bool vhost_user_server_start(VuServer *server,
 
 void vhost_user_server_stop(VuServer *server);
 
-void vhost_user_server_ref(VuServer *server);
-void vhost_user_server_unref(VuServer *server);
+void vhost_user_server_inc_in_flight(VuServer *server);
+void vhost_user_server_dec_in_flight(VuServer *server);
 
 void vhost_user_server_attach_aio_context(VuServer *server, AioContext *ctx);
 void vhost_user_server_detach_aio_context(VuServer *server);
diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index e56b92f2e2..841acb36e3 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -50,7 +50,10 @@ static void vu_blk_req_complete(VuBlkReq *req, size_t in_len)
 free(req);
 }
 
-/* Called with server refcount increased, must decrease before returning */
+/*
+ * Called with server in_flight counter increased, must decrease before
+ * returning.
+ */
 static void coroutine_fn vu_blk_virtio_process_req(void *opaque)
 {
 VuBlkReq *req = opaque;
@@ -68,12 +71,12 @@ static void coroutine_fn vu_blk_virtio_process_req(void 
*opaque)
 in_num, out_num);
 if (in_len < 0) {
 free(req);
-vhost_user_server_unref(server);
+vhost_user_server_dec_in_flight(server);
 return;
 }
 
 vu_blk_req_complete(req, in_len);
-vhost_user_server_unref(server);
+vhost_user_server_dec_in_flight(server);
 }
 
 static void vu_blk_process_vq(VuDev *vu_dev, int idx)
@@ -95,7 +98,7 @@ static void vu_blk_process_vq(VuDev *vu_dev, int idx)
 Coroutine *co =
 qemu_coroutine_create(vu_blk_virtio_process_req, req);
 
-vhost_user_server_ref(server);
+vhost_user_server_inc_in_flight(server);
 qemu_coroutine_enter(co);
 }
 }
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index 5b6216069c..1622f8cfb3 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -75,16 +75,16 @@ static void panic_cb(VuDev *vu_dev, const char *buf)
 error_report("vu_panic: %s", buf);
 }
 
-void vhost_user_server_ref(VuServer *server)
+void vhost_user_server_inc_in_flight(VuServer *server)
 {
 assert(!server->wait_idle);
-server->refcount++;
+server->in_flight++;
 }
 
-void vhost_user_server_unref(VuServer *server)
+void vhost_user_server_dec_in_flight(VuServer *server)
 {
-server->refcount--;
-if (server->wait_idle && !server->refcount) {
+server->in_flight--;
+if (server->wait_idle && !server->in_flight) {
 aio_co_wake(server->co_trip);
 }
 }
@@ -192,13 +192,13 @@ static coroutine_fn void vu_client_trip(void *opaque)
 /* Keep running */
 }
 
-if (server->refcount) {
+if (server->in_flight) {
 /* Wait for requests to complete before we can unmap the memory */
 server->wait_idle = true;
 qemu_coroutine_yield();
 server->wait_idle = false;
 }
-assert(server->refcount == 0);
+assert(server->in_flight == 0);
 
 vu_deinit(vu_dev);
 
-- 
2.40.1

[PATCH v6 04/20] virtio-scsi: stop using aio_disable_external() during unplug

2023-05-16 Thread Stefan Hajnoczi

This patch is part of an effort to remove the aio_disable_external()
API because it does not fit in a multi-queue block layer world where
many AioContexts may be submitting requests to the same disk.

The SCSI emulation code is already in good shape to stop using
aio_disable_external(). It was only used by commit 9c5aad84da1c
("virtio-scsi: fixed virtio_scsi_ctx_check failed when detaching scsi
disk") to ensure that virtio_scsi_hotunplug() works while the guest
driver is submitting I/O.

Ensure virtio_scsi_hotunplug() is safe as follows:

1. qdev_simple_device_unplug_cb() -> qdev_unrealize() ->
   device_set_realized() calls qatomic_set(&dev->realized, false) so
   that future scsi_device_get() calls return NULL because they exclude
   SCSIDevices with realized=false.

   That means virtio-scsi will reject new I/O requests to this
   SCSIDevice with VIRTIO_SCSI_S_BAD_TARGET even while
   virtio_scsi_hotunplug() is still executing. We are protected against
   new requests!

2. scsi_qdev_unrealize() already contains a call to
   scsi_device_purge_requests() so that in-flight requests are cancelled
   synchronously. This ensures that no in-flight requests remain once
   qdev_simple_device_unplug_cb() returns.

Thanks to these two conditions we don't need aio_disable_external()
anymore.

Cc: Zhengui Li 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Daniil Tatianin 
Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
---
 hw/scsi/virtio-scsi.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index ae314af3de..c1a7ea9ae2 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -1091,7 +1091,6 @@ static void virtio_scsi_hotunplug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 VirtIODevice *vdev = VIRTIO_DEVICE(hotplug_dev);
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 SCSIDevice *sd = SCSI_DEVICE(dev);
-AioContext *ctx = s->ctx ?: qemu_get_aio_context();
 VirtIOSCSIEventInfo info = {
 .event   = VIRTIO_SCSI_T_TRANSPORT_RESET,
 .reason  = VIRTIO_SCSI_EVT_RESET_REMOVED,
@@ -1101,9 +1100,7 @@ static void virtio_scsi_hotunplug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 },
 };
 
-aio_disable_external(ctx);
 qdev_simple_device_unplug_cb(hotplug_dev, dev, errp);
-aio_enable_external(ctx);
 
 if (s->ctx) {
 virtio_scsi_acquire(s);
-- 
2.40.1

[PATCH v6 03/20] virtio-scsi: avoid race between unplug and transport event

2023-05-16 Thread Stefan Hajnoczi

Only report a transport reset event to the guest after the SCSIDevice
has been unrealized by qdev_simple_device_unplug_cb().

qdev_simple_device_unplug_cb() sets the SCSIDevice's qdev.realized field
to false so that scsi_device_find/get() no longer see it.

scsi_target_emulate_report_luns() also needs to be updated to filter out
SCSIDevices that are unrealized.

Change virtio_scsi_push_event() to take event information as an argument
instead of the SCSIDevice. This allows virtio_scsi_hotunplug() to emit a
VIRTIO_SCSI_T_TRANSPORT_RESET event after the SCSIDevice has already
been unrealized.

These changes ensure that the guest driver does not see the SCSIDevice
that's being unplugged if it responds very quickly to the transport
reset event.

Reviewed-by: Paolo Bonzini 
Reviewed-by: Michael S. Tsirkin 
Reviewed-by: Daniil Tatianin 
Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
---
v5:
- Stash SCSIDevice id/lun values for VIRTIO_SCSI_T_TRANSPORT_RESET event
  before unrealizing the SCSIDevice [Kevin]
---
 hw/scsi/scsi-bus.c|  3 +-
 hw/scsi/virtio-scsi.c | 86 ++-
 2 files changed, 63 insertions(+), 26 deletions(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 8857ff41f6..64013c8a24 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -487,7 +487,8 @@ static bool scsi_target_emulate_report_luns(SCSITargetReq 
*r)
 DeviceState *qdev = kid->child;
 SCSIDevice *dev = SCSI_DEVICE(qdev);
 
-if (dev->channel == channel && dev->id == id && dev->lun != 0) {
+if (dev->channel == channel && dev->id == id && dev->lun != 0 &&
+qdev_is_realized(&dev->qdev)) {
 store_lun(tmp, dev->lun);
 g_byte_array_append(buf, tmp, 8);
 len += 8;
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 612c525d9d..ae314af3de 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -933,13 +933,27 @@ static void virtio_scsi_reset(VirtIODevice *vdev)
 s->events_dropped = false;
 }
 
-static void virtio_scsi_push_event(VirtIOSCSI *s, SCSIDevice *dev,
-   uint32_t event, uint32_t reason)
+typedef struct {
+uint32_t event;
+uint32_t reason;
+union {
+/* Used by messages specific to a device */
+struct {
+uint32_t id;
+uint32_t lun;
+} address;
+};
+} VirtIOSCSIEventInfo;
+
+static void virtio_scsi_push_event(VirtIOSCSI *s,
+   const VirtIOSCSIEventInfo *info)
 {
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
 VirtIOSCSIReq *req;
 VirtIOSCSIEvent *evt;
 VirtIODevice *vdev = VIRTIO_DEVICE(s);
+uint32_t event = info->event;
+uint32_t reason = info->reason;
 
 if (!(vdev->status & VIRTIO_CONFIG_S_DRIVER_OK)) {
 return;
@@ -965,27 +979,28 @@ static void virtio_scsi_push_event(VirtIOSCSI *s, 
SCSIDevice *dev,
 memset(evt, 0, sizeof(VirtIOSCSIEvent));
 evt->event = virtio_tswap32(vdev, event);
 evt->reason = virtio_tswap32(vdev, reason);
-if (!dev) {
-assert(event == VIRTIO_SCSI_T_EVENTS_MISSED);
-} else {
+if (event != VIRTIO_SCSI_T_EVENTS_MISSED) {
 evt->lun[0] = 1;
-evt->lun[1] = dev->id;
+evt->lun[1] = info->address.id;
 
 /* Linux wants us to keep the same encoding we use for REPORT LUNS.  */
-if (dev->lun >= 256) {
-evt->lun[2] = (dev->lun >> 8) | 0x40;
+if (info->address.lun >= 256) {
+evt->lun[2] = (info->address.lun >> 8) | 0x40;
 }
-evt->lun[3] = dev->lun & 0xFF;
+evt->lun[3] = info->address.lun & 0xFF;
 }
 trace_virtio_scsi_event(virtio_scsi_get_lun(evt->lun), event, reason);
- 
+
 virtio_scsi_complete_req(req);
 }
 
 static void virtio_scsi_handle_event_vq(VirtIOSCSI *s, VirtQueue *vq)
 {
 if (s->events_dropped) {
-virtio_scsi_push_event(s, NULL, VIRTIO_SCSI_T_NO_EVENT, 0);
+VirtIOSCSIEventInfo info = {
+.event = VIRTIO_SCSI_T_NO_EVENT,
+};
+virtio_scsi_push_event(s, &info);
 }
 }
 
@@ -1009,9 +1024,17 @@ static void virtio_scsi_change(SCSIBus *bus, SCSIDevice 
*dev, SCSISense sense)
 
 if (virtio_vdev_has_feature(vdev, VIRTIO_SCSI_F_CHANGE) &&
 dev->type != TYPE_ROM) {
+VirtIOSCSIEventInfo info = {
+.event   = VIRTIO_SCSI_T_PARAM_CHANGE,
+.reason  = sense.asc | (sense.ascq << 8),
+.address = {
+.id  = dev->id,
+.lun = dev->lun,
+},
+};
+
 virtio_scsi_acquire(s);
-virtio_scsi_push_event(s, dev, VIRTIO_SCSI_T_PARAM_CHANGE,
-

[PATCH v6 02/20] hw/qdev: introduce qdev_is_realized() helper

2023-05-16 Thread Stefan Hajnoczi

Add a helper function to check whether the device is realized without
requiring the Big QEMU Lock. The next patch adds a second caller. The
goal is to avoid spreading DeviceState field accesses throughout the
code.

Suggested-by: Philippe Mathieu-Daudé 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
---
 include/hw/qdev-core.h | 17 ++---
 hw/scsi/scsi-bus.c |  3 +--
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 7623703943..f1070d6dc7 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -1,6 +1,7 @@
 #ifndef QDEV_CORE_H
 #define QDEV_CORE_H
 
+#include "qemu/atomic.h"
 #include "qemu/queue.h"
 #include "qemu/bitmap.h"
 #include "qemu/rcu.h"
@@ -168,9 +169,6 @@ typedef struct {
 
 /**
  * DeviceState:
- * @realized: Indicates whether the device has been fully constructed.
- *When accessed outside big qemu lock, must be accessed with
- *qatomic_load_acquire()
  * @reset: ResettableState for the device; handled by Resettable interface.
  *
  * This structure should not be accessed directly.  We declare it here
@@ -339,6 +337,19 @@ DeviceState *qdev_new(const char *name);
  */
 DeviceState *qdev_try_new(const char *name);
 
+/**
+ * qdev_is_realized:
+ * @dev: The device to check.
+ *
+ * May be called outside big qemu lock.
+ *
+ * Returns: %true% if the device has been fully constructed, %false% otherwise.
+ */
+static inline bool qdev_is_realized(DeviceState *dev)
+{
+return qatomic_load_acquire(&dev->realized);
+}
+
 /**
  * qdev_realize: Realize @dev.
  * @dev: device to realize
diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 3c20b47ad0..8857ff41f6 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -60,8 +60,7 @@ static SCSIDevice *do_scsi_device_find(SCSIBus *bus,
  * the user access the device.
  */
 
-if (retval && !include_unrealized &&
-!qatomic_load_acquire(&retval->qdev.realized)) {
+if (retval && !include_unrealized && !qdev_is_realized(&retval->qdev)) {
 retval = NULL;
 }
 
-- 
2.40.1

[PATCH v6 01/20] block-backend: split blk_do_set_aio_context()

2023-05-16 Thread Stefan Hajnoczi

blk_set_aio_context() is not fully transactional because
blk_do_set_aio_context() updates blk->ctx outside the transaction. Most
of the time this goes unnoticed but a BlockDevOps.drained_end() callback
that invokes blk_get_aio_context() fails assert(ctx == blk->ctx). This
happens because blk->ctx is only assigned after
BlockDevOps.drained_end() is called and we're in an intermediate state
where BlockDrvierState nodes already have the new context and the
BlockBackend still has the old context.

Making blk_set_aio_context() fully transactional solves this assertion
failure because the BlockBackend's context is updated as part of the
transaction (before BlockDevOps.drained_end() is called).

Split blk_do_set_aio_context() in order to solve this assertion failure.
This helper function actually serves two different purposes:
1. It drives blk_set_aio_context().
2. It responds to BdrvChildClass->change_aio_ctx().

Get rid of the helper function. Do #1 inside blk_set_aio_context() and
do #2 inside blk_root_set_aio_ctx_commit(). This simplifies the code.

The only drawback of the fully transactional approach is that
blk_set_aio_context() must contend with blk_root_set_aio_ctx_commit()
being invoked as part of the AioContext change propagation. This can be
solved by temporarily setting blk->allow_aio_context_change to true.

Future patches call blk_get_aio_context() from
BlockDevOps->drained_end(), so this patch will become necessary.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
---
 block/block-backend.c | 71 +--
 1 file changed, 28 insertions(+), 43 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index ca537cd0ad..68087437ac 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2411,52 +2411,31 @@ static AioContext *blk_aiocb_get_aio_context(BlockAIOCB 
*acb)
 return blk_get_aio_context(blk_acb->blk);
 }
 
-static int blk_do_set_aio_context(BlockBackend *blk, AioContext *new_context,
-  bool update_root_node, Error **errp)
-{
-BlockDriverState *bs = blk_bs(blk);
-ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
-int ret;
-
-if (bs) {
-bdrv_ref(bs);
-
-if (update_root_node) {
-/*
- * update_root_node MUST be false for 
blk_root_set_aio_ctx_commit(),
- * as we are already in the commit function of a transaction.
- */
-ret = bdrv_try_change_aio_context(bs, new_context, blk->root, 
errp);
-if (ret < 0) {
-bdrv_unref(bs);
-return ret;
-}
-}
-/*
- * Make blk->ctx consistent with the root node before we invoke any
- * other operations like drain that might inquire blk->ctx
- */
-blk->ctx = new_context;
-if (tgm->throttle_state) {
-bdrv_drained_begin(bs);
-throttle_group_detach_aio_context(tgm);
-throttle_group_attach_aio_context(tgm, new_context);
-bdrv_drained_end(bs);
-}
-
-bdrv_unref(bs);
-} else {
-blk->ctx = new_context;
-}
-
-return 0;
-}
-
 int blk_set_aio_context(BlockBackend *blk, AioContext *new_context,
 Error **errp)
 {
+bool old_allow_change;
+BlockDriverState *bs = blk_bs(blk);
+int ret;
+
 GLOBAL_STATE_CODE();
-return blk_do_set_aio_context(blk, new_context, true, errp);
+
+if (!bs) {
+blk->ctx = new_context;
+return 0;
+}
+
+bdrv_ref(bs);
+
+old_allow_change = blk->allow_aio_context_change;
+blk->allow_aio_context_change = true;
+
+ret = bdrv_try_change_aio_context(bs, new_context, NULL, errp);
+
+blk->allow_aio_context_change = old_allow_change;
+
+bdrv_unref(bs);
+return ret;
 }
 
 typedef struct BdrvStateBlkRootContext {
@@ -2468,8 +2447,14 @@ static void blk_root_set_aio_ctx_commit(void *opaque)
 {
 BdrvStateBlkRootContext *s = opaque;
 BlockBackend *blk = s->blk;
+AioContext *new_context = s->new_ctx;
+ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
 
-blk_do_set_aio_context(blk, s->new_ctx, false, &error_abort);
+blk->ctx = new_context;
+if (tgm->throttle_state) {
+throttle_group_detach_aio_context(tgm);
+throttle_group_attach_aio_context(tgm, new_context);
+}
 }
 
 static TransactionActionDrv set_blk_root_context = {
-- 
2.40.1

[PATCH v6 00/20] block: remove aio_disable_external() API

2023-05-16 Thread Stefan Hajnoczi

v6:
- Fix scsi_device_unrealize() -> scsi_qdev_unrealize() mistake in Patch 4
  commit description [Kevin]
- Explain why we don't schedule a BH in .drained_begin() in Patch 16 [Kevin]
- Copy the comment explaining why the event notifier is tested and cleared in
  Patch 16 [Kevin]
- Fix EPOLL_ENABLE_THRESHOLD mismerge in util/fdmon-epoll.c [Kevin]

v5:
- Use atomic accesses for in_flight counter in vhost-user-server.c [Kevin]
- Stash SCSIDevice id/lun values for VIRTIO_SCSI_T_TRANSPORT_RESET event
  before unrealizing the SCSIDevice [Kevin]
- Keep vhost-user-blk export .detach() callback so ctx is set to NULL [Kevin]
- Narrow BdrvChildClass and BlockDriver drained_{begin/end/poll} callbacks from
  IO_OR_GS_CODE() to GLOBAL_STATE_CODE() [Kevin]
- Include Kevin's "block: Fix use after free in blockdev_mark_auto_del()" to
  fix a latent bug that was exposed by this series

v4:
- Remove external_disable_cnt variable [Philippe]
- Add Patch 1 to fix assertion failure in .drained_end() -> 
blk_get_aio_context()
v3:
- Resend full patch series. v2 was sent in the middle of a git rebase and was
  missing patches. [Eric]
- Apply Reviewed-by tags.
v2:
- Do not rely on BlockBackend request queuing, implement .drained_begin/end()
  instead in xen-block, virtio-blk, and virtio-scsi [Paolo]
- Add qdev_is_realized() API [Philippe]
- Add patch to avoid AioContext lock around blk_exp_ref/unref() [Paolo]
- Add patch to call .drained_begin/end() from main loop thread to simplify
  callback implementations

The aio_disable_external() API temporarily suspends file descriptor monitoring
in the event loop. The block layer uses this to prevent new I/O requests being
submitted from the guest and elsewhere between bdrv_drained_begin() and
bdrv_drained_end().

While the block layer still needs to prevent new I/O requests in drained
sections, the aio_disable_external() API can be replaced with
.drained_begin/end/poll() callbacks that have been added to BdrvChildClass and
BlockDevOps.

This newer .bdrained_begin/end/poll() approach is attractive because it works
without specifying a specific AioContext. The block layer is moving towards
multi-queue and that means multiple AioContexts may be processing I/O
simultaneously.

The aio_disable_external() was always somewhat hacky. It suspends all file
descriptors that were registered with is_external=true, even if they have
nothing to do with the BlockDriverState graph nodes that are being drained.
It's better to solve a block layer problem in the block layer than to have an
odd event loop API solution.

The approach in this patch series is to implement BlockDevOps
.drained_begin/end() callbacks that temporarily stop file descriptor handlers.
This ensures that new I/O requests are not submitted in drained sections.

Stefan Hajnoczi (20):
  block-backend: split blk_do_set_aio_context()
  hw/qdev: introduce qdev_is_realized() helper
  virtio-scsi: avoid race between unplug and transport event
  virtio-scsi: stop using aio_disable_external() during unplug
  util/vhost-user-server: rename refcount to in_flight counter
  block/export: wait for vhost-user-blk requests when draining
  block/export: stop using is_external in vhost-user-blk server
  hw/xen: do not use aio_set_fd_handler(is_external=true) in
xen_xenstore
  block: add blk_in_drain() API
  block: drain from main loop thread in bdrv_co_yield_to_drain()
  xen-block: implement BlockDevOps->drained_begin()
  hw/xen: do not set is_external=true on evtchn fds
  block/export: rewrite vduse-blk drain code
  block/export: don't require AioContext lock around blk_exp_ref/unref()
  block/fuse: do not set is_external=true on FUSE fd
  virtio: make it possible to detach host notifier from any thread
  virtio-blk: implement BlockDevOps->drained_begin()
  virtio-scsi: implement BlockDevOps->drained_begin()
  virtio: do not set is_external=true on host notifiers
  aio: remove aio_disable_external() API

 hw/block/dataplane/xen-block.h  |   2 +
 include/block/aio.h |  57 -
 include/block/block_int-common.h|  90 +++---
 include/block/export.h  |   2 +
 include/hw/qdev-core.h  |  17 ++-
 include/hw/scsi/scsi.h  |  14 +++
 include/qemu/vhost-user-server.h|   8 +-
 include/sysemu/block-backend-common.h   |  25 ++--
 include/sysemu/block-backend-global-state.h |   1 +
 util/aio-posix.h|   1 -
 block.c |   7 --
 block/blkio.c   |  15 +--
 block/block-backend.c   |  78 ++--
 block/curl.c|  10 +-
 block/export/export.c   |  13 +-
 block/export/fuse.c |  56 -
 block/export/vduse-blk.c| 128 ++--
 block/export/vhost-user-b

Re: [PATCH v4 20/20] aio: remove aio_disable_external() API

2023-05-11 Thread Stefan Hajnoczi

On Thu, May 04, 2023 at 11:34:17PM +0200, Kevin Wolf wrote:
> Am 25.04.2023 um 19:27 hat Stefan Hajnoczi geschrieben:
> > All callers now pass is_external=false to aio_set_fd_handler() and
> > aio_set_event_notifier(). The aio_disable_external() API that
> > temporarily disables fd handlers that were registered is_external=true
> > is therefore dead code.
> > 
> > Remove aio_disable_external(), aio_enable_external(), and the
> > is_external arguments to aio_set_fd_handler() and
> > aio_set_event_notifier().
> > 
> > The entire test-fdmon-epoll test is removed because its sole purpose was
> > testing aio_disable_external().
> > 
> > Parts of this patch were generated using the following coccinelle
> > (https://coccinelle.lip6.fr/) semantic patch:
> > 
> >   @@
> >   expression ctx, fd, is_external, io_read, io_write, io_poll, 
> > io_poll_ready, opaque;
> >   @@
> >   - aio_set_fd_handler(ctx, fd, is_external, io_read, io_write, io_poll, 
> > io_poll_ready, opaque)
> >   + aio_set_fd_handler(ctx, fd, io_read, io_write, io_poll, io_poll_ready, 
> > opaque)
> > 
> >   @@
> >   expression ctx, notifier, is_external, io_read, io_poll, io_poll_ready;
> >   @@
> >   - aio_set_event_notifier(ctx, notifier, is_external, io_read, io_poll, 
> > io_poll_ready)
> >   + aio_set_event_notifier(ctx, notifier, io_read, io_poll, io_poll_ready)
> > 
> > Reviewed-by: Juan Quintela 
> > Reviewed-by: Philippe Mathieu-Daudé 
> > Signed-off-by: Stefan Hajnoczi 
> 
> > diff --git a/util/fdmon-epoll.c b/util/fdmon-epoll.c
> > index 1683aa1105..6b6a1a91f8 100644
> > --- a/util/fdmon-epoll.c
> > +++ b/util/fdmon-epoll.c
> > @@ -133,13 +128,12 @@ bool fdmon_epoll_try_upgrade(AioContext *ctx, 
> > unsigned npfd)
> >  return false;
> >  }
> >  
> > -/* Do not upgrade while external clients are disabled */
> > -if (qatomic_read(&ctx->external_disable_cnt)) {
> > -return false;
> > -}
> > -
> > -if (npfd < EPOLL_ENABLE_THRESHOLD) {
> > -return false;
> > +if (npfd >= EPOLL_ENABLE_THRESHOLD) {
> > +if (fdmon_epoll_try_enable(ctx)) {
> > +return true;
> > +} else {
> > +fdmon_epoll_disable(ctx);
> > +}
> >  }
> >  
> >  /* The list must not change while we add fds to epoll */
> 
> I don't understand this hunk. Why are you changing more than just
> deleting the external_disable_cnt check?
> 
> Is this a mismerge with your own commit e62da985?

Yes, it's a mismerge. Thanks for catching that!

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v4 17/20] virtio-blk: implement BlockDevOps->drained_begin()

2023-05-11 Thread Stefan Hajnoczi

On Thu, May 04, 2023 at 11:13:42PM +0200, Kevin Wolf wrote:
> Am 25.04.2023 um 19:27 hat Stefan Hajnoczi geschrieben:
> > Detach ioeventfds during drained sections to stop I/O submission from
> > the guest. virtio-blk is no longer reliant on aio_disable_external()
> > after this patch. This will allow us to remove the
> > aio_disable_external() API once all other code that relies on it is
> > converted.
> > 
> > Take extra care to avoid attaching/detaching ioeventfds if the data
> > plane is started/stopped during a drained section. This should be rare,
> > but maybe the mirror block job can trigger it.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> >  hw/block/dataplane/virtio-blk.c | 17 +--
> >  hw/block/virtio-blk.c   | 38 -
> >  2 files changed, 48 insertions(+), 7 deletions(-)
> > 
> > diff --git a/hw/block/dataplane/virtio-blk.c 
> > b/hw/block/dataplane/virtio-blk.c
> > index bd7cc6e76b..d77fc6028c 100644
> > --- a/hw/block/dataplane/virtio-blk.c
> > +++ b/hw/block/dataplane/virtio-blk.c
> > @@ -245,13 +245,15 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
> >  }
> >  
> >  /* Get this show started by hooking up our callbacks */
> > -aio_context_acquire(s->ctx);
> > -for (i = 0; i < nvqs; i++) {
> > -VirtQueue *vq = virtio_get_queue(s->vdev, i);
> > +if (!blk_in_drain(s->conf->conf.blk)) {
> > +aio_context_acquire(s->ctx);
> > +for (i = 0; i < nvqs; i++) {
> > +VirtQueue *vq = virtio_get_queue(s->vdev, i);
> >  
> > -virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> > +virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> > +}
> > +aio_context_release(s->ctx);
> >  }
> > -aio_context_release(s->ctx);
> >  return 0;
> >  
> >fail_aio_context:
> > @@ -317,7 +319,10 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
> >  trace_virtio_blk_data_plane_stop(s);
> >  
> >  aio_context_acquire(s->ctx);
> > -aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
> > +
> > +if (!blk_in_drain(s->conf->conf.blk)) {
> > +aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
> > +}
> 
> So here we actually get a semantic change: What you described as the
> second part in the previous patch, processing the virtqueue one last
> time, isn't done any more if the device is drained.
> 
> If it's okay to just skip this during drain, why do we need to do it
> outside of drain?

I forgot to answer why we need to process requests one last time outside
drain.

This approach comes from how vhost uses ioeventfd. When switching from
vhost back to QEMU emulation, there's a chance that a final virtqueue
kick snuck in while ioeventfd was being disabled.

This is not the case with dataplane nowadays (it may have in the past).
The only state dataplane transitions are device reset and 'stop'/'cont'.
Neither of these require QEMU to process new requests in while stopping
dataplane.

My confidence is not 100%, but still pretty high that the
virtio_queue_host_notifier_read() call could be dropped from dataplane
code. Since I'm not 100% sure I didn't attempt that.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v4 17/20] virtio-blk: implement BlockDevOps->drained_begin()

2023-05-11 Thread Stefan Hajnoczi

On Thu, May 04, 2023 at 11:13:42PM +0200, Kevin Wolf wrote:
> Am 25.04.2023 um 19:27 hat Stefan Hajnoczi geschrieben:
> > Detach ioeventfds during drained sections to stop I/O submission from
> > the guest. virtio-blk is no longer reliant on aio_disable_external()
> > after this patch. This will allow us to remove the
> > aio_disable_external() API once all other code that relies on it is
> > converted.
> > 
> > Take extra care to avoid attaching/detaching ioeventfds if the data
> > plane is started/stopped during a drained section. This should be rare,
> > but maybe the mirror block job can trigger it.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> >  hw/block/dataplane/virtio-blk.c | 17 +--
> >  hw/block/virtio-blk.c   | 38 -
> >  2 files changed, 48 insertions(+), 7 deletions(-)
> > 
> > diff --git a/hw/block/dataplane/virtio-blk.c 
> > b/hw/block/dataplane/virtio-blk.c
> > index bd7cc6e76b..d77fc6028c 100644
> > --- a/hw/block/dataplane/virtio-blk.c
> > +++ b/hw/block/dataplane/virtio-blk.c
> > @@ -245,13 +245,15 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
> >  }
> >  
> >  /* Get this show started by hooking up our callbacks */
> > -aio_context_acquire(s->ctx);
> > -for (i = 0; i < nvqs; i++) {
> > -VirtQueue *vq = virtio_get_queue(s->vdev, i);
> > +if (!blk_in_drain(s->conf->conf.blk)) {
> > +aio_context_acquire(s->ctx);
> > +for (i = 0; i < nvqs; i++) {
> > +VirtQueue *vq = virtio_get_queue(s->vdev, i);
> >  
> > -virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> > +virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> > +}
> > +aio_context_release(s->ctx);
> >  }
> > -aio_context_release(s->ctx);
> >  return 0;
> >  
> >fail_aio_context:
> > @@ -317,7 +319,10 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
> >  trace_virtio_blk_data_plane_stop(s);
> >  
> >  aio_context_acquire(s->ctx);
> > -aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
> > +
> > +if (!blk_in_drain(s->conf->conf.blk)) {
> > +aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
> > +}
> 
> So here we actually get a semantic change: What you described as the
> second part in the previous patch, processing the virtqueue one last
> time, isn't done any more if the device is drained.
> 
> If it's okay to just skip this during drain, why do we need to do it
> outside of drain?

Yes, it's safe because virtio_blk_data_plane_stop() has two cases:
1. The device is being reset. It is not necessary to process new
   requests.
2. 'stop'/'cont'. 'cont' will call virtio_blk_data_plane_start() ->
   event_notifier_set() so new requests will be processed when the guest
   resumes exection.

That's why I think this is safe and the right thing to do.

However, your question led me to a pre-existing drain bug when a vCPU
resets the device during a drained section (e.g. when a mirror block job
has started a drained section and the main loop runs until the block job
exits). New requests must not be processed by
virtio_blk_data_plane_stop() because that would violate drain semantics.

It turns out requests are still processed because of
virtio_blk_data_plane_stop() -> virtio_bus_cleanup_host_notifier() ->
virtio_queue_host_notifier_read().

I think that should be handled in a separate patch series. It's not
related to aio_disable_external().

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v4 16/20] virtio: make it possible to detach host notifier from any thread

2023-05-11 Thread Stefan Hajnoczi

On Thu, May 04, 2023 at 11:00:35PM +0200, Kevin Wolf wrote:
> Am 25.04.2023 um 19:27 hat Stefan Hajnoczi geschrieben:
> > virtio_queue_aio_detach_host_notifier() does two things:
> > 1. It removes the fd handler from the event loop.
> > 2. It processes the virtqueue one last time.
> > 
> > The first step can be peformed by any thread and without taking the
> > AioContext lock.
> > 
> > The second step may need the AioContext lock (depending on the device
> > implementation) and runs in the thread where request processing takes
> > place. virtio-blk and virtio-scsi therefore call
> > virtio_queue_aio_detach_host_notifier() from a BH that is scheduled in
> > AioContext
> > 
> > Scheduling a BH is undesirable for .drained_begin() functions. The next
> > patch will introduce a .drained_begin() function that needs to call
> > virtio_queue_aio_detach_host_notifier().
> 
> Why is it undesirable? In my mental model, .drained_begin() is still
> free to start as many asynchronous things as it likes. The only
> important thing to take care of is that .drained_poll() returns true as
> long as the BH (or other asynchronous operation) is still pending.
> 
> Of course, your way of doing things still seems to result in simpler
> code because you don't have to deal with a BH at all if you only really
> want the first part and not the second.

I have clarified this in the commit description. We can't wait
synchronously, but we could wait asynchronously as you described. It's
simpler to split the function instead of implementing async wait using
.drained_poll().

> 
> > Move the virtqueue processing out to the callers of
> > virtio_queue_aio_detach_host_notifier() so that the function can be
> > called from any thread. This is in preparation for the next patch.
> 
> Did you forget to remove it in virtio_queue_aio_detach_host_notifier()?
> If it's unchanged, I don't think the AioContext requirement is lifted.

Yes! Thank you :)

> 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> >  hw/block/dataplane/virtio-blk.c | 2 ++
> >  hw/scsi/virtio-scsi-dataplane.c | 9 +
> >  2 files changed, 11 insertions(+)
> > diff --git a/hw/block/dataplane/virtio-blk.c 
> > b/hw/block/dataplane/virtio-blk.c
> > index b28d81737e..bd7cc6e76b 100644
> > --- a/hw/block/dataplane/virtio-blk.c
> > +++ b/hw/block/dataplane/virtio-blk.c
> > @@ -286,8 +286,10 @@ static void virtio_blk_data_plane_stop_bh(void *opaque)
> >  
> >  for (i = 0; i < s->conf->num_queues; i++) {
> >  VirtQueue *vq = virtio_get_queue(s->vdev, i);
> > +EventNotifier *host_notifier = virtio_queue_get_host_notifier(vq);
> >  
> >  virtio_queue_aio_detach_host_notifier(vq, s->ctx);
> > +virtio_queue_host_notifier_read(host_notifier);
> >  }
> >  }
> 
> The existing code in virtio_queue_aio_detach_host_notifier() has a
> comment before the read:
> 
> /* Test and clear notifier before after disabling event,
>  * in case poll callback didn't have time to run. */
> 
> Do we want to keep it around in the new places? (And also fix the
> "before after", I suppose, or replace it with a similar, but better
> comment that explains why we're reading here.)

I will add the comment.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v5 05/21] virtio-scsi: stop using aio_disable_external() during unplug

2023-05-09 Thread Stefan Hajnoczi

On Tue, May 09, 2023 at 08:55:14PM +0200, Kevin Wolf wrote:
> Am 04.05.2023 um 21:53 hat Stefan Hajnoczi geschrieben:
> > This patch is part of an effort to remove the aio_disable_external()
> > API because it does not fit in a multi-queue block layer world where
> > many AioContexts may be submitting requests to the same disk.
> > 
> > The SCSI emulation code is already in good shape to stop using
> > aio_disable_external(). It was only used by commit 9c5aad84da1c
> > ("virtio-scsi: fixed virtio_scsi_ctx_check failed when detaching scsi
> > disk") to ensure that virtio_scsi_hotunplug() works while the guest
> > driver is submitting I/O.
> > 
> > Ensure virtio_scsi_hotunplug() is safe as follows:
> > 
> > 1. qdev_simple_device_unplug_cb() -> qdev_unrealize() ->
> >device_set_realized() calls qatomic_set(&dev->realized, false) so
> >that future scsi_device_get() calls return NULL because they exclude
> >SCSIDevices with realized=false.
> > 
> >That means virtio-scsi will reject new I/O requests to this
> >SCSIDevice with VIRTIO_SCSI_S_BAD_TARGET even while
> >virtio_scsi_hotunplug() is still executing. We are protected against
> >new requests!
> > 
> > 2. scsi_device_unrealize() already contains a call to
> 
> I think you mean scsi_qdev_unrealize(). Can be fixed while applying.

Yes, it should be scsi_qdev_unrealize(). I'll review your other comments
and fix this if I need to respin.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v5 00/21] block: remove aio_disable_external() API

2023-05-09 Thread Stefan Hajnoczi

On Thu, May 04, 2023 at 11:44:42PM +0200, Kevin Wolf wrote:
> Am 04.05.2023 um 21:53 hat Stefan Hajnoczi geschrieben:
> > v5:
> > - Use atomic accesses for in_flight counter in vhost-user-server.c [Kevin]
> > - Stash SCSIDevice id/lun values for VIRTIO_SCSI_T_TRANSPORT_RESET event
> >   before unrealizing the SCSIDevice [Kevin]
> > - Keep vhost-user-blk export .detach() callback so ctx is set to NULL 
> > [Kevin]
> > - Narrow BdrvChildClass and BlockDriver drained_{begin/end/poll} callbacks 
> > from
> >   IO_OR_GS_CODE() to GLOBAL_STATE_CODE() [Kevin]
> > - Include Kevin's "block: Fix use after free in blockdev_mark_auto_del()" to
> >   fix a latent bug that was exposed by this series
> 
> I only just finished reviewing v4 when you had already sent v5, but it
> hadn't arrived yet. I had a few more comments on what are now patches
> 17, 18, 19 and 21 in v5. I think they all still apply.

I'm not sure which comments from v4 still apply. In my email client all
your replies were already read when I sent v5.

Maybe you can share the Message-Id of something I still need to address?

Thanks,
Stefan


signature.asc
Description: PGP signature

[PATCH v5 13/21] hw/xen: do not set is_external=true on evtchn fds

2023-05-04 Thread Stefan Hajnoczi

is_external=true suspends fd handlers between aio_disable_external() and
aio_enable_external(). The block layer's drain operation uses this
mechanism to prevent new I/O from sneaking in between
bdrv_drained_begin() and bdrv_drained_end().

The previous commit converted the xen-block device to use BlockDevOps
.drained_begin/end() callbacks. It no longer relies on is_external=true
so it is safe to pass is_external=false.

This is part of ongoing work to remove the aio_disable_external() API.

Signed-off-by: Stefan Hajnoczi 
---
 hw/xen/xen-bus.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/xen/xen-bus.c b/hw/xen/xen-bus.c
index b8f408c9ed..bf256d4da2 100644
--- a/hw/xen/xen-bus.c
+++ b/hw/xen/xen-bus.c
@@ -842,14 +842,14 @@ void xen_device_set_event_channel_context(XenDevice 
*xendev,
 }
 
 if (channel->ctx)
-aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), 
true,
+aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), 
false,
NULL, NULL, NULL, NULL, NULL);
 
 channel->ctx = ctx;
 if (ctx) {
 aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh),
-   true, xen_device_event, NULL, xen_device_poll, NULL,
-   channel);
+   false, xen_device_event, NULL, xen_device_poll,
+   NULL, channel);
 }
 }
 
@@ -923,7 +923,7 @@ void xen_device_unbind_event_channel(XenDevice *xendev,
 
 QLIST_REMOVE(channel, list);
 
-aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), true,
+aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), false,
NULL, NULL, NULL, NULL, NULL);
 
 if (qemu_xen_evtchn_unbind(channel->xeh, channel->local_port) < 0) {
-- 
2.40.1

[PATCH v5 15/21] block/export: don't require AioContext lock around blk_exp_ref/unref()

2023-05-04 Thread Stefan Hajnoczi

The FUSE export calls blk_exp_ref/unref() without the AioContext lock.
Instead of fixing the FUSE export, adjust blk_exp_ref/unref() so they
work without the AioContext lock. This way it's less error-prone.

Suggested-by: Paolo Bonzini 
Signed-off-by: Stefan Hajnoczi 
---
 include/block/export.h   |  2 ++
 block/export/export.c| 13 ++---
 block/export/vduse-blk.c |  4 
 3 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/block/export.h b/include/block/export.h
index 7feb02e10d..f2fe0f8078 100644
--- a/include/block/export.h
+++ b/include/block/export.h
@@ -57,6 +57,8 @@ struct BlockExport {
  * Reference count for this block export. This includes strong references
  * both from the owner (qemu-nbd or the monitor) and clients connected to
  * the export.
+ *
+ * Use atomics to access this field.
  */
 int refcount;
 
diff --git a/block/export/export.c b/block/export/export.c
index 62c7c22d45..ab007e9d31 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -202,11 +202,10 @@ fail:
 return NULL;
 }
 
-/* Callers must hold exp->ctx lock */
 void blk_exp_ref(BlockExport *exp)
 {
-assert(exp->refcount > 0);
-exp->refcount++;
+assert(qatomic_read(&exp->refcount) > 0);
+qatomic_inc(&exp->refcount);
 }
 
 /* Runs in the main thread */
@@ -229,11 +228,10 @@ static void blk_exp_delete_bh(void *opaque)
 aio_context_release(aio_context);
 }
 
-/* Callers must hold exp->ctx lock */
 void blk_exp_unref(BlockExport *exp)
 {
-assert(exp->refcount > 0);
-if (--exp->refcount == 0) {
+assert(qatomic_read(&exp->refcount) > 0);
+if (qatomic_fetch_dec(&exp->refcount) == 1) {
 /* Touch the block_exports list only in the main thread */
 aio_bh_schedule_oneshot(qemu_get_aio_context(), blk_exp_delete_bh,
 exp);
@@ -341,7 +339,8 @@ void qmp_block_export_del(const char *id,
 if (!has_mode) {
 mode = BLOCK_EXPORT_REMOVE_MODE_SAFE;
 }
-if (mode == BLOCK_EXPORT_REMOVE_MODE_SAFE && exp->refcount > 1) {
+if (mode == BLOCK_EXPORT_REMOVE_MODE_SAFE &&
+qatomic_read(&exp->refcount) > 1) {
 error_setg(errp, "export '%s' still in use", exp->id);
 error_append_hint(errp, "Use mode='hard' to force client "
   "disconnect\n");
diff --git a/block/export/vduse-blk.c b/block/export/vduse-blk.c
index a25556fe04..e041f9 100644
--- a/block/export/vduse-blk.c
+++ b/block/export/vduse-blk.c
@@ -44,9 +44,7 @@ static void vduse_blk_inflight_inc(VduseBlkExport *vblk_exp)
 {
 if (qatomic_fetch_inc(&vblk_exp->inflight) == 0) {
 /* Prevent export from being deleted */
-aio_context_acquire(vblk_exp->export.ctx);
 blk_exp_ref(&vblk_exp->export);
-aio_context_release(vblk_exp->export.ctx);
 }
 }
 
@@ -57,9 +55,7 @@ static void vduse_blk_inflight_dec(VduseBlkExport *vblk_exp)
 aio_wait_kick();
 
 /* Now the export can be deleted */
-aio_context_acquire(vblk_exp->export.ctx);
 blk_exp_unref(&vblk_exp->export);
-aio_context_release(vblk_exp->export.ctx);
 }
 }
 
-- 
2.40.1

[PATCH v5 21/21] aio: remove aio_disable_external() API

2023-05-04 Thread Stefan Hajnoczi

All callers now pass is_external=false to aio_set_fd_handler() and
aio_set_event_notifier(). The aio_disable_external() API that
temporarily disables fd handlers that were registered is_external=true
is therefore dead code.

Remove aio_disable_external(), aio_enable_external(), and the
is_external arguments to aio_set_fd_handler() and
aio_set_event_notifier().

The entire test-fdmon-epoll test is removed because its sole purpose was
testing aio_disable_external().

Parts of this patch were generated using the following coccinelle
(https://coccinelle.lip6.fr/) semantic patch:

  @@
  expression ctx, fd, is_external, io_read, io_write, io_poll, io_poll_ready, 
opaque;
  @@
  - aio_set_fd_handler(ctx, fd, is_external, io_read, io_write, io_poll, 
io_poll_ready, opaque)
  + aio_set_fd_handler(ctx, fd, io_read, io_write, io_poll, io_poll_ready, 
opaque)

  @@
  expression ctx, notifier, is_external, io_read, io_poll, io_poll_ready;
  @@
  - aio_set_event_notifier(ctx, notifier, is_external, io_read, io_poll, 
io_poll_ready)
  + aio_set_event_notifier(ctx, notifier, io_read, io_poll, io_poll_ready)

Reviewed-by: Juan Quintela 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Stefan Hajnoczi 
---
 include/block/aio.h   | 57 ---
 util/aio-posix.h  |  1 -
 block.c   |  7 
 block/blkio.c | 15 +++
 block/curl.c  | 10 ++---
 block/export/fuse.c   |  8 ++--
 block/export/vduse-blk.c  | 10 ++---
 block/io.c|  2 -
 block/io_uring.c  |  4 +-
 block/iscsi.c |  3 +-
 block/linux-aio.c |  4 +-
 block/nfs.c   |  5 +--
 block/nvme.c  |  8 ++--
 block/ssh.c   |  4 +-
 block/win32-aio.c |  6 +--
 hw/i386/kvm/xen_xenstore.c|  2 +-
 hw/virtio/virtio.c|  6 +--
 hw/xen/xen-bus.c  |  8 ++--
 io/channel-command.c  |  6 +--
 io/channel-file.c |  3 +-
 io/channel-socket.c   |  3 +-
 migration/rdma.c  | 16 
 tests/unit/test-aio.c | 27 +
 tests/unit/test-bdrv-drain.c  |  1 -
 tests/unit/test-fdmon-epoll.c | 73 ---
 util/aio-posix.c  | 20 +++---
 util/aio-win32.c  |  8 +---
 util/async.c  |  3 +-
 util/fdmon-epoll.c| 18 +++--
 util/fdmon-io_uring.c |  8 +---
 util/fdmon-poll.c |  3 +-
 util/main-loop.c  |  7 ++--
 util/qemu-coroutine-io.c  |  7 ++--
 util/vhost-user-server.c  | 11 +++---
 tests/unit/meson.build|  3 --
 35 files changed, 82 insertions(+), 295 deletions(-)
 delete mode 100644 tests/unit/test-fdmon-epoll.c

diff --git a/include/block/aio.h b/include/block/aio.h
index 89bbc536f9..32042e8905 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -225,8 +225,6 @@ struct AioContext {
  */
 QEMUTimerListGroup tlg;
 
-int external_disable_cnt;
-
 /* Number of AioHandlers without .io_poll() */
 int poll_disable_cnt;
 
@@ -481,7 +479,6 @@ bool aio_poll(AioContext *ctx, bool blocking);
  */
 void aio_set_fd_handler(AioContext *ctx,
 int fd,
-bool is_external,
 IOHandler *io_read,
 IOHandler *io_write,
 AioPollFn *io_poll,
@@ -497,7 +494,6 @@ void aio_set_fd_handler(AioContext *ctx,
  */
 void aio_set_event_notifier(AioContext *ctx,
 EventNotifier *notifier,
-bool is_external,
 EventNotifierHandler *io_read,
 AioPollFn *io_poll,
 EventNotifierHandler *io_poll_ready);
@@ -626,59 +622,6 @@ static inline void aio_timer_init(AioContext *ctx,
  */
 int64_t aio_compute_timeout(AioContext *ctx);
 
-/**
- * aio_disable_external:
- * @ctx: the aio context
- *
- * Disable the further processing of external clients.
- */
-static inline void aio_disable_external(AioContext *ctx)
-{
-qatomic_inc(&ctx->external_disable_cnt);
-}
-
-/**
- * aio_enable_external:
- * @ctx: the aio context
- *
- * Enable the processing of external clients.
- */
-static inline void aio_enable_external(AioContext *ctx)
-{
-int old;
-
-old = qatomic_fetch_dec(&ctx->external_disable_cnt);
-assert(old > 0);
-if (old == 1) {
-/* Kick event loop so it re-arms file descriptors */
-aio_notify(ctx);
-}
-}
-
-/**
- * aio_external_disabled:
- * @ctx: the aio context
- *
- * Return true if the external clients are disabled.
- */
-static inline bool aio_external_disabled(AioContext *ctx)
-{
-return qatomic_read(&ctx->external_disable_cnt);
-}
-
-/**
- * aio_node_check:
- * @ctx: the aio context
- * @is_external: Whether or not the checked node is an external event sour

[PATCH v5 17/21] virtio: make it possible to detach host notifier from any thread

2023-05-04 Thread Stefan Hajnoczi

virtio_queue_aio_detach_host_notifier() does two things:
1. It removes the fd handler from the event loop.
2. It processes the virtqueue one last time.

The first step can be peformed by any thread and without taking the
AioContext lock.

The second step may need the AioContext lock (depending on the device
implementation) and runs in the thread where request processing takes
place. virtio-blk and virtio-scsi therefore call
virtio_queue_aio_detach_host_notifier() from a BH that is scheduled in
AioContext

Scheduling a BH is undesirable for .drained_begin() functions. The next
patch will introduce a .drained_begin() function that needs to call
virtio_queue_aio_detach_host_notifier().

Move the virtqueue processing out to the callers of
virtio_queue_aio_detach_host_notifier() so that the function can be
called from any thread. This is in preparation for the next patch.

Signed-off-by: Stefan Hajnoczi 
---
 hw/block/dataplane/virtio-blk.c | 2 ++
 hw/scsi/virtio-scsi-dataplane.c | 9 +
 2 files changed, 11 insertions(+)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index a6202997ee..27eafa6c92 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -287,8 +287,10 @@ static void virtio_blk_data_plane_stop_bh(void *opaque)
 
 for (i = 0; i < s->conf->num_queues; i++) {
 VirtQueue *vq = virtio_get_queue(s->vdev, i);
+EventNotifier *host_notifier = virtio_queue_get_host_notifier(vq);
 
 virtio_queue_aio_detach_host_notifier(vq, s->ctx);
+virtio_queue_host_notifier_read(host_notifier);
 }
 }
 
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 20bb91766e..81643445ed 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -71,12 +71,21 @@ static void virtio_scsi_dataplane_stop_bh(void *opaque)
 {
 VirtIOSCSI *s = opaque;
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
+EventNotifier *host_notifier;
 int i;
 
 virtio_queue_aio_detach_host_notifier(vs->ctrl_vq, s->ctx);
+host_notifier = virtio_queue_get_host_notifier(vs->ctrl_vq);
+virtio_queue_host_notifier_read(host_notifier);
+
 virtio_queue_aio_detach_host_notifier(vs->event_vq, s->ctx);
+host_notifier = virtio_queue_get_host_notifier(vs->event_vq);
+virtio_queue_host_notifier_read(host_notifier);
+
 for (i = 0; i < vs->conf.num_queues; i++) {
 virtio_queue_aio_detach_host_notifier(vs->cmd_vqs[i], s->ctx);
+host_notifier = virtio_queue_get_host_notifier(vs->cmd_vqs[i]);
+virtio_queue_host_notifier_read(host_notifier);
 }
 }
 
-- 
2.40.1

[PATCH v5 12/21] xen-block: implement BlockDevOps->drained_begin()

2023-05-04 Thread Stefan Hajnoczi

Detach event channels during drained sections to stop I/O submission
from the ring. xen-block is no longer reliant on aio_disable_external()
after this patch. This will allow us to remove the
aio_disable_external() API once all other code that relies on it is
converted.

Extend xen_device_set_event_channel_context() to allow ctx=NULL. The
event channel still exists but the event loop does not monitor the file
descriptor. Event channel processing can resume by calling
xen_device_set_event_channel_context() with a non-NULL ctx.

Factor out xen_device_set_event_channel_context() calls in
hw/block/dataplane/xen-block.c into attach/detach helper functions.
Incidentally, these don't require the AioContext lock because
aio_set_fd_handler() is thread-safe.

It's safer to register BlockDevOps after the dataplane instance has been
created. The BlockDevOps .drained_begin/end() callbacks depend on the
dataplane instance, so move the blk_set_dev_ops() call after
xen_block_dataplane_create().

Signed-off-by: Stefan Hajnoczi 
---
 hw/block/dataplane/xen-block.h |  2 ++
 hw/block/dataplane/xen-block.c | 42 +-
 hw/block/xen-block.c   | 24 ---
 hw/xen/xen-bus.c   |  7 --
 4 files changed, 59 insertions(+), 16 deletions(-)

diff --git a/hw/block/dataplane/xen-block.h b/hw/block/dataplane/xen-block.h
index 76dcd51c3d..7b8e9df09f 100644
--- a/hw/block/dataplane/xen-block.h
+++ b/hw/block/dataplane/xen-block.h
@@ -26,5 +26,7 @@ void xen_block_dataplane_start(XenBlockDataPlane *dataplane,
unsigned int protocol,
Error **errp);
 void xen_block_dataplane_stop(XenBlockDataPlane *dataplane);
+void xen_block_dataplane_attach(XenBlockDataPlane *dataplane);
+void xen_block_dataplane_detach(XenBlockDataPlane *dataplane);
 
 #endif /* HW_BLOCK_DATAPLANE_XEN_BLOCK_H */
diff --git a/hw/block/dataplane/xen-block.c b/hw/block/dataplane/xen-block.c
index d8bc39d359..2597f38805 100644
--- a/hw/block/dataplane/xen-block.c
+++ b/hw/block/dataplane/xen-block.c
@@ -664,6 +664,30 @@ void xen_block_dataplane_destroy(XenBlockDataPlane 
*dataplane)
 g_free(dataplane);
 }
 
+void xen_block_dataplane_detach(XenBlockDataPlane *dataplane)
+{
+if (!dataplane || !dataplane->event_channel) {
+return;
+}
+
+/* Only reason for failure is a NULL channel */
+xen_device_set_event_channel_context(dataplane->xendev,
+ dataplane->event_channel,
+ NULL, &error_abort);
+}
+
+void xen_block_dataplane_attach(XenBlockDataPlane *dataplane)
+{
+if (!dataplane || !dataplane->event_channel) {
+return;
+}
+
+/* Only reason for failure is a NULL channel */
+xen_device_set_event_channel_context(dataplane->xendev,
+ dataplane->event_channel,
+ dataplane->ctx, &error_abort);
+}
+
 void xen_block_dataplane_stop(XenBlockDataPlane *dataplane)
 {
 XenDevice *xendev;
@@ -674,13 +698,11 @@ void xen_block_dataplane_stop(XenBlockDataPlane 
*dataplane)
 
 xendev = dataplane->xendev;
 
-aio_context_acquire(dataplane->ctx);
-if (dataplane->event_channel) {
-/* Only reason for failure is a NULL channel */
-xen_device_set_event_channel_context(xendev, dataplane->event_channel,
- qemu_get_aio_context(),
- &error_abort);
+if (!blk_in_drain(dataplane->blk)) {
+xen_block_dataplane_detach(dataplane);
 }
+
+aio_context_acquire(dataplane->ctx);
 /* Xen doesn't have multiple users for nodes, so this can't fail */
 blk_set_aio_context(dataplane->blk, qemu_get_aio_context(), &error_abort);
 aio_context_release(dataplane->ctx);
@@ -819,11 +841,9 @@ void xen_block_dataplane_start(XenBlockDataPlane 
*dataplane,
 blk_set_aio_context(dataplane->blk, dataplane->ctx, NULL);
 aio_context_release(old_context);
 
-/* Only reason for failure is a NULL channel */
-aio_context_acquire(dataplane->ctx);
-xen_device_set_event_channel_context(xendev, dataplane->event_channel,
- dataplane->ctx, &error_abort);
-aio_context_release(dataplane->ctx);
+if (!blk_in_drain(dataplane->blk)) {
+xen_block_dataplane_attach(dataplane);
+}
 
 return;
 
diff --git a/hw/block/xen-block.c b/hw/block/xen-block.c
index f5a744589d..f099914831 100644
--- a/hw/block/xen-block.c
+++ b/hw/block/xen-block.c
@@ -189,8 +189,26 @@ static void xen_block_resize_cb(void *opaque)
 xen_device_backend_printf(xendev, "state", "%u", state);
 }
 
+/* Suspend request handling */
+static void xen_block_drained_begin(void *opaque)
+{
+XenBlockDevic

[PATCH v5 19/21] virtio-scsi: implement BlockDevOps->drained_begin()

2023-05-04 Thread Stefan Hajnoczi

The virtio-scsi Host Bus Adapter provides access to devices on a SCSI
bus. Those SCSI devices typically have a BlockBackend. When the
BlockBackend enters a drained section, the SCSI device must temporarily
stop submitting new I/O requests.

Implement this behavior by temporarily stopping virtio-scsi virtqueue
processing when one of the SCSI devices enters a drained section. The
new scsi_device_drained_begin() API allows scsi-disk to message the
virtio-scsi HBA.

scsi_device_drained_begin() uses a drain counter so that multiple SCSI
devices can have overlapping drained sections. The HBA only sees one
pair of .drained_begin/end() calls.

After this commit, virtio-scsi no longer depends on hw/virtio's
ioeventfd aio_set_event_notifier(is_external=true). This commit is a
step towards removing the aio_disable_external() API.

Signed-off-by: Stefan Hajnoczi 
---
 include/hw/scsi/scsi.h  | 14 
 hw/scsi/scsi-bus.c  | 40 +
 hw/scsi/scsi-disk.c | 27 +-
 hw/scsi/virtio-scsi-dataplane.c | 22 ++
 hw/scsi/virtio-scsi.c   | 38 +++
 hw/scsi/trace-events|  2 ++
 6 files changed, 129 insertions(+), 14 deletions(-)

diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index 6f23a7a73e..e2bb1a2fbf 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -133,6 +133,16 @@ struct SCSIBusInfo {
 void (*save_request)(QEMUFile *f, SCSIRequest *req);
 void *(*load_request)(QEMUFile *f, SCSIRequest *req);
 void (*free_request)(SCSIBus *bus, void *priv);
+
+/*
+ * Temporarily stop submitting new requests between drained_begin() and
+ * drained_end(). Called from the main loop thread with the BQL held.
+ *
+ * Implement these callbacks if request processing is triggered by a file
+ * descriptor like an EventNotifier. Otherwise set them to NULL.
+ */
+void (*drained_begin)(SCSIBus *bus);
+void (*drained_end)(SCSIBus *bus);
 };
 
 #define TYPE_SCSI_BUS "SCSI"
@@ -144,6 +154,8 @@ struct SCSIBus {
 
 SCSISense unit_attention;
 const SCSIBusInfo *info;
+
+int drain_count; /* protected by BQL */
 };
 
 /**
@@ -213,6 +225,8 @@ void scsi_req_cancel_complete(SCSIRequest *req);
 void scsi_req_cancel(SCSIRequest *req);
 void scsi_req_cancel_async(SCSIRequest *req, Notifier *notifier);
 void scsi_req_retry(SCSIRequest *req);
+void scsi_device_drained_begin(SCSIDevice *sdev);
+void scsi_device_drained_end(SCSIDevice *sdev);
 void scsi_device_purge_requests(SCSIDevice *sdev, SCSISense sense);
 void scsi_device_set_ua(SCSIDevice *sdev, SCSISense sense);
 void scsi_device_report_change(SCSIDevice *dev, SCSISense sense);
diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 64013c8a24..f80f4cb4fc 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -1669,6 +1669,46 @@ void scsi_device_purge_requests(SCSIDevice *sdev, 
SCSISense sense)
 scsi_device_set_ua(sdev, sense);
 }
 
+void scsi_device_drained_begin(SCSIDevice *sdev)
+{
+SCSIBus *bus = DO_UPCAST(SCSIBus, qbus, sdev->qdev.parent_bus);
+if (!bus) {
+return;
+}
+
+assert(qemu_get_current_aio_context() == qemu_get_aio_context());
+assert(bus->drain_count < INT_MAX);
+
+/*
+ * Multiple BlockBackends can be on a SCSIBus and each may begin/end
+ * draining at any time. Keep a counter so HBAs only see begin/end once.
+ */
+if (bus->drain_count++ == 0) {
+trace_scsi_bus_drained_begin(bus, sdev);
+if (bus->info->drained_begin) {
+bus->info->drained_begin(bus);
+}
+}
+}
+
+void scsi_device_drained_end(SCSIDevice *sdev)
+{
+SCSIBus *bus = DO_UPCAST(SCSIBus, qbus, sdev->qdev.parent_bus);
+if (!bus) {
+return;
+}
+
+assert(qemu_get_current_aio_context() == qemu_get_aio_context());
+assert(bus->drain_count > 0);
+
+if (bus->drain_count-- == 1) {
+trace_scsi_bus_drained_end(bus, sdev);
+if (bus->info->drained_end) {
+bus->info->drained_end(bus);
+}
+}
+}
+
 static char *scsibus_get_dev_path(DeviceState *dev)
 {
 SCSIDevice *d = SCSI_DEVICE(dev);
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 97c9b1c8cd..e0d79c7966 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2360,6 +2360,20 @@ static void scsi_disk_reset(DeviceState *dev)
 s->qdev.scsi_version = s->qdev.default_scsi_version;
 }
 
+static void scsi_disk_drained_begin(void *opaque)
+{
+SCSIDiskState *s = opaque;
+
+scsi_device_drained_begin(&s->qdev);
+}
+
+static void scsi_disk_drained_end(void *opaque)
+{
+SCSIDiskState *s = opaque;
+
+scsi_device_drained_end(&s->qdev);
+}
+
 static void scsi_disk_resize_cb(void *opaque)
 {
 SCSIDiskState *s = opaque;
@@ -2414,16 +2428,19 @@ static bool scsi_cd_is_mediu

[PATCH v5 18/21] virtio-blk: implement BlockDevOps->drained_begin()

2023-05-04 Thread Stefan Hajnoczi

Detach ioeventfds during drained sections to stop I/O submission from
the guest. virtio-blk is no longer reliant on aio_disable_external()
after this patch. This will allow us to remove the
aio_disable_external() API once all other code that relies on it is
converted.

Take extra care to avoid attaching/detaching ioeventfds if the data
plane is started/stopped during a drained section. This should be rare,
but maybe the mirror block job can trigger it.

Signed-off-by: Stefan Hajnoczi 
---
 hw/block/dataplane/virtio-blk.c | 17 +--
 hw/block/virtio-blk.c   | 38 -
 2 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 27eafa6c92..c0d2663abc 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -246,13 +246,15 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 }
 
 /* Get this show started by hooking up our callbacks */
-aio_context_acquire(s->ctx);
-for (i = 0; i < nvqs; i++) {
-VirtQueue *vq = virtio_get_queue(s->vdev, i);
+if (!blk_in_drain(s->conf->conf.blk)) {
+aio_context_acquire(s->ctx);
+for (i = 0; i < nvqs; i++) {
+VirtQueue *vq = virtio_get_queue(s->vdev, i);
 
-virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+}
+aio_context_release(s->ctx);
 }
-aio_context_release(s->ctx);
 return 0;
 
   fail_aio_context:
@@ -318,7 +320,10 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
 trace_virtio_blk_data_plane_stop(s);
 
 aio_context_acquire(s->ctx);
-aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
+
+if (!blk_in_drain(s->conf->conf.blk)) {
+aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
+}
 
 /* Wait for virtio_blk_dma_restart_bh() and in flight I/O to complete */
 blk_drain(s->conf->conf.blk);
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index cefca93b31..d8dedc575c 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1109,8 +1109,44 @@ static void virtio_blk_resize(void *opaque)
 aio_bh_schedule_oneshot(qemu_get_aio_context(), virtio_resize_cb, vdev);
 }
 
+/* Suspend virtqueue ioeventfd processing during drain */
+static void virtio_blk_drained_begin(void *opaque)
+{
+VirtIOBlock *s = opaque;
+VirtIODevice *vdev = VIRTIO_DEVICE(opaque);
+AioContext *ctx = blk_get_aio_context(s->conf.conf.blk);
+
+if (!s->dataplane || !s->dataplane_started) {
+return;
+}
+
+for (uint16_t i = 0; i < s->conf.num_queues; i++) {
+VirtQueue *vq = virtio_get_queue(vdev, i);
+virtio_queue_aio_detach_host_notifier(vq, ctx);
+}
+}
+
+/* Resume virtqueue ioeventfd processing after drain */
+static void virtio_blk_drained_end(void *opaque)
+{
+VirtIOBlock *s = opaque;
+VirtIODevice *vdev = VIRTIO_DEVICE(opaque);
+AioContext *ctx = blk_get_aio_context(s->conf.conf.blk);
+
+if (!s->dataplane || !s->dataplane_started) {
+return;
+}
+
+for (uint16_t i = 0; i < s->conf.num_queues; i++) {
+VirtQueue *vq = virtio_get_queue(vdev, i);
+virtio_queue_aio_attach_host_notifier(vq, ctx);
+}
+}
+
 static const BlockDevOps virtio_block_ops = {
-.resize_cb = virtio_blk_resize,
+.resize_cb = virtio_blk_resize,
+.drained_begin = virtio_blk_drained_begin,
+.drained_end   = virtio_blk_drained_end,
 };
 
 static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
-- 
2.40.1

[PATCH v5 14/21] block/export: rewrite vduse-blk drain code

2023-05-04 Thread Stefan Hajnoczi

vduse_blk_detach_ctx() waits for in-flight requests using
AIO_WAIT_WHILE(). This is not allowed according to a comment in
bdrv_set_aio_context_commit():

  /*
   * Take the old AioContex when detaching it from bs.
   * At this point, new_context lock is already acquired, and we are now
   * also taking old_context. This is safe as long as bdrv_detach_aio_context
   * does not call AIO_POLL_WHILE().
   */

Use this opportunity to rewrite the drain code in vduse-blk:

- Use the BlockExport refcount so that vduse_blk_exp_delete() is only
  called when there are no more requests in flight.

- Implement .drained_poll() so in-flight request coroutines are stopped
  by the time .bdrv_detach_aio_context() is called.

- Remove AIO_WAIT_WHILE() from vduse_blk_detach_ctx() to solve the
  .bdrv_detach_aio_context() constraint violation. It's no longer
  needed due to the previous changes.

- Always handle the VDUSE file descriptor, even in drained sections. The
  VDUSE file descriptor doesn't submit I/O, so it's safe to handle it in
  drained sections. This ensures that the VDUSE kernel code gets a fast
  response.

- Suspend virtqueue fd handlers in .drained_begin() and resume them in
  .drained_end(). This eliminates the need for the
  aio_set_fd_handler(is_external=true) flag, which is being removed from
  QEMU.

This is a long list but splitting it into individual commits would
probably lead to git bisect failures - the changes are all related.

Signed-off-by: Stefan Hajnoczi 
---
 block/export/vduse-blk.c | 132 +++
 1 file changed, 93 insertions(+), 39 deletions(-)

diff --git a/block/export/vduse-blk.c b/block/export/vduse-blk.c
index b53ef39da0..a25556fe04 100644
--- a/block/export/vduse-blk.c
+++ b/block/export/vduse-blk.c
@@ -31,7 +31,8 @@ typedef struct VduseBlkExport {
 VduseDev *dev;
 uint16_t num_queues;
 char *recon_file;
-unsigned int inflight;
+unsigned int inflight; /* atomic */
+bool vqs_started;
 } VduseBlkExport;
 
 typedef struct VduseBlkReq {
@@ -41,13 +42,24 @@ typedef struct VduseBlkReq {
 
 static void vduse_blk_inflight_inc(VduseBlkExport *vblk_exp)
 {
-vblk_exp->inflight++;
+if (qatomic_fetch_inc(&vblk_exp->inflight) == 0) {
+/* Prevent export from being deleted */
+aio_context_acquire(vblk_exp->export.ctx);
+blk_exp_ref(&vblk_exp->export);
+aio_context_release(vblk_exp->export.ctx);
+}
 }
 
 static void vduse_blk_inflight_dec(VduseBlkExport *vblk_exp)
 {
-if (--vblk_exp->inflight == 0) {
+if (qatomic_fetch_dec(&vblk_exp->inflight) == 1) {
+/* Wake AIO_WAIT_WHILE() */
 aio_wait_kick();
+
+/* Now the export can be deleted */
+aio_context_acquire(vblk_exp->export.ctx);
+blk_exp_unref(&vblk_exp->export);
+aio_context_release(vblk_exp->export.ctx);
 }
 }
 
@@ -124,8 +136,12 @@ static void vduse_blk_enable_queue(VduseDev *dev, 
VduseVirtq *vq)
 {
 VduseBlkExport *vblk_exp = vduse_dev_get_priv(dev);
 
+if (!vblk_exp->vqs_started) {
+return; /* vduse_blk_drained_end() will start vqs later */
+}
+
 aio_set_fd_handler(vblk_exp->export.ctx, vduse_queue_get_fd(vq),
-   true, on_vduse_vq_kick, NULL, NULL, NULL, vq);
+   false, on_vduse_vq_kick, NULL, NULL, NULL, vq);
 /* Make sure we don't miss any kick afer reconnecting */
 eventfd_write(vduse_queue_get_fd(vq), 1);
 }
@@ -133,9 +149,14 @@ static void vduse_blk_enable_queue(VduseDev *dev, 
VduseVirtq *vq)
 static void vduse_blk_disable_queue(VduseDev *dev, VduseVirtq *vq)
 {
 VduseBlkExport *vblk_exp = vduse_dev_get_priv(dev);
+int fd = vduse_queue_get_fd(vq);
 
-aio_set_fd_handler(vblk_exp->export.ctx, vduse_queue_get_fd(vq),
-   true, NULL, NULL, NULL, NULL, NULL);
+if (fd < 0) {
+return;
+}
+
+aio_set_fd_handler(vblk_exp->export.ctx, fd, false,
+   NULL, NULL, NULL, NULL, NULL);
 }
 
 static const VduseOps vduse_blk_ops = {
@@ -152,42 +173,19 @@ static void on_vduse_dev_kick(void *opaque)
 
 static void vduse_blk_attach_ctx(VduseBlkExport *vblk_exp, AioContext *ctx)
 {
-int i;
-
 aio_set_fd_handler(vblk_exp->export.ctx, vduse_dev_get_fd(vblk_exp->dev),
-   true, on_vduse_dev_kick, NULL, NULL, NULL,
+   false, on_vduse_dev_kick, NULL, NULL, NULL,
vblk_exp->dev);
 
-for (i = 0; i < vblk_exp->num_queues; i++) {
-VduseVirtq *vq = vduse_dev_get_queue(vblk_exp->dev, i);
-int fd = vduse_queue_get_fd(vq);
-
-if (fd < 0) {
-continue;
-}
-aio_set_fd_handler(vblk_exp->export.ctx, fd, true,
-   on_vduse_vq_kick, NULL, NULL, NULL, vq);
-}
+/* Virtqueues are handled by vduse_blk_drained_end(

[PATCH v5 20/21] virtio: do not set is_external=true on host notifiers

2023-05-04 Thread Stefan Hajnoczi

Host notifiers can now use is_external=false since virtio-blk and
virtio-scsi no longer rely on is_external=true for drained sections.

Signed-off-by: Stefan Hajnoczi 
---
 hw/virtio/virtio.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 272d930721..9cdad7e550 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3491,7 +3491,7 @@ static void 
virtio_queue_host_notifier_aio_poll_end(EventNotifier *n)
 
 void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
-aio_set_event_notifier(ctx, &vq->host_notifier, true,
+aio_set_event_notifier(ctx, &vq->host_notifier, false,
virtio_queue_host_notifier_read,
virtio_queue_host_notifier_aio_poll,
virtio_queue_host_notifier_aio_poll_ready);
@@ -3508,14 +3508,14 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
  */
 void virtio_queue_aio_attach_host_notifier_no_poll(VirtQueue *vq, AioContext 
*ctx)
 {
-aio_set_event_notifier(ctx, &vq->host_notifier, true,
+aio_set_event_notifier(ctx, &vq->host_notifier, false,
virtio_queue_host_notifier_read,
NULL, NULL);
 }
 
 void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
-aio_set_event_notifier(ctx, &vq->host_notifier, true, NULL, NULL, NULL);
+aio_set_event_notifier(ctx, &vq->host_notifier, false, NULL, NULL, NULL);
 /* Test and clear notifier before after disabling event,
  * in case poll callback didn't have time to run. */
 virtio_queue_host_notifier_read(&vq->host_notifier);
-- 
2.40.1

[PATCH v5 16/21] block/fuse: do not set is_external=true on FUSE fd

2023-05-04 Thread Stefan Hajnoczi

This is part of ongoing work to remove the aio_disable_external() API.

Use BlockDevOps .drained_begin/end/poll() instead of
aio_set_fd_handler(is_external=true).

As a side-effect the FUSE export now follows AioContext changes like the
other export types.

Signed-off-by: Stefan Hajnoczi 
---
 block/export/fuse.c | 56 +++--
 1 file changed, 54 insertions(+), 2 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 06fa41079e..adf3236b5a 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -50,6 +50,7 @@ typedef struct FuseExport {
 
 struct fuse_session *fuse_session;
 struct fuse_buf fuse_buf;
+unsigned int in_flight; /* atomic */
 bool mounted, fd_handler_set_up;
 
 char *mountpoint;
@@ -78,6 +79,42 @@ static void read_from_fuse_export(void *opaque);
 static bool is_regular_file(const char *path, Error **errp);
 
 
+static void fuse_export_drained_begin(void *opaque)
+{
+FuseExport *exp = opaque;
+
+aio_set_fd_handler(exp->common.ctx,
+   fuse_session_fd(exp->fuse_session), false,
+   NULL, NULL, NULL, NULL, NULL);
+exp->fd_handler_set_up = false;
+}
+
+static void fuse_export_drained_end(void *opaque)
+{
+FuseExport *exp = opaque;
+
+/* Refresh AioContext in case it changed */
+exp->common.ctx = blk_get_aio_context(exp->common.blk);
+
+aio_set_fd_handler(exp->common.ctx,
+   fuse_session_fd(exp->fuse_session), false,
+   read_from_fuse_export, NULL, NULL, NULL, exp);
+exp->fd_handler_set_up = true;
+}
+
+static bool fuse_export_drained_poll(void *opaque)
+{
+FuseExport *exp = opaque;
+
+return qatomic_read(&exp->in_flight) > 0;
+}
+
+static const BlockDevOps fuse_export_blk_dev_ops = {
+.drained_begin = fuse_export_drained_begin,
+.drained_end   = fuse_export_drained_end,
+.drained_poll  = fuse_export_drained_poll,
+};
+
 static int fuse_export_create(BlockExport *blk_exp,
   BlockExportOptions *blk_exp_args,
   Error **errp)
@@ -101,6 +138,15 @@ static int fuse_export_create(BlockExport *blk_exp,
 }
 }
 
+blk_set_dev_ops(exp->common.blk, &fuse_export_blk_dev_ops, exp);
+
+/*
+ * We handle draining ourselves using an in-flight counter and by disabling
+ * the FUSE fd handler. Do not queue BlockBackend requests, they need to
+ * complete so the in-flight counter reaches zero.
+ */
+blk_set_disable_request_queuing(exp->common.blk, true);
+
 init_exports_table();
 
 /*
@@ -224,7 +270,7 @@ static int setup_fuse_export(FuseExport *exp, const char 
*mountpoint,
 g_hash_table_insert(exports, g_strdup(mountpoint), NULL);
 
 aio_set_fd_handler(exp->common.ctx,
-   fuse_session_fd(exp->fuse_session), true,
+   fuse_session_fd(exp->fuse_session), false,
read_from_fuse_export, NULL, NULL, NULL, exp);
 exp->fd_handler_set_up = true;
 
@@ -246,6 +292,8 @@ static void read_from_fuse_export(void *opaque)
 
 blk_exp_ref(&exp->common);
 
+qatomic_inc(&exp->in_flight);
+
 do {
 ret = fuse_session_receive_buf(exp->fuse_session, &exp->fuse_buf);
 } while (ret == -EINTR);
@@ -256,6 +304,10 @@ static void read_from_fuse_export(void *opaque)
 fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
 
 out:
+if (qatomic_fetch_dec(&exp->in_flight) == 1) {
+aio_wait_kick(); /* wake AIO_WAIT_WHILE() */
+}
+
 blk_exp_unref(&exp->common);
 }
 
@@ -268,7 +320,7 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
 
 if (exp->fd_handler_set_up) {
 aio_set_fd_handler(exp->common.ctx,
-   fuse_session_fd(exp->fuse_session), true,
+   fuse_session_fd(exp->fuse_session), false,
NULL, NULL, NULL, NULL, NULL);
 exp->fd_handler_set_up = false;
 }
-- 
2.40.1

[PATCH v5 11/21] block: drain from main loop thread in bdrv_co_yield_to_drain()

2023-05-04 Thread Stefan Hajnoczi

For simplicity, always run BlockDevOps .drained_begin/end/poll()
callbacks in the main loop thread. This makes it easier to implement the
callbacks and avoids extra locks.

Move the function pointer declarations from the I/O Code section to the
Global State section for BlockDevOps, BdrvChildClass, and BlockDriver.

Narrow IO_OR_GS_CODE() to GLOBAL_STATE_CODE() where appropriate.

The test-bdrv-drain test case calls bdrv_drain() from an IOThread. This
is now only allowed from coroutine context, so update the test case to
run in a coroutine.

Signed-off-by: Stefan Hajnoczi 
---
 include/block/block_int-common.h  | 90 +--
 include/sysemu/block-backend-common.h | 25 
 block/io.c| 14 +++--
 tests/unit/test-bdrv-drain.c  | 14 +++--
 4 files changed, 76 insertions(+), 67 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 013d419444..f462a8be55 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -356,6 +356,21 @@ struct BlockDriver {
 void (*bdrv_attach_aio_context)(BlockDriverState *bs,
 AioContext *new_context);
 
+/**
+ * bdrv_drain_begin is called if implemented in the beginning of a
+ * drain operation to drain and stop any internal sources of requests in
+ * the driver.
+ * bdrv_drain_end is called if implemented at the end of the drain.
+ *
+ * They should be used by the driver to e.g. manage scheduled I/O
+ * requests, or toggle an internal state. After the end of the drain new
+ * requests will continue normally.
+ *
+ * Implementations of both functions must not call aio_poll().
+ */
+void (*bdrv_drain_begin)(BlockDriverState *bs);
+void (*bdrv_drain_end)(BlockDriverState *bs);
+
 /**
  * Try to get @bs's logical and physical block size.
  * On success, store them in @bsz and return zero.
@@ -743,21 +758,6 @@ struct BlockDriver {
 void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_unplug)(
 BlockDriverState *bs);
 
-/**
- * bdrv_drain_begin is called if implemented in the beginning of a
- * drain operation to drain and stop any internal sources of requests in
- * the driver.
- * bdrv_drain_end is called if implemented at the end of the drain.
- *
- * They should be used by the driver to e.g. manage scheduled I/O
- * requests, or toggle an internal state. After the end of the drain new
- * requests will continue normally.
- *
- * Implementations of both functions must not call aio_poll().
- */
-void (*bdrv_drain_begin)(BlockDriverState *bs);
-void (*bdrv_drain_end)(BlockDriverState *bs);
-
 bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
 
 bool coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_can_store_new_dirty_bitmap)(
@@ -920,36 +920,6 @@ struct BdrvChildClass {
 void GRAPH_WRLOCK_PTR (*attach)(BdrvChild *child);
 void GRAPH_WRLOCK_PTR (*detach)(BdrvChild *child);
 
-/*
- * Notifies the parent that the filename of its child has changed (e.g.
- * because the direct child was removed from the backing chain), so that it
- * can update its reference.
- */
-int (*update_filename)(BdrvChild *child, BlockDriverState *new_base,
-   const char *filename, Error **errp);
-
-bool (*change_aio_ctx)(BdrvChild *child, AioContext *ctx,
-   GHashTable *visited, Transaction *tran,
-   Error **errp);
-
-/*
- * I/O API functions. These functions are thread-safe.
- *
- * See include/block/block-io.h for more information about
- * the I/O API.
- */
-
-void (*resize)(BdrvChild *child);
-
-/*
- * Returns a name that is supposedly more useful for human users than the
- * node name for identifying the node in question (in particular, a BB
- * name), or NULL if the parent can't provide a better name.
- */
-const char *(*get_name)(BdrvChild *child);
-
-AioContext *(*get_parent_aio_context)(BdrvChild *child);
-
 /*
  * If this pair of functions is implemented, the parent doesn't issue new
  * requests after returning from .drained_begin() until .drained_end() is
@@ -970,6 +940,36 @@ struct BdrvChildClass {
  * activity on the child has stopped.
  */
 bool (*drained_poll)(BdrvChild *child);
+
+/*
+ * Notifies the parent that the filename of its child has changed (e.g.
+ * because the direct child was removed from the backing chain), so that it
+ * can update its reference.
+ */
+int (*update_filename)(BdrvChild *child, BlockDriverState *new_base,
+   const char *filename, Error **errp);
+
+bool (*change_aio_ctx)(BdrvChild *child, AioContext *ctx,
+   GHashTable *visited, Transaction *tran,
+

[PATCH v5 10/21] block: add blk_in_drain() API

2023-05-04 Thread Stefan Hajnoczi

The BlockBackend quiesce_counter is greater than zero during drained
sections. Add an API to check whether the BlockBackend is in a drained
section.

The next patch will use this API.

Signed-off-by: Stefan Hajnoczi 
---
 include/sysemu/block-backend-global-state.h | 1 +
 block/block-backend.c   | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/include/sysemu/block-backend-global-state.h 
b/include/sysemu/block-backend-global-state.h
index 2b6d27db7c..ac7cbd6b5e 100644
--- a/include/sysemu/block-backend-global-state.h
+++ b/include/sysemu/block-backend-global-state.h
@@ -78,6 +78,7 @@ void blk_activate(BlockBackend *blk, Error **errp);
 int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags);
 void blk_aio_cancel(BlockAIOCB *acb);
 int blk_commit_all(void);
+bool blk_in_drain(BlockBackend *blk);
 void blk_drain(BlockBackend *blk);
 void blk_drain_all(void);
 void blk_set_on_error(BlockBackend *blk, BlockdevOnError on_read_error,
diff --git a/block/block-backend.c b/block/block-backend.c
index 68d38635bc..96f03cae95 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1270,6 +1270,13 @@ blk_check_byte_request(BlockBackend *blk, int64_t 
offset, int64_t bytes)
 return 0;
 }
 
+/* Are we currently in a drained section? */
+bool blk_in_drain(BlockBackend *blk)
+{
+GLOBAL_STATE_CODE(); /* change to IO_OR_GS_CODE(), if necessary */
+return qatomic_read(&blk->quiesce_counter);
+}
+
 /* To be called between exactly one pair of blk_inc/dec_in_flight() */
 static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
 {
-- 
2.40.1

[PATCH v5 08/21] block/export: stop using is_external in vhost-user-blk server

2023-05-04 Thread Stefan Hajnoczi

vhost-user activity must be suspended during bdrv_drained_begin/end().
This prevents new requests from interfering with whatever is happening
in the drained section.

Previously this was done using aio_set_fd_handler()'s is_external
argument. In a multi-queue block layer world the aio_disable_external()
API cannot be used since multiple AioContext may be processing I/O, not
just one.

Switch to BlockDevOps->drained_begin/end() callbacks.

Signed-off-by: Stefan Hajnoczi 
---
 block/export/vhost-user-blk-server.c | 28 ++--
 util/vhost-user-server.c | 10 +-
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index f51a36a14f..81b59761e3 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -212,15 +212,21 @@ static void blk_aio_attached(AioContext *ctx, void 
*opaque)
 {
 VuBlkExport *vexp = opaque;
 
+/*
+ * The actual attach will happen in vu_blk_drained_end() and we just
+ * restore ctx here.
+ */
 vexp->export.ctx = ctx;
-vhost_user_server_attach_aio_context(&vexp->vu_server, ctx);
 }
 
 static void blk_aio_detach(void *opaque)
 {
 VuBlkExport *vexp = opaque;
 
-vhost_user_server_detach_aio_context(&vexp->vu_server);
+/*
+ * The actual detach already happened in vu_blk_drained_begin() but from
+ * this point on we must not access ctx anymore.
+ */
 vexp->export.ctx = NULL;
 }
 
@@ -272,6 +278,22 @@ static void vu_blk_exp_resize(void *opaque)
 vu_config_change_msg(&vexp->vu_server.vu_dev);
 }
 
+/* Called with vexp->export.ctx acquired */
+static void vu_blk_drained_begin(void *opaque)
+{
+VuBlkExport *vexp = opaque;
+
+vhost_user_server_detach_aio_context(&vexp->vu_server);
+}
+
+/* Called with vexp->export.blk AioContext acquired */
+static void vu_blk_drained_end(void *opaque)
+{
+VuBlkExport *vexp = opaque;
+
+vhost_user_server_attach_aio_context(&vexp->vu_server, vexp->export.ctx);
+}
+
 /*
  * Ensures that bdrv_drained_begin() waits until in-flight requests complete.
  *
@@ -285,6 +307,8 @@ static bool vu_blk_drained_poll(void *opaque)
 }
 
 static const BlockDevOps vu_blk_dev_ops = {
+.drained_begin = vu_blk_drained_begin,
+.drained_end   = vu_blk_drained_end,
 .drained_poll  = vu_blk_drained_poll,
 .resize_cb = vu_blk_exp_resize,
 };
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index 68c3bf162f..a12b2d1bba 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -278,7 +278,7 @@ set_watch(VuDev *vu_dev, int fd, int vu_evt,
 vu_fd_watch->fd = fd;
 vu_fd_watch->cb = cb;
 qemu_socket_set_nonblock(fd);
-aio_set_fd_handler(server->ioc->ctx, fd, true, kick_handler,
+aio_set_fd_handler(server->ioc->ctx, fd, false, kick_handler,
NULL, NULL, NULL, vu_fd_watch);
 vu_fd_watch->vu_dev = vu_dev;
 vu_fd_watch->pvt = pvt;
@@ -299,7 +299,7 @@ static void remove_watch(VuDev *vu_dev, int fd)
 if (!vu_fd_watch) {
 return;
 }
-aio_set_fd_handler(server->ioc->ctx, fd, true,
+aio_set_fd_handler(server->ioc->ctx, fd, false,
NULL, NULL, NULL, NULL, NULL);
 
 QTAILQ_REMOVE(&server->vu_fd_watches, vu_fd_watch, next);
@@ -362,7 +362,7 @@ void vhost_user_server_stop(VuServer *server)
 VuFdWatch *vu_fd_watch;
 
 QTAILQ_FOREACH(vu_fd_watch, &server->vu_fd_watches, next) {
-aio_set_fd_handler(server->ctx, vu_fd_watch->fd, true,
+aio_set_fd_handler(server->ctx, vu_fd_watch->fd, false,
NULL, NULL, NULL, NULL, vu_fd_watch);
 }
 
@@ -403,7 +403,7 @@ void vhost_user_server_attach_aio_context(VuServer *server, 
AioContext *ctx)
 qio_channel_attach_aio_context(server->ioc, ctx);
 
 QTAILQ_FOREACH(vu_fd_watch, &server->vu_fd_watches, next) {
-aio_set_fd_handler(ctx, vu_fd_watch->fd, true, kick_handler, NULL,
+aio_set_fd_handler(ctx, vu_fd_watch->fd, false, kick_handler, NULL,
NULL, NULL, vu_fd_watch);
 }
 
@@ -417,7 +417,7 @@ void vhost_user_server_detach_aio_context(VuServer *server)
 VuFdWatch *vu_fd_watch;
 
 QTAILQ_FOREACH(vu_fd_watch, &server->vu_fd_watches, next) {
-aio_set_fd_handler(server->ctx, vu_fd_watch->fd, true,
+aio_set_fd_handler(server->ctx, vu_fd_watch->fd, false,
NULL, NULL, NULL, NULL, vu_fd_watch);
 }
 
-- 
2.40.1

[PATCH v5 09/21] hw/xen: do not use aio_set_fd_handler(is_external=true) in xen_xenstore

2023-05-04 Thread Stefan Hajnoczi

There is no need to suspend activity between aio_disable_external() and
aio_enable_external(), which is mainly used for the block layer's drain
operation.

This is part of ongoing work to remove the aio_disable_external() API.

Reviewed-by: David Woodhouse 
Reviewed-by: Paul Durrant 
Signed-off-by: Stefan Hajnoczi 
---
 hw/i386/kvm/xen_xenstore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/kvm/xen_xenstore.c b/hw/i386/kvm/xen_xenstore.c
index 900679af8a..6e81bc8791 100644
--- a/hw/i386/kvm/xen_xenstore.c
+++ b/hw/i386/kvm/xen_xenstore.c
@@ -133,7 +133,7 @@ static void xen_xenstore_realize(DeviceState *dev, Error 
**errp)
 error_setg(errp, "Xenstore evtchn port init failed");
 return;
 }
-aio_set_fd_handler(qemu_get_aio_context(), xen_be_evtchn_fd(s->eh), true,
+aio_set_fd_handler(qemu_get_aio_context(), xen_be_evtchn_fd(s->eh), false,
xen_xenstore_event, NULL, NULL, NULL, s);
 
 s->impl = xs_impl_create(xen_domid);
-- 
2.40.1

[PATCH v5 07/21] block/export: wait for vhost-user-blk requests when draining

2023-05-04 Thread Stefan Hajnoczi

Each vhost-user-blk request runs in a coroutine. When the BlockBackend
enters a drained section we need to enter a quiescent state. Currently
any in-flight requests race with bdrv_drained_begin() because it is
unaware of vhost-user-blk requests.

When blk_co_preadv/pwritev()/etc returns it wakes the
bdrv_drained_begin() thread but vhost-user-blk request processing has
not yet finished. The request coroutine continues executing while the
main loop thread thinks it is in a drained section.

One example where this is unsafe is for blk_set_aio_context() where
bdrv_drained_begin() is called before .aio_context_detached() and
.aio_context_attach(). If request coroutines are still running after
bdrv_drained_begin(), then the AioContext could change underneath them
and they race with new requests processed in the new AioContext. This
could lead to virtqueue corruption, for example.

(This example is theoretical, I came across this while reading the
code and have not tried to reproduce it.)

It's easy to make bdrv_drained_begin() wait for in-flight requests: add
a .drained_poll() callback that checks the VuServer's in-flight counter.
VuServer just needs an API that returns true when there are requests in
flight. The in-flight counter needs to be atomic.

Signed-off-by: Stefan Hajnoczi 
---
v5:
- Use atomic accesses for in_flight counter in vhost-user-server.c [Kevin]
---
 include/qemu/vhost-user-server.h |  4 +++-
 block/export/vhost-user-blk-server.c | 13 +
 util/vhost-user-server.c | 18 --
 3 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/qemu/vhost-user-server.h b/include/qemu/vhost-user-server.h
index bc0ac9ddb6..b1c1cda886 100644
--- a/include/qemu/vhost-user-server.h
+++ b/include/qemu/vhost-user-server.h
@@ -40,8 +40,9 @@ typedef struct {
 int max_queues;
 const VuDevIface *vu_iface;
 
+unsigned int in_flight; /* atomic */
+
 /* Protected by ctx lock */
-unsigned int in_flight;
 bool wait_idle;
 VuDev vu_dev;
 QIOChannel *ioc; /* The I/O channel with the client */
@@ -62,6 +63,7 @@ void vhost_user_server_stop(VuServer *server);
 
 void vhost_user_server_inc_in_flight(VuServer *server);
 void vhost_user_server_dec_in_flight(VuServer *server);
+bool vhost_user_server_has_in_flight(VuServer *server);
 
 void vhost_user_server_attach_aio_context(VuServer *server, AioContext *ctx);
 void vhost_user_server_detach_aio_context(VuServer *server);
diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index 841acb36e3..f51a36a14f 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -272,7 +272,20 @@ static void vu_blk_exp_resize(void *opaque)
 vu_config_change_msg(&vexp->vu_server.vu_dev);
 }
 
+/*
+ * Ensures that bdrv_drained_begin() waits until in-flight requests complete.
+ *
+ * Called with vexp->export.ctx acquired.
+ */
+static bool vu_blk_drained_poll(void *opaque)
+{
+VuBlkExport *vexp = opaque;
+
+return vhost_user_server_has_in_flight(&vexp->vu_server);
+}
+
 static const BlockDevOps vu_blk_dev_ops = {
+.drained_poll  = vu_blk_drained_poll,
 .resize_cb = vu_blk_exp_resize,
 };
 
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index 1622f8cfb3..68c3bf162f 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -78,17 +78,23 @@ static void panic_cb(VuDev *vu_dev, const char *buf)
 void vhost_user_server_inc_in_flight(VuServer *server)
 {
 assert(!server->wait_idle);
-server->in_flight++;
+qatomic_inc(&server->in_flight);
 }
 
 void vhost_user_server_dec_in_flight(VuServer *server)
 {
-server->in_flight--;
-if (server->wait_idle && !server->in_flight) {
-aio_co_wake(server->co_trip);
+if (qatomic_fetch_dec(&server->in_flight) == 1) {
+if (server->wait_idle) {
+aio_co_wake(server->co_trip);
+}
 }
 }
 
+bool vhost_user_server_has_in_flight(VuServer *server)
+{
+return qatomic_load_acquire(&server->in_flight) > 0;
+}
+
 static bool coroutine_fn
 vu_message_read(VuDev *vu_dev, int conn_fd, VhostUserMsg *vmsg)
 {
@@ -192,13 +198,13 @@ static coroutine_fn void vu_client_trip(void *opaque)
 /* Keep running */
 }
 
-if (server->in_flight) {
+if (vhost_user_server_has_in_flight(server)) {
 /* Wait for requests to complete before we can unmap the memory */
 server->wait_idle = true;
 qemu_coroutine_yield();
 server->wait_idle = false;
 }
-assert(server->in_flight == 0);
+assert(!vhost_user_server_has_in_flight(server));
 
 vu_deinit(vu_dev);
 
-- 
2.40.1

[PATCH v5 06/21] util/vhost-user-server: rename refcount to in_flight counter

2023-05-04 Thread Stefan Hajnoczi

The VuServer object has a refcount field and ref/unref APIs. The name is
confusing because it's actually an in-flight request counter instead of
a refcount.

Normally a refcount destroys the object upon reaching zero. The VuServer
counter is used to wake up the vhost-user coroutine when there are no
more requests.

Avoid confusing by renaming refcount and ref/unref to in_flight and
inc/dec.

Reviewed-by: Paolo Bonzini 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Stefan Hajnoczi 
---
 include/qemu/vhost-user-server.h |  6 +++---
 block/export/vhost-user-blk-server.c | 11 +++
 util/vhost-user-server.c | 14 +++---
 3 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/include/qemu/vhost-user-server.h b/include/qemu/vhost-user-server.h
index 25c72433ca..bc0ac9ddb6 100644
--- a/include/qemu/vhost-user-server.h
+++ b/include/qemu/vhost-user-server.h
@@ -41,7 +41,7 @@ typedef struct {
 const VuDevIface *vu_iface;
 
 /* Protected by ctx lock */
-unsigned int refcount;
+unsigned int in_flight;
 bool wait_idle;
 VuDev vu_dev;
 QIOChannel *ioc; /* The I/O channel with the client */
@@ -60,8 +60,8 @@ bool vhost_user_server_start(VuServer *server,
 
 void vhost_user_server_stop(VuServer *server);
 
-void vhost_user_server_ref(VuServer *server);
-void vhost_user_server_unref(VuServer *server);
+void vhost_user_server_inc_in_flight(VuServer *server);
+void vhost_user_server_dec_in_flight(VuServer *server);
 
 void vhost_user_server_attach_aio_context(VuServer *server, AioContext *ctx);
 void vhost_user_server_detach_aio_context(VuServer *server);
diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index e56b92f2e2..841acb36e3 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -50,7 +50,10 @@ static void vu_blk_req_complete(VuBlkReq *req, size_t in_len)
 free(req);
 }
 
-/* Called with server refcount increased, must decrease before returning */
+/*
+ * Called with server in_flight counter increased, must decrease before
+ * returning.
+ */
 static void coroutine_fn vu_blk_virtio_process_req(void *opaque)
 {
 VuBlkReq *req = opaque;
@@ -68,12 +71,12 @@ static void coroutine_fn vu_blk_virtio_process_req(void 
*opaque)
 in_num, out_num);
 if (in_len < 0) {
 free(req);
-vhost_user_server_unref(server);
+vhost_user_server_dec_in_flight(server);
 return;
 }
 
 vu_blk_req_complete(req, in_len);
-vhost_user_server_unref(server);
+vhost_user_server_dec_in_flight(server);
 }
 
 static void vu_blk_process_vq(VuDev *vu_dev, int idx)
@@ -95,7 +98,7 @@ static void vu_blk_process_vq(VuDev *vu_dev, int idx)
 Coroutine *co =
 qemu_coroutine_create(vu_blk_virtio_process_req, req);
 
-vhost_user_server_ref(server);
+vhost_user_server_inc_in_flight(server);
 qemu_coroutine_enter(co);
 }
 }
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index 5b6216069c..1622f8cfb3 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -75,16 +75,16 @@ static void panic_cb(VuDev *vu_dev, const char *buf)
 error_report("vu_panic: %s", buf);
 }
 
-void vhost_user_server_ref(VuServer *server)
+void vhost_user_server_inc_in_flight(VuServer *server)
 {
 assert(!server->wait_idle);
-server->refcount++;
+server->in_flight++;
 }
 
-void vhost_user_server_unref(VuServer *server)
+void vhost_user_server_dec_in_flight(VuServer *server)
 {
-server->refcount--;
-if (server->wait_idle && !server->refcount) {
+server->in_flight--;
+if (server->wait_idle && !server->in_flight) {
 aio_co_wake(server->co_trip);
 }
 }
@@ -192,13 +192,13 @@ static coroutine_fn void vu_client_trip(void *opaque)
 /* Keep running */
 }
 
-if (server->refcount) {
+if (server->in_flight) {
 /* Wait for requests to complete before we can unmap the memory */
 server->wait_idle = true;
 qemu_coroutine_yield();
 server->wait_idle = false;
 }
-assert(server->refcount == 0);
+assert(server->in_flight == 0);
 
 vu_deinit(vu_dev);
 
-- 
2.40.1

[PATCH v5 04/21] virtio-scsi: avoid race between unplug and transport event

2023-05-04 Thread Stefan Hajnoczi

Only report a transport reset event to the guest after the SCSIDevice
has been unrealized by qdev_simple_device_unplug_cb().

qdev_simple_device_unplug_cb() sets the SCSIDevice's qdev.realized field
to false so that scsi_device_find/get() no longer see it.

scsi_target_emulate_report_luns() also needs to be updated to filter out
SCSIDevices that are unrealized.

Change virtio_scsi_push_event() to take event information as an argument
instead of the SCSIDevice. This allows virtio_scsi_hotunplug() to emit a
VIRTIO_SCSI_T_TRANSPORT_RESET event after the SCSIDevice has already
been unrealized.

These changes ensure that the guest driver does not see the SCSIDevice
that's being unplugged if it responds very quickly to the transport
reset event.

Reviewed-by: Paolo Bonzini 
Reviewed-by: Michael S. Tsirkin 
Reviewed-by: Daniil Tatianin 
Signed-off-by: Stefan Hajnoczi 
---
v5:
- Stash SCSIDevice id/lun values for VIRTIO_SCSI_T_TRANSPORT_RESET event
  before unrealizing the SCSIDevice [Kevin]
---
 hw/scsi/scsi-bus.c|  3 +-
 hw/scsi/virtio-scsi.c | 86 ++-
 2 files changed, 63 insertions(+), 26 deletions(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 8857ff41f6..64013c8a24 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -487,7 +487,8 @@ static bool scsi_target_emulate_report_luns(SCSITargetReq 
*r)
 DeviceState *qdev = kid->child;
 SCSIDevice *dev = SCSI_DEVICE(qdev);
 
-if (dev->channel == channel && dev->id == id && dev->lun != 0) {
+if (dev->channel == channel && dev->id == id && dev->lun != 0 &&
+qdev_is_realized(&dev->qdev)) {
 store_lun(tmp, dev->lun);
 g_byte_array_append(buf, tmp, 8);
 len += 8;
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 612c525d9d..ae314af3de 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -933,13 +933,27 @@ static void virtio_scsi_reset(VirtIODevice *vdev)
 s->events_dropped = false;
 }
 
-static void virtio_scsi_push_event(VirtIOSCSI *s, SCSIDevice *dev,
-   uint32_t event, uint32_t reason)
+typedef struct {
+uint32_t event;
+uint32_t reason;
+union {
+/* Used by messages specific to a device */
+struct {
+uint32_t id;
+uint32_t lun;
+} address;
+};
+} VirtIOSCSIEventInfo;
+
+static void virtio_scsi_push_event(VirtIOSCSI *s,
+   const VirtIOSCSIEventInfo *info)
 {
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
 VirtIOSCSIReq *req;
 VirtIOSCSIEvent *evt;
 VirtIODevice *vdev = VIRTIO_DEVICE(s);
+uint32_t event = info->event;
+uint32_t reason = info->reason;
 
 if (!(vdev->status & VIRTIO_CONFIG_S_DRIVER_OK)) {
 return;
@@ -965,27 +979,28 @@ static void virtio_scsi_push_event(VirtIOSCSI *s, 
SCSIDevice *dev,
 memset(evt, 0, sizeof(VirtIOSCSIEvent));
 evt->event = virtio_tswap32(vdev, event);
 evt->reason = virtio_tswap32(vdev, reason);
-if (!dev) {
-assert(event == VIRTIO_SCSI_T_EVENTS_MISSED);
-} else {
+if (event != VIRTIO_SCSI_T_EVENTS_MISSED) {
 evt->lun[0] = 1;
-evt->lun[1] = dev->id;
+evt->lun[1] = info->address.id;
 
 /* Linux wants us to keep the same encoding we use for REPORT LUNS.  */
-if (dev->lun >= 256) {
-evt->lun[2] = (dev->lun >> 8) | 0x40;
+if (info->address.lun >= 256) {
+evt->lun[2] = (info->address.lun >> 8) | 0x40;
 }
-evt->lun[3] = dev->lun & 0xFF;
+evt->lun[3] = info->address.lun & 0xFF;
 }
 trace_virtio_scsi_event(virtio_scsi_get_lun(evt->lun), event, reason);
- 
+
 virtio_scsi_complete_req(req);
 }
 
 static void virtio_scsi_handle_event_vq(VirtIOSCSI *s, VirtQueue *vq)
 {
 if (s->events_dropped) {
-virtio_scsi_push_event(s, NULL, VIRTIO_SCSI_T_NO_EVENT, 0);
+VirtIOSCSIEventInfo info = {
+.event = VIRTIO_SCSI_T_NO_EVENT,
+};
+virtio_scsi_push_event(s, &info);
 }
 }
 
@@ -1009,9 +1024,17 @@ static void virtio_scsi_change(SCSIBus *bus, SCSIDevice 
*dev, SCSISense sense)
 
 if (virtio_vdev_has_feature(vdev, VIRTIO_SCSI_F_CHANGE) &&
 dev->type != TYPE_ROM) {
+VirtIOSCSIEventInfo info = {
+.event   = VIRTIO_SCSI_T_PARAM_CHANGE,
+.reason  = sense.asc | (sense.ascq << 8),
+.address = {
+.id  = dev->id,
+.lun = dev->lun,
+},
+};
+
 virtio_scsi_acquire(s);
-virtio_scsi_push_event(s, dev, VIRTIO_SCSI_T_PARAM_CHANGE,
-   sense.asc | (

[PATCH v5 05/21] virtio-scsi: stop using aio_disable_external() during unplug

2023-05-04 Thread Stefan Hajnoczi

This patch is part of an effort to remove the aio_disable_external()
API because it does not fit in a multi-queue block layer world where
many AioContexts may be submitting requests to the same disk.

The SCSI emulation code is already in good shape to stop using
aio_disable_external(). It was only used by commit 9c5aad84da1c
("virtio-scsi: fixed virtio_scsi_ctx_check failed when detaching scsi
disk") to ensure that virtio_scsi_hotunplug() works while the guest
driver is submitting I/O.

Ensure virtio_scsi_hotunplug() is safe as follows:

1. qdev_simple_device_unplug_cb() -> qdev_unrealize() ->
   device_set_realized() calls qatomic_set(&dev->realized, false) so
   that future scsi_device_get() calls return NULL because they exclude
   SCSIDevices with realized=false.

   That means virtio-scsi will reject new I/O requests to this
   SCSIDevice with VIRTIO_SCSI_S_BAD_TARGET even while
   virtio_scsi_hotunplug() is still executing. We are protected against
   new requests!

2. scsi_device_unrealize() already contains a call to
   scsi_device_purge_requests() so that in-flight requests are cancelled
   synchronously. This ensures that no in-flight requests remain once
   qdev_simple_device_unplug_cb() returns.

Thanks to these two conditions we don't need aio_disable_external()
anymore.

Cc: Zhengui Li 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Daniil Tatianin 
Signed-off-by: Stefan Hajnoczi 
---
 hw/scsi/virtio-scsi.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index ae314af3de..c1a7ea9ae2 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -1091,7 +1091,6 @@ static void virtio_scsi_hotunplug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 VirtIODevice *vdev = VIRTIO_DEVICE(hotplug_dev);
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 SCSIDevice *sd = SCSI_DEVICE(dev);
-AioContext *ctx = s->ctx ?: qemu_get_aio_context();
 VirtIOSCSIEventInfo info = {
 .event   = VIRTIO_SCSI_T_TRANSPORT_RESET,
 .reason  = VIRTIO_SCSI_EVT_RESET_REMOVED,
@@ -1101,9 +1100,7 @@ static void virtio_scsi_hotunplug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 },
 };
 
-aio_disable_external(ctx);
 qdev_simple_device_unplug_cb(hotplug_dev, dev, errp);
-aio_enable_external(ctx);
 
 if (s->ctx) {
 virtio_scsi_acquire(s);
-- 
2.40.1

[PATCH v5 03/21] hw/qdev: introduce qdev_is_realized() helper

2023-05-04 Thread Stefan Hajnoczi

Add a helper function to check whether the device is realized without
requiring the Big QEMU Lock. The next patch adds a second caller. The
goal is to avoid spreading DeviceState field accesses throughout the
code.

Suggested-by: Philippe Mathieu-Daudé 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Stefan Hajnoczi 
---
 include/hw/qdev-core.h | 17 ++---
 hw/scsi/scsi-bus.c |  3 +--
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 7623703943..f1070d6dc7 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -1,6 +1,7 @@
 #ifndef QDEV_CORE_H
 #define QDEV_CORE_H
 
+#include "qemu/atomic.h"
 #include "qemu/queue.h"
 #include "qemu/bitmap.h"
 #include "qemu/rcu.h"
@@ -168,9 +169,6 @@ typedef struct {
 
 /**
  * DeviceState:
- * @realized: Indicates whether the device has been fully constructed.
- *When accessed outside big qemu lock, must be accessed with
- *qatomic_load_acquire()
  * @reset: ResettableState for the device; handled by Resettable interface.
  *
  * This structure should not be accessed directly.  We declare it here
@@ -339,6 +337,19 @@ DeviceState *qdev_new(const char *name);
  */
 DeviceState *qdev_try_new(const char *name);
 
+/**
+ * qdev_is_realized:
+ * @dev: The device to check.
+ *
+ * May be called outside big qemu lock.
+ *
+ * Returns: %true% if the device has been fully constructed, %false% otherwise.
+ */
+static inline bool qdev_is_realized(DeviceState *dev)
+{
+return qatomic_load_acquire(&dev->realized);
+}
+
 /**
  * qdev_realize: Realize @dev.
  * @dev: device to realize
diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 3c20b47ad0..8857ff41f6 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -60,8 +60,7 @@ static SCSIDevice *do_scsi_device_find(SCSIBus *bus,
  * the user access the device.
  */
 
-if (retval && !include_unrealized &&
-!qatomic_load_acquire(&retval->qdev.realized)) {
+if (retval && !include_unrealized && !qdev_is_realized(&retval->qdev)) {
 retval = NULL;
 }
 
-- 
2.40.1

[PATCH v5 00/21] block: remove aio_disable_external() API

2023-05-04 Thread Stefan Hajnoczi

v5:
- Use atomic accesses for in_flight counter in vhost-user-server.c [Kevin]
- Stash SCSIDevice id/lun values for VIRTIO_SCSI_T_TRANSPORT_RESET event
  before unrealizing the SCSIDevice [Kevin]
- Keep vhost-user-blk export .detach() callback so ctx is set to NULL [Kevin]
- Narrow BdrvChildClass and BlockDriver drained_{begin/end/poll} callbacks from
  IO_OR_GS_CODE() to GLOBAL_STATE_CODE() [Kevin]
- Include Kevin's "block: Fix use after free in blockdev_mark_auto_del()" to
  fix a latent bug that was exposed by this series

v4:
- Remove external_disable_cnt variable [Philippe]
- Add Patch 1 to fix assertion failure in .drained_end() -> 
blk_get_aio_context()
v3:
- Resend full patch series. v2 was sent in the middle of a git rebase and was
  missing patches. [Eric]
- Apply Reviewed-by tags.
v2:
- Do not rely on BlockBackend request queuing, implement .drained_begin/end()
  instead in xen-block, virtio-blk, and virtio-scsi [Paolo]
- Add qdev_is_realized() API [Philippe]
- Add patch to avoid AioContext lock around blk_exp_ref/unref() [Paolo]
- Add patch to call .drained_begin/end() from main loop thread to simplify
  callback implementations

The aio_disable_external() API temporarily suspends file descriptor monitoring
in the event loop. The block layer uses this to prevent new I/O requests being
submitted from the guest and elsewhere between bdrv_drained_begin() and
bdrv_drained_end().

While the block layer still needs to prevent new I/O requests in drained
sections, the aio_disable_external() API can be replaced with
.drained_begin/end/poll() callbacks that have been added to BdrvChildClass and
BlockDevOps.

This newer .bdrained_begin/end/poll() approach is attractive because it works
without specifying a specific AioContext. The block layer is moving towards
multi-queue and that means multiple AioContexts may be processing I/O
simultaneously.

The aio_disable_external() was always somewhat hacky. It suspends all file
descriptors that were registered with is_external=true, even if they have
nothing to do with the BlockDriverState graph nodes that are being drained.
It's better to solve a block layer problem in the block layer than to have an
odd event loop API solution.

The approach in this patch series is to implement BlockDevOps
.drained_begin/end() callbacks that temporarily stop file descriptor handlers.
This ensures that new I/O requests are not submitted in drained sections.

Kevin Wolf (1):
  block: Fix use after free in blockdev_mark_auto_del()

Stefan Hajnoczi (20):
  block-backend: split blk_do_set_aio_context()
  hw/qdev: introduce qdev_is_realized() helper
  virtio-scsi: avoid race between unplug and transport event
  virtio-scsi: stop using aio_disable_external() during unplug
  util/vhost-user-server: rename refcount to in_flight counter
  block/export: wait for vhost-user-blk requests when draining
  block/export: stop using is_external in vhost-user-blk server
  hw/xen: do not use aio_set_fd_handler(is_external=true) in
xen_xenstore
  block: add blk_in_drain() API
  block: drain from main loop thread in bdrv_co_yield_to_drain()
  xen-block: implement BlockDevOps->drained_begin()
  hw/xen: do not set is_external=true on evtchn fds
  block/export: rewrite vduse-blk drain code
  block/export: don't require AioContext lock around blk_exp_ref/unref()
  block/fuse: do not set is_external=true on FUSE fd
  virtio: make it possible to detach host notifier from any thread
  virtio-blk: implement BlockDevOps->drained_begin()
  virtio-scsi: implement BlockDevOps->drained_begin()
  virtio: do not set is_external=true on host notifiers
  aio: remove aio_disable_external() API

 hw/block/dataplane/xen-block.h  |   2 +
 include/block/aio.h |  57 -
 include/block/block_int-common.h|  90 +++---
 include/block/export.h  |   2 +
 include/hw/qdev-core.h  |  17 ++-
 include/hw/scsi/scsi.h  |  14 +++
 include/qemu/vhost-user-server.h|   8 +-
 include/sysemu/block-backend-common.h   |  25 ++--
 include/sysemu/block-backend-global-state.h |   1 +
 util/aio-posix.h|   1 -
 block.c |   7 --
 block/blkio.c   |  15 +--
 block/block-backend.c   |  78 ++--
 block/curl.c|  10 +-
 block/export/export.c   |  13 +-
 block/export/fuse.c |  56 -
 block/export/vduse-blk.c| 128 ++--
 block/export/vhost-user-blk-server.c|  52 +++-
 block/io.c  |  16 ++-
 block/io_uring.c|   4 +-
 block/iscsi.c   |   3 +-
 block/linux-aio.c   |   4 +-
 block/nfs.c

[PATCH v5 02/21] block-backend: split blk_do_set_aio_context()

2023-05-04 Thread Stefan Hajnoczi

blk_set_aio_context() is not fully transactional because
blk_do_set_aio_context() updates blk->ctx outside the transaction. Most
of the time this goes unnoticed but a BlockDevOps.drained_end() callback
that invokes blk_get_aio_context() fails assert(ctx == blk->ctx). This
happens because blk->ctx is only assigned after
BlockDevOps.drained_end() is called and we're in an intermediate state
where BlockDrvierState nodes already have the new context and the
BlockBackend still has the old context.

Making blk_set_aio_context() fully transactional solves this assertion
failure because the BlockBackend's context is updated as part of the
transaction (before BlockDevOps.drained_end() is called).

Split blk_do_set_aio_context() in order to solve this assertion failure.
This helper function actually serves two different purposes:
1. It drives blk_set_aio_context().
2. It responds to BdrvChildClass->change_aio_ctx().

Get rid of the helper function. Do #1 inside blk_set_aio_context() and
do #2 inside blk_root_set_aio_ctx_commit(). This simplifies the code.

The only drawback of the fully transactional approach is that
blk_set_aio_context() must contend with blk_root_set_aio_ctx_commit()
being invoked as part of the AioContext change propagation. This can be
solved by temporarily setting blk->allow_aio_context_change to true.

Future patches call blk_get_aio_context() from
BlockDevOps->drained_end(), so this patch will become necessary.

Signed-off-by: Stefan Hajnoczi 
---
 block/block-backend.c | 71 +--
 1 file changed, 28 insertions(+), 43 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index fc530ded6a..68d38635bc 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2205,52 +2205,31 @@ static AioContext *blk_aiocb_get_aio_context(BlockAIOCB 
*acb)
 return blk_get_aio_context(blk_acb->blk);
 }
 
-static int blk_do_set_aio_context(BlockBackend *blk, AioContext *new_context,
-  bool update_root_node, Error **errp)
-{
-BlockDriverState *bs = blk_bs(blk);
-ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
-int ret;
-
-if (bs) {
-bdrv_ref(bs);
-
-if (update_root_node) {
-/*
- * update_root_node MUST be false for 
blk_root_set_aio_ctx_commit(),
- * as we are already in the commit function of a transaction.
- */
-ret = bdrv_try_change_aio_context(bs, new_context, blk->root, 
errp);
-if (ret < 0) {
-bdrv_unref(bs);
-return ret;
-}
-}
-/*
- * Make blk->ctx consistent with the root node before we invoke any
- * other operations like drain that might inquire blk->ctx
- */
-blk->ctx = new_context;
-if (tgm->throttle_state) {
-bdrv_drained_begin(bs);
-throttle_group_detach_aio_context(tgm);
-throttle_group_attach_aio_context(tgm, new_context);
-bdrv_drained_end(bs);
-}
-
-bdrv_unref(bs);
-} else {
-blk->ctx = new_context;
-}
-
-return 0;
-}
-
 int blk_set_aio_context(BlockBackend *blk, AioContext *new_context,
 Error **errp)
 {
+bool old_allow_change;
+BlockDriverState *bs = blk_bs(blk);
+int ret;
+
 GLOBAL_STATE_CODE();
-return blk_do_set_aio_context(blk, new_context, true, errp);
+
+if (!bs) {
+blk->ctx = new_context;
+return 0;
+}
+
+bdrv_ref(bs);
+
+old_allow_change = blk->allow_aio_context_change;
+blk->allow_aio_context_change = true;
+
+ret = bdrv_try_change_aio_context(bs, new_context, NULL, errp);
+
+blk->allow_aio_context_change = old_allow_change;
+
+bdrv_unref(bs);
+return ret;
 }
 
 typedef struct BdrvStateBlkRootContext {
@@ -2262,8 +2241,14 @@ static void blk_root_set_aio_ctx_commit(void *opaque)
 {
 BdrvStateBlkRootContext *s = opaque;
 BlockBackend *blk = s->blk;
+AioContext *new_context = s->new_ctx;
+ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
 
-blk_do_set_aio_context(blk, s->new_ctx, false, &error_abort);
+blk->ctx = new_context;
+if (tgm->throttle_state) {
+throttle_group_detach_aio_context(tgm);
+throttle_group_attach_aio_context(tgm, new_context);
+}
 }
 
 static TransactionActionDrv set_blk_root_context = {
-- 
2.40.1

[PATCH v5 01/21] block: Fix use after free in blockdev_mark_auto_del()

2023-05-04 Thread Stefan Hajnoczi

From: Kevin Wolf 

job_cancel_locked() drops the job list lock temporarily and it may call
aio_poll(). We must assume that the list has changed after this call.
Also, with unlucky timing, it can end up freeing the job during
job_completed_txn_abort_locked(), making the job pointer invalid, too.

For both reasons, we can't just continue at block_job_next_locked(job).
Instead, start at the head of the list again after job_cancel_locked()
and skip those jobs that we already cancelled (or that are completing
anyway).

Signed-off-by: Kevin Wolf 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Stefan Hajnoczi 
Message-Id: <20230503140142.474404-1-kw...@redhat.com>
---
 blockdev.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index d7b5c18f0a..2c1752a403 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -153,12 +153,22 @@ void blockdev_mark_auto_del(BlockBackend *blk)
 
 JOB_LOCK_GUARD();
 
-for (job = block_job_next_locked(NULL); job;
- job = block_job_next_locked(job)) {
-if (block_job_has_bdrv(job, blk_bs(blk))) {
+do {
+job = block_job_next_locked(NULL);
+while (job && (job->job.cancelled ||
+   job->job.deferred_to_main_loop ||
+   !block_job_has_bdrv(job, blk_bs(blk
+{
+job = block_job_next_locked(job);
+}
+if (job) {
+/*
+ * This drops the job lock temporarily and polls, so we need to
+ * restart processing the list from the start after this.
+ */
 job_cancel_locked(&job->job, false);
 }
-}
+} while (job);
 
 dinfo->auto_del = 1;
 }
-- 
2.40.1

Re: [PATCH v4 03/20] virtio-scsi: avoid race between unplug and transport event

2023-05-03 Thread Stefan Hajnoczi

On Wed, May 03, 2023 at 10:00:57AM +0200, Kevin Wolf wrote:
> Am 02.05.2023 um 20:56 hat Stefan Hajnoczi geschrieben:
> > On Tue, May 02, 2023 at 05:19:46PM +0200, Kevin Wolf wrote:
> > > Am 25.04.2023 um 19:26 hat Stefan Hajnoczi geschrieben:
> > > > Only report a transport reset event to the guest after the SCSIDevice
> > > > has been unrealized by qdev_simple_device_unplug_cb().
> > > > 
> > > > qdev_simple_device_unplug_cb() sets the SCSIDevice's qdev.realized field
> > > > to false so that scsi_device_find/get() no longer see it.
> > > > 
> > > > scsi_target_emulate_report_luns() also needs to be updated to filter out
> > > > SCSIDevices that are unrealized.
> > > > 
> > > > These changes ensure that the guest driver does not see the SCSIDevice
> > > > that's being unplugged if it responds very quickly to the transport
> > > > reset event.
> > > > 
> > > > Reviewed-by: Paolo Bonzini 
> > > > Reviewed-by: Michael S. Tsirkin 
> > > > Reviewed-by: Daniil Tatianin 
> > > > Signed-off-by: Stefan Hajnoczi 
> > > 
> > > > @@ -1082,6 +1073,15 @@ static void virtio_scsi_hotunplug(HotplugHandler 
> > > > *hotplug_dev, DeviceState *dev,
> > > >  blk_set_aio_context(sd->conf.blk, qemu_get_aio_context(), 
> > > > NULL);
> > > >  virtio_scsi_release(s);
> > > >  }
> > > > +
> > > > +if (virtio_vdev_has_feature(vdev, VIRTIO_SCSI_F_HOTPLUG)) {
> > > > +virtio_scsi_acquire(s);
> > > > +virtio_scsi_push_event(s, sd,
> > > > +   VIRTIO_SCSI_T_TRANSPORT_RESET,
> > > > +   VIRTIO_SCSI_EVT_RESET_REMOVED);
> > > > +scsi_bus_set_ua(&s->bus, SENSE_CODE(REPORTED_LUNS_CHANGED));
> > > > +virtio_scsi_release(s);
> > > > +}
> > > >  }
> > > 
> > > s, sd and s->bus are all unrealized at this point, whereas before this
> > > patch they were still realized. I couldn't find any practical problem
> > > with it, but it made me nervous enough that I thought I should comment
> > > on it at least.
> > > 
> > > Should we maybe have documentation on these functions that says that
> > > they accept unrealized objects as their parameters?
> > 
> > s is the VirtIOSCSI controller, not the SCSIDevice that is being
> > unplugged. The VirtIOSCSI controller is still realized.
> > 
> > s->bus is the VirtIOSCSI controller's bus, it is still realized.
> 
> You're right, I misread this part.
> 
> > You are right that the SCSIDevice (sd) has been unrealized at this
> > point:
> > - sd->conf.blk is safe because qdev properties stay alive the
> >   Object is deleted, but I'm not sure we should rely on that.
> 
> This feels relatively safe (and it's preexisting anyway), reading a
> property doesn't do anything unpredictable and we know the pointer is
> still valid.
> 
> > - virti_scsi_push_event(.., sd, ...) is questionable because the LUN
> >   that's fetched from sd no longer belongs to the unplugged SCSIDevice.
> 
> This call is what made me nervous.
> 
> > How about I change the code to fetch sd->conf.blk and the LUN before
> > unplugging?
> 
> You mean passing sd->id and sd->lun to virtio_scsi_push_event() instead
> of sd itself? That would certainly look cleaner and make sure that we
> don't later add code to it that does something with sd that would
> require it to be realized.

Yes, I'll do that in the next revision.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v4 07/20] block/export: stop using is_external in vhost-user-blk server

2023-05-03 Thread Stefan Hajnoczi

On Wed, May 03, 2023 at 10:08:46AM +0200, Kevin Wolf wrote:
> Am 02.05.2023 um 22:06 hat Stefan Hajnoczi geschrieben:
> > On Tue, May 02, 2023 at 06:04:24PM +0200, Kevin Wolf wrote:
> > > Am 25.04.2023 um 19:27 hat Stefan Hajnoczi geschrieben:
> > > > vhost-user activity must be suspended during bdrv_drained_begin/end().
> > > > This prevents new requests from interfering with whatever is happening
> > > > in the drained section.
> > > > 
> > > > Previously this was done using aio_set_fd_handler()'s is_external
> > > > argument. In a multi-queue block layer world the aio_disable_external()
> > > > API cannot be used since multiple AioContext may be processing I/O, not
> > > > just one.
> > > > 
> > > > Switch to BlockDevOps->drained_begin/end() callbacks.
> > > > 
> > > > Signed-off-by: Stefan Hajnoczi 
> > > > ---
> > > >  block/export/vhost-user-blk-server.c | 43 ++--
> > > >  util/vhost-user-server.c | 10 +++
> > > >  2 files changed, 26 insertions(+), 27 deletions(-)
> > > > 
> > > > diff --git a/block/export/vhost-user-blk-server.c 
> > > > b/block/export/vhost-user-blk-server.c
> > > > index 092b86aae4..d20f69cd74 100644
> > > > --- a/block/export/vhost-user-blk-server.c
> > > > +++ b/block/export/vhost-user-blk-server.c
> > > > @@ -208,22 +208,6 @@ static const VuDevIface vu_blk_iface = {
> > > >  .process_msg   = vu_blk_process_msg,
> > > >  };
> > > >  
> > > > -static void blk_aio_attached(AioContext *ctx, void *opaque)
> > > > -{
> > > > -VuBlkExport *vexp = opaque;
> > > > -
> > > > -vexp->export.ctx = ctx;
> > > > -vhost_user_server_attach_aio_context(&vexp->vu_server, ctx);
> > > > -}
> > > > -
> > > > -static void blk_aio_detach(void *opaque)
> > > > -{
> > > > -VuBlkExport *vexp = opaque;
> > > > -
> > > > -vhost_user_server_detach_aio_context(&vexp->vu_server);
> > > > -vexp->export.ctx = NULL;
> > > > -}
> > > 
> > > So for changing the AioContext, we now rely on the fact that the node to
> > > be changed is always drained, so the drain callbacks implicitly cover
> > > this case, too?
> > 
> > Yes.
> 
> Ok. This surprised me a bit at first, but I think it's fine.
> 
> We just need to remember it if we ever decide that once we have
> multiqueue, we can actually change the default AioContext without
> draining the node. But maybe at that point, we have to do more
> fundamental changes anyway.
> 
> > > >  static void
> > > >  vu_blk_initialize_config(BlockDriverState *bs,
> > > >   struct virtio_blk_config *config,
> > > > @@ -272,6 +256,25 @@ static void vu_blk_exp_resize(void *opaque)
> > > >  vu_config_change_msg(&vexp->vu_server.vu_dev);
> > > >  }
> > > >  
> > > > +/* Called with vexp->export.ctx acquired */
> > > > +static void vu_blk_drained_begin(void *opaque)
> > > > +{
> > > > +VuBlkExport *vexp = opaque;
> > > > +
> > > > +vhost_user_server_detach_aio_context(&vexp->vu_server);
> > > > +}
> > > 
> > > Compared to the old code, we're losing the vexp->export.ctx = NULL. This
> > > is correct at this point because after drained_begin we still keep
> > > processing requests until we arrive at a quiescent state.
> > > 
> > > However, if we detach the AioContext because we're deleting the
> > > iothread, won't we end up with a dangling pointer in vexp->export.ctx?
> > > Or can we be certain that nothing interesting happens before drained_end
> > > updates it with a new valid pointer again?
> > 
> > If you want I can add the detach() callback back again and set ctx to
> > NULL there?
> 
> I haven't thought enough about it to say if it's a problem. If you have
> and are confident that it's correct the way it is, I'm happy with it.
>
> But bringing the callback back is the minimal change compared to the old
> state. It's just unnecessary code if we don't actually need it.

The reasoning behind my patch is that detach() sets NULL today and we
would see crashes if ctx was accessed between detach() -> attach().
Therefore, I'm assuming there are no ctx accesses in the code today and
removing the ctx = NULL assignment doesn't break anything.

However, my approach is not very defensive. If the code is changed in a
way that accesses ctx when it's not supposed to, then a dangling pointer
will be accessed.

I think leaving the detach() callback there can be justified because it
will make it easier to detect bugs in the future. I'll add it back in
the next revision.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v4 04/20] virtio-scsi: stop using aio_disable_external() during unplug

2023-05-03 Thread Stefan Hajnoczi

On Wed, May 03, 2023 at 01:40:49PM +0200, Kevin Wolf wrote:
> Am 02.05.2023 um 22:02 hat Stefan Hajnoczi geschrieben:
> > On Tue, May 02, 2023 at 03:19:52PM +0200, Kevin Wolf wrote:
> > > Am 01.05.2023 um 17:09 hat Stefan Hajnoczi geschrieben:
> > > > On Fri, Apr 28, 2023 at 04:22:55PM +0200, Kevin Wolf wrote:
> > > > > Am 25.04.2023 um 19:27 hat Stefan Hajnoczi geschrieben:
> > > > > > This patch is part of an effort to remove the aio_disable_external()
> > > > > > API because it does not fit in a multi-queue block layer world where
> > > > > > many AioContexts may be submitting requests to the same disk.
> > > > > > 
> > > > > > The SCSI emulation code is already in good shape to stop using
> > > > > > aio_disable_external(). It was only used by commit 9c5aad84da1c
> > > > > > ("virtio-scsi: fixed virtio_scsi_ctx_check failed when detaching 
> > > > > > scsi
> > > > > > disk") to ensure that virtio_scsi_hotunplug() works while the guest
> > > > > > driver is submitting I/O.
> > > > > > 
> > > > > > Ensure virtio_scsi_hotunplug() is safe as follows:
> > > > > > 
> > > > > > 1. qdev_simple_device_unplug_cb() -> qdev_unrealize() ->
> > > > > >device_set_realized() calls qatomic_set(&dev->realized, false) so
> > > > > >that future scsi_device_get() calls return NULL because they 
> > > > > > exclude
> > > > > >SCSIDevices with realized=false.
> > > > > > 
> > > > > >That means virtio-scsi will reject new I/O requests to this
> > > > > >SCSIDevice with VIRTIO_SCSI_S_BAD_TARGET even while
> > > > > >virtio_scsi_hotunplug() is still executing. We are protected 
> > > > > > against
> > > > > >new requests!
> > > > > > 
> > > > > > 2. Add a call to scsi_device_purge_requests() from scsi_unrealize() 
> > > > > > so
> > > > > >that in-flight requests are cancelled synchronously. This ensures
> > > > > >that no in-flight requests remain once 
> > > > > > qdev_simple_device_unplug_cb()
> > > > > >returns.
> > > > > > 
> > > > > > Thanks to these two conditions we don't need aio_disable_external()
> > > > > > anymore.
> > > > > > 
> > > > > > Cc: Zhengui Li 
> > > > > > Reviewed-by: Paolo Bonzini 
> > > > > > Reviewed-by: Daniil Tatianin 
> > > > > > Signed-off-by: Stefan Hajnoczi 
> > > > > 
> > > > > qemu-iotests 040 starts failing for me after this patch, with what 
> > > > > looks
> > > > > like a use-after-free error of some kind.
> > > > > 
> > > > > (gdb) bt
> > > > > #0  0x55b6e3e1f31c in job_type (job=0xe3e3e3e3e3e3e3e3) at 
> > > > > ../job.c:238
> > > > > #1  0x55b6e3e1cee5 in is_block_job (job=0xe3e3e3e3e3e3e3e3) at 
> > > > > ../blockjob.c:41
> > > > > #2  0x55b6e3e1ce7d in block_job_next_locked (bjob=0x55b6e72b7570) 
> > > > > at ../blockjob.c:54
> > > > > #3  0x55b6e3df6370 in blockdev_mark_auto_del (blk=0x55b6e74af0a0) 
> > > > > at ../blockdev.c:157
> > > > > #4  0x55b6e393e23b in scsi_qdev_unrealize (qdev=0x55b6e7c04d40) 
> > > > > at ../hw/scsi/scsi-bus.c:303
> > > > > #5  0x55b6e3db0d0e in device_set_realized (obj=0x55b6e7c04d40, 
> > > > > value=false, errp=0x55b6e497c918 ) at 
> > > > > ../hw/core/qdev.c:599
> > > > > #6  0x55b6e3dba36e in property_set_bool (obj=0x55b6e7c04d40, 
> > > > > v=0x55b6e7d7f290, name=0x55b6e41bd6d8 "realized", 
> > > > > opaque=0x55b6e7246d20, errp=0x55b6e497c918 )
> > > > > at ../qom/object.c:2285
> > > > > #7  0x55b6e3db7e65 in object_property_set (obj=0x55b6e7c04d40, 
> > > > > name=0x55b6e41bd6d8 "realized", v=0x55b6e7d7f290, errp=0x55b6e497c918 
> > > > > ) at ../qom/object.c:1420
> > > > > #8  0x55b6e3dbd84a in object_property_set_qobject 
> > > > > (obj=0x55b6e7c04d40, name=0x55b6e41bd6d8 "realized", 
> > > > > value=0x55b6e74c1890, errp=0x55b6e497c918 )

Re: [PATCH v4 10/20] block: drain from main loop thread in bdrv_co_yield_to_drain()

2023-05-02 Thread Stefan Hajnoczi

On Tue, May 02, 2023 at 06:21:20PM +0200, Kevin Wolf wrote:
> Am 25.04.2023 um 19:27 hat Stefan Hajnoczi geschrieben:
> > For simplicity, always run BlockDevOps .drained_begin/end/poll()
> > callbacks in the main loop thread. This makes it easier to implement the
> > callbacks and avoids extra locks.
> > 
> > Move the function pointer declarations from the I/O Code section to the
> > Global State section in block-backend-common.h.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> 
> If we're updating function pointers, we should probably update them in
> BdrvChildClass and BlockDriver, too.

I'll do that in the next revision.

> This means that a non-coroutine caller can't run in an iothread, not
> even the home iothread of the BlockDriverState. (I'm not sure if it was
> allowed previously. I don't think we're actually doing this, but in
> theory it could have worked.) Maybe put a GLOBAL_STATE_CODE() after
> handling the bdrv_co_yield_to_drain() case? Or would that look too odd?
> 
> IO_OR_GS_CODE();
> 
> if (qemu_in_coroutine()) {
> bdrv_co_yield_to_drain(bs, true, parent, poll);
> return;
> }
> 
> GLOBAL_STATE_CODE();

That looks good to me, it makes explicit that IO_OR_GS_CODE() only
applies until the end of the if statement.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v4 07/20] block/export: stop using is_external in vhost-user-blk server

2023-05-02 Thread Stefan Hajnoczi

On Tue, May 02, 2023 at 06:04:24PM +0200, Kevin Wolf wrote:
> Am 25.04.2023 um 19:27 hat Stefan Hajnoczi geschrieben:
> > vhost-user activity must be suspended during bdrv_drained_begin/end().
> > This prevents new requests from interfering with whatever is happening
> > in the drained section.
> > 
> > Previously this was done using aio_set_fd_handler()'s is_external
> > argument. In a multi-queue block layer world the aio_disable_external()
> > API cannot be used since multiple AioContext may be processing I/O, not
> > just one.
> > 
> > Switch to BlockDevOps->drained_begin/end() callbacks.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> >  block/export/vhost-user-blk-server.c | 43 ++--
> >  util/vhost-user-server.c | 10 +++
> >  2 files changed, 26 insertions(+), 27 deletions(-)
> > 
> > diff --git a/block/export/vhost-user-blk-server.c 
> > b/block/export/vhost-user-blk-server.c
> > index 092b86aae4..d20f69cd74 100644
> > --- a/block/export/vhost-user-blk-server.c
> > +++ b/block/export/vhost-user-blk-server.c
> > @@ -208,22 +208,6 @@ static const VuDevIface vu_blk_iface = {
> >  .process_msg   = vu_blk_process_msg,
> >  };
> >  
> > -static void blk_aio_attached(AioContext *ctx, void *opaque)
> > -{
> > -VuBlkExport *vexp = opaque;
> > -
> > -vexp->export.ctx = ctx;
> > -vhost_user_server_attach_aio_context(&vexp->vu_server, ctx);
> > -}
> > -
> > -static void blk_aio_detach(void *opaque)
> > -{
> > -VuBlkExport *vexp = opaque;
> > -
> > -vhost_user_server_detach_aio_context(&vexp->vu_server);
> > -vexp->export.ctx = NULL;
> > -}
> 
> So for changing the AioContext, we now rely on the fact that the node to
> be changed is always drained, so the drain callbacks implicitly cover
> this case, too?

Yes.

> >  static void
> >  vu_blk_initialize_config(BlockDriverState *bs,
> >   struct virtio_blk_config *config,
> > @@ -272,6 +256,25 @@ static void vu_blk_exp_resize(void *opaque)
> >  vu_config_change_msg(&vexp->vu_server.vu_dev);
> >  }
> >  
> > +/* Called with vexp->export.ctx acquired */
> > +static void vu_blk_drained_begin(void *opaque)
> > +{
> > +VuBlkExport *vexp = opaque;
> > +
> > +vhost_user_server_detach_aio_context(&vexp->vu_server);
> > +}
> 
> Compared to the old code, we're losing the vexp->export.ctx = NULL. This
> is correct at this point because after drained_begin we still keep
> processing requests until we arrive at a quiescent state.
> 
> However, if we detach the AioContext because we're deleting the
> iothread, won't we end up with a dangling pointer in vexp->export.ctx?
> Or can we be certain that nothing interesting happens before drained_end
> updates it with a new valid pointer again?

If you want I can add the detach() callback back again and set ctx to
NULL there?

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v4 04/20] virtio-scsi: stop using aio_disable_external() during unplug

2023-05-02 Thread Stefan Hajnoczi

On Tue, May 02, 2023 at 03:19:52PM +0200, Kevin Wolf wrote:
> Am 01.05.2023 um 17:09 hat Stefan Hajnoczi geschrieben:
> > On Fri, Apr 28, 2023 at 04:22:55PM +0200, Kevin Wolf wrote:
> > > Am 25.04.2023 um 19:27 hat Stefan Hajnoczi geschrieben:
> > > > This patch is part of an effort to remove the aio_disable_external()
> > > > API because it does not fit in a multi-queue block layer world where
> > > > many AioContexts may be submitting requests to the same disk.
> > > > 
> > > > The SCSI emulation code is already in good shape to stop using
> > > > aio_disable_external(). It was only used by commit 9c5aad84da1c
> > > > ("virtio-scsi: fixed virtio_scsi_ctx_check failed when detaching scsi
> > > > disk") to ensure that virtio_scsi_hotunplug() works while the guest
> > > > driver is submitting I/O.
> > > > 
> > > > Ensure virtio_scsi_hotunplug() is safe as follows:
> > > > 
> > > > 1. qdev_simple_device_unplug_cb() -> qdev_unrealize() ->
> > > >device_set_realized() calls qatomic_set(&dev->realized, false) so
> > > >that future scsi_device_get() calls return NULL because they exclude
> > > >SCSIDevices with realized=false.
> > > > 
> > > >That means virtio-scsi will reject new I/O requests to this
> > > >SCSIDevice with VIRTIO_SCSI_S_BAD_TARGET even while
> > > >virtio_scsi_hotunplug() is still executing. We are protected against
> > > >new requests!
> > > > 
> > > > 2. Add a call to scsi_device_purge_requests() from scsi_unrealize() so
> > > >that in-flight requests are cancelled synchronously. This ensures
> > > >that no in-flight requests remain once qdev_simple_device_unplug_cb()
> > > >returns.
> > > > 
> > > > Thanks to these two conditions we don't need aio_disable_external()
> > > > anymore.
> > > > 
> > > > Cc: Zhengui Li 
> > > > Reviewed-by: Paolo Bonzini 
> > > > Reviewed-by: Daniil Tatianin 
> > > > Signed-off-by: Stefan Hajnoczi 
> > > 
> > > qemu-iotests 040 starts failing for me after this patch, with what looks
> > > like a use-after-free error of some kind.
> > > 
> > > (gdb) bt
> > > #0  0x55b6e3e1f31c in job_type (job=0xe3e3e3e3e3e3e3e3) at 
> > > ../job.c:238
> > > #1  0x55b6e3e1cee5 in is_block_job (job=0xe3e3e3e3e3e3e3e3) at 
> > > ../blockjob.c:41
> > > #2  0x55b6e3e1ce7d in block_job_next_locked (bjob=0x55b6e72b7570) at 
> > > ../blockjob.c:54
> > > #3  0x55b6e3df6370 in blockdev_mark_auto_del (blk=0x55b6e74af0a0) at 
> > > ../blockdev.c:157
> > > #4  0x55b6e393e23b in scsi_qdev_unrealize (qdev=0x55b6e7c04d40) at 
> > > ../hw/scsi/scsi-bus.c:303
> > > #5  0x55b6e3db0d0e in device_set_realized (obj=0x55b6e7c04d40, 
> > > value=false, errp=0x55b6e497c918 ) at ../hw/core/qdev.c:599
> > > #6  0x55b6e3dba36e in property_set_bool (obj=0x55b6e7c04d40, 
> > > v=0x55b6e7d7f290, name=0x55b6e41bd6d8 "realized", opaque=0x55b6e7246d20, 
> > > errp=0x55b6e497c918 )
> > > at ../qom/object.c:2285
> > > #7  0x55b6e3db7e65 in object_property_set (obj=0x55b6e7c04d40, 
> > > name=0x55b6e41bd6d8 "realized", v=0x55b6e7d7f290, errp=0x55b6e497c918 
> > > ) at ../qom/object.c:1420
> > > #8  0x55b6e3dbd84a in object_property_set_qobject 
> > > (obj=0x55b6e7c04d40, name=0x55b6e41bd6d8 "realized", 
> > > value=0x55b6e74c1890, errp=0x55b6e497c918 )
> > > at ../qom/qom-qobject.c:28
> > > #9  0x55b6e3db8570 in object_property_set_bool (obj=0x55b6e7c04d40, 
> > > name=0x55b6e41bd6d8 "realized", value=false, errp=0x55b6e497c918 
> > > ) at ../qom/object.c:1489
> > > #10 0x55b6e3daf2b5 in qdev_unrealize (dev=0x55b6e7c04d40) at 
> > > ../hw/core/qdev.c:306
> > > #11 0x55b6e3db509d in qdev_simple_device_unplug_cb 
> > > (hotplug_dev=0x55b6e81c3630, dev=0x55b6e7c04d40, errp=0x7ffec5519200) at 
> > > ../hw/core/qdev-hotplug.c:72
> > > #12 0x55b6e3c520f9 in virtio_scsi_hotunplug 
> > > (hotplug_dev=0x55b6e81c3630, dev=0x55b6e7c04d40, errp=0x7ffec5519200) at 
> > > ../hw/scsi/virtio-scsi.c:1065
> > > #13 0x55b6e3db4dec in hotplug_handler_unplug 
> > > (plug_handler=0x55b6e81c3630, plugged_dev=0x55b6e7c04d40, 
> > > errp=0x7ffec5519200) at ../hw/core/hotplug.c:

Re: [PATCH v4 06/20] block/export: wait for vhost-user-blk requests when draining

2023-05-02 Thread Stefan Hajnoczi

On Tue, May 02, 2023 at 05:42:51PM +0200, Kevin Wolf wrote:
> Am 25.04.2023 um 19:27 hat Stefan Hajnoczi geschrieben:
> > Each vhost-user-blk request runs in a coroutine. When the BlockBackend
> > enters a drained section we need to enter a quiescent state. Currently
> > any in-flight requests race with bdrv_drained_begin() because it is
> > unaware of vhost-user-blk requests.
> > 
> > When blk_co_preadv/pwritev()/etc returns it wakes the
> > bdrv_drained_begin() thread but vhost-user-blk request processing has
> > not yet finished. The request coroutine continues executing while the
> > main loop thread thinks it is in a drained section.
> > 
> > One example where this is unsafe is for blk_set_aio_context() where
> > bdrv_drained_begin() is called before .aio_context_detached() and
> > .aio_context_attach(). If request coroutines are still running after
> > bdrv_drained_begin(), then the AioContext could change underneath them
> > and they race with new requests processed in the new AioContext. This
> > could lead to virtqueue corruption, for example.
> > 
> > (This example is theoretical, I came across this while reading the
> > code and have not tried to reproduce it.)
> > 
> > It's easy to make bdrv_drained_begin() wait for in-flight requests: add
> > a .drained_poll() callback that checks the VuServer's in-flight counter.
> > VuServer just needs an API that returns true when there are requests in
> > flight. The in-flight counter needs to be atomic.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> >  include/qemu/vhost-user-server.h |  4 +++-
> >  block/export/vhost-user-blk-server.c | 16 
> >  util/vhost-user-server.c | 14 ++
> >  3 files changed, 29 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/qemu/vhost-user-server.h 
> > b/include/qemu/vhost-user-server.h
> > index bc0ac9ddb6..b1c1cda886 100644
> > --- a/include/qemu/vhost-user-server.h
> > +++ b/include/qemu/vhost-user-server.h
> > @@ -40,8 +40,9 @@ typedef struct {
> >  int max_queues;
> >  const VuDevIface *vu_iface;
> >  
> > +unsigned int in_flight; /* atomic */
> > +
> >  /* Protected by ctx lock */
> > -unsigned int in_flight;
> >  bool wait_idle;
> >  VuDev vu_dev;
> >  QIOChannel *ioc; /* The I/O channel with the client */
> > @@ -62,6 +63,7 @@ void vhost_user_server_stop(VuServer *server);
> >  
> >  void vhost_user_server_inc_in_flight(VuServer *server);
> >  void vhost_user_server_dec_in_flight(VuServer *server);
> > +bool vhost_user_server_has_in_flight(VuServer *server);
> >  
> >  void vhost_user_server_attach_aio_context(VuServer *server, AioContext 
> > *ctx);
> >  void vhost_user_server_detach_aio_context(VuServer *server);
> > diff --git a/block/export/vhost-user-blk-server.c 
> > b/block/export/vhost-user-blk-server.c
> > index 841acb36e3..092b86aae4 100644
> > --- a/block/export/vhost-user-blk-server.c
> > +++ b/block/export/vhost-user-blk-server.c
> > @@ -272,7 +272,20 @@ static void vu_blk_exp_resize(void *opaque)
> >  vu_config_change_msg(&vexp->vu_server.vu_dev);
> >  }
> >  
> > +/*
> > + * Ensures that bdrv_drained_begin() waits until in-flight requests 
> > complete.
> > + *
> > + * Called with vexp->export.ctx acquired.
> > + */
> > +static bool vu_blk_drained_poll(void *opaque)
> > +{
> > +VuBlkExport *vexp = opaque;
> > +
> > +return vhost_user_server_has_in_flight(&vexp->vu_server);
> > +}
> > +
> >  static const BlockDevOps vu_blk_dev_ops = {
> > +.drained_poll  = vu_blk_drained_poll,
> >  .resize_cb = vu_blk_exp_resize,
> >  };
> 
> You're adding a new function pointer to an existing BlockDevOps...
> 
> > @@ -314,6 +327,7 @@ static int vu_blk_exp_create(BlockExport *exp, 
> > BlockExportOptions *opts,
> >  vu_blk_initialize_config(blk_bs(exp->blk), &vexp->blkcfg,
> >   logical_block_size, num_queues);
> >  
> > +blk_set_dev_ops(exp->blk, &vu_blk_dev_ops, vexp);
> >  blk_add_aio_context_notifier(exp->blk, blk_aio_attached, 
> > blk_aio_detach,
> >   vexp);
> >  
> >  blk_set_dev_ops(exp->blk, &vu_blk_dev_ops, vexp);
> 
> ..but still add a second blk_set_dev_ops(). Maybe a bad merge conflict
> resolution with commit ca858a5fe94?

Thanks, I probably didn't have ca858a5fe94 in my tree w

Re: [PATCH v4 03/20] virtio-scsi: avoid race between unplug and transport event

2023-05-02 Thread Stefan Hajnoczi

On Tue, May 02, 2023 at 05:19:46PM +0200, Kevin Wolf wrote:
> Am 25.04.2023 um 19:26 hat Stefan Hajnoczi geschrieben:
> > Only report a transport reset event to the guest after the SCSIDevice
> > has been unrealized by qdev_simple_device_unplug_cb().
> > 
> > qdev_simple_device_unplug_cb() sets the SCSIDevice's qdev.realized field
> > to false so that scsi_device_find/get() no longer see it.
> > 
> > scsi_target_emulate_report_luns() also needs to be updated to filter out
> > SCSIDevices that are unrealized.
> > 
> > These changes ensure that the guest driver does not see the SCSIDevice
> > that's being unplugged if it responds very quickly to the transport
> > reset event.
> > 
> > Reviewed-by: Paolo Bonzini 
> > Reviewed-by: Michael S. Tsirkin 
> > Reviewed-by: Daniil Tatianin 
> > Signed-off-by: Stefan Hajnoczi 
> 
> > @@ -1082,6 +1073,15 @@ static void virtio_scsi_hotunplug(HotplugHandler 
> > *hotplug_dev, DeviceState *dev,
> >  blk_set_aio_context(sd->conf.blk, qemu_get_aio_context(), NULL);
> >  virtio_scsi_release(s);
> >  }
> > +
> > +if (virtio_vdev_has_feature(vdev, VIRTIO_SCSI_F_HOTPLUG)) {
> > +virtio_scsi_acquire(s);
> > +virtio_scsi_push_event(s, sd,
> > +   VIRTIO_SCSI_T_TRANSPORT_RESET,
> > +   VIRTIO_SCSI_EVT_RESET_REMOVED);
> > +scsi_bus_set_ua(&s->bus, SENSE_CODE(REPORTED_LUNS_CHANGED));
> > +virtio_scsi_release(s);
> > +}
> >  }
> 
> s, sd and s->bus are all unrealized at this point, whereas before this
> patch they were still realized. I couldn't find any practical problem
> with it, but it made me nervous enough that I thought I should comment
> on it at least.
> 
> Should we maybe have documentation on these functions that says that
> they accept unrealized objects as their parameters?

s is the VirtIOSCSI controller, not the SCSIDevice that is being
unplugged. The VirtIOSCSI controller is still realized.

s->bus is the VirtIOSCSI controller's bus, it is still realized.

You are right that the SCSIDevice (sd) has been unrealized at this
point:
- sd->conf.blk is safe because qdev properties stay alive the
  Object is deleted, but I'm not sure we should rely on that.
- virti_scsi_push_event(.., sd, ...) is questionable because the LUN
  that's fetched from sd no longer belongs to the unplugged SCSIDevice.

How about I change the code to fetch sd->conf.blk and the LUN before
unplugging?

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v4 04/20] virtio-scsi: stop using aio_disable_external() during unplug

2023-05-01 Thread Stefan Hajnoczi

On Fri, Apr 28, 2023 at 04:22:55PM +0200, Kevin Wolf wrote:
> Am 25.04.2023 um 19:27 hat Stefan Hajnoczi geschrieben:
> > This patch is part of an effort to remove the aio_disable_external()
> > API because it does not fit in a multi-queue block layer world where
> > many AioContexts may be submitting requests to the same disk.
> > 
> > The SCSI emulation code is already in good shape to stop using
> > aio_disable_external(). It was only used by commit 9c5aad84da1c
> > ("virtio-scsi: fixed virtio_scsi_ctx_check failed when detaching scsi
> > disk") to ensure that virtio_scsi_hotunplug() works while the guest
> > driver is submitting I/O.
> > 
> > Ensure virtio_scsi_hotunplug() is safe as follows:
> > 
> > 1. qdev_simple_device_unplug_cb() -> qdev_unrealize() ->
> >device_set_realized() calls qatomic_set(&dev->realized, false) so
> >that future scsi_device_get() calls return NULL because they exclude
> >SCSIDevices with realized=false.
> > 
> >That means virtio-scsi will reject new I/O requests to this
> >SCSIDevice with VIRTIO_SCSI_S_BAD_TARGET even while
> >virtio_scsi_hotunplug() is still executing. We are protected against
> >new requests!
> > 
> > 2. Add a call to scsi_device_purge_requests() from scsi_unrealize() so
> >that in-flight requests are cancelled synchronously. This ensures
> >that no in-flight requests remain once qdev_simple_device_unplug_cb()
> >returns.
> > 
> > Thanks to these two conditions we don't need aio_disable_external()
> > anymore.
> > 
> > Cc: Zhengui Li 
> > Reviewed-by: Paolo Bonzini 
> > Reviewed-by: Daniil Tatianin 
> > Signed-off-by: Stefan Hajnoczi 
> 
> qemu-iotests 040 starts failing for me after this patch, with what looks
> like a use-after-free error of some kind.
> 
> (gdb) bt
> #0  0x55b6e3e1f31c in job_type (job=0xe3e3e3e3e3e3e3e3) at ../job.c:238
> #1  0x55b6e3e1cee5 in is_block_job (job=0xe3e3e3e3e3e3e3e3) at 
> ../blockjob.c:41
> #2  0x55b6e3e1ce7d in block_job_next_locked (bjob=0x55b6e72b7570) at 
> ../blockjob.c:54
> #3  0x55b6e3df6370 in blockdev_mark_auto_del (blk=0x55b6e74af0a0) at 
> ../blockdev.c:157
> #4  0x55b6e393e23b in scsi_qdev_unrealize (qdev=0x55b6e7c04d40) at 
> ../hw/scsi/scsi-bus.c:303
> #5  0x55b6e3db0d0e in device_set_realized (obj=0x55b6e7c04d40, 
> value=false, errp=0x55b6e497c918 ) at ../hw/core/qdev.c:599
> #6  0x55b6e3dba36e in property_set_bool (obj=0x55b6e7c04d40, 
> v=0x55b6e7d7f290, name=0x55b6e41bd6d8 "realized", opaque=0x55b6e7246d20, 
> errp=0x55b6e497c918 )
> at ../qom/object.c:2285
> #7  0x55b6e3db7e65 in object_property_set (obj=0x55b6e7c04d40, 
> name=0x55b6e41bd6d8 "realized", v=0x55b6e7d7f290, errp=0x55b6e497c918 
> ) at ../qom/object.c:1420
> #8  0x55b6e3dbd84a in object_property_set_qobject (obj=0x55b6e7c04d40, 
> name=0x55b6e41bd6d8 "realized", value=0x55b6e74c1890, errp=0x55b6e497c918 
> )
> at ../qom/qom-qobject.c:28
> #9  0x55b6e3db8570 in object_property_set_bool (obj=0x55b6e7c04d40, 
> name=0x55b6e41bd6d8 "realized", value=false, errp=0x55b6e497c918 
> ) at ../qom/object.c:1489
> #10 0x55b6e3daf2b5 in qdev_unrealize (dev=0x55b6e7c04d40) at 
> ../hw/core/qdev.c:306
> #11 0x55b6e3db509d in qdev_simple_device_unplug_cb 
> (hotplug_dev=0x55b6e81c3630, dev=0x55b6e7c04d40, errp=0x7ffec5519200) at 
> ../hw/core/qdev-hotplug.c:72
> #12 0x55b6e3c520f9 in virtio_scsi_hotunplug (hotplug_dev=0x55b6e81c3630, 
> dev=0x55b6e7c04d40, errp=0x7ffec5519200) at ../hw/scsi/virtio-scsi.c:1065
> #13 0x55b6e3db4dec in hotplug_handler_unplug 
> (plug_handler=0x55b6e81c3630, plugged_dev=0x55b6e7c04d40, 
> errp=0x7ffec5519200) at ../hw/core/hotplug.c:56
> #14 0x55b6e3a28f84 in qdev_unplug (dev=0x55b6e7c04d40, 
> errp=0x7ffec55192e0) at ../softmmu/qdev-monitor.c:935
> #15 0x55b6e3a290fa in qmp_device_del (id=0x55b6e74c1760 "scsi0", 
> errp=0x7ffec55192e0) at ../softmmu/qdev-monitor.c:955
> #16 0x55b6e3fb0a5f in qmp_marshal_device_del (args=0x7f61cc005eb0, 
> ret=0x7f61d5a8ae38, errp=0x7f61d5a8ae40) at qapi/qapi-commands-qdev.c:114
> #17 0x55b6e3fd52e1 in do_qmp_dispatch_bh (opaque=0x7f61d5a8ae08) at 
> ../qapi/qmp-dispatch.c:128
> #18 0x55b6e4007b9e in aio_bh_call (bh=0x55b6e7dea730) at 
> ../util/async.c:155
> #19 0x55b6e4007d2e in aio_bh_poll (ctx=0x55b6e72447c0) at 
> ../util/async.c:184
> #20 0x55b6e3fe3b45 in aio_dispatch (ctx=0x55b6e72447c0) at 
> ../util/aio-posix.c:421
> #21 0x55b6e4009544 in aio_ctx_dispatch (source=0x55b6e72447c0, 
> callback=0x0, user_d

[PATCH v4 19/20] virtio: do not set is_external=true on host notifiers

2023-04-25 Thread Stefan Hajnoczi

Host notifiers can now use is_external=false since virtio-blk and
virtio-scsi no longer rely on is_external=true for drained sections.

Signed-off-by: Stefan Hajnoczi 
---
 hw/virtio/virtio.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 272d930721..9cdad7e550 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3491,7 +3491,7 @@ static void 
virtio_queue_host_notifier_aio_poll_end(EventNotifier *n)
 
 void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
-aio_set_event_notifier(ctx, &vq->host_notifier, true,
+aio_set_event_notifier(ctx, &vq->host_notifier, false,
virtio_queue_host_notifier_read,
virtio_queue_host_notifier_aio_poll,
virtio_queue_host_notifier_aio_poll_ready);
@@ -3508,14 +3508,14 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
  */
 void virtio_queue_aio_attach_host_notifier_no_poll(VirtQueue *vq, AioContext 
*ctx)
 {
-aio_set_event_notifier(ctx, &vq->host_notifier, true,
+aio_set_event_notifier(ctx, &vq->host_notifier, false,
virtio_queue_host_notifier_read,
NULL, NULL);
 }
 
 void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
-aio_set_event_notifier(ctx, &vq->host_notifier, true, NULL, NULL, NULL);
+aio_set_event_notifier(ctx, &vq->host_notifier, false, NULL, NULL, NULL);
 /* Test and clear notifier before after disabling event,
  * in case poll callback didn't have time to run. */
 virtio_queue_host_notifier_read(&vq->host_notifier);
-- 
2.39.2

[PATCH v4 16/20] virtio: make it possible to detach host notifier from any thread

2023-04-25 Thread Stefan Hajnoczi

virtio_queue_aio_detach_host_notifier() does two things:
1. It removes the fd handler from the event loop.
2. It processes the virtqueue one last time.

The first step can be peformed by any thread and without taking the
AioContext lock.

The second step may need the AioContext lock (depending on the device
implementation) and runs in the thread where request processing takes
place. virtio-blk and virtio-scsi therefore call
virtio_queue_aio_detach_host_notifier() from a BH that is scheduled in
AioContext

Scheduling a BH is undesirable for .drained_begin() functions. The next
patch will introduce a .drained_begin() function that needs to call
virtio_queue_aio_detach_host_notifier().

Move the virtqueue processing out to the callers of
virtio_queue_aio_detach_host_notifier() so that the function can be
called from any thread. This is in preparation for the next patch.

Signed-off-by: Stefan Hajnoczi 
---
 hw/block/dataplane/virtio-blk.c | 2 ++
 hw/scsi/virtio-scsi-dataplane.c | 9 +
 2 files changed, 11 insertions(+)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index b28d81737e..bd7cc6e76b 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -286,8 +286,10 @@ static void virtio_blk_data_plane_stop_bh(void *opaque)
 
 for (i = 0; i < s->conf->num_queues; i++) {
 VirtQueue *vq = virtio_get_queue(s->vdev, i);
+EventNotifier *host_notifier = virtio_queue_get_host_notifier(vq);
 
 virtio_queue_aio_detach_host_notifier(vq, s->ctx);
+virtio_queue_host_notifier_read(host_notifier);
 }
 }
 
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 20bb91766e..81643445ed 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -71,12 +71,21 @@ static void virtio_scsi_dataplane_stop_bh(void *opaque)
 {
 VirtIOSCSI *s = opaque;
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
+EventNotifier *host_notifier;
 int i;
 
 virtio_queue_aio_detach_host_notifier(vs->ctrl_vq, s->ctx);
+host_notifier = virtio_queue_get_host_notifier(vs->ctrl_vq);
+virtio_queue_host_notifier_read(host_notifier);
+
 virtio_queue_aio_detach_host_notifier(vs->event_vq, s->ctx);
+host_notifier = virtio_queue_get_host_notifier(vs->event_vq);
+virtio_queue_host_notifier_read(host_notifier);
+
 for (i = 0; i < vs->conf.num_queues; i++) {
 virtio_queue_aio_detach_host_notifier(vs->cmd_vqs[i], s->ctx);
+host_notifier = virtio_queue_get_host_notifier(vs->cmd_vqs[i]);
+virtio_queue_host_notifier_read(host_notifier);
 }
 }
 
-- 
2.39.2

[PATCH v4 20/20] aio: remove aio_disable_external() API

2023-04-25 Thread Stefan Hajnoczi

All callers now pass is_external=false to aio_set_fd_handler() and
aio_set_event_notifier(). The aio_disable_external() API that
temporarily disables fd handlers that were registered is_external=true
is therefore dead code.

Remove aio_disable_external(), aio_enable_external(), and the
is_external arguments to aio_set_fd_handler() and
aio_set_event_notifier().

The entire test-fdmon-epoll test is removed because its sole purpose was
testing aio_disable_external().

Parts of this patch were generated using the following coccinelle
(https://coccinelle.lip6.fr/) semantic patch:

  @@
  expression ctx, fd, is_external, io_read, io_write, io_poll, io_poll_ready, 
opaque;
  @@
  - aio_set_fd_handler(ctx, fd, is_external, io_read, io_write, io_poll, 
io_poll_ready, opaque)
  + aio_set_fd_handler(ctx, fd, io_read, io_write, io_poll, io_poll_ready, 
opaque)

  @@
  expression ctx, notifier, is_external, io_read, io_poll, io_poll_ready;
  @@
  - aio_set_event_notifier(ctx, notifier, is_external, io_read, io_poll, 
io_poll_ready)
  + aio_set_event_notifier(ctx, notifier, io_read, io_poll, io_poll_ready)

Reviewed-by: Juan Quintela 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Stefan Hajnoczi 
---
 include/block/aio.h   | 57 ---
 util/aio-posix.h  |  1 -
 block.c   |  7 
 block/blkio.c | 15 +++
 block/curl.c  | 10 ++---
 block/export/fuse.c   |  8 ++--
 block/export/vduse-blk.c  | 10 ++---
 block/io.c|  2 -
 block/io_uring.c  |  4 +-
 block/iscsi.c |  3 +-
 block/linux-aio.c |  4 +-
 block/nfs.c   |  5 +--
 block/nvme.c  |  8 ++--
 block/ssh.c   |  4 +-
 block/win32-aio.c |  6 +--
 hw/i386/kvm/xen_xenstore.c|  2 +-
 hw/virtio/virtio.c|  6 +--
 hw/xen/xen-bus.c  |  8 ++--
 io/channel-command.c  |  6 +--
 io/channel-file.c |  3 +-
 io/channel-socket.c   |  3 +-
 migration/rdma.c  | 16 
 tests/unit/test-aio.c | 27 +
 tests/unit/test-bdrv-drain.c  |  1 -
 tests/unit/test-fdmon-epoll.c | 73 ---
 util/aio-posix.c  | 20 +++---
 util/aio-win32.c  |  8 +---
 util/async.c  |  3 +-
 util/fdmon-epoll.c| 18 +++--
 util/fdmon-io_uring.c |  8 +---
 util/fdmon-poll.c |  3 +-
 util/main-loop.c  |  7 ++--
 util/qemu-coroutine-io.c  |  7 ++--
 util/vhost-user-server.c  | 11 +++---
 tests/unit/meson.build|  3 --
 35 files changed, 82 insertions(+), 295 deletions(-)
 delete mode 100644 tests/unit/test-fdmon-epoll.c

diff --git a/include/block/aio.h b/include/block/aio.h
index 543717f294..bb38f0753f 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -231,8 +231,6 @@ struct AioContext {
  */
 QEMUTimerListGroup tlg;
 
-int external_disable_cnt;
-
 /* Number of AioHandlers without .io_poll() */
 int poll_disable_cnt;
 
@@ -475,7 +473,6 @@ bool aio_poll(AioContext *ctx, bool blocking);
  */
 void aio_set_fd_handler(AioContext *ctx,
 int fd,
-bool is_external,
 IOHandler *io_read,
 IOHandler *io_write,
 AioPollFn *io_poll,
@@ -491,7 +488,6 @@ void aio_set_fd_handler(AioContext *ctx,
  */
 void aio_set_event_notifier(AioContext *ctx,
 EventNotifier *notifier,
-bool is_external,
 EventNotifierHandler *io_read,
 AioPollFn *io_poll,
 EventNotifierHandler *io_poll_ready);
@@ -620,59 +616,6 @@ static inline void aio_timer_init(AioContext *ctx,
  */
 int64_t aio_compute_timeout(AioContext *ctx);
 
-/**
- * aio_disable_external:
- * @ctx: the aio context
- *
- * Disable the further processing of external clients.
- */
-static inline void aio_disable_external(AioContext *ctx)
-{
-qatomic_inc(&ctx->external_disable_cnt);
-}
-
-/**
- * aio_enable_external:
- * @ctx: the aio context
- *
- * Enable the processing of external clients.
- */
-static inline void aio_enable_external(AioContext *ctx)
-{
-int old;
-
-old = qatomic_fetch_dec(&ctx->external_disable_cnt);
-assert(old > 0);
-if (old == 1) {
-/* Kick event loop so it re-arms file descriptors */
-aio_notify(ctx);
-}
-}
-
-/**
- * aio_external_disabled:
- * @ctx: the aio context
- *
- * Return true if the external clients are disabled.
- */
-static inline bool aio_external_disabled(AioContext *ctx)
-{
-return qatomic_read(&ctx->external_disable_cnt);
-}
-
-/**
- * aio_node_check:
- * @ctx: the aio context
- * @is_external: Whether or not the checked node is an external event sour

[PATCH v4 14/20] block/export: don't require AioContext lock around blk_exp_ref/unref()

2023-04-25 Thread Stefan Hajnoczi

The FUSE export calls blk_exp_ref/unref() without the AioContext lock.
Instead of fixing the FUSE export, adjust blk_exp_ref/unref() so they
work without the AioContext lock. This way it's less error-prone.

Suggested-by: Paolo Bonzini 
Signed-off-by: Stefan Hajnoczi 
---
 include/block/export.h   |  2 ++
 block/export/export.c| 13 ++---
 block/export/vduse-blk.c |  4 
 3 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/block/export.h b/include/block/export.h
index 7feb02e10d..f2fe0f8078 100644
--- a/include/block/export.h
+++ b/include/block/export.h
@@ -57,6 +57,8 @@ struct BlockExport {
  * Reference count for this block export. This includes strong references
  * both from the owner (qemu-nbd or the monitor) and clients connected to
  * the export.
+ *
+ * Use atomics to access this field.
  */
 int refcount;
 
diff --git a/block/export/export.c b/block/export/export.c
index 28a91c9c42..ddaf8036e5 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -201,11 +201,10 @@ fail:
 return NULL;
 }
 
-/* Callers must hold exp->ctx lock */
 void blk_exp_ref(BlockExport *exp)
 {
-assert(exp->refcount > 0);
-exp->refcount++;
+assert(qatomic_read(&exp->refcount) > 0);
+qatomic_inc(&exp->refcount);
 }
 
 /* Runs in the main thread */
@@ -227,11 +226,10 @@ static void blk_exp_delete_bh(void *opaque)
 aio_context_release(aio_context);
 }
 
-/* Callers must hold exp->ctx lock */
 void blk_exp_unref(BlockExport *exp)
 {
-assert(exp->refcount > 0);
-if (--exp->refcount == 0) {
+assert(qatomic_read(&exp->refcount) > 0);
+if (qatomic_fetch_dec(&exp->refcount) == 1) {
 /* Touch the block_exports list only in the main thread */
 aio_bh_schedule_oneshot(qemu_get_aio_context(), blk_exp_delete_bh,
 exp);
@@ -339,7 +337,8 @@ void qmp_block_export_del(const char *id,
 if (!has_mode) {
 mode = BLOCK_EXPORT_REMOVE_MODE_SAFE;
 }
-if (mode == BLOCK_EXPORT_REMOVE_MODE_SAFE && exp->refcount > 1) {
+if (mode == BLOCK_EXPORT_REMOVE_MODE_SAFE &&
+qatomic_read(&exp->refcount) > 1) {
 error_setg(errp, "export '%s' still in use", exp->id);
 error_append_hint(errp, "Use mode='hard' to force client "
   "disconnect\n");
diff --git a/block/export/vduse-blk.c b/block/export/vduse-blk.c
index 35dc8fcf45..611430afda 100644
--- a/block/export/vduse-blk.c
+++ b/block/export/vduse-blk.c
@@ -44,9 +44,7 @@ static void vduse_blk_inflight_inc(VduseBlkExport *vblk_exp)
 {
 if (qatomic_fetch_inc(&vblk_exp->inflight) == 0) {
 /* Prevent export from being deleted */
-aio_context_acquire(vblk_exp->export.ctx);
 blk_exp_ref(&vblk_exp->export);
-aio_context_release(vblk_exp->export.ctx);
 }
 }
 
@@ -57,9 +55,7 @@ static void vduse_blk_inflight_dec(VduseBlkExport *vblk_exp)
 aio_wait_kick();
 
 /* Now the export can be deleted */
-aio_context_acquire(vblk_exp->export.ctx);
 blk_exp_unref(&vblk_exp->export);
-aio_context_release(vblk_exp->export.ctx);
 }
 }
 
-- 
2.39.2

[PATCH v4 17/20] virtio-blk: implement BlockDevOps->drained_begin()

2023-04-25 Thread Stefan Hajnoczi

Detach ioeventfds during drained sections to stop I/O submission from
the guest. virtio-blk is no longer reliant on aio_disable_external()
after this patch. This will allow us to remove the
aio_disable_external() API once all other code that relies on it is
converted.

Take extra care to avoid attaching/detaching ioeventfds if the data
plane is started/stopped during a drained section. This should be rare,
but maybe the mirror block job can trigger it.

Signed-off-by: Stefan Hajnoczi 
---
 hw/block/dataplane/virtio-blk.c | 17 +--
 hw/block/virtio-blk.c   | 38 -
 2 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index bd7cc6e76b..d77fc6028c 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -245,13 +245,15 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 }
 
 /* Get this show started by hooking up our callbacks */
-aio_context_acquire(s->ctx);
-for (i = 0; i < nvqs; i++) {
-VirtQueue *vq = virtio_get_queue(s->vdev, i);
+if (!blk_in_drain(s->conf->conf.blk)) {
+aio_context_acquire(s->ctx);
+for (i = 0; i < nvqs; i++) {
+VirtQueue *vq = virtio_get_queue(s->vdev, i);
 
-virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+}
+aio_context_release(s->ctx);
 }
-aio_context_release(s->ctx);
 return 0;
 
   fail_aio_context:
@@ -317,7 +319,10 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
 trace_virtio_blk_data_plane_stop(s);
 
 aio_context_acquire(s->ctx);
-aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
+
+if (!blk_in_drain(s->conf->conf.blk)) {
+aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
+}
 
 /* Wait for virtio_blk_dma_restart_bh() and in flight I/O to complete */
 blk_drain(s->conf->conf.blk);
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index cefca93b31..d8dedc575c 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1109,8 +1109,44 @@ static void virtio_blk_resize(void *opaque)
 aio_bh_schedule_oneshot(qemu_get_aio_context(), virtio_resize_cb, vdev);
 }
 
+/* Suspend virtqueue ioeventfd processing during drain */
+static void virtio_blk_drained_begin(void *opaque)
+{
+VirtIOBlock *s = opaque;
+VirtIODevice *vdev = VIRTIO_DEVICE(opaque);
+AioContext *ctx = blk_get_aio_context(s->conf.conf.blk);
+
+if (!s->dataplane || !s->dataplane_started) {
+return;
+}
+
+for (uint16_t i = 0; i < s->conf.num_queues; i++) {
+VirtQueue *vq = virtio_get_queue(vdev, i);
+virtio_queue_aio_detach_host_notifier(vq, ctx);
+}
+}
+
+/* Resume virtqueue ioeventfd processing after drain */
+static void virtio_blk_drained_end(void *opaque)
+{
+VirtIOBlock *s = opaque;
+VirtIODevice *vdev = VIRTIO_DEVICE(opaque);
+AioContext *ctx = blk_get_aio_context(s->conf.conf.blk);
+
+if (!s->dataplane || !s->dataplane_started) {
+return;
+}
+
+for (uint16_t i = 0; i < s->conf.num_queues; i++) {
+VirtQueue *vq = virtio_get_queue(vdev, i);
+virtio_queue_aio_attach_host_notifier(vq, ctx);
+}
+}
+
 static const BlockDevOps virtio_block_ops = {
-.resize_cb = virtio_blk_resize,
+.resize_cb = virtio_blk_resize,
+.drained_begin = virtio_blk_drained_begin,
+.drained_end   = virtio_blk_drained_end,
 };
 
 static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
-- 
2.39.2

[PATCH v4 18/20] virtio-scsi: implement BlockDevOps->drained_begin()

2023-04-25 Thread Stefan Hajnoczi

The virtio-scsi Host Bus Adapter provides access to devices on a SCSI
bus. Those SCSI devices typically have a BlockBackend. When the
BlockBackend enters a drained section, the SCSI device must temporarily
stop submitting new I/O requests.

Implement this behavior by temporarily stopping virtio-scsi virtqueue
processing when one of the SCSI devices enters a drained section. The
new scsi_device_drained_begin() API allows scsi-disk to message the
virtio-scsi HBA.

scsi_device_drained_begin() uses a drain counter so that multiple SCSI
devices can have overlapping drained sections. The HBA only sees one
pair of .drained_begin/end() calls.

After this commit, virtio-scsi no longer depends on hw/virtio's
ioeventfd aio_set_event_notifier(is_external=true). This commit is a
step towards removing the aio_disable_external() API.

Signed-off-by: Stefan Hajnoczi 
---
 include/hw/scsi/scsi.h  | 14 
 hw/scsi/scsi-bus.c  | 40 +
 hw/scsi/scsi-disk.c | 27 +-
 hw/scsi/virtio-scsi-dataplane.c | 22 ++
 hw/scsi/virtio-scsi.c   | 38 +++
 hw/scsi/trace-events|  2 ++
 6 files changed, 129 insertions(+), 14 deletions(-)

diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index 6f23a7a73e..e2bb1a2fbf 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -133,6 +133,16 @@ struct SCSIBusInfo {
 void (*save_request)(QEMUFile *f, SCSIRequest *req);
 void *(*load_request)(QEMUFile *f, SCSIRequest *req);
 void (*free_request)(SCSIBus *bus, void *priv);
+
+/*
+ * Temporarily stop submitting new requests between drained_begin() and
+ * drained_end(). Called from the main loop thread with the BQL held.
+ *
+ * Implement these callbacks if request processing is triggered by a file
+ * descriptor like an EventNotifier. Otherwise set them to NULL.
+ */
+void (*drained_begin)(SCSIBus *bus);
+void (*drained_end)(SCSIBus *bus);
 };
 
 #define TYPE_SCSI_BUS "SCSI"
@@ -144,6 +154,8 @@ struct SCSIBus {
 
 SCSISense unit_attention;
 const SCSIBusInfo *info;
+
+int drain_count; /* protected by BQL */
 };
 
 /**
@@ -213,6 +225,8 @@ void scsi_req_cancel_complete(SCSIRequest *req);
 void scsi_req_cancel(SCSIRequest *req);
 void scsi_req_cancel_async(SCSIRequest *req, Notifier *notifier);
 void scsi_req_retry(SCSIRequest *req);
+void scsi_device_drained_begin(SCSIDevice *sdev);
+void scsi_device_drained_end(SCSIDevice *sdev);
 void scsi_device_purge_requests(SCSIDevice *sdev, SCSISense sense);
 void scsi_device_set_ua(SCSIDevice *sdev, SCSISense sense);
 void scsi_device_report_change(SCSIDevice *dev, SCSISense sense);
diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 64d7311757..b571fdf895 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -1668,6 +1668,46 @@ void scsi_device_purge_requests(SCSIDevice *sdev, 
SCSISense sense)
 scsi_device_set_ua(sdev, sense);
 }
 
+void scsi_device_drained_begin(SCSIDevice *sdev)
+{
+SCSIBus *bus = DO_UPCAST(SCSIBus, qbus, sdev->qdev.parent_bus);
+if (!bus) {
+return;
+}
+
+assert(qemu_get_current_aio_context() == qemu_get_aio_context());
+assert(bus->drain_count < INT_MAX);
+
+/*
+ * Multiple BlockBackends can be on a SCSIBus and each may begin/end
+ * draining at any time. Keep a counter so HBAs only see begin/end once.
+ */
+if (bus->drain_count++ == 0) {
+trace_scsi_bus_drained_begin(bus, sdev);
+if (bus->info->drained_begin) {
+bus->info->drained_begin(bus);
+}
+}
+}
+
+void scsi_device_drained_end(SCSIDevice *sdev)
+{
+SCSIBus *bus = DO_UPCAST(SCSIBus, qbus, sdev->qdev.parent_bus);
+if (!bus) {
+return;
+}
+
+assert(qemu_get_current_aio_context() == qemu_get_aio_context());
+assert(bus->drain_count > 0);
+
+if (bus->drain_count-- == 1) {
+trace_scsi_bus_drained_end(bus, sdev);
+if (bus->info->drained_end) {
+bus->info->drained_end(bus);
+}
+}
+}
+
 static char *scsibus_get_dev_path(DeviceState *dev)
 {
 SCSIDevice *d = SCSI_DEVICE(dev);
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index e01bd84541..2249087d6a 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2360,6 +2360,20 @@ static void scsi_disk_reset(DeviceState *dev)
 s->qdev.scsi_version = s->qdev.default_scsi_version;
 }
 
+static void scsi_disk_drained_begin(void *opaque)
+{
+SCSIDiskState *s = opaque;
+
+scsi_device_drained_begin(&s->qdev);
+}
+
+static void scsi_disk_drained_end(void *opaque)
+{
+SCSIDiskState *s = opaque;
+
+scsi_device_drained_end(&s->qdev);
+}
+
 static void scsi_disk_resize_cb(void *opaque)
 {
 SCSIDiskState *s = opaque;
@@ -2414,16 +2428,19 @@ static bool scsi_cd_is_mediu

[PATCH v4 13/20] block/export: rewrite vduse-blk drain code

2023-04-25 Thread Stefan Hajnoczi

vduse_blk_detach_ctx() waits for in-flight requests using
AIO_WAIT_WHILE(). This is not allowed according to a comment in
bdrv_set_aio_context_commit():

  /*
   * Take the old AioContex when detaching it from bs.
   * At this point, new_context lock is already acquired, and we are now
   * also taking old_context. This is safe as long as bdrv_detach_aio_context
   * does not call AIO_POLL_WHILE().
   */

Use this opportunity to rewrite the drain code in vduse-blk:

- Use the BlockExport refcount so that vduse_blk_exp_delete() is only
  called when there are no more requests in flight.

- Implement .drained_poll() so in-flight request coroutines are stopped
  by the time .bdrv_detach_aio_context() is called.

- Remove AIO_WAIT_WHILE() from vduse_blk_detach_ctx() to solve the
  .bdrv_detach_aio_context() constraint violation. It's no longer
  needed due to the previous changes.

- Always handle the VDUSE file descriptor, even in drained sections. The
  VDUSE file descriptor doesn't submit I/O, so it's safe to handle it in
  drained sections. This ensures that the VDUSE kernel code gets a fast
  response.

- Suspend virtqueue fd handlers in .drained_begin() and resume them in
  .drained_end(). This eliminates the need for the
  aio_set_fd_handler(is_external=true) flag, which is being removed from
  QEMU.

This is a long list but splitting it into individual commits would
probably lead to git bisect failures - the changes are all related.

Signed-off-by: Stefan Hajnoczi 
---
 block/export/vduse-blk.c | 132 +++
 1 file changed, 93 insertions(+), 39 deletions(-)

diff --git a/block/export/vduse-blk.c b/block/export/vduse-blk.c
index f7ae44e3ce..35dc8fcf45 100644
--- a/block/export/vduse-blk.c
+++ b/block/export/vduse-blk.c
@@ -31,7 +31,8 @@ typedef struct VduseBlkExport {
 VduseDev *dev;
 uint16_t num_queues;
 char *recon_file;
-unsigned int inflight;
+unsigned int inflight; /* atomic */
+bool vqs_started;
 } VduseBlkExport;
 
 typedef struct VduseBlkReq {
@@ -41,13 +42,24 @@ typedef struct VduseBlkReq {
 
 static void vduse_blk_inflight_inc(VduseBlkExport *vblk_exp)
 {
-vblk_exp->inflight++;
+if (qatomic_fetch_inc(&vblk_exp->inflight) == 0) {
+/* Prevent export from being deleted */
+aio_context_acquire(vblk_exp->export.ctx);
+blk_exp_ref(&vblk_exp->export);
+aio_context_release(vblk_exp->export.ctx);
+}
 }
 
 static void vduse_blk_inflight_dec(VduseBlkExport *vblk_exp)
 {
-if (--vblk_exp->inflight == 0) {
+if (qatomic_fetch_dec(&vblk_exp->inflight) == 1) {
+/* Wake AIO_WAIT_WHILE() */
 aio_wait_kick();
+
+/* Now the export can be deleted */
+aio_context_acquire(vblk_exp->export.ctx);
+blk_exp_unref(&vblk_exp->export);
+aio_context_release(vblk_exp->export.ctx);
 }
 }
 
@@ -124,8 +136,12 @@ static void vduse_blk_enable_queue(VduseDev *dev, 
VduseVirtq *vq)
 {
 VduseBlkExport *vblk_exp = vduse_dev_get_priv(dev);
 
+if (!vblk_exp->vqs_started) {
+return; /* vduse_blk_drained_end() will start vqs later */
+}
+
 aio_set_fd_handler(vblk_exp->export.ctx, vduse_queue_get_fd(vq),
-   true, on_vduse_vq_kick, NULL, NULL, NULL, vq);
+   false, on_vduse_vq_kick, NULL, NULL, NULL, vq);
 /* Make sure we don't miss any kick afer reconnecting */
 eventfd_write(vduse_queue_get_fd(vq), 1);
 }
@@ -133,9 +149,14 @@ static void vduse_blk_enable_queue(VduseDev *dev, 
VduseVirtq *vq)
 static void vduse_blk_disable_queue(VduseDev *dev, VduseVirtq *vq)
 {
 VduseBlkExport *vblk_exp = vduse_dev_get_priv(dev);
+int fd = vduse_queue_get_fd(vq);
 
-aio_set_fd_handler(vblk_exp->export.ctx, vduse_queue_get_fd(vq),
-   true, NULL, NULL, NULL, NULL, NULL);
+if (fd < 0) {
+return;
+}
+
+aio_set_fd_handler(vblk_exp->export.ctx, fd, false,
+   NULL, NULL, NULL, NULL, NULL);
 }
 
 static const VduseOps vduse_blk_ops = {
@@ -152,42 +173,19 @@ static void on_vduse_dev_kick(void *opaque)
 
 static void vduse_blk_attach_ctx(VduseBlkExport *vblk_exp, AioContext *ctx)
 {
-int i;
-
 aio_set_fd_handler(vblk_exp->export.ctx, vduse_dev_get_fd(vblk_exp->dev),
-   true, on_vduse_dev_kick, NULL, NULL, NULL,
+   false, on_vduse_dev_kick, NULL, NULL, NULL,
vblk_exp->dev);
 
-for (i = 0; i < vblk_exp->num_queues; i++) {
-VduseVirtq *vq = vduse_dev_get_queue(vblk_exp->dev, i);
-int fd = vduse_queue_get_fd(vq);
-
-if (fd < 0) {
-continue;
-}
-aio_set_fd_handler(vblk_exp->export.ctx, fd, true,
-   on_vduse_vq_kick, NULL, NULL, NULL, vq);
-}
+/* Virtqueues are handled by vduse_blk_drained_end(

[PATCH v4 15/20] block/fuse: do not set is_external=true on FUSE fd

2023-04-25 Thread Stefan Hajnoczi

This is part of ongoing work to remove the aio_disable_external() API.

Use BlockDevOps .drained_begin/end/poll() instead of
aio_set_fd_handler(is_external=true).

As a side-effect the FUSE export now follows AioContext changes like the
other export types.

Signed-off-by: Stefan Hajnoczi 
---
 block/export/fuse.c | 58 +++--
 1 file changed, 56 insertions(+), 2 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 06fa41079e..65a7f4d723 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -50,6 +50,7 @@ typedef struct FuseExport {
 
 struct fuse_session *fuse_session;
 struct fuse_buf fuse_buf;
+unsigned int in_flight; /* atomic */
 bool mounted, fd_handler_set_up;
 
 char *mountpoint;
@@ -78,6 +79,42 @@ static void read_from_fuse_export(void *opaque);
 static bool is_regular_file(const char *path, Error **errp);
 
 
+static void fuse_export_drained_begin(void *opaque)
+{
+FuseExport *exp = opaque;
+
+aio_set_fd_handler(exp->common.ctx,
+   fuse_session_fd(exp->fuse_session), false,
+   NULL, NULL, NULL, NULL, NULL);
+exp->fd_handler_set_up = false;
+}
+
+static void fuse_export_drained_end(void *opaque)
+{
+FuseExport *exp = opaque;
+
+/* Refresh AioContext in case it changed */
+exp->common.ctx = blk_get_aio_context(exp->common.blk);
+
+aio_set_fd_handler(exp->common.ctx,
+   fuse_session_fd(exp->fuse_session), false,
+   read_from_fuse_export, NULL, NULL, NULL, exp);
+exp->fd_handler_set_up = true;
+}
+
+static bool fuse_export_drained_poll(void *opaque)
+{
+FuseExport *exp = opaque;
+
+return qatomic_read(&exp->in_flight) > 0;
+}
+
+static const BlockDevOps fuse_export_blk_dev_ops = {
+.drained_begin = fuse_export_drained_begin,
+.drained_end   = fuse_export_drained_end,
+.drained_poll  = fuse_export_drained_poll,
+};
+
 static int fuse_export_create(BlockExport *blk_exp,
   BlockExportOptions *blk_exp_args,
   Error **errp)
@@ -101,6 +138,15 @@ static int fuse_export_create(BlockExport *blk_exp,
 }
 }
 
+blk_set_dev_ops(exp->common.blk, &fuse_export_blk_dev_ops, exp);
+
+/*
+ * We handle draining ourselves using an in-flight counter and by disabling
+ * the FUSE fd handler. Do not queue BlockBackend requests, they need to
+ * complete so the in-flight counter reaches zero.
+ */
+blk_set_disable_request_queuing(exp->common.blk, true);
+
 init_exports_table();
 
 /*
@@ -224,7 +270,7 @@ static int setup_fuse_export(FuseExport *exp, const char 
*mountpoint,
 g_hash_table_insert(exports, g_strdup(mountpoint), NULL);
 
 aio_set_fd_handler(exp->common.ctx,
-   fuse_session_fd(exp->fuse_session), true,
+   fuse_session_fd(exp->fuse_session), false,
read_from_fuse_export, NULL, NULL, NULL, exp);
 exp->fd_handler_set_up = true;
 
@@ -246,6 +292,8 @@ static void read_from_fuse_export(void *opaque)
 
 blk_exp_ref(&exp->common);
 
+qatomic_inc(&exp->in_flight);
+
 do {
 ret = fuse_session_receive_buf(exp->fuse_session, &exp->fuse_buf);
 } while (ret == -EINTR);
@@ -256,6 +304,10 @@ static void read_from_fuse_export(void *opaque)
 fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
 
 out:
+if (qatomic_fetch_dec(&exp->in_flight) == 1) {
+aio_wait_kick(); /* wake AIO_WAIT_WHILE() */
+}
+
 blk_exp_unref(&exp->common);
 }
 
@@ -268,7 +320,7 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
 
 if (exp->fd_handler_set_up) {
 aio_set_fd_handler(exp->common.ctx,
-   fuse_session_fd(exp->fuse_session), true,
+   fuse_session_fd(exp->fuse_session), false,
NULL, NULL, NULL, NULL, NULL);
 exp->fd_handler_set_up = false;
 }
@@ -287,6 +339,8 @@ static void fuse_export_delete(BlockExport *blk_exp)
 {
 FuseExport *exp = container_of(blk_exp, FuseExport, common);
 
+blk_set_dev_ops(exp->common.blk, NULL, NULL);
+
 if (exp->fuse_session) {
 if (exp->mounted) {
 fuse_session_unmount(exp->fuse_session);
-- 
2.39.2

[PATCH v4 11/20] xen-block: implement BlockDevOps->drained_begin()

2023-04-25 Thread Stefan Hajnoczi

Detach event channels during drained sections to stop I/O submission
from the ring. xen-block is no longer reliant on aio_disable_external()
after this patch. This will allow us to remove the
aio_disable_external() API once all other code that relies on it is
converted.

Extend xen_device_set_event_channel_context() to allow ctx=NULL. The
event channel still exists but the event loop does not monitor the file
descriptor. Event channel processing can resume by calling
xen_device_set_event_channel_context() with a non-NULL ctx.

Factor out xen_device_set_event_channel_context() calls in
hw/block/dataplane/xen-block.c into attach/detach helper functions.
Incidentally, these don't require the AioContext lock because
aio_set_fd_handler() is thread-safe.

It's safer to register BlockDevOps after the dataplane instance has been
created. The BlockDevOps .drained_begin/end() callbacks depend on the
dataplane instance, so move the blk_set_dev_ops() call after
xen_block_dataplane_create().

Signed-off-by: Stefan Hajnoczi 
---
 hw/block/dataplane/xen-block.h |  2 ++
 hw/block/dataplane/xen-block.c | 42 +-
 hw/block/xen-block.c   | 24 ---
 hw/xen/xen-bus.c   |  7 --
 4 files changed, 59 insertions(+), 16 deletions(-)

diff --git a/hw/block/dataplane/xen-block.h b/hw/block/dataplane/xen-block.h
index 76dcd51c3d..7b8e9df09f 100644
--- a/hw/block/dataplane/xen-block.h
+++ b/hw/block/dataplane/xen-block.h
@@ -26,5 +26,7 @@ void xen_block_dataplane_start(XenBlockDataPlane *dataplane,
unsigned int protocol,
Error **errp);
 void xen_block_dataplane_stop(XenBlockDataPlane *dataplane);
+void xen_block_dataplane_attach(XenBlockDataPlane *dataplane);
+void xen_block_dataplane_detach(XenBlockDataPlane *dataplane);
 
 #endif /* HW_BLOCK_DATAPLANE_XEN_BLOCK_H */
diff --git a/hw/block/dataplane/xen-block.c b/hw/block/dataplane/xen-block.c
index 734da42ea7..02e0fd6115 100644
--- a/hw/block/dataplane/xen-block.c
+++ b/hw/block/dataplane/xen-block.c
@@ -663,6 +663,30 @@ void xen_block_dataplane_destroy(XenBlockDataPlane 
*dataplane)
 g_free(dataplane);
 }
 
+void xen_block_dataplane_detach(XenBlockDataPlane *dataplane)
+{
+if (!dataplane || !dataplane->event_channel) {
+return;
+}
+
+/* Only reason for failure is a NULL channel */
+xen_device_set_event_channel_context(dataplane->xendev,
+ dataplane->event_channel,
+ NULL, &error_abort);
+}
+
+void xen_block_dataplane_attach(XenBlockDataPlane *dataplane)
+{
+if (!dataplane || !dataplane->event_channel) {
+return;
+}
+
+/* Only reason for failure is a NULL channel */
+xen_device_set_event_channel_context(dataplane->xendev,
+ dataplane->event_channel,
+ dataplane->ctx, &error_abort);
+}
+
 void xen_block_dataplane_stop(XenBlockDataPlane *dataplane)
 {
 XenDevice *xendev;
@@ -673,13 +697,11 @@ void xen_block_dataplane_stop(XenBlockDataPlane 
*dataplane)
 
 xendev = dataplane->xendev;
 
-aio_context_acquire(dataplane->ctx);
-if (dataplane->event_channel) {
-/* Only reason for failure is a NULL channel */
-xen_device_set_event_channel_context(xendev, dataplane->event_channel,
- qemu_get_aio_context(),
- &error_abort);
+if (!blk_in_drain(dataplane->blk)) {
+xen_block_dataplane_detach(dataplane);
 }
+
+aio_context_acquire(dataplane->ctx);
 /* Xen doesn't have multiple users for nodes, so this can't fail */
 blk_set_aio_context(dataplane->blk, qemu_get_aio_context(), &error_abort);
 aio_context_release(dataplane->ctx);
@@ -818,11 +840,9 @@ void xen_block_dataplane_start(XenBlockDataPlane 
*dataplane,
 blk_set_aio_context(dataplane->blk, dataplane->ctx, NULL);
 aio_context_release(old_context);
 
-/* Only reason for failure is a NULL channel */
-aio_context_acquire(dataplane->ctx);
-xen_device_set_event_channel_context(xendev, dataplane->event_channel,
- dataplane->ctx, &error_abort);
-aio_context_release(dataplane->ctx);
+if (!blk_in_drain(dataplane->blk)) {
+xen_block_dataplane_attach(dataplane);
+}
 
 return;
 
diff --git a/hw/block/xen-block.c b/hw/block/xen-block.c
index f5a744589d..f099914831 100644
--- a/hw/block/xen-block.c
+++ b/hw/block/xen-block.c
@@ -189,8 +189,26 @@ static void xen_block_resize_cb(void *opaque)
 xen_device_backend_printf(xendev, "state", "%u", state);
 }
 
+/* Suspend request handling */
+static void xen_block_drained_begin(void *opaque)
+{
+XenBlockDevic

[PATCH v4 10/20] block: drain from main loop thread in bdrv_co_yield_to_drain()

2023-04-25 Thread Stefan Hajnoczi

For simplicity, always run BlockDevOps .drained_begin/end/poll()
callbacks in the main loop thread. This makes it easier to implement the
callbacks and avoids extra locks.

Move the function pointer declarations from the I/O Code section to the
Global State section in block-backend-common.h.

Signed-off-by: Stefan Hajnoczi 
---
 include/sysemu/block-backend-common.h | 25 +
 block/io.c|  3 ++-
 2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/include/sysemu/block-backend-common.h 
b/include/sysemu/block-backend-common.h
index 2391679c56..780cea7305 100644
--- a/include/sysemu/block-backend-common.h
+++ b/include/sysemu/block-backend-common.h
@@ -59,6 +59,19 @@ typedef struct BlockDevOps {
  */
 bool (*is_medium_locked)(void *opaque);
 
+/*
+ * Runs when the backend receives a drain request.
+ */
+void (*drained_begin)(void *opaque);
+/*
+ * Runs when the backend's last drain request ends.
+ */
+void (*drained_end)(void *opaque);
+/*
+ * Is the device still busy?
+ */
+bool (*drained_poll)(void *opaque);
+
 /*
  * I/O API functions. These functions are thread-safe.
  *
@@ -76,18 +89,6 @@ typedef struct BlockDevOps {
  * Runs when the size changed (e.g. monitor command block_resize)
  */
 void (*resize_cb)(void *opaque);
-/*
- * Runs when the backend receives a drain request.
- */
-void (*drained_begin)(void *opaque);
-/*
- * Runs when the backend's last drain request ends.
- */
-void (*drained_end)(void *opaque);
-/*
- * Is the device still busy?
- */
-bool (*drained_poll)(void *opaque);
 } BlockDevOps;
 
 /*
diff --git a/block/io.c b/block/io.c
index 2e267a85ab..4f9fe2f808 100644
--- a/block/io.c
+++ b/block/io.c
@@ -335,7 +335,8 @@ static void coroutine_fn 
bdrv_co_yield_to_drain(BlockDriverState *bs,
 if (ctx != co_ctx) {
 aio_context_release(ctx);
 }
-replay_bh_schedule_oneshot_event(ctx, bdrv_co_drain_bh_cb, &data);
+replay_bh_schedule_oneshot_event(qemu_get_aio_context(),
+ bdrv_co_drain_bh_cb, &data);
 
 qemu_coroutine_yield();
 /* If we are resumed from some other event (such as an aio completion or a
-- 
2.39.2

[PATCH v4 12/20] hw/xen: do not set is_external=true on evtchn fds

2023-04-25 Thread Stefan Hajnoczi

is_external=true suspends fd handlers between aio_disable_external() and
aio_enable_external(). The block layer's drain operation uses this
mechanism to prevent new I/O from sneaking in between
bdrv_drained_begin() and bdrv_drained_end().

The previous commit converted the xen-block device to use BlockDevOps
.drained_begin/end() callbacks. It no longer relies on is_external=true
so it is safe to pass is_external=false.

This is part of ongoing work to remove the aio_disable_external() API.

Signed-off-by: Stefan Hajnoczi 
---
 hw/xen/xen-bus.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/xen/xen-bus.c b/hw/xen/xen-bus.c
index b8f408c9ed..bf256d4da2 100644
--- a/hw/xen/xen-bus.c
+++ b/hw/xen/xen-bus.c
@@ -842,14 +842,14 @@ void xen_device_set_event_channel_context(XenDevice 
*xendev,
 }
 
 if (channel->ctx)
-aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), 
true,
+aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), 
false,
NULL, NULL, NULL, NULL, NULL);
 
 channel->ctx = ctx;
 if (ctx) {
 aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh),
-   true, xen_device_event, NULL, xen_device_poll, NULL,
-   channel);
+   false, xen_device_event, NULL, xen_device_poll,
+   NULL, channel);
 }
 }
 
@@ -923,7 +923,7 @@ void xen_device_unbind_event_channel(XenDevice *xendev,
 
 QLIST_REMOVE(channel, list);
 
-aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), true,
+aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), false,
NULL, NULL, NULL, NULL, NULL);
 
 if (qemu_xen_evtchn_unbind(channel->xeh, channel->local_port) < 0) {
-- 
2.39.2

[PATCH v4 09/20] block: add blk_in_drain() API

2023-04-25 Thread Stefan Hajnoczi

The BlockBackend quiesce_counter is greater than zero during drained
sections. Add an API to check whether the BlockBackend is in a drained
section.

The next patch will use this API.

Signed-off-by: Stefan Hajnoczi 
---
 include/sysemu/block-backend-global-state.h | 1 +
 block/block-backend.c   | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/include/sysemu/block-backend-global-state.h 
b/include/sysemu/block-backend-global-state.h
index 2b6d27db7c..ac7cbd6b5e 100644
--- a/include/sysemu/block-backend-global-state.h
+++ b/include/sysemu/block-backend-global-state.h
@@ -78,6 +78,7 @@ void blk_activate(BlockBackend *blk, Error **errp);
 int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags);
 void blk_aio_cancel(BlockAIOCB *acb);
 int blk_commit_all(void);
+bool blk_in_drain(BlockBackend *blk);
 void blk_drain(BlockBackend *blk);
 void blk_drain_all(void);
 void blk_set_on_error(BlockBackend *blk, BlockdevOnError on_read_error,
diff --git a/block/block-backend.c b/block/block-backend.c
index ffd1d66f7d..42721a3592 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1266,6 +1266,13 @@ blk_check_byte_request(BlockBackend *blk, int64_t 
offset, int64_t bytes)
 return 0;
 }
 
+/* Are we currently in a drained section? */
+bool blk_in_drain(BlockBackend *blk)
+{
+GLOBAL_STATE_CODE(); /* change to IO_OR_GS_CODE(), if necessary */
+return qatomic_read(&blk->quiesce_counter);
+}
+
 /* To be called between exactly one pair of blk_inc/dec_in_flight() */
 static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
 {
-- 
2.39.2

[PATCH v4 08/20] hw/xen: do not use aio_set_fd_handler(is_external=true) in xen_xenstore

2023-04-25 Thread Stefan Hajnoczi

There is no need to suspend activity between aio_disable_external() and
aio_enable_external(), which is mainly used for the block layer's drain
operation.

This is part of ongoing work to remove the aio_disable_external() API.

Reviewed-by: David Woodhouse 
Reviewed-by: Paul Durrant 
Signed-off-by: Stefan Hajnoczi 
---
 hw/i386/kvm/xen_xenstore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/kvm/xen_xenstore.c b/hw/i386/kvm/xen_xenstore.c
index 900679af8a..6e81bc8791 100644
--- a/hw/i386/kvm/xen_xenstore.c
+++ b/hw/i386/kvm/xen_xenstore.c
@@ -133,7 +133,7 @@ static void xen_xenstore_realize(DeviceState *dev, Error 
**errp)
 error_setg(errp, "Xenstore evtchn port init failed");
 return;
 }
-aio_set_fd_handler(qemu_get_aio_context(), xen_be_evtchn_fd(s->eh), true,
+aio_set_fd_handler(qemu_get_aio_context(), xen_be_evtchn_fd(s->eh), false,
xen_xenstore_event, NULL, NULL, NULL, s);
 
 s->impl = xs_impl_create(xen_domid);
-- 
2.39.2

< 1 2 3 4 >

101 - 200 of 334 matches

Mail list logo