from:"Kevin Wolf"

Re: [PATCH v2 5/5] monitor: use aio_co_reschedule_self()

2024-05-03 Thread Kevin Wolf

Am 06.02.2024 um 20:06 hat Stefan Hajnoczi geschrieben:
> The aio_co_reschedule_self() API is designed to avoid the race
> condition between scheduling the coroutine in another AioContext and
> yielding.
> 
> The QMP dispatch code uses the open-coded version that appears
> susceptible to the race condition at first glance:
> 
>   aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
>   qemu_coroutine_yield();
> 
> The code is actually safe because the iohandler and qemu_aio_context
> AioContext run under the Big QEMU Lock. Nevertheless, set a good example
> and use aio_co_reschedule_self() so it's obvious that there is no race.
> 
> Suggested-by: Hanna Reitz 
> Reviewed-by: Manos Pitsidianakis 
> Reviewed-by: Hanna Czenczek 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  qapi/qmp-dispatch.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
> index 176b549473..f3488afeef 100644
> --- a/qapi/qmp-dispatch.c
> +++ b/qapi/qmp-dispatch.c
> @@ -212,8 +212,7 @@ QDict *coroutine_mixed_fn qmp_dispatch(const 
> QmpCommandList *cmds, QObject *requ
>   * executing the command handler so that it can make progress if 
> it
>   * involves an AIO_WAIT_WHILE().
>   */
> -aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
> -qemu_coroutine_yield();
> +aio_co_reschedule_self(qemu_get_aio_context());

Turns out that this one actually causes a regression. [1] This code is
ŕun in iohandler_ctx, aio_co_reschedule_self() looks at the new context
and compares it with qemu_get_current_aio_context() - and because both
are qemu_aio_context, it decides that it has nothing to do. So the
command handler coroutine actually still runs in iohandler_ctx now,
which is not what we want.

We could just revert this patch because it was only meant as a cleanup
without a semantic difference.

Or aio_co_reschedule_self() could look at qemu_coroutine_self()->ctx
instead of using qemu_get_current_aio_context(). That would be a little
more indirect, though, and I'm not sure if co->ctx is always up to date.

Any opinions on what is the best way to fix this?

Kevin

[1] https://issues.redhat.com/browse/RHEL-34618

Re: [PULL 6/6] iotests: add backup-discard-source

2024-04-30 Thread Kevin Wolf

Am 29.04.2024 um 20:39 hat Vladimir Sementsov-Ogievskiy geschrieben:
> [Add John]
> 
> On 29.04.24 17:18, Richard Henderson wrote:
> > On 4/29/24 04:51, Vladimir Sementsov-Ogievskiy wrote:
> > > Add test for a new backup option: discard-source.
> > > 
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > Reviewed-by: Fiona Ebner 
> > > Tested-by: Fiona Ebner 
> > > Message-Id: <20240313152822.626493-6-vsement...@yandex-team.ru>
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > ---
> > >   .../qemu-iotests/tests/backup-discard-source  | 152 ++
> > >   .../tests/backup-discard-source.out   |   5 +
> > >   2 files changed, 157 insertions(+)
> > >   create mode 100755 tests/qemu-iotests/tests/backup-discard-source
> > >   create mode 100644 tests/qemu-iotests/tests/backup-discard-source.out
> > 
> > This fails check-python-minreqs
> > 
> >    https://gitlab.com/qemu-project/qemu/-/jobs/6739551782
> > 
> > It appears to be a pylint issue.
> > 
> > 
> 
> tests/export-incoming-iothread:1:0: R0801: Similar lines in 2 files
> ==tests.backup-discard-source:[52:61]
> ==tests.copy-before-write:[54:63]
> 'file': {
> 'driver': iotests.imgfmt,
> 'file': {
> 'driver': 'file',
> 'filename': source_img,
> }
> },
> 'target': {
> 'driver': iotests.imgfmt, (duplicate-code)
> 
> 
> 
> Hmm. I don't think, that something should be changed in code.
> splitting out part of this json object to a function? That's a test
> for QMP command, and it's good that we see the command as is in one
> place. I can swap some lines or rename variables to hack the linter,
> but I'd prefer not doing so:)
> 
> 
> For me that looks like a false-positive: yes it's a duplication, but
> it should better be duplication, than complicating raw json objects by
> reusing common parts.
> 
> 
> What to do? As described in 22305c2a081b8b6 "python: Reduce strictness
> of pylint's duplicate-code check", this check could be either be
> disabled completely, or we can increase min-similarity-lines config
> value.
> 
> I'd just disable it completely. Any thoughts?

I think it would make sense to treat tests differently from real
production code. In tests/, some duplication is bound to happen and
trying to unify things across test cases (which would mean moving to
iotests.py) often doesn't make sense. On the other hand, in python/ we
would probably want to unify duplicated code.

Kevin

Re: [PATCH 0/3] Make it possible to compile the x86 binaries without FDC

2024-04-29 Thread Kevin Wolf

[ Cc: qemu-block ]

Am 25.04.2024 um 20:43 hat Thomas Huth geschrieben:
> For downstream versions of QEMU, we'd like to be able to compile QEMU
> without the FDC code included (since it's not required for modern VMs
> anymore and the FDC code has rather a bad reputation, see the VENOM CVE).
> 
> The q35 machine can already be instantiated without FDC, but for being
> able to link a binary without the FDC code, the Kconfig file needs some
> tweaks and there are two spots in the pc code that directly call functions
> from the FDC code - those need to be disabled via #ifdefs.
> 
> The third patch changes the i440fx and isapc machine types so that
> they can work without the FDC device, too, in case it has not been
> compiled into the binary. It's marked as RFC since I assume that the
> FDC was originally a fix compononent of these motherboards, so I'm
> unsure whether we should allow the disablement there. OTOH, it seems
> to work fine, and the FDC is only disabled when it is not available
> in the binary, so I hope this patch is fine, too.
> 
> Thomas Huth (3):
>   hw/i386/pc: Allow to compile without CONFIG_FDC_ISA
>   hw/i386/Kconfig: Allow to compile Q35 without FDC_ISA
>   hw/i386: Add the possibility to use i440fx and isapc without FDC
> 
>  hw/i386/pc.c  | 13 +
>  hw/i386/pc_piix.c |  6 --
>  hw/i386/Kconfig   |  2 +-
>  3 files changed, 14 insertions(+), 7 deletions(-)
> 
> -- 
> 2.44.0
> 
>

Re: [PATCH v4] linux-aio: add IO_CMD_FDSYNC command support

2024-04-24 Thread Kevin Wolf

Am 14.03.2024 um 12:16 hat Prasad Pandit geschrieben:
> From: Prasad Pandit 
> 
> Libaio defines IO_CMD_FDSYNC command to sync all outstanding
> asynchronous I/O operations, by flushing out file data to the
> disk storage.
> 
> Enable linux-aio to submit such aio request. This helps to
> reduce latency induced via pthread_create calls by
> thread-pool (aio=threads).
> 
> Signed-off-by: Prasad Pandit 

As we discussed in chat, it would be good to be more detailed about the
scenario that we're really concerned about here. The commit message
above sounds like submitting fdsync takes too long, but the real concern
seems to be about the effect that creating and destroying a thread has
on a vcpu by adding a TLB flush. Describing the mechanisms, the sequence
of operations that happen and the problem this causes in more detail
would make the commit message a lot more useful.

>  block/file-posix.c  |  7 +++
>  block/linux-aio.c   | 21 -
>  include/block/raw-aio.h |  1 +
>  3 files changed, 28 insertions(+), 1 deletion(-)
> 
> v4: New boolean field to indicate if aio_fdsync is available or not.
> It is set at file open time and checked before AIO_FLUSH call.
>   - https://lists.nongnu.org/archive/html/qemu-devel/2024-03/msg03701.html
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 35684f7e21..78a8cea03b 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -159,6 +159,7 @@ typedef struct BDRVRawState {
>  bool has_discard:1;
>  bool has_write_zeroes:1;
>  bool use_linux_aio:1;
> +bool has_laio_fdsync:1;
>  bool use_linux_io_uring:1;
>  int page_cache_inconsistent; /* errno from fdatasync failure */
>  bool has_fallocate;
> @@ -718,6 +719,7 @@ static int raw_open_common(BlockDriverState *bs, QDict 
> *options,
>  ret = -EINVAL;
>  goto fail;
>  }
> +s->has_laio_fdsync = laio_has_fdsync(s->fd);

I think this should be conditional on s->use_linux_aio. No point in
probing it if we'll never call it anyway.

Kevin

[PATCH for-9.0?] usb-storage: Fix BlockConf defaults

2024-04-12 Thread Kevin Wolf

Commit 30896374 started to pass the full BlockConf from usb-storage to
scsi-disk, while previously only a few select properties would be
forwarded. This enables the user to set more properties, e.g. the block
size, that are actually taking effect.

However, now the calls to blkconf_apply_backend_options() and
blkconf_blocksizes() in usb_msd_storage_realize() that modify some of
these properties take effect, too, instead of being silently ignored.
This means at least that the block sizes get an unconditional default of
512 bytes before the configuration is passed to scsi-disk.

Before commit 30896374, the property wouldn't be set for scsi-disk and
therefore the device dependent defaults would apply - 512 for scsi-hd,
but 2048 for scsi-cd. The latter default has now become 512, too, which
makes at least Windows 11 installation fail when installing from
usb-storage.

Fix this by simply not calling these functions any more in usb-storage
and passing BlockConf on unmodified (except for the BlockBackend). The
same functions are called by the SCSI code anyway and it sets the right
defaults for the actual media type.

Fixes: 308963746169 ('scsi: Don't ignore most usb-storage properties')
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2260
Reported-by: Jonas Svensson
Signed-off-by: Kevin Wolf 
---
Considering this a candidate for 9.0 given that we're already having an
rc4, it's a regression from 8.2 and breaks installing Windows from USB

 hw/usb/dev-storage-classic.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/hw/usb/dev-storage-classic.c b/hw/usb/dev-storage-classic.c
index 50a3ad6285..6147387dc6 100644
--- a/hw/usb/dev-storage-classic.c
+++ b/hw/usb/dev-storage-classic.c
@@ -38,15 +38,6 @@ static void usb_msd_storage_realize(USBDevice *dev, Error 
**errp)
 return;
 }
 
-if (!blkconf_blocksizes(>conf, errp)) {
-return;
-}
-
-if (!blkconf_apply_backend_options(>conf, !blk_supports_write_perm(blk),
-   true, errp)) {
-return;
-}
-
 /*
  * Hack alert: this pretends to be a block device, but it's really
  * a SCSI bus that can serve only a single device, which it
-- 
2.44.0

Re: [PATCH-for-9.0 v2 0/3] hw/block/nand: Fix out-of-bound access in NAND block buffer

2024-04-09 Thread Kevin Wolf

Am 09.04.2024 um 15:59 hat Philippe Mathieu-Daudé geschrieben:
> Fix for https://gitlab.com/qemu-project/qemu/-/issues/1446
> 
> Since v1:
> - Addressed Kevin trivial suggestions (unsigned offset)

You already kept the Reviewed-by tags, but looks good to me.

Kevin

Re: [PATCH-for-9.0? 0/3] hw/block/nand: Fix out-of-bound access in NAND block buffer

2024-04-09 Thread Kevin Wolf

Am 08.04.2024 um 10:36 hat Philippe Mathieu-Daudé geschrieben:
> Fix for https://gitlab.com/qemu-project/qemu/-/issues/1446
> 
> Philippe Mathieu-Daudé (3):
>   hw/block/nand: Factor nand_load_iolen() method out
>   hw/block/nand: Have blk_load() return boolean indicating success
>   hw/block/nand: Fix out-of-bound access in NAND block buffer

As we're short on time for 9.0:

Reviewed-by: Kevin Wolf 

But it feels to me like this device could use some more cleanup to make
the code more robust.

Kevin

Re: [PATCH-for-9.0? 1/3] hw/block/nand: Factor nand_load_iolen() method out

2024-04-09 Thread Kevin Wolf

Am 08.04.2024 um 10:36 hat Philippe Mathieu-Daudé geschrieben:
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  hw/block/nand.c | 32 +++-
>  1 file changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/block/nand.c b/hw/block/nand.c
> index d1435f2207..6fa9038bb5 100644
> --- a/hw/block/nand.c
> +++ b/hw/block/nand.c
> @@ -243,9 +243,25 @@ static inline void nand_pushio_byte(NANDFlashState *s, 
> uint8_t value)
>  }
>  }
>  
> +/*
> + * nand_load_block: Load block containing (s->addr + @offset).
> + * Returns length of data available at @offset in this block.
> + */
> +static int nand_load_block(NANDFlashState *s, int offset)
> +{
> +int iolen;
> +
> +s->blk_load(s, s->addr, offset);

Wouldn't it make more sense for @offset to be unsigned, like in
nand_command() before this patch?

I think the values are guaranteed to be small enough to fit in either
signed or unsigned, but we never check for < 0 (maybe that should be
done in your patch to s->blk_load() anyway).

> +iolen = (1 << s->page_shift) - offset;

This is not new, but I'm confused. Can this legitimately be negative?
offset seems to be limited to (1 << s->addr_shift) + s->offset in
practice, but addr_shift > page_shift for NAND_PAGE_SIZE == 2048.

After patch 3, offset is implicitly limited to NAND_PAGE_SIZE + OOB_SIZE
because we return early if s->blk_load() fails. I wonder if it wouldn't
be better to catch this in the callers already and only assert in
s->blk_load().

Anyway, after patch 3 iolen can only temporarily become negative here...

> +if (s->gnd) {
> +iolen += 1 << s->oob_shift;

...before it becomes non-negative again here.

> +}
> +return iolen;
> +}

So none of this makes the code technically incorrect after applying the
full series, but let someone modify it who doesn't understand these
intricacies and I think chances are that it will fall apart.

Kevin

Re: [PATCH] block: Remove unnecessary NULL check in bdrv_pad_request()

2024-04-02 Thread Kevin Wolf

Am 02.04.2024 um 12:53 hat Philippe Mathieu-Daudé geschrieben:
> On 27/3/24 20:27, Kevin Wolf wrote:
> > Coverity complains that the check introduced in commit 3f934817 suggests
> > that qiov could be NULL and we dereference it before reaching the check.
> > In fact, all of the callers pass a non-NULL pointer, so just remove the
> > misleading check.
> > 
> > Resolves: Coverity CID 1542668
> > Signed-off-by: Kevin Wolf 
> > ---
> >   block/io.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Since I'm not seeing other block related patch for 9.0 and
> I'm preparing a pull request, I'm queuing this one.

Thanks, Phil. I didn't send a pull request because I didn't have
anything else and silencing a Coverity false positive didn't seem urgent
for 9.0, but it certainly doesn't hurt either.

Kevin

Re: [PATCH v1] coroutine: avoid inserting duplicate coroutine to co_queue_wakeup

2024-03-28 Thread Kevin Wolf

Am 27.03.2024 um 23:13 hat Eric Blake geschrieben:
> On Mon, Mar 25, 2024 at 11:50:41AM -0400, Stefan Hajnoczi wrote:
> > On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang wrote:
> > > If g_main_loop_run()/aio_poll() is called in the coroutine context,
> > > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
> > > may be disordered.
> > 
> > aio_poll() must not be called from coroutine context:
> > 
> >   bool no_coroutine_fn aio_poll(AioContext *ctx, bool blocking);
> >^^^
> > 
> > Coroutines are not supposed to block. Instead, they should yield.
> > 
> > > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
> > > some listened events is completed. Therefore, the completion callback
> > > function is dispatched.
> > > 
> > > If this callback function needs to invoke aio_co_enter(), it will only
> > > wake up the coroutine (because we are already in coroutine context),
> > > which may cause that the data on this listening event_fd/socket_fd
> > > is not read/cleared. When the next poll () exits, it will be woken up 
> > > again
> > > and inserted into the wakeup queue again.
> > > 
> > > For example, if TLS is enabled in NBD, the server will call 
> > > g_main_loop_run()
> > > in the coroutine, and repeatedly wake up the io_read event on a socket.
> > > The call stack is as follows:
> > > 
> > > aio_co_enter()
> > > aio_co_wake()
> > > qio_channel_restart_read()
> > > aio_dispatch_handler()
> > > aio_dispatch_handlers()
> > > aio_dispatch()
> > > aio_ctx_dispatch()
> > > g_main_context_dispatch()
> > > g_main_loop_run()
> > > nbd_negotiate_handle_starttls()
> > 
> > This code does not look like it was designed to run in coroutine
> > context. Two options:
> > 
> > 1. Don't run it in coroutine context (e.g. use a BH instead). This
> >avoids blocking the coroutine but calling g_main_loop_run() is still
> >ugly, in my opinion.
> > 
> > 2. Get rid of data.loop and use coroutine APIs instead:
> > 
> >while (!data.complete) {
> >qemu_coroutine_yield();
> >}
> > 
> >and update nbd_tls_handshake() to call aio_co_wake(data->co) instead
> >of g_main_loop_quit(data->loop).
> > 
> >This requires auditing the code to check whether the event loop might
> >invoke something that interferes with
> >nbd_negotiate_handle_starttls(). Typically this means monitor
> >commands or fd activity that could change the state of this
> >connection while it is yielded. This is where the real work is but
> >hopefully it will not be that hard to figure out.
> 
> I agree that 1) is ugly.  So far, I've added some temporary
> assertions, to see when qio_channel_tls_handshake is reached; it looks
> like nbd/server.c is calling it from within coroutine context, but
> nbd/client.c is calling it from the main loop.  The qio code handles
> either, but now I'm stuck in trying to get client.c into having the
> right coroutine context; the TLS handshake is done before the usual
> BlockDriverState *bs object is available, so I'm not even sure what
> aio context, if any, I should be using.  But on my first try, I'm
> hitting:
> 
> qemu-img: ../util/async.c:707: aio_co_enter: Assertion `self != co' failed.
> 
> so I obviously got something wrong.

Hard to tell without seeing the code, but it looks like you're trying to
wake up the coroutine while you're still executing in the context of the
same coroutine.

It looks like the documentation of qio_channel_tls_handshake() is wrong
and the function can return and call the callback immediately without
dropping out of coroutine context.

A rather heavy-handed, but obvious approach would be moving the
qio_channel_tls_handshake() into a BH, then you know you'll always be
outside of coroutine context in the callback.

But maybe it would be enough to just check if the coroutine isn't
already active:

if (!qemu_coroutine_entered(co)) {
aio_wake_co(co);
}

> It may be possible to use block_gen_c to turn nbd_receive_negotiate
> and nbd_receive_export_list into co_wrapper functions, if I can audit
> that all of their code goes through qio (and is therefore
> coroutine-safe), but that work is still ongoing.

If it's possible, I think that would be nicer in the code and would also
reduce the time the guest is blocked while we're creating a new NBD
connection.

*reads code*

Hm... Actually, one thing I was completely unaware of is that all of
this is running in a separate thread, so maybe the existing synchronous
code already doesn't block the guest. nbd_co_establish_connection()
starts this thread. The thread doesn't have an AioContext, so anything
that involves scheduling something in an AioContext (including BHs and
waking up coroutines) will result in code running in a different thread.

I'm not sure why a thread is used in the first place (does it do
something that coroutines can't do?) and if running parts of the code in
a different thread would be a problem, but we should

[PATCH] block: Remove unnecessary NULL check in bdrv_pad_request()

2024-03-27 Thread Kevin Wolf

Coverity complains that the check introduced in commit 3f934817 suggests
that qiov could be NULL and we dereference it before reaching the check.
In fact, all of the callers pass a non-NULL pointer, so just remove the
misleading check.

Resolves: Coverity CID 1542668
Signed-off-by: Kevin Wolf 
---
 block/io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index 395bea3bac..7217cf811b 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1730,7 +1730,7 @@ static int bdrv_pad_request(BlockDriverState *bs,
  * For prefetching in stream_populate(), no qiov is passed along, because
  * only copy-on-read matters.
  */
-if (qiov && *qiov) {
+if (*qiov) {
 sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
   _head, _tail,
   _niov);
-- 
2.44.0

[PULL 6/6] iotests: add test for stream job with an unaligned prefetch read

2024-03-26 Thread Kevin Wolf

From: Fiona Ebner 

Previously, bdrv_pad_request() could not deal with a NULL qiov when
a read needed to be aligned. During prefetch, a stream job will pass a
NULL qiov. Add a test case to cover this scenario.

By accident, also covers a previous race during shutdown, where block
graph changes during iteration in bdrv_flush_all() could lead to
unreferencing the wrong block driver state and an assertion failure
later.

Signed-off-by: Fiona Ebner 
Message-ID: <20240322095009.346989-5-f.eb...@proxmox.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 .../tests/stream-unaligned-prefetch   | 86 +++
 .../tests/stream-unaligned-prefetch.out   |  5 ++
 2 files changed, 91 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch
new file mode 100755
index 00..546db1d369
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch
@@ -0,0 +1,86 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when a stream job does an unaligned prefetch read
+# which requires padding while having a NULL qiov.
+#
+# Copyright (C) Proxmox Server Solutions GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+from iotests import imgfmt, qemu_img_create, qemu_io, QMPTestCase
+
+image_size = 1 * 1024 * 1024
+cluster_size = 64 * 1024
+base = os.path.join(iotests.test_dir, 'base.img')
+top = os.path.join(iotests.test_dir, 'top.img')
+
+class TestStreamUnalignedPrefetch(QMPTestCase):
+def setUp(self) -> None:
+"""
+Create two images:
+- base image {base} with {cluster_size // 2} bytes allocated
+- top image {top} without any data allocated and coarser
+  cluster size
+
+Attach a compress filter for the top image, because that
+requires that the request alignment is the top image's cluster
+size.
+"""
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size // 2),
+base, str(image_size))
+qemu_io('-c', f'write 0 {cluster_size // 2}', base)
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size),
+top, str(image_size))
+
+self.vm = iotests.VM()
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': imgfmt,
+'node-name': 'base',
+'file': {
+'driver': 'file',
+'filename': base
+}
+}))
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': 'compress',
+'node-name': 'compress-top',
+'file': {
+'driver': imgfmt,
+'node-name': 'top',
+'file': {
+'driver': 'file',
+'filename': top
+},
+'backing': 'base'
+}
+}))
+self.vm.launch()
+
+def tearDown(self) -> None:
+self.vm.shutdown()
+os.remove(top)
+os.remove(base)
+
+def test_stream_unaligned_prefetch(self) -> None:
+self.vm.cmd('block-stream', job_id='stream', device='compress-top')
+
+
+if __name__ == '__main__':
+iotests.main(supported_fmts=['qcow2'], supported_protocols=['file'])
diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch.out 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
new file mode 100644
index 00..ae1213e6f8
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
@@ -0,0 +1,5 @@
+.
+--
+Ran 1 tests
+
+OK
-- 
2.44.0

[PULL 4/6] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-03-26 Thread Kevin Wolf

From: Fiona Ebner 

The old_bs variable in bdrv_next() is currently determined by looking
at the old block backend. However, if the block graph changes before
the next bdrv_next() call, it might be that the associated BDS is not
the same that was referenced previously. In that case, the wrong BDS
is unreferenced, leading to an assertion failure later:

> bdrv_unref: Assertion `bs->refcnt > 0' failed.

In particular, this can happen in the context of bdrv_flush_all(),
when polling for bdrv_co_flush() in the generated co-wrapper leads to
a graph change (for example with a stream block job [0]).

A racy reproducer:

> #!/bin/bash
> rm -f /tmp/backing.qcow2
> rm -f /tmp/top.qcow2
> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node0" } }
> {"execute": "quit"}
> EOF

[0]:

> #0  bdrv_replace_child_tran (child=..., new_bs=..., tran=...)
> #1  bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., 
> errp=...)
> #2  bdrv_replace_node_common (from=..., to=..., auto_skip=..., 
> detach_subchain=..., errp=...)
> #3  bdrv_drop_filter (bs=..., errp=...)
> #4  bdrv_cor_filter_drop (cor_filter_bs=...)
> #5  stream_prepare (job=...)
> #6  job_prepare_locked (job=...)
> #7  job_txn_apply_locked (fn=..., job=...)
> #8  job_do_finalize_locked (job=...)
> #9  job_exit (opaque=...)
> #10 aio_bh_poll (ctx=...)
> #11 aio_poll (ctx=..., blocking=...)
> #12 bdrv_poll_co (s=...)
> #13 bdrv_flush (bs=...)
> #14 bdrv_flush_all ()
> #15 do_vm_stop (state=..., send_stop=...)
> #16 vm_shutdown ()

Signed-off-by: Fiona Ebner 
Message-ID: <20240322095009.346989-3-f.eb...@proxmox.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 block/block-backend.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 9c4de79e6b..28af1eb17a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -599,14 +599,14 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 /* Must be called from the main loop */
 assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
+old_bs = it->bs;
+
 /* First, return all root nodes of BlockBackends. In order to avoid
  * returning a BDS twice when multiple BBs refer to it, we only return it
  * if the BB is the first one in the parent list of the BDS. */
 if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
 BlockBackend *old_blk = it->blk;
 
-old_bs = old_blk ? blk_bs(old_blk) : NULL;
-
 do {
 it->blk = blk_all_next(it->blk);
 bs = it->blk ? blk_bs(it->blk) : NULL;
@@ -620,11 +620,10 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 if (bs) {
 bdrv_ref(bs);
 bdrv_unref(old_bs);
+it->bs = bs;
 return bs;
 }
 it->phase = BDRV_NEXT_MONITOR_OWNED;
-} else {
-old_bs = it->bs;
 }
 
 /* Then return the monitor-owned BDSes without a BB attached. Ignore all
-- 
2.44.0

[PULL 3/6] block/io: accept NULL qiov in bdrv_pad_request

2024-03-26 Thread Kevin Wolf

From: Stefan Reiter 

Some operations, e.g. block-stream, perform reads while discarding the
results (only copy-on-read matters). In this case, they will pass NULL
as the target QEMUIOVector, which will however trip bdrv_pad_request,
since it wants to extend its passed vector. In particular, this is the
case for the blk_co_preadv() call in stream_populate().

If there is no qiov, no operation can be done with it, but the bytes
and offset still need to be updated, so the subsequent aligned read
will actually be aligned and not run into an assertion failure.

In particular, this can happen when the request alignment of the top
node is larger than the allocated part of the bottom node, in which
case padding becomes necessary. For example:

> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": 
> "node0", "node-name": "node1" } }
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node1" } }
> EOF

Originally-by: Stefan Reiter 
Signed-off-by: Thomas Lamprecht 
[FE: do update bytes and offset in any case
 add reproducer to commit message]
Signed-off-by: Fiona Ebner 
Message-ID: <20240322095009.346989-2-f.eb...@proxmox.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 block/io.c | 33 -
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/block/io.c b/block/io.c
index 33150c0359..395bea3bac 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs,
 return 0;
 }
 
-sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
-  _head, _tail,
-  _niov);
-
-/* Guaranteed by bdrv_check_request32() */
-assert(*bytes <= SIZE_MAX);
-ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
-  sliced_head, *bytes);
-if (ret < 0) {
-bdrv_padding_finalize(pad);
-return ret;
+/*
+ * For prefetching in stream_populate(), no qiov is passed along, because
+ * only copy-on-read matters.
+ */
+if (qiov && *qiov) {
+sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+  _head, _tail,
+  _niov);
+
+/* Guaranteed by bdrv_check_request32() */
+assert(*bytes <= SIZE_MAX);
+ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
+  sliced_head, *bytes);
+if (ret < 0) {
+bdrv_padding_finalize(pad);
+return ret;
+}
+*qiov = >local_qiov;
+*qiov_offset = 0;
 }
+
 *bytes += pad->head + pad->tail;
 *offset -= pad->head;
-*qiov = >local_qiov;
-*qiov_offset = 0;
 if (padded) {
 *padded = true;
 }
-- 
2.44.0

[PULL 5/6] block-backend: fix edge case in bdrv_next_cleanup() where BDS associated to BB changes

2024-03-26 Thread Kevin Wolf

From: Fiona Ebner 

Same rationale as for commit "block-backend: fix edge case in
bdrv_next() where BDS associated to BB changes". The block graph might
change between the bdrv_next() call and the bdrv_next_cleanup() call,
so it could be that the associated BDS is not the same that was
referenced previously anymore. Instead, rely on bdrv_next() to set
it->bs to the BDS it referenced and unreference that one in any case.

Signed-off-by: Fiona Ebner 
Message-ID: <20240322095009.346989-4-f.eb...@proxmox.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 block/block-backend.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 28af1eb17a..db6f9b92a3 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -663,13 +663,10 @@ void bdrv_next_cleanup(BdrvNextIterator *it)
 /* Must be called from the main loop */
 assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
-if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
-if (it->blk) {
-bdrv_unref(blk_bs(it->blk));
-blk_unref(it->blk);
-}
-} else {
-bdrv_unref(it->bs);
+bdrv_unref(it->bs);
+
+if (it->phase == BDRV_NEXT_BACKEND_ROOTS && it->blk) {
+blk_unref(it->blk);
 }
 
 bdrv_next_reset(it);
-- 
2.44.0

[PULL 0/6] Block layer patches

2024-03-26 Thread Kevin Wolf

The following changes since commit 096ae430a7b5a704af4cd94dca7200d6cb069991:

  Merge tag 'pull-qapi-2024-03-26' of https://repo.or.cz/qemu/armbru into 
staging (2024-03-26 09:50:21 +)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 12d7b3bbdcededd3b695501d8d247239d769:

  iotests: add test for stream job with an unaligned prefetch read (2024-03-26 
14:21:26 +0100)


Block layer patches

- Fix crash with unaligned prefetch requests (e.g. in stream jobs)
- vdpa-dev: Fix initialisation order to restore VDUSE compatibility
- iotests fixes


Fiona Ebner (3):
  block-backend: fix edge case in bdrv_next() where BDS associated to BB 
changes
  block-backend: fix edge case in bdrv_next_cleanup() where BDS associated 
to BB changes
  iotests: add test for stream job with an unaligned prefetch read

Kevin Wolf (1):
  vdpa-dev: Fix initialisation order to restore VDUSE compatibility

Stefan Reiter (1):
  block/io: accept NULL qiov in bdrv_pad_request

Thomas Huth (1):
  tests/qemu-iotests: Test 157 and 227 require virtio-blk

 block/block-backend.c  | 18 ++---
 block/io.c | 33 +
 hw/net/vhost_net.c | 10 +++
 hw/virtio/vdpa-dev.c   |  5 +-
 hw/virtio/vhost-vdpa.c | 29 +++-
 hw/virtio/vhost.c  |  8 +-
 hw/virtio/trace-events |  2 +-
 tests/qemu-iotests/157 |  2 +
 tests/qemu-iotests/227 |  2 +
 tests/qemu-iotests/tests/stream-unaligned-prefetch | 86 ++
 .../tests/stream-unaligned-prefetch.out|  5 ++
 11 files changed, 167 insertions(+), 33 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

[PULL 2/6] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-03-26 Thread Kevin Wolf

VDUSE requires that virtqueues are first enabled before the DRIVER_OK
status flag is set; with the current API of the kernel module, it is
impossible to enable the opposite order in our block export code because
userspace is not notified when a virtqueue is enabled.

This requirement also mathces the normal initialisation order as done by
the generic vhost code in QEMU. However, commit 6c482547 accidentally
changed the order for vdpa-dev and broke access to VDUSE devices with
this.

This changes vdpa-dev to use the normal order again and use the standard
vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
used with vdpa-dev again after this fix.

vhost_net intentionally avoided enabling the vrings for vdpa and does
this manually later while it does enable them for other vhost backends.
Reflect this in the vhost_net code and return early for vdpa, so that
the behaviour doesn't change for this device.

Cc: qemu-sta...@nongnu.org
Fixes: 6c4825476a43 ('vdpa: move vhost_vdpa_set_vring_ready to the caller')
Signed-off-by: Kevin Wolf 
Message-ID: <20240315155949.86066-1-kw...@redhat.com>
Reviewed-by: Eugenio Pérez 
Reviewed-by: Stefano Garzarella 
Signed-off-by: Kevin Wolf 
---
 hw/net/vhost_net.c | 10 ++
 hw/virtio/vdpa-dev.c   |  5 +
 hw/virtio/vhost-vdpa.c | 29 ++---
 hw/virtio/vhost.c  |  8 +++-
 hw/virtio/trace-events |  2 +-
 5 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index e8e1661646..fd1a93701a 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -541,6 +541,16 @@ int vhost_set_vring_enable(NetClientState *nc, int enable)
 VHostNetState *net = get_vhost_net(nc);
 const VhostOps *vhost_ops = net->dev.vhost_ops;
 
+/*
+ * vhost-vdpa network devices need to enable dataplane virtqueues after
+ * DRIVER_OK, so they can recover device state before starting dataplane.
+ * Because of that, we don't enable virtqueues here and leave it to
+ * net/vhost-vdpa.c.
+ */
+if (nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
+return 0;
+}
+
 nc->vring_enable = enable;
 
 if (vhost_ops && vhost_ops->vhost_set_vring_enable) {
diff --git a/hw/virtio/vdpa-dev.c b/hw/virtio/vdpa-dev.c
index eb9ecea83b..13e87f06f6 100644
--- a/hw/virtio/vdpa-dev.c
+++ b/hw/virtio/vdpa-dev.c
@@ -253,14 +253,11 @@ static int vhost_vdpa_device_start(VirtIODevice *vdev, 
Error **errp)
 
 s->dev.acked_features = vdev->guest_features;
 
-ret = vhost_dev_start(>dev, vdev, false);
+ret = vhost_dev_start(>dev, vdev, true);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Error starting vhost");
 goto err_guest_notifiers;
 }
-for (i = 0; i < s->dev.nvqs; ++i) {
-vhost_vdpa_set_vring_ready(>vdpa, i);
-}
 s->started = true;
 
 /*
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 3bcd05cc22..e827b9175f 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -896,19 +896,41 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, 
int idx)
 return idx;
 }
 
-int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, unsigned idx)
+static int vhost_vdpa_set_vring_enable_one(struct vhost_vdpa *v, unsigned idx,
+   int enable)
 {
 struct vhost_dev *dev = v->dev;
 struct vhost_vring_state state = {
 .index = idx,
-.num = 1,
+.num = enable,
 };
 int r = vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, );
 
-trace_vhost_vdpa_set_vring_ready(dev, idx, r);
+trace_vhost_vdpa_set_vring_enable_one(dev, idx, enable, r);
 return r;
 }
 
+static int vhost_vdpa_set_vring_enable(struct vhost_dev *dev, int enable)
+{
+struct vhost_vdpa *v = dev->opaque;
+unsigned int i;
+int ret;
+
+for (i = 0; i < dev->nvqs; ++i) {
+ret = vhost_vdpa_set_vring_enable_one(v, i, enable);
+if (ret < 0) {
+return ret;
+}
+}
+
+return 0;
+}
+
+int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, unsigned idx)
+{
+return vhost_vdpa_set_vring_enable_one(v, idx, 1);
+}
+
 static int vhost_vdpa_set_config_call(struct vhost_dev *dev,
int fd)
 {
@@ -1536,6 +1558,7 @@ const VhostOps vdpa_ops = {
 .vhost_set_features = vhost_vdpa_set_features,
 .vhost_reset_device = vhost_vdpa_reset_device,
 .vhost_get_vq_index = vhost_vdpa_get_vq_index,
+.vhost_set_vring_enable = vhost_vdpa_set_vring_enable,
 .vhost_get_config  = vhost_vdpa_get_config,
 .vhost_set_config = vhost_vdpa_set_config,
 .vhost_requires_shm_log = NULL,
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2e4e040db8..f50180e60e 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1984,7 +1984,13 @@ static int vhost_dev_

[PULL 1/6] tests/qemu-iotests: Test 157 and 227 require virtio-blk

2024-03-26 Thread Kevin Wolf

From: Thomas Huth 

Tests 157 and 227 use the virtio-blk device, so we have to mark these
tests accordingly to be skipped if this devices is not available (e.g.
when running the tests with qemu-system-avr only).

Signed-off-by: Thomas Huth 
Message-ID: <20240325154737.1305063-1-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/157 | 2 ++
 tests/qemu-iotests/227 | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/tests/qemu-iotests/157 b/tests/qemu-iotests/157
index 0dc9ba68d2..aa2ebbfb4b 100755
--- a/tests/qemu-iotests/157
+++ b/tests/qemu-iotests/157
@@ -40,6 +40,8 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fmt generic
 _supported_proto file
 
+_require_devices virtio-blk
+
 do_run_qemu()
 {
 (
diff --git a/tests/qemu-iotests/227 b/tests/qemu-iotests/227
index 7e45a47ac6..eddaad679e 100755
--- a/tests/qemu-iotests/227
+++ b/tests/qemu-iotests/227
@@ -40,6 +40,8 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fmt generic
 _supported_proto file
 
+_require_devices virtio-blk
+
 do_run_qemu()
 {
 echo Testing: "$@"
-- 
2.44.0

Re: [PATCH v3 0/4] fix two edge cases related to stream block jobs

2024-03-26 Thread Kevin Wolf

Am 25.03.2024 um 21:11 hat Stefan Hajnoczi geschrieben:
> On Fri, Mar 22, 2024 at 10:50:05AM +0100, Fiona Ebner wrote:
> > Changes in v3:
> > * Also deal with edge case in bdrv_next_cleanup(). Haven't run
> >   into an actual issue there, but at least the caller in
> >   migration/block.c uses bdrv_nb_sectors() which, while not a
> >   coroutine wrapper itself (it's written manually), may call
> >   bdrv_refresh_total_sectors(), which is a generated coroutine
> >   wrapper, so AFAIU, the block graph can change during that call.
> >   And even without that, it's just better to be more consistent
> >   with bdrv_next().
> > 
> > Changes in v2:
> > * Ran into another issue while writing the IO test Stefan wanted
> >   to have (good call :)), so include a fix for that and add the
> >   test. I didn't notice during manual testing, because I hadn't
> >   used a scripted QMP 'quit', so there was no race.
> > 
> > Fiona Ebner (3):
> >   block-backend: fix edge case in bdrv_next() where BDS associated to BB
> > changes
> >   block-backend: fix edge case in bdrv_next_cleanup() where BDS
> > associated to BB changes
> >   iotests: add test for stream job with an unaligned prefetch read
> > 
> > Stefan Reiter (1):
> >   block/io: accept NULL qiov in bdrv_pad_request
> > 
> >  block/block-backend.c | 18 ++--
> >  block/io.c| 31 ---
> >  .../tests/stream-unaligned-prefetch   | 86 +++
> >  .../tests/stream-unaligned-prefetch.out   |  5 ++
> >  4 files changed, 117 insertions(+), 23 deletions(-)
> >  create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
> >  create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out
> 
> Looks good to me. I will wait until Thursday before merging in case
> Hanna, Vladimir, or Kevin have comments. Thanks!

Let's not delay it to -rc2. If something turns out to be wrong with it,
we can still revert it, but I think getting fixes in earlier is better
during freeze.

Thanks, applied to the block branch.

Kevin


signature.asc
Description: PGP signature

Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-03-26 Thread Kevin Wolf

Am 22.03.2024 um 10:50 hat Fiona Ebner geschrieben:
> The old_bs variable in bdrv_next() is currently determined by looking
> at the old block backend. However, if the block graph changes before
> the next bdrv_next() call, it might be that the associated BDS is not
> the same that was referenced previously. In that case, the wrong BDS
> is unreferenced, leading to an assertion failure later:
> 
> > bdrv_unref: Assertion `bs->refcnt > 0' failed.

Your change makes sense, but in theory it shouldn't make a difference.
The real bug is in the caller, you can't allow graph modifications while
iterating the list of nodes. Even if it doesn't crash (like after your
patch), you don't have any guarantee that you will have seen every node
that exists that the end - and maybe not even that you don't get the
same node twice.

> In particular, this can happen in the context of bdrv_flush_all(),
> when polling for bdrv_co_flush() in the generated co-wrapper leads to
> a graph change (for example with a stream block job [0]).

The whole locking around this case is a bit tricky and would deserve
some cleanup.

The basic rule for bdrv_next() callers is that they need to hold the
graph reader lock as long as they are iterating the graph, otherwise
it's not safe. This implies that the ref/unref pairs in it should never
make a difference either - which is important, because at least
releasing the last reference is forbidden while holding the graph lock.
I intended to remove the ref/unref for bdrv_next(), but I didn't because
I realised that the callers need to be audited first that they really
obey the rules. You found one that would be problematic.

The thing that bdrv_flush_all() gets wrong is that it promises to follow
the graph lock rules with GRAPH_RDLOCK_GUARD_MAINLOOP(), but then calls
something that polls. The compiler can't catch this because bdrv_flush()
is a co_wrapper_mixed_bdrv_rdlock. The behaviour for these functions is:

- If called outside of coroutine context, they are GRAPH_UNLOCKED
- If called in coroutine context, they are GRAPH_RDLOCK

We should probably try harder to get rid of the mixed functions, because
a synchronous co_wrapper_bdrv_rdlock could actually be marked
GRAPH_UNLOCKED in the code and then the compiler could catch this case.

The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
with a coroutine wrapper so that the graph lock is held for the whole
function. Then calling bdrv_co_flush() while iterating the list is safe
and doesn't allow concurrent graph modifications.

Kevin

Re: [PATCH] tests/qemu-iotests: Test 157 and 227 require virtio-blk

2024-03-26 Thread Kevin Wolf

Am 25.03.2024 um 16:47 hat Thomas Huth geschrieben:
> Tests 157 and 227 use the virtio-blk device, so we have to mark these
> tests accordingly to be skipped if this devices is not available (e.g.
> when running the tests with qemu-system-avr only).
> 
> Signed-off-by: Thomas Huth 

Thanks, applied to the block branch.

Kevin

Re: [PATCH for-9.0 v3] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-03-19 Thread Kevin Wolf

Am 18.03.2024 um 20:27 hat Eugenio Perez Martin geschrieben:
> On Mon, Mar 18, 2024 at 10:02 AM Michael S. Tsirkin  wrote:
> >
> > On Mon, Mar 18, 2024 at 12:31:26PM +0800, Jason Wang wrote:
> > > On Fri, Mar 15, 2024 at 11:59 PM Kevin Wolf  wrote:
> > > >
> > > > VDUSE requires that virtqueues are first enabled before the DRIVER_OK
> > > > status flag is set; with the current API of the kernel module, it is
> > > > impossible to enable the opposite order in our block export code because
> > > > userspace is not notified when a virtqueue is enabled.
> > > >
> > > > This requirement also mathces the normal initialisation order as done by
> > > > the generic vhost code in QEMU. However, commit 6c482547 accidentally
> > > > changed the order for vdpa-dev and broke access to VDUSE devices with
> > > > this.
> > > >
> > > > This changes vdpa-dev to use the normal order again and use the standard
> > > > vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
> > > > used with vdpa-dev again after this fix.
> > > >
> > > > vhost_net intentionally avoided enabling the vrings for vdpa and does
> > > > this manually later while it does enable them for other vhost backends.
> > > > Reflect this in the vhost_net code and return early for vdpa, so that
> > > > the behaviour doesn't change for this device.
> > > >
> > > > Cc: qemu-sta...@nongnu.org
> > > > Fixes: 6c4825476a4351530bcac17abab72295b75ffe98
> > > > Signed-off-by: Kevin Wolf 
> > > > ---
> > > > v2:
> > > > - Actually make use of the @enable parameter
> > > > - Change vhost_net to preserve the current behaviour
> > > >
> > > > v3:
> > > > - Updated trace point [Stefano]
> > > > - Fixed typo in comment [Stefano]
> > > >
> > > >  hw/net/vhost_net.c | 10 ++
> > > >  hw/virtio/vdpa-dev.c   |  5 +
> > > >  hw/virtio/vhost-vdpa.c | 29 ++---
> > > >  hw/virtio/vhost.c  |  8 +++-
> > > >  hw/virtio/trace-events |  2 +-
> > > >  5 files changed, 45 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> > > > index e8e1661646..fd1a93701a 100644
> > > > --- a/hw/net/vhost_net.c
> > > > +++ b/hw/net/vhost_net.c
> > > > @@ -541,6 +541,16 @@ int vhost_set_vring_enable(NetClientState *nc, int 
> > > > enable)
> > > >  VHostNetState *net = get_vhost_net(nc);
> > > >  const VhostOps *vhost_ops = net->dev.vhost_ops;
> > > >
> > > > +/*
> > > > + * vhost-vdpa network devices need to enable dataplane virtqueues 
> > > > after
> > > > + * DRIVER_OK, so they can recover device state before starting 
> > > > dataplane.
> > > > + * Because of that, we don't enable virtqueues here and leave it to
> > > > + * net/vhost-vdpa.c.
> > > > + */
> > > > +if (nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
> > > > +return 0;
> > > > +}
> > >
> > > I think we need some inputs from Eugenio, this is only needed for
> > > shadow virtqueue during live migration but not other cases.
> > >
> > > Thanks
> >
> >
> > Yes I think we had a backend flag for this, right? Eugenio can you
> > comment please?
> >
> 
> We have the VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend flag,
> right. If the backend does not offer it, it is better to enable all
> the queues here and add a migration blocker in net/vhost-vdpa.c.
> 
> So the check should be:
> nc->info->type == VHOST_VDPA && (backend_features &
> VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK).
> 
> I can manage to add the migration blocker on top of this patch.

Note that my patch preserves the current behaviour for vhost_net. The
callback wasn't implemented for vdpa so far, so we never called anything
even if the flag wasn't set. This patch adds an implementation for the
callback, so we have to skip it here to have everything in vhost_net
work as before - which is what the condition as written does.

If we add a check for the flag now (I don't know if that's correct or
not), that would be a second, unrelated change of behaviour in the same
patch. So if it's necessary, that's a preexisting problem and I'd argue
it doesn't belong in this patch, but should be done separately.

Kevin

[PULL 09/15] tests/qemu-iotests: Restrict test 130 to the 'file' protocol

2024-03-18 Thread Kevin Wolf

From: Thomas Huth 

Using "-drive ...,backing.file.filename=..." only works with the
file protocol, but not with URIs, so mark this test accordingly.

Signed-off-by: Thomas Huth 
Message-ID: <2024031508.153201-5-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/130 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/130 b/tests/qemu-iotests/130
index 7257f09677..7af85d20a8 100755
--- a/tests/qemu-iotests/130
+++ b/tests/qemu-iotests/130
@@ -42,7 +42,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 . ./common.qemu
 
 _supported_fmt qcow2
-_supported_proto generic
+_supported_proto file
 _supported_os Linux
 # We are going to use lazy-refcounts
 _unsupported_imgopts 'compat=0.10'
-- 
2.44.0

[PULL 06/15] tests/qemu-iotests: Fix test 033 for running with non-file protocols

2024-03-18 Thread Kevin Wolf

From: Thomas Huth 

When running iotest 033 with the ssh protocol, it fails with:

 033   fail   [14:48:31] [14:48:41]   10.2soutput mismatch
 --- /.../tests/qemu-iotests/033.out
 +++ /.../tests/qemu-iotests/scratch/qcow2-ssh-033/033.out.bad
 @@ -174,6 +174,7 @@
  512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 512/512 bytes at offset 2097152
  512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +qemu-io: warning: Failed to truncate the tail of the image: ssh driver does 
not support shrinking files
  read 512/512 bytes at offset 0
  512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)

We already check for the qcow2 format here, so let's simply also
add a check for the protocol here, too, to only test the truncation
with the file protocol.

Signed-off-by: Thomas Huth 
Message-ID: <2024031508.153201-2-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/033 | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/033 b/tests/qemu-iotests/033
index da9133c44b..4bc7a071bd 100755
--- a/tests/qemu-iotests/033
+++ b/tests/qemu-iotests/033
@@ -123,9 +123,9 @@ do_test 512 "write -P 1 0 0x200" "$TEST_IMG" | 
_filter_qemu_io
 # next L2 table
 do_test 512 "write -P 1 $L2_COVERAGE 0x200" "$TEST_IMG" | _filter_qemu_io
 
-# only interested in qcow2 here; also other formats might respond with
-#  "not supported" error message
-if [ $IMGFMT = "qcow2" ]; then
+# only interested in qcow2 with file protocol here; also other formats
+# might respond with "not supported" error message
+if [ $IMGFMT = "qcow2" ] && [ $IMGPROTO = "file" ]; then
 do_test 512 "truncate $L2_COVERAGE" "$TEST_IMG" | _filter_qemu_io
 fi
 
-- 
2.44.0

[PULL 02/15] nbd/server: Fix race in draining the export

2024-03-18 Thread Kevin Wolf

When draining an NBD export, nbd_drained_begin() first sets
client->quiescing so that nbd_client_receive_next_request() won't start
any new request coroutines. Then nbd_drained_poll() tries to makes sure
that we wait for any existing request coroutines by checking that
client->nb_requests has become 0.

However, there is a small window between creating a new request
coroutine and increasing client->nb_requests. If a coroutine is in this
state, it won't be waited for and drain returns too early.

In the context of switching to a different AioContext, this means that
blk_aio_attached() will see client->recv_coroutine != NULL and fail its
assertion.

Fix this by increasing client->nb_requests immediately when starting the
coroutine. Doing this after the checks if we should create a new
coroutine is okay because client->lock is held.

Cc: qemu-sta...@nongnu.org
Fixes: fd6afc501a01 ("nbd/server: Use drained block ops to quiesce the server")
Signed-off-by: Kevin Wolf 
Message-ID: <20240314165825.40261-2-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 nbd/server.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 941832f178..c3484cc1eb 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -3007,8 +3007,8 @@ static coroutine_fn int nbd_handle_request(NBDClient 
*client,
 /* Owns a reference to the NBDClient passed as opaque.  */
 static coroutine_fn void nbd_trip(void *opaque)
 {
-NBDClient *client = opaque;
-NBDRequestData *req = NULL;
+NBDRequestData *req = opaque;
+NBDClient *client = req->client;
 NBDRequest request = { 0 };/* GCC thinks it can be used uninitialized 
*/
 int ret;
 Error *local_err = NULL;
@@ -3037,8 +3037,6 @@ static coroutine_fn void nbd_trip(void *opaque)
 goto done;
 }
 
-req = nbd_request_get(client);
-
 /*
  * nbd_co_receive_request() returns -EAGAIN when nbd_drained_begin() has
  * set client->quiescing but by the time we get back nbd_drained_end() may
@@ -3112,9 +3110,7 @@ static coroutine_fn void nbd_trip(void *opaque)
 }
 
 done:
-if (req) {
-nbd_request_put(req);
-}
+nbd_request_put(req);
 
 qemu_mutex_unlock(>lock);
 
@@ -3143,10 +3139,13 @@ disconnect:
  */
 static void nbd_client_receive_next_request(NBDClient *client)
 {
+NBDRequestData *req;
+
 if (!client->recv_coroutine && client->nb_requests < MAX_NBD_REQUESTS &&
 !client->quiescing) {
 nbd_client_get(client);
-client->recv_coroutine = qemu_coroutine_create(nbd_trip, client);
+req = nbd_request_get(client);
+client->recv_coroutine = qemu_coroutine_create(nbd_trip, req);
 aio_co_schedule(client->exp->common.ctx, client->recv_coroutine);
 }
 }
-- 
2.44.0

[PULL 12/15] tests/qemu-iotests: Restrict tests that use --image-opts to the 'file' protocol

2024-03-18 Thread Kevin Wolf

From: Thomas Huth 

These tests 188, 189 and 198 use qemu-io with --image-opts with additional
hard-coded parameters for the file protocol, so they cannot work for other
protocols. Thus we have to limit these tests to the file protocol only.

Signed-off-by: Thomas Huth 
Message-ID: <2024031508.153201-8-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/188 | 2 +-
 tests/qemu-iotests/189 | 2 +-
 tests/qemu-iotests/198 | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/188 b/tests/qemu-iotests/188
index ce087d1873..2950b1dc31 100755
--- a/tests/qemu-iotests/188
+++ b/tests/qemu-iotests/188
@@ -38,7 +38,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 . ./common.filter
 
 _supported_fmt qcow2
-_supported_proto generic
+_supported_proto file
 _supported_os Linux
 _require_working_luks
 
diff --git a/tests/qemu-iotests/189 b/tests/qemu-iotests/189
index 801494c6b9..008f73b07d 100755
--- a/tests/qemu-iotests/189
+++ b/tests/qemu-iotests/189
@@ -38,7 +38,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 . ./common.filter
 
 _supported_fmt qcow2
-_supported_proto generic
+_supported_proto file
 _supported_os Linux
 _require_working_luks
 
diff --git a/tests/qemu-iotests/198 b/tests/qemu-iotests/198
index 1c93dea1f7..6ddeffddd2 100755
--- a/tests/qemu-iotests/198
+++ b/tests/qemu-iotests/198
@@ -38,7 +38,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 . ./common.filter
 
 _supported_fmt qcow2
-_supported_proto generic
+_supported_proto file
 _supported_os Linux
 _require_working_luks
 
-- 
2.44.0

[PULL 10/15] tests/qemu-iotests: Restrict test 134 and 158 to the 'file' protocol

2024-03-18 Thread Kevin Wolf

From: Thomas Huth 

Commit b25b387fa592 updated the iotests 134 and 158 to use the --image-opts
parameter for qemu-io with file protocol related options, but forgot to
update the _supported_proto line accordingly. So let's do that now.

Fixes: b25b387fa5 ("qcow2: convert QCow2 to use QCryptoBlock for encryption")
Signed-off-by: Thomas Huth 
Message-ID: <2024031508.153201-6-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/134 | 2 +-
 tests/qemu-iotests/158 | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/134 b/tests/qemu-iotests/134
index ded153c0b9..b2c3c03f08 100755
--- a/tests/qemu-iotests/134
+++ b/tests/qemu-iotests/134
@@ -38,7 +38,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 . ./common.filter
 
 _supported_fmt qcow qcow2
-_supported_proto generic
+_supported_proto file
 
 
 size=128M
diff --git a/tests/qemu-iotests/158 b/tests/qemu-iotests/158
index a95878e4ce..3a9ad7eed0 100755
--- a/tests/qemu-iotests/158
+++ b/tests/qemu-iotests/158
@@ -38,7 +38,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 . ./common.filter
 
 _supported_fmt qcow qcow2
-_supported_proto generic
+_supported_proto file
 
 
 size=128M
-- 
2.44.0

[PULL 11/15] tests/qemu-iotests: Restrict test 156 to the 'file' protocol

2024-03-18 Thread Kevin Wolf

From: Thomas Huth 

The test fails completely when you try to use it with a different
protocol, e.g. with "./check -ssh -qcow2 156".
The test uses some hand-crafted JSON statements which cannot work with other
protocols, thus let's change this test to only support the 'file' protocol.

Signed-off-by: Thomas Huth 
Message-ID: <2024031508.153201-7-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/156 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/156 b/tests/qemu-iotests/156
index a9540bd80d..97c2d86ce5 100755
--- a/tests/qemu-iotests/156
+++ b/tests/qemu-iotests/156
@@ -50,7 +50,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 . ./common.qemu
 
 _supported_fmt qcow2 qed
-_supported_proto generic
+_supported_proto file
 # Copying files around with cp does not work with external data files
 _unsupported_imgopts data_file
 
-- 
2.44.0

[PULL 13/15] tests/qemu-iotests: Fix some tests that use --image-opts for other protocols

2024-03-18 Thread Kevin Wolf

From: Thomas Huth 

Tests 263, 284 and detect-zeroes-registered-buf use qemu-io
with --image-opts so we have to enforce IMGOPTSSYNTAX=true here
to get $TEST_IMG in shape for other protocols than "file".

Signed-off-by: Thomas Huth 
Message-ID: <2024031508.153201-9-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/263| 6 --
 tests/qemu-iotests/284| 7 +++
 tests/qemu-iotests/tests/detect-zeroes-registered-buf | 4 +++-
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/tests/qemu-iotests/263 b/tests/qemu-iotests/263
index ec09b41405..44fdada0d6 100755
--- a/tests/qemu-iotests/263
+++ b/tests/qemu-iotests/263
@@ -34,6 +34,8 @@ _cleanup()
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
 
+IMGOPTSSYNTAX=true
+
 # get standard environment, filters and checks
 . ./common.rc
 . ./common.filter
@@ -73,7 +75,7 @@ echo "testing LUKS qcow2 encryption"
 echo
 
 _make_test_img --object $SECRET -o 
"encrypt.format=luks,encrypt.key-secret=sec0,encrypt.iter-time=10,cluster_size=64K"
 $size
-_run_test "driver=$IMGFMT,encrypt.key-secret=sec0,file.filename=$TEST_IMG"
+_run_test "$TEST_IMG,encrypt.key-secret=sec0"
 _cleanup_test_img
 
 echo
@@ -82,7 +84,7 @@ echo
 
 
 _make_test_img --object $SECRET -o 
"encrypt.format=aes,encrypt.key-secret=sec0,cluster_size=64K" $size
-_run_test "driver=$IMGFMT,encrypt.key-secret=sec0,file.filename=$TEST_IMG"
+_run_test "$TEST_IMG,encrypt.key-secret=sec0"
 _cleanup_test_img
 
 
diff --git a/tests/qemu-iotests/284 b/tests/qemu-iotests/284
index 5a82639e7f..722267486d 100755
--- a/tests/qemu-iotests/284
+++ b/tests/qemu-iotests/284
@@ -33,6 +33,8 @@ _cleanup()
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
 
+IMGOPTSSYNTAX=true
+
 # get standard environment, filters and checks
 . ./common.rc
 . ./common.filter
@@ -47,14 +49,12 @@ size=1M
 
 SECRET="secret,id=sec0,data=astrochicken"
 
-IMGSPEC="driver=$IMGFMT,file.filename=$TEST_IMG,encrypt.key-secret=sec0"
 QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
 
 _run_test()
 {
-IMGOPTSSYNTAX=true
 OLD_TEST_IMG="$TEST_IMG"
-
TEST_IMG="driver=$IMGFMT,file.filename=$TEST_IMG,encrypt.key-secret=sec0"
+TEST_IMG="$TEST_IMG,encrypt.key-secret=sec0"
 QEMU_IMG_EXTRA_ARGS="--image-opts --object $SECRET"
 
 echo
@@ -78,7 +78,6 @@ _run_test()
 
 TEST_IMG="$OLD_TEST_IMG"
 QEMU_IMG_EXTRA_ARGS=
-IMGOPTSSYNTAX=
 }
 
 
diff --git a/tests/qemu-iotests/tests/detect-zeroes-registered-buf 
b/tests/qemu-iotests/tests/detect-zeroes-registered-buf
index edb5f2cee5..5eaf34e5a6 100755
--- a/tests/qemu-iotests/tests/detect-zeroes-registered-buf
+++ b/tests/qemu-iotests/tests/detect-zeroes-registered-buf
@@ -36,6 +36,8 @@ _cleanup()
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
 
+IMGOPTSSYNTAX=true
+
 # get standard environment, filters and checks
 cd ..
 . ./common.rc
@@ -46,7 +48,7 @@ _supported_proto generic
 
 size=128M
 _make_test_img $size
-IMGSPEC="driver=$IMGFMT,file.filename=$TEST_IMG,discard=unmap,detect-zeroes=unmap"
+IMGSPEC="$TEST_IMG,discard=unmap,detect-zeroes=unmap"
 
 echo
 echo "== writing zero buffer to image =="
-- 
2.44.0

[PULL 04/15] blockdev: Fix blockdev-snapshot-sync error reporting for no medium

2024-03-18 Thread Kevin Wolf

From: Markus Armbruster 

When external_snapshot_abort() rejects a BlockDriverState without a
medium, it creates an error like this:

error_setg(errp, "Device '%s' has no medium", device);

Trouble is @device can be null.  My system formats null as "(null)",
but other systems might crash.  Reproducer:

1. Create a block device without a medium

-> {"execute": "blockdev-add", "arguments": {"driver": "host_cdrom", 
"node-name": "blk0", "filename": "/dev/sr0"}}
<- {"return": {}}

3. Attempt to snapshot it

-> {"execute":"blockdev-snapshot-sync", "arguments": { "node-name": "blk0", 
"snapshot-file":"/tmp/foo.qcow2","format":"qcow2"}}
<- {"error": {"class": "GenericError", "desc": "Device '(null)' has no 
medium"}}

Broken when commit 0901f67ecdb made @device optional.

Use bdrv_get_device_or_node_name() instead.  Now it fails as it
should:

    <- {"error": {"class": "GenericError", "desc": "Device 'blk0' has no 
medium"}}

Fixes: 0901f67ecdb7 ("qmp: Allow to take external snapshots on bs graphs node.")
Signed-off-by: Markus Armbruster 
Message-ID: <20240306142831.2514431-1-arm...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 blockdev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/blockdev.c b/blockdev.c
index d8fb3399f5..057601dcf0 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1395,7 +1395,8 @@ static void external_snapshot_action(TransactionAction 
*action,
 bdrv_drained_begin(state->old_bs);
 
 if (!bdrv_is_inserted(state->old_bs)) {
-error_setg(errp, QERR_DEVICE_HAS_NO_MEDIUM, device);
+error_setg(errp, QERR_DEVICE_HAS_NO_MEDIUM,
+   bdrv_get_device_or_node_name(state->old_bs));
 return;
 }
 
-- 
2.44.0

[PULL 01/15] mirror: Don't call job_pause_point() under graph lock

2024-03-18 Thread Kevin Wolf

Calling job_pause_point() while holding the graph reader lock
potentially results in a deadlock: bdrv_graph_wrlock() first drains
everything, including the mirror job, which pauses it. The job is only
unpaused at the end of the drain section, which is when the graph writer
lock has been successfully taken. However, if the job happens to be
paused at a pause point where it still holds the reader lock, the writer
lock can't be taken as long as the job is still paused.

Mark job_pause_point() as GRAPH_UNLOCKED and fix mirror accordingly.

Cc: qemu-sta...@nongnu.org
Buglink: https://issues.redhat.com/browse/RHEL-28125
Fixes: 004915a96a7a ("block: Protect bs->backing with graph_lock")
Signed-off-by: Kevin Wolf 
Message-ID: <20240313153000.33121-1-kw...@redhat.com>
Reviewed-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 include/qemu/job.h |  2 +-
 block/mirror.c | 10 ++
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 9ea98b5927..2b873f2576 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -483,7 +483,7 @@ void job_enter(Job *job);
  *
  * Called with job_mutex *not* held.
  */
-void coroutine_fn job_pause_point(Job *job);
+void coroutine_fn GRAPH_UNLOCKED job_pause_point(Job *job);
 
 /**
  * @job: The job that calls the function.
diff --git a/block/mirror.c b/block/mirror.c
index 5145eb53e1..1bdce3b657 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -479,9 +479,9 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t 
offset,
 return bytes_handled;
 }
 
-static void coroutine_fn GRAPH_RDLOCK mirror_iteration(MirrorBlockJob *s)
+static void coroutine_fn GRAPH_UNLOCKED mirror_iteration(MirrorBlockJob *s)
 {
-BlockDriverState *source = s->mirror_top_bs->backing->bs;
+BlockDriverState *source;
 MirrorOp *pseudo_op;
 int64_t offset;
 /* At least the first dirty chunk is mirrored in one iteration. */
@@ -489,6 +489,10 @@ static void coroutine_fn GRAPH_RDLOCK 
mirror_iteration(MirrorBlockJob *s)
 bool write_zeroes_ok = bdrv_can_write_zeroes_with_unmap(blk_bs(s->target));
 int max_io_bytes = MAX(s->buf_size / MAX_IN_FLIGHT, MAX_IO_BYTES);
 
+bdrv_graph_co_rdlock();
+source = s->mirror_top_bs->backing->bs;
+bdrv_graph_co_rdunlock();
+
 bdrv_dirty_bitmap_lock(s->dirty_bitmap);
 offset = bdrv_dirty_iter_next(s->dbi);
 if (offset < 0) {
@@ -1066,9 +1070,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_wait_for_free_in_flight_slot(s);
 continue;
 } else if (cnt != 0) {
-bdrv_graph_co_rdlock();
 mirror_iteration(s);
-bdrv_graph_co_rdunlock();
 }
 }
 
-- 
2.44.0

[PULL 08/15] tests/qemu-iotests: Restrict test 114 to the 'file' protocol

2024-03-18 Thread Kevin Wolf

From: Thomas Huth 

iotest 114 uses "truncate" and the qcow2.py script on the destination file,
which both cannot deal with URIs. Thus this test needs the "file" protocol,
otherwise it fails with an error message like this:

 truncate: cannot open 
'ssh://127.0.0.1/tmp/qemu-build/tests/qemu-iotests/scratch/qcow2-ssh-114/t.qcow2.orig'
  for writing: No such file or directory

Thus mark this test for "file protocol only" accordingly.

Signed-off-by: Thomas Huth 
Message-ID: <2024031508.153201-4-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/114 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/114 b/tests/qemu-iotests/114
index de6fd327ee..dccc71008b 100755
--- a/tests/qemu-iotests/114
+++ b/tests/qemu-iotests/114
@@ -38,7 +38,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 . ./common.filter
 
 _supported_fmt qcow2
-_supported_proto generic
+_supported_proto file
 # At least OpenBSD doesn't seem to have truncate
 _supported_os Linux
 # qcow2.py does not work too well with external data files
-- 
2.44.0

[PULL 14/15] tests/qemu-iotests: Restrict tests using "--blockdev file" to the file protocol

2024-03-18 Thread Kevin Wolf

From: Thomas Huth 

Tests that use "--blockdev" with the "file" driver cannot work with
other protocols, so we should mark them accordingly.

Signed-off-by: Thomas Huth 
Message-ID: <2024031508.153201-10-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/tests/qcow2-internal-snapshots | 2 +-
 tests/qemu-iotests/tests/qsd-jobs | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/tests/qcow2-internal-snapshots 
b/tests/qemu-iotests/tests/qcow2-internal-snapshots
index 36523aba06..9f83aa8903 100755
--- a/tests/qemu-iotests/tests/qcow2-internal-snapshots
+++ b/tests/qemu-iotests/tests/qcow2-internal-snapshots
@@ -39,7 +39,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 
 # This tests qcow2-specific low-level functionality
 _supported_fmt qcow2
-_supported_proto generic
+_supported_proto file
 # Internal snapshots are (currently) impossible with refcount_bits=1,
 # and generally impossible with external data files
 _unsupported_imgopts 'compat=0.10' 'refcount_bits=1[^0-9]' data_file
diff --git a/tests/qemu-iotests/tests/qsd-jobs 
b/tests/qemu-iotests/tests/qsd-jobs
index 510bf0a9dc..9b843af631 100755
--- a/tests/qemu-iotests/tests/qsd-jobs
+++ b/tests/qemu-iotests/tests/qsd-jobs
@@ -40,7 +40,7 @@ cd ..
 . ./common.filter
 
 _supported_fmt qcow2
-_supported_proto generic
+_supported_proto file
 
 size=128M
 
-- 
2.44.0

[PULL 03/15] iotests: Add test for reset/AioContext switches with NBD exports

2024-03-18 Thread Kevin Wolf

This replicates the scenario in which the bug was reported.
Unfortunately this relies on actually executing a guest (so that the
firmware initialises the virtio-blk device and moves it to its
configured iothread), so this can't make use of the qtest accelerator
like most other test cases. I tried to find a different easy way to
trigger the bug, but couldn't find one.

Signed-off-by: Kevin Wolf 
Message-ID: <20240314165825.40261-3-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/tests/iothreads-nbd-export | 66 +++
 .../tests/iothreads-nbd-export.out| 19 ++
 2 files changed, 85 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/iothreads-nbd-export
 create mode 100644 tests/qemu-iotests/tests/iothreads-nbd-export.out

diff --git a/tests/qemu-iotests/tests/iothreads-nbd-export 
b/tests/qemu-iotests/tests/iothreads-nbd-export
new file mode 100755
index 00..037260729c
--- /dev/null
+++ b/tests/qemu-iotests/tests/iothreads-nbd-export
@@ -0,0 +1,66 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Copyright (C) 2024 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+# Creator/Owner: Kevin Wolf 
+
+import time
+import qemu
+import iotests
+
+iotests.script_initialize(supported_fmts=['qcow2'],
+  supported_platforms=['linux'])
+
+with iotests.FilePath('disk1.img') as path, \
+ iotests.FilePath('nbd.sock', base_dir=iotests.sock_dir) as nbd_sock, \
+ qemu.machine.QEMUMachine(iotests.qemu_prog) as vm:
+
+img_size = '10M'
+
+iotests.log('Preparing disk...')
+iotests.qemu_img_create('-f', iotests.imgfmt, path, img_size)
+vm.add_args('-blockdev', f'file,node-name=disk-file,filename={path}')
+vm.add_args('-blockdev', 'qcow2,node-name=disk,file=disk-file')
+vm.add_args('-object', 'iothread,id=iothread0')
+vm.add_args('-device',
+'virtio-blk,drive=disk,iothread=iothread0,share-rw=on')
+
+iotests.log('Launching VM...')
+vm.add_args('-accel', 'kvm', '-accel', 'tcg')
+#vm.add_args('-accel', 'qtest')
+vm.launch()
+
+iotests.log('Exporting to NBD...')
+iotests.log(vm.qmp('nbd-server-start',
+   addr={'type': 'unix', 'data': {'path': nbd_sock}}))
+iotests.log(vm.qmp('block-export-add', type='nbd', id='exp0',
+   node_name='disk', writable=True))
+
+iotests.log('Connecting qemu-img...')
+qemu_io = iotests.QemuIoInteractive('-f', 'raw',
+f'nbd+unix:///disk?socket={nbd_sock}')
+
+iotests.log('Moving the NBD export to a different iothread...')
+for i in range(0, 10):
+iotests.log(vm.qmp('system_reset'))
+time.sleep(0.1)
+
+iotests.log('Checking that it is still alive...')
+iotests.log(vm.qmp('query-status'))
+
+qemu_io.close()
+vm.shutdown()
diff --git a/tests/qemu-iotests/tests/iothreads-nbd-export.out 
b/tests/qemu-iotests/tests/iothreads-nbd-export.out
new file mode 100644
index 00..bc514e35e5
--- /dev/null
+++ b/tests/qemu-iotests/tests/iothreads-nbd-export.out
@@ -0,0 +1,19 @@
+Preparing disk...
+Launching VM...
+Exporting to NBD...
+{"return": {}}
+{"return": {}}
+Connecting qemu-img...
+Moving the NBD export to a different iothread...
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+Checking that it is still alive...
+{"return": {"running": true, "status": "running"}}
-- 
2.44.0

[PULL 15/15] iotests: adapt to output change for recently introduced 'detached header' field

2024-03-18 Thread Kevin Wolf

From: Fiona Ebner 

Failure was noticed when running the tests for the qcow2 image format.

Fixes: 0bd779e27e ("crypto: Introduce 'detached-header' field in 
QCryptoBlockInfoLUKS")
Signed-off-by: Fiona Ebner 
Message-ID: <20240216101415.293769-1-f.eb...@proxmox.com>
Reviewed-by: Daniel P. Berrangé 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/198.out | 2 ++
 tests/qemu-iotests/206.out | 1 +
 2 files changed, 3 insertions(+)

diff --git a/tests/qemu-iotests/198.out b/tests/qemu-iotests/198.out
index 805494916f..62fb73fa3e 100644
--- a/tests/qemu-iotests/198.out
+++ b/tests/qemu-iotests/198.out
@@ -39,6 +39,7 @@ Format specific information:
 compression type: COMPRESSION_TYPE
 encrypt:
 ivgen alg: plain64
+detached header: false
 hash alg: sha256
 cipher alg: aes-256
 uuid: ----
@@ -84,6 +85,7 @@ Format specific information:
 compression type: COMPRESSION_TYPE
 encrypt:
 ivgen alg: plain64
+detached header: false
 hash alg: sha256
 cipher alg: aes-256
 uuid: ----
diff --git a/tests/qemu-iotests/206.out b/tests/qemu-iotests/206.out
index 7e95694777..979f00f9bf 100644
--- a/tests/qemu-iotests/206.out
+++ b/tests/qemu-iotests/206.out
@@ -114,6 +114,7 @@ Format specific information:
 refcount bits: 16
 encrypt:
 ivgen alg: plain64
+detached header: false
 hash alg: sha1
 cipher alg: aes-128
 uuid: ----
-- 
2.44.0

[PULL 00/15] Block layer patches

2024-03-18 Thread Kevin Wolf

The following changes since commit ba49d760eb04630e7b15f423ebecf6c871b8f77b:

  Merge tag 'pull-maintainer-final-130324-1' of https://gitlab.com/stsquad/qemu 
into staging (2024-03-13 15:12:14 +)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 39a94d7c34ce9d222fa9c0c99a14e20a567456d7:

  iotests: adapt to output change for recently introduced 'detached header' 
field (2024-03-18 13:33:54 +0100)


Block layer patches

- mirror: Fix deadlock
- nbd/server: Fix race in draining the export
- qemu-img snapshot: Fix formatting with large values
- Fix blockdev-snapshot-sync error reporting for no medium
- iotests fixes


Abhiram Tilak (1):
  qemu-img: Fix Column Width and Improve Formatting in snapshot list

Fiona Ebner (1):
  iotests: adapt to output change for recently introduced 'detached header' 
field

Kevin Wolf (3):
  mirror: Don't call job_pause_point() under graph lock
  nbd/server: Fix race in draining the export
  iotests: Add test for reset/AioContext switches with NBD exports

Markus Armbruster (1):
  blockdev: Fix blockdev-snapshot-sync error reporting for no medium

Thomas Huth (9):
  tests/qemu-iotests: Fix test 033 for running with non-file protocols
  tests/qemu-iotests: Restrict test 066 to the 'file' protocol
  tests/qemu-iotests: Restrict test 114 to the 'file' protocol
  tests/qemu-iotests: Restrict test 130 to the 'file' protocol
  tests/qemu-iotests: Restrict test 134 and 158 to the 'file' protocol
  tests/qemu-iotests: Restrict test 156 to the 'file' protocol
  tests/qemu-iotests: Restrict tests that use --image-opts to the 'file' 
protocol
  tests/qemu-iotests: Fix some tests that use --image-opts for other 
protocols
  tests/qemu-iotests: Restrict tests using "--blockdev file" to the file 
protocol

 include/qemu/job.h |  2 +-
 block/mirror.c | 10 ++--
 block/qapi.c   | 10 ++--
 blockdev.c |  3 +-
 nbd/server.c   | 15 +++--
 tests/qemu-iotests/033 |  6 +-
 tests/qemu-iotests/066 |  2 +-
 tests/qemu-iotests/114 |  2 +-
 tests/qemu-iotests/130 |  2 +-
 tests/qemu-iotests/134 |  2 +-
 tests/qemu-iotests/156 |  2 +-
 tests/qemu-iotests/158 |  2 +-
 tests/qemu-iotests/176.out | 16 +++---
 tests/qemu-iotests/188 |  2 +-
 tests/qemu-iotests/189 |  2 +-
 tests/qemu-iotests/198 |  2 +-
 tests/qemu-iotests/198.out |  2 +
 tests/qemu-iotests/206.out |  1 +
 tests/qemu-iotests/261 |  4 +-
 tests/qemu-iotests/263 |  6 +-
 tests/qemu-iotests/267.out | 48 
 tests/qemu-iotests/284 |  7 +--
 tests/qemu-iotests/286 |  3 +-
 tests/qemu-iotests/286.out |  2 +-
 .../tests/detect-zeroes-registered-buf |  4 +-
 tests/qemu-iotests/tests/iothreads-nbd-export  | 66 ++
 tests/qemu-iotests/tests/iothreads-nbd-export.out  | 19 +++
 tests/qemu-iotests/tests/qcow2-internal-snapshots  |  2 +-
 .../tests/qcow2-internal-snapshots.out | 14 ++---
 tests/qemu-iotests/tests/qsd-jobs  |  2 +-
 30 files changed, 178 insertions(+), 82 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/iothreads-nbd-export
 create mode 100644 tests/qemu-iotests/tests/iothreads-nbd-export.out

[PULL 05/15] qemu-img: Fix Column Width and Improve Formatting in snapshot list

2024-03-18 Thread Kevin Wolf

From: Abhiram Tilak 

When running the command `qemu-img snapshot -l SNAPSHOT` the output of
VM_CLOCK (measures the offset between host and VM clock) cannot to
accommodate values in the order of thousands (4-digit).

This line [1] hints on the problem. Additionally, the column width for
the VM_CLOCK field was reduced from 15 to 13 spaces in commit b39847a5
in line [2], resulting in a shortage of space.

[1]:
https://gitlab.com/qemu-project/qemu/-/blob/master/block/qapi.c?ref_type=heads#L753
[2]:
https://gitlab.com/qemu-project/qemu/-/blob/master/block/qapi.c?ref_type=heads#L763

This patch restores the column width to 15 spaces and makes adjustments
to the affected iotests accordingly. Furthermore, addresses a potential
source
of confusion by removing whitespace in column headers. Example, VM CLOCK
is modified to VM_CLOCK. Additionally a '--' symbol is introduced when
ICOUNT returns no output for clarity.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2062
Fixes: b39847a50553 ("migration: introduce icount field for snapshots")
Signed-off-by: Abhiram Tilak 
Message-ID: <20240123050354.22152-2-atp@gmail.com>
[kwolf: Fixed up qemu-iotests 261 and 286]
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 block/qapi.c  | 10 ++--
 tests/qemu-iotests/176.out| 16 +++
 tests/qemu-iotests/261|  4 +-
 tests/qemu-iotests/267.out| 48 +--
 tests/qemu-iotests/286|  3 +-
 tests/qemu-iotests/286.out|  2 +-
 .../tests/qcow2-internal-snapshots.out| 14 +++---
 7 files changed, 50 insertions(+), 47 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index 31183d4933..2b5793f1d9 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -742,15 +742,15 @@ void bdrv_snapshot_dump(QEMUSnapshotInfo *sn)
 char *sizing = NULL;
 
 if (!sn) {
-qemu_printf("%-10s%-17s%8s%20s%13s%11s",
-"ID", "TAG", "VM SIZE", "DATE", "VM CLOCK", "ICOUNT");
+qemu_printf("%-7s %-16s %8s %19s %15s %10s",
+"ID", "TAG", "VM_SIZE", "DATE", "VM_CLOCK", "ICOUNT");
 } else {
 g_autoptr(GDateTime) date = 
g_date_time_new_from_unix_local(sn->date_sec);
 g_autofree char *date_buf = g_date_time_format(date, "%Y-%m-%d 
%H:%M:%S");
 
 secs = sn->vm_clock_nsec / 10;
 snprintf(clock_buf, sizeof(clock_buf),
- "%02d:%02d:%02d.%03d",
+ "%04d:%02d:%02d.%03d",
  (int)(secs / 3600),
  (int)((secs / 60) % 60),
  (int)(secs % 60),
@@ -759,8 +759,10 @@ void bdrv_snapshot_dump(QEMUSnapshotInfo *sn)
 if (sn->icount != -1ULL) {
 snprintf(icount_buf, sizeof(icount_buf),
 "%"PRId64, sn->icount);
+} else {
+snprintf(icount_buf, sizeof(icount_buf), "--");
 }
-qemu_printf("%-9s %-16s %8s%20s%13s%11s",
+qemu_printf("%-7s %-16s %8s %19s %15s %10s",
 sn->id_str, sn->name,
 sizing,
 date_buf,
diff --git a/tests/qemu-iotests/176.out b/tests/qemu-iotests/176.out
index 45e9153ef3..9c73ef2eea 100644
--- a/tests/qemu-iotests/176.out
+++ b/tests/qemu-iotests/176.out
@@ -37,8 +37,8 @@ Offset  Length  File
 0x7ffe  0x2 TEST_DIR/t.IMGFMT.itmd
 0x8340  0x200   TEST_DIR/t.IMGFMT.itmd
 Snapshot list:
-IDTAG
-1 snap
+ID  TAG
+1   snap
 
 === Test pass snapshot.1 ===
 
@@ -78,8 +78,8 @@ Offset  Length  File
 0x7fff  0x1 TEST_DIR/t.IMGFMT
 0x8340  0x200   TEST_DIR/t.IMGFMT
 Snapshot list:
-IDTAG
-1 snap
+ID  TAG
+1   snap
 
 === Test pass snapshot.2 ===
 
@@ -119,8 +119,8 @@ Offset  Length  File
 0x7fff  0x1 TEST_DIR/t.IMGFMT
 0x8340  0x200   TEST_DIR/t.IMGFMT
 Snapshot list:
-IDTAG
-1 snap
+ID  TAG
+1   snap
 
 === Test pass snapshot.3 ===
 
@@ -157,8 +157,8 @@ Offset  Length  File
 0x7fff  0x1 TEST_DIR/t.IMGFMT
 0x8340  0x200   TEST_DIR/t.IMGFMT
 Snapshot list:
-IDTAG
-1 snap
+ID  TAG
+1   snap
 
 === Test pass bitmap.0 ===
 
diff --git a/tests/qemu-iotests/261 b/tests/qemu-iotests/261
index b73da565da..22b969d310 100755
--- a/tests/qemu-iotests/261
+++ b/tests/qemu-iotests/261
@@ -393,7 +393,7 @@ _check_test_img -r all
 
 echo
 echo "$((sn_count - 1)) snapshots should remain:"
-echo "  qemu-img info reports $(_img_info | grep -c '^ \{32\}') snaps

[PULL 07/15] tests/qemu-iotests: Restrict test 066 to the 'file' protocol

2024-03-18 Thread Kevin Wolf

From: Thomas Huth 

The hand-crafted json statement in this test only works if the test
is run with the "file" protocol, so mark this test accordingly.

Signed-off-by: Thomas Huth 
Message-ID: <2024031508.153201-3-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/066 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/066 b/tests/qemu-iotests/066
index cf63144cb9..336d8565dd 100755
--- a/tests/qemu-iotests/066
+++ b/tests/qemu-iotests/066
@@ -39,7 +39,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 
 # This tests qcow2-specific low-level functionality
 _supported_fmt qcow2
-_supported_proto generic
+_supported_proto file
 # We need zero clusters and snapshots
 # (TODO: Consider splitting the snapshot part into a separate test
 #file, so this one runs with refcount_bits=1 and data_file)
-- 
2.44.0

Re: [PATCH] iotests: adapt to output change for recently introduced 'detached header' field

2024-03-18 Thread Kevin Wolf

Am 16.02.2024 um 11:14 hat Fiona Ebner geschrieben:
> Failure was noticed when running the tests for the qcow2 image format.
> 
> Fixes: 0bd779e27e ("crypto: Introduce 'detached-header' field in 
> QCryptoBlockInfoLUKS")
> Signed-off-by: Fiona Ebner 

Thanks, applied to the block branch.

Kevin

Re: [PATCH 0/9] tests/qemu-iotests: Fix running with "check -ssh -qcow2"

2024-03-18 Thread Kevin Wolf

Am 15.03.2024 um 12:10 hat Thomas Huth geschrieben:
> I recently wanted to check for some changes that I did to the URI handling
> in the block layer code, but I had to discover that a lot of iotests only
> work with the raw file format when using a protocol that is not "file",
> i.e. "./check -ssh -qcow2" shows a lot of failures.
> While some tests could be fixed to work with the "ssh" protocol, too,
> many other tests seem to be written for the "file" protocol only and
> thus have to be marked accordingly.
> 
> After applying these patches, there is still one failure left in test 181
> where I'm unsure whether it's a real bug or whether this test should also
> simply be marked to work with the "file" protocol only. Suggestions are
> welcome!
> 
> Thomas Huth (9):
>   tests/qemu-iotests: Fix test 033 for running with non-file protocols
>   tests/qemu-iotests: Restrict test 066 to the 'file' protocol
>   tests/qemu-iotests: Restrict test 114 to the 'file' protocol
>   tests/qemu-iotests: Restrict test 130 to the 'file' protocol
>   tests/qemu-iotests: Restrict test 134 and 158 to the 'file' protocol
>   tests/qemu-iotests: Restrict test 156 to the 'file' protocol
>   tests/qemu-iotests: Restrict tests that use --image-opts to the 'file'
> protocol
>   tests/qemu-iotests: Fix some tests that use --image-opts for other
> protocols
>   tests/qemu-iotests: Restrict tests using "--blockdev file" to the file
> protocol

Thanks, applied to the block branch.

Kevin

Re: [PATCH] blockdev: Fix blockdev-snapshot-sync error reporting for no medium

2024-03-18 Thread Kevin Wolf

Am 06.03.2024 um 15:28 hat Markus Armbruster geschrieben:
> When external_snapshot_abort() rejects a BlockDriverState without a
> medium, it creates an error like this:
> 
> error_setg(errp, "Device '%s' has no medium", device);
> 
> Trouble is @device can be null.  My system formats null as "(null)",
> but other systems might crash.  Reproducer:
> 
> 1. Create a block device without a medium
> 
> -> {"execute": "blockdev-add", "arguments": {"driver": "host_cdrom", 
> "node-name": "blk0", "filename": "/dev/sr0"}}
> <- {"return": {}}
> 
> 3. Attempt to snapshot it
> 
> -> {"execute":"blockdev-snapshot-sync", "arguments": { "node-name": 
> "blk0", "snapshot-file":"/tmp/foo.qcow2","format":"qcow2"}}
> <- {"error": {"class": "GenericError", "desc": "Device '(null)' has no 
> medium"}}
> 
> Broken when commit 0901f67ecdb made @device optional.
> 
> Use bdrv_get_device_or_node_name() instead.  Now it fails as it
> should:
> 
> <- {"error": {"class": "GenericError", "desc": "Device 'blk0' has no 
> medium"}}
> 
> Fixes: 0901f67ecdb7 (qmp: Allow to take external snapshots on bs graphs node.)
> Signed-off-by: Markus Armbruster 

Thanks, applied to the block branch.

Kevin

Re: [PATCH for-9.0] mirror: Don't call job_pause_point() under graph lock

2024-03-18 Thread Kevin Wolf

Am 14.03.2024 um 15:29 hat Stefan Hajnoczi geschrieben:
> On Wed, Mar 13, 2024 at 04:30:00PM +0100, Kevin Wolf wrote:
> > Calling job_pause_point() while holding the graph reader lock
> > potentially results in a deadlock: bdrv_graph_wrlock() first drains
> > everything, including the mirror job, which pauses it. The job is only
> > unpaused at the end of the drain section, which is when the graph writer
> > lock has been successfully taken. However, if the job happens to be
> > paused at a pause point where it still holds the reader lock, the writer
> > lock can't be taken as long as the job is still paused.
> > 
> > Mark job_pause_point() as GRAPH_UNLOCKED and fix mirror accordingly.
> > 
> > Cc: qemu-sta...@nongnu.org
> > Buglink: https://issues.redhat.com/browse/RHEL-28125
> > Fixes: 004915a96a7a40e942ac85e6d22518cbcd283506
> > Signed-off-by: Kevin Wolf 
> > ---
> >  include/qemu/job.h |  2 +-
> >  block/mirror.c | 10 ++
> >  2 files changed, 7 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/qemu/job.h b/include/qemu/job.h
> > index 9ea98b5927..2b873f2576 100644
> > --- a/include/qemu/job.h
> > +++ b/include/qemu/job.h
> > @@ -483,7 +483,7 @@ void job_enter(Job *job);
> >   *
> >   * Called with job_mutex *not* held.
> >   */
> > -void coroutine_fn job_pause_point(Job *job);
> > +void coroutine_fn GRAPH_UNLOCKED job_pause_point(Job *job);
> >  
> >  /**
> >   * @job: The job that calls the function.
> > diff --git a/block/mirror.c b/block/mirror.c
> > index 5145eb53e1..1bdce3b657 100644
> > --- a/block/mirror.c
> > +++ b/block/mirror.c
> > @@ -479,9 +479,9 @@ static unsigned mirror_perform(MirrorBlockJob *s, 
> > int64_t offset,
> >  return bytes_handled;
> >  }
> >  
> > -static void coroutine_fn GRAPH_RDLOCK mirror_iteration(MirrorBlockJob *s)
> > +static void coroutine_fn GRAPH_UNLOCKED mirror_iteration(MirrorBlockJob *s)
> >  {
> > -BlockDriverState *source = s->mirror_top_bs->backing->bs;
> > +BlockDriverState *source;
> >  MirrorOp *pseudo_op;
> >  int64_t offset;
> >  /* At least the first dirty chunk is mirrored in one iteration. */
> > @@ -489,6 +489,10 @@ static void coroutine_fn GRAPH_RDLOCK 
> > mirror_iteration(MirrorBlockJob *s)
> >  bool write_zeroes_ok = 
> > bdrv_can_write_zeroes_with_unmap(blk_bs(s->target));
> >  int max_io_bytes = MAX(s->buf_size / MAX_IN_FLIGHT, MAX_IO_BYTES);
> >  
> > +bdrv_graph_co_rdlock();
> > +source = s->mirror_top_bs->backing->bs;
> 
> Is bdrv_ref(source) needed here so that source cannot go away if someone
> else write locks the graph and removes it? Or maybe something else
> protects against that. Either way, please add a comment that explains
> why this is safe.

We didn't even get to looking at this level of detail with the graph
locking work. We probably should, but this is not the only place in
mirror we need to look at then. Commit 004915a9 just took the lazy path
of taking the lock for the whole function, and it turns out that this
was wrong and causes deadlocks, so I'm reverting it and replacing it
with what other parts of the code do - the minimal thing to let it
compile.

I think we already own a reference, we do a block_job_add_bdrv() in
mirror_start_job(). But once it changes, we have a reference to the
wrong node. So it looks to me that mirror has a problem with a changing
source node that is more fundamental than graph locking in one specific
function because it stores BDS pointers in its state.

Active commit already freezes the backing chain between mirror_top_bs
and target, maybe other mirror jobs need to freeze the link between
mirror_top_bs and source at least.

So I agree that it might be worth looking into this more, but I consider
it unrelated to this patch. We just go back to the state in which it has
always been before 8.2 (which might contain a latent bug that apparently
never triggered in practice) to fix a regression that we do see in
practice.

Kevin

signature.asc
Description: PGP signature

[PATCH for-9.0 v3] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-03-15 Thread Kevin Wolf

VDUSE requires that virtqueues are first enabled before the DRIVER_OK
status flag is set; with the current API of the kernel module, it is
impossible to enable the opposite order in our block export code because
userspace is not notified when a virtqueue is enabled.

This requirement also mathces the normal initialisation order as done by
the generic vhost code in QEMU. However, commit 6c482547 accidentally
changed the order for vdpa-dev and broke access to VDUSE devices with
this.

This changes vdpa-dev to use the normal order again and use the standard
vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
used with vdpa-dev again after this fix.

vhost_net intentionally avoided enabling the vrings for vdpa and does
this manually later while it does enable them for other vhost backends.
Reflect this in the vhost_net code and return early for vdpa, so that
the behaviour doesn't change for this device.

Cc: qemu-sta...@nongnu.org
Fixes: 6c4825476a4351530bcac17abab72295b75ffe98
Signed-off-by: Kevin Wolf 
---
v2:
- Actually make use of the @enable parameter
- Change vhost_net to preserve the current behaviour

v3:
- Updated trace point [Stefano]
- Fixed typo in comment [Stefano]

 hw/net/vhost_net.c | 10 ++
 hw/virtio/vdpa-dev.c   |  5 +
 hw/virtio/vhost-vdpa.c | 29 ++---
 hw/virtio/vhost.c  |  8 +++-
 hw/virtio/trace-events |  2 +-
 5 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index e8e1661646..fd1a93701a 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -541,6 +541,16 @@ int vhost_set_vring_enable(NetClientState *nc, int enable)
 VHostNetState *net = get_vhost_net(nc);
 const VhostOps *vhost_ops = net->dev.vhost_ops;
 
+/*
+ * vhost-vdpa network devices need to enable dataplane virtqueues after
+ * DRIVER_OK, so they can recover device state before starting dataplane.
+ * Because of that, we don't enable virtqueues here and leave it to
+ * net/vhost-vdpa.c.
+ */
+if (nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
+return 0;
+}
+
 nc->vring_enable = enable;
 
 if (vhost_ops && vhost_ops->vhost_set_vring_enable) {
diff --git a/hw/virtio/vdpa-dev.c b/hw/virtio/vdpa-dev.c
index eb9ecea83b..13e87f06f6 100644
--- a/hw/virtio/vdpa-dev.c
+++ b/hw/virtio/vdpa-dev.c
@@ -253,14 +253,11 @@ static int vhost_vdpa_device_start(VirtIODevice *vdev, 
Error **errp)
 
 s->dev.acked_features = vdev->guest_features;
 
-ret = vhost_dev_start(>dev, vdev, false);
+ret = vhost_dev_start(>dev, vdev, true);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Error starting vhost");
 goto err_guest_notifiers;
 }
-for (i = 0; i < s->dev.nvqs; ++i) {
-vhost_vdpa_set_vring_ready(>vdpa, i);
-}
 s->started = true;
 
 /*
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index ddae494ca8..31453466af 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -886,19 +886,41 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, 
int idx)
 return idx;
 }
 
-int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, unsigned idx)
+static int vhost_vdpa_set_vring_enable_one(struct vhost_vdpa *v, unsigned idx,
+   int enable)
 {
 struct vhost_dev *dev = v->dev;
 struct vhost_vring_state state = {
 .index = idx,
-.num = 1,
+.num = enable,
 };
 int r = vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, );
 
-trace_vhost_vdpa_set_vring_ready(dev, idx, r);
+trace_vhost_vdpa_set_vring_enable_one(dev, idx, enable, r);
 return r;
 }
 
+static int vhost_vdpa_set_vring_enable(struct vhost_dev *dev, int enable)
+{
+struct vhost_vdpa *v = dev->opaque;
+unsigned int i;
+int ret;
+
+for (i = 0; i < dev->nvqs; ++i) {
+ret = vhost_vdpa_set_vring_enable_one(v, i, enable);
+if (ret < 0) {
+return ret;
+}
+}
+
+return 0;
+}
+
+int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, unsigned idx)
+{
+return vhost_vdpa_set_vring_enable_one(v, idx, 1);
+}
+
 static int vhost_vdpa_set_config_call(struct vhost_dev *dev,
int fd)
 {
@@ -1514,6 +1536,7 @@ const VhostOps vdpa_ops = {
 .vhost_set_features = vhost_vdpa_set_features,
 .vhost_reset_device = vhost_vdpa_reset_device,
 .vhost_get_vq_index = vhost_vdpa_get_vq_index,
+.vhost_set_vring_enable = vhost_vdpa_set_vring_enable,
 .vhost_get_config  = vhost_vdpa_get_config,
 .vhost_set_config = vhost_vdpa_set_config,
 .vhost_requires_shm_log = NULL,
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79468..a66186 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1984,7 +1984,13 @@ static int vhost_dev_set_vr

Re: [PATCH for-9.0 v2] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-03-15 Thread Kevin Wolf

Am 15.03.2024 um 16:07 hat Stefano Garzarella geschrieben:
> On Fri, Mar 15, 2024 at 03:03:31PM +0100, Kevin Wolf wrote:
> > VDUSE requires that virtqueues are first enabled before the DRIVER_OK
> > status flag is set; with the current API of the kernel module, it is
> > impossible to enable the opposite order in our block export code because
> > userspace is not notified when a virtqueue is enabled.
> > 
> > This requirement also mathces the normal initialisation order as done by
> > the generic vhost code in QEMU. However, commit 6c482547 accidentally
> > changed the order for vdpa-dev and broke access to VDUSE devices with
> > this.
> > 
> > This changes vdpa-dev to use the normal order again and use the standard
> > vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
> > used with vdpa-dev again after this fix.
> > 
> > vhost_net intentionally avoided enabling the vrings for vdpa and does
> > this manually later while it does enable them for other vhost backends.
> > Reflect this in the vhost_net code and return early for vdpa, so that
> > the behaviour doesn't change for this device.
> > 
> > Cc: qemu-sta...@nongnu.org
> > Fixes: 6c4825476a4351530bcac17abab72295b75ffe98
> > Signed-off-by: Kevin Wolf 
> > ---
> > v2:
> > - Actually make use of the @enable parameter
> > - Change vhost_net to preserve the current behaviour
> > 
> > hw/net/vhost_net.c | 10 ++
> > hw/virtio/vdpa-dev.c   |  5 +
> > hw/virtio/vhost-vdpa.c | 27 +--
> > hw/virtio/vhost.c  |  8 +++-
> > 4 files changed, 43 insertions(+), 7 deletions(-)
> > 
> > diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> > index e8e1661646..fd1a93701a 100644
> > --- a/hw/net/vhost_net.c
> > +++ b/hw/net/vhost_net.c
> > @@ -541,6 +541,16 @@ int vhost_set_vring_enable(NetClientState *nc, int 
> > enable)
> > VHostNetState *net = get_vhost_net(nc);
> > const VhostOps *vhost_ops = net->dev.vhost_ops;
> > 
> > +/*
> > + * vhost-vdpa network devices need to enable dataplane virtqueues after
> > + * DRIVER_OK, so they can recover device state before starting 
> > dataplane.
> > + * Because of that, we don't enable virtqueues here and leave it to
> > + * net/vhost-vdpa.c.
> > + */
> > +if (nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
> > +return 0;
> > +}
> > +
> > nc->vring_enable = enable;
> > 
> > if (vhost_ops && vhost_ops->vhost_set_vring_enable) {
> > diff --git a/hw/virtio/vdpa-dev.c b/hw/virtio/vdpa-dev.c
> > index eb9ecea83b..13e87f06f6 100644
> > --- a/hw/virtio/vdpa-dev.c
> > +++ b/hw/virtio/vdpa-dev.c
> > @@ -253,14 +253,11 @@ static int vhost_vdpa_device_start(VirtIODevice 
> > *vdev, Error **errp)
> > 
> > s->dev.acked_features = vdev->guest_features;
> > 
> > -ret = vhost_dev_start(>dev, vdev, false);
> > +ret = vhost_dev_start(>dev, vdev, true);
> > if (ret < 0) {
> > error_setg_errno(errp, -ret, "Error starting vhost");
> > goto err_guest_notifiers;
> > }
> > -for (i = 0; i < s->dev.nvqs; ++i) {
> > -vhost_vdpa_set_vring_ready(>vdpa, i);
> > -}
> > s->started = true;
> > 
> > /*
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index ddae494ca8..401afac2f5 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -886,12 +886,13 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev 
> > *dev, int idx)
> > return idx;
> > }
> > 
> > -int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, unsigned idx)
> > +static int vhost_vdpa_set_vring_enable_one(struct vhost_vdpa *v, unsigned 
> > idx,
> > +   int enable)
> > {
> > struct vhost_dev *dev = v->dev;
> > struct vhost_vring_state state = {
> > .index = idx,
> > -.num = 1,
> > +.num = enable,
> > };
> > int r = vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, );
> > 
> 
> After this line we now have:
> 
>   trace_vhost_vdpa_set_vring_ready(dev, idx, r);
> 
> Should we rename it or move it to the new function?
> 
> If we rename it, we should trace also `enable`.

I think renaming is better so that we cover all code paths that enable a
vring. I'll change this and send a v3.

> > @@ -899

[PATCH for-9.0 v2] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-03-15 Thread Kevin Wolf

VDUSE requires that virtqueues are first enabled before the DRIVER_OK
status flag is set; with the current API of the kernel module, it is
impossible to enable the opposite order in our block export code because
userspace is not notified when a virtqueue is enabled.

This requirement also mathces the normal initialisation order as done by
the generic vhost code in QEMU. However, commit 6c482547 accidentally
changed the order for vdpa-dev and broke access to VDUSE devices with
this.

This changes vdpa-dev to use the normal order again and use the standard
vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
used with vdpa-dev again after this fix.

vhost_net intentionally avoided enabling the vrings for vdpa and does
this manually later while it does enable them for other vhost backends.
Reflect this in the vhost_net code and return early for vdpa, so that
the behaviour doesn't change for this device.

Cc: qemu-sta...@nongnu.org
Fixes: 6c4825476a4351530bcac17abab72295b75ffe98
Signed-off-by: Kevin Wolf 
---
v2:
- Actually make use of the @enable parameter
- Change vhost_net to preserve the current behaviour

 hw/net/vhost_net.c | 10 ++
 hw/virtio/vdpa-dev.c   |  5 +
 hw/virtio/vhost-vdpa.c | 27 +--
 hw/virtio/vhost.c  |  8 +++-
 4 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index e8e1661646..fd1a93701a 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -541,6 +541,16 @@ int vhost_set_vring_enable(NetClientState *nc, int enable)
 VHostNetState *net = get_vhost_net(nc);
 const VhostOps *vhost_ops = net->dev.vhost_ops;
 
+/*
+ * vhost-vdpa network devices need to enable dataplane virtqueues after
+ * DRIVER_OK, so they can recover device state before starting dataplane.
+ * Because of that, we don't enable virtqueues here and leave it to
+ * net/vhost-vdpa.c.
+ */
+if (nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
+return 0;
+}
+
 nc->vring_enable = enable;
 
 if (vhost_ops && vhost_ops->vhost_set_vring_enable) {
diff --git a/hw/virtio/vdpa-dev.c b/hw/virtio/vdpa-dev.c
index eb9ecea83b..13e87f06f6 100644
--- a/hw/virtio/vdpa-dev.c
+++ b/hw/virtio/vdpa-dev.c
@@ -253,14 +253,11 @@ static int vhost_vdpa_device_start(VirtIODevice *vdev, 
Error **errp)
 
 s->dev.acked_features = vdev->guest_features;
 
-ret = vhost_dev_start(>dev, vdev, false);
+ret = vhost_dev_start(>dev, vdev, true);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Error starting vhost");
 goto err_guest_notifiers;
 }
-for (i = 0; i < s->dev.nvqs; ++i) {
-vhost_vdpa_set_vring_ready(>vdpa, i);
-}
 s->started = true;
 
 /*
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index ddae494ca8..401afac2f5 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -886,12 +886,13 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, 
int idx)
 return idx;
 }
 
-int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, unsigned idx)
+static int vhost_vdpa_set_vring_enable_one(struct vhost_vdpa *v, unsigned idx,
+   int enable)
 {
 struct vhost_dev *dev = v->dev;
 struct vhost_vring_state state = {
 .index = idx,
-.num = 1,
+.num = enable,
 };
 int r = vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, );
 
@@ -899,6 +900,27 @@ int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, 
unsigned idx)
 return r;
 }
 
+static int vhost_vdpa_set_vring_enable(struct vhost_dev *dev, int enable)
+{
+struct vhost_vdpa *v = dev->opaque;
+unsigned int i;
+int ret;
+
+for (i = 0; i < dev->nvqs; ++i) {
+ret = vhost_vdpa_set_vring_enable_one(v, i, enable);
+if (ret < 0) {
+return ret;
+}
+}
+
+return 0;
+}
+
+int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, unsigned idx)
+{
+return vhost_vdpa_set_vring_enable_one(v, idx, 1);
+}
+
 static int vhost_vdpa_set_config_call(struct vhost_dev *dev,
int fd)
 {
@@ -1514,6 +1536,7 @@ const VhostOps vdpa_ops = {
 .vhost_set_features = vhost_vdpa_set_features,
 .vhost_reset_device = vhost_vdpa_reset_device,
 .vhost_get_vq_index = vhost_vdpa_get_vq_index,
+.vhost_set_vring_enable = vhost_vdpa_set_vring_enable,
 .vhost_get_config  = vhost_vdpa_get_config,
 .vhost_set_config = vhost_vdpa_set_config,
 .vhost_requires_shm_log = NULL,
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79468..decfb85184 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1984,7 +1984,13 @@ static int vhost_dev_set_vring_enable(struct vhost_dev 
*hdev, int enable)
 return hdev->vhost_ops->vhost_set_vring_enable(hdev, enable);
 }
 
-/* Host n

[PATCH for-9.0 0/2] nbd: Fix server crash on reset with iothreads

2024-03-14 Thread Kevin Wolf

Kevin Wolf (2):
  nbd/server: Fix race in draining the export
  iotests: Add test for reset/AioContext switches with NBD exports

 nbd/server.c  | 15 ++---
 tests/qemu-iotests/tests/iothreads-nbd-export | 66 +++
 .../tests/iothreads-nbd-export.out| 19 ++
 3 files changed, 92 insertions(+), 8 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/iothreads-nbd-export
 create mode 100644 tests/qemu-iotests/tests/iothreads-nbd-export.out

-- 
2.44.0

[PATCH for-9.0 1/2] nbd/server: Fix race in draining the export

2024-03-14 Thread Kevin Wolf

When draining an NBD export, nbd_drained_begin() first sets
client->quiescing so that nbd_client_receive_next_request() won't start
any new request coroutines. Then nbd_drained_poll() tries to makes sure
that we wait for any existing request coroutines by checking that
client->nb_requests has become 0.

However, there is a small window between creating a new request
coroutine and increasing client->nb_requests. If a coroutine is in this
state, it won't be waited for and drain returns too early.

In the context of switching to a different AioContext, this means that
blk_aio_attached() will see client->recv_coroutine != NULL and fail its
assertion.

Fix this by increasing client->nb_requests immediately when starting the
coroutine. Doing this after the checks if we should create a new
coroutine is okay because client->lock is held.

Cc: qemu-sta...@nongnu.org
Fixes: fd6afc501a019682d1b8468b562355a2887087bd
Signed-off-by: Kevin Wolf 
---
 nbd/server.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 941832f178..c3484cc1eb 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -3007,8 +3007,8 @@ static coroutine_fn int nbd_handle_request(NBDClient 
*client,
 /* Owns a reference to the NBDClient passed as opaque.  */
 static coroutine_fn void nbd_trip(void *opaque)
 {
-NBDClient *client = opaque;
-NBDRequestData *req = NULL;
+NBDRequestData *req = opaque;
+NBDClient *client = req->client;
 NBDRequest request = { 0 };/* GCC thinks it can be used uninitialized 
*/
 int ret;
 Error *local_err = NULL;
@@ -3037,8 +3037,6 @@ static coroutine_fn void nbd_trip(void *opaque)
 goto done;
 }
 
-req = nbd_request_get(client);
-
 /*
  * nbd_co_receive_request() returns -EAGAIN when nbd_drained_begin() has
  * set client->quiescing but by the time we get back nbd_drained_end() may
@@ -3112,9 +3110,7 @@ static coroutine_fn void nbd_trip(void *opaque)
 }
 
 done:
-if (req) {
-nbd_request_put(req);
-}
+nbd_request_put(req);
 
 qemu_mutex_unlock(>lock);
 
@@ -3143,10 +3139,13 @@ disconnect:
  */
 static void nbd_client_receive_next_request(NBDClient *client)
 {
+NBDRequestData *req;
+
 if (!client->recv_coroutine && client->nb_requests < MAX_NBD_REQUESTS &&
 !client->quiescing) {
 nbd_client_get(client);
-client->recv_coroutine = qemu_coroutine_create(nbd_trip, client);
+req = nbd_request_get(client);
+client->recv_coroutine = qemu_coroutine_create(nbd_trip, req);
 aio_co_schedule(client->exp->common.ctx, client->recv_coroutine);
 }
 }
-- 
2.44.0

[PATCH for-9.0 2/2] iotests: Add test for reset/AioContext switches with NBD exports

2024-03-14 Thread Kevin Wolf

This replicates the scenario in which the bug was reported.
Unfortunately this relies on actually executing a guest (so that the
firmware initialises the virtio-blk device and moves it to its
configured iothread), so this can't make use of the qtest accelerator
like most other test cases. I tried to find a different easy way to
trigger the bug, but couldn't find one.

Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/tests/iothreads-nbd-export | 66 +++
 .../tests/iothreads-nbd-export.out| 19 ++
 2 files changed, 85 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/iothreads-nbd-export
 create mode 100644 tests/qemu-iotests/tests/iothreads-nbd-export.out

diff --git a/tests/qemu-iotests/tests/iothreads-nbd-export 
b/tests/qemu-iotests/tests/iothreads-nbd-export
new file mode 100755
index 00..63cac8fdbf
--- /dev/null
+++ b/tests/qemu-iotests/tests/iothreads-nbd-export
@@ -0,0 +1,66 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Copyright (C) 2024 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+# Creator/Owner: Kevin Wolf 
+
+import asyncio
+import iotests
+import qemu
+import time
+
+iotests.script_initialize(supported_fmts=['qcow2'],
+  supported_platforms=['linux'])
+
+with iotests.FilePath('disk1.img') as path, \
+ iotests.FilePath('nbd.sock', base_dir=iotests.sock_dir) as nbd_sock, \
+ qemu.machine.QEMUMachine(iotests.qemu_prog) as vm:
+
+img_size = '10M'
+
+iotests.log('Preparing disk...')
+iotests.qemu_img_create('-f', iotests.imgfmt, path, img_size)
+vm.add_args('-blockdev', f'file,node-name=disk-file,filename={path}')
+vm.add_args('-blockdev', f'qcow2,node-name=disk,file=disk-file')
+vm.add_args('-object', 'iothread,id=iothread0')
+vm.add_args('-device', 
'virtio-blk,drive=disk,iothread=iothread0,share-rw=on')
+
+iotests.log('Launching VM...')
+vm.add_args('-accel', 'kvm', '-accel', 'tcg')
+#vm.add_args('-accel', 'qtest')
+vm.launch()
+
+iotests.log('Exporting to NBD...')
+iotests.log(vm.qmp('nbd-server-start',
+   addr={'type': 'unix', 'data': {'path': nbd_sock}}))
+iotests.log(vm.qmp('block-export-add', type='nbd', id='exp0',
+   node_name='disk', writable=True))
+
+iotests.log('Connecting qemu-img...')
+qemu_io = iotests.QemuIoInteractive('-f', 'raw',
+f'nbd+unix:///disk?socket={nbd_sock}')
+
+iotests.log('Moving the NBD export to a different iothread...')
+for i in range(0, 10):
+iotests.log(vm.qmp('system_reset'))
+time.sleep(0.1)
+
+iotests.log('Checking that it is still alive...')
+iotests.log(vm.qmp('query-status'))
+
+qemu_io.close()
+vm.shutdown()
diff --git a/tests/qemu-iotests/tests/iothreads-nbd-export.out 
b/tests/qemu-iotests/tests/iothreads-nbd-export.out
new file mode 100644
index 00..bc514e35e5
--- /dev/null
+++ b/tests/qemu-iotests/tests/iothreads-nbd-export.out
@@ -0,0 +1,19 @@
+Preparing disk...
+Launching VM...
+Exporting to NBD...
+{"return": {}}
+{"return": {}}
+Connecting qemu-img...
+Moving the NBD export to a different iothread...
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+Checking that it is still alive...
+{"return": {"running": true, "status": "running"}}
-- 
2.44.0

[PATCH for-9.0] mirror: Don't call job_pause_point() under graph lock

2024-03-13 Thread Kevin Wolf

Calling job_pause_point() while holding the graph reader lock
potentially results in a deadlock: bdrv_graph_wrlock() first drains
everything, including the mirror job, which pauses it. The job is only
unpaused at the end of the drain section, which is when the graph writer
lock has been successfully taken. However, if the job happens to be
paused at a pause point where it still holds the reader lock, the writer
lock can't be taken as long as the job is still paused.

Mark job_pause_point() as GRAPH_UNLOCKED and fix mirror accordingly.

Cc: qemu-sta...@nongnu.org
Buglink: https://issues.redhat.com/browse/RHEL-28125
Fixes: 004915a96a7a40e942ac85e6d22518cbcd283506
Signed-off-by: Kevin Wolf 
---
 include/qemu/job.h |  2 +-
 block/mirror.c | 10 ++
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 9ea98b5927..2b873f2576 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -483,7 +483,7 @@ void job_enter(Job *job);
  *
  * Called with job_mutex *not* held.
  */
-void coroutine_fn job_pause_point(Job *job);
+void coroutine_fn GRAPH_UNLOCKED job_pause_point(Job *job);
 
 /**
  * @job: The job that calls the function.
diff --git a/block/mirror.c b/block/mirror.c
index 5145eb53e1..1bdce3b657 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -479,9 +479,9 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t 
offset,
 return bytes_handled;
 }
 
-static void coroutine_fn GRAPH_RDLOCK mirror_iteration(MirrorBlockJob *s)
+static void coroutine_fn GRAPH_UNLOCKED mirror_iteration(MirrorBlockJob *s)
 {
-BlockDriverState *source = s->mirror_top_bs->backing->bs;
+BlockDriverState *source;
 MirrorOp *pseudo_op;
 int64_t offset;
 /* At least the first dirty chunk is mirrored in one iteration. */
@@ -489,6 +489,10 @@ static void coroutine_fn GRAPH_RDLOCK 
mirror_iteration(MirrorBlockJob *s)
 bool write_zeroes_ok = bdrv_can_write_zeroes_with_unmap(blk_bs(s->target));
 int max_io_bytes = MAX(s->buf_size / MAX_IN_FLIGHT, MAX_IO_BYTES);
 
+bdrv_graph_co_rdlock();
+source = s->mirror_top_bs->backing->bs;
+bdrv_graph_co_rdunlock();
+
 bdrv_dirty_bitmap_lock(s->dirty_bitmap);
 offset = bdrv_dirty_iter_next(s->dbi);
 if (offset < 0) {
@@ -1066,9 +1070,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_wait_for_free_in_flight_slot(s);
 continue;
 } else if (cnt != 0) {
-bdrv_graph_co_rdlock();
 mirror_iteration(s);
-bdrv_graph_co_rdunlock();
 }
 }
 
-- 
2.44.0

Re: [PATCH for 9.0] migration: Skip only empty block devices

2024-03-13 Thread Kevin Wolf

Am 12.03.2024 um 13:27 hat Peter Xu geschrieben:
> On Tue, Mar 12, 2024 at 01:04:31PM +0100, Cédric Le Goater wrote:
> > The block .save_setup() handler calls a helper routine
> > init_blk_migration() which builds a list of block devices to take into
> > account for migration. When one device is found to be empty (sectors
> > == 0), the loop exits and all the remaining devices are ignored. This
> > is a regression introduced when bdrv_iterate() was removed.
> > 
> > Change that by skipping only empty devices.
> > 
> > Cc: Markus Armbruster 
> > Suggested: Kevin Wolf 
> 
> This should be:
> 
> Suggested-by: Kevin Wolf 
> 
> I think the missed "by" caused Kevin not in the cc list, I added Kevin in.
> 
> I'll hold a bit for Kevin to ACK, no repost needed just for above; I can
> fix it.

Thanks.

Reviewed-by: Kevin Wolf

Re: [PATCH v2 00/10] mirror: allow switching from background to active mode

2024-03-12 Thread Kevin Wolf

Am 12.03.2024 um 14:44 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 11.03.24 18:15, Vladimir Sementsov-Ogievskiy wrote:
> > On 08.03.24 11:52, Kevin Wolf wrote:
> > > Am 07.03.2024 um 20:42 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > > On 04.03.24 14:09, Peter Krempa wrote:
> > > > > On Mon, Mar 04, 2024 at 11:48:54 +0100, Kevin Wolf wrote:
> > > > > > Am 28.02.2024 um 19:07 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > > > > > On 03.11.23 18:56, Markus Armbruster wrote:
> > > > > > > > Kevin Wolf  writes:
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > > > Is the job abstraction a failure?
> > > > > > > > 
> > > > > > > > We have
> > > > > > > > 
> > > > > > > >    block-job- command  since   job- command    since
> > > > > > > >    -
> > > > > > > >    block-job-set-speed 1.1
> > > > > > > >    block-job-cancel    1.1 job-cancel  3.0
> > > > > > > >    block-job-pause 1.3 job-pause   3.0
> > > > > > > >    block-job-resume    1.3 job-resume  3.0
> > > > > > > >    block-job-complete  1.3 job-complete    3.0
> > > > > > > >    block-job-dismiss   2.12    job-dismiss 3.0
> > > > > > > >    block-job-finalize  2.12    job-finalize    3.0
> > > > > > > >    block-job-change    8.2
> > > > > > > >    query-block-jobs    1.1 query-jobs
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > I consider these strictly optional. We don't really have strong 
> > > > > > reasons
> > > > > > to deprecate these commands (they are just thin wrappers), and I 
> > > > > > think
> > > > > > libvirt still uses block-job-* in some places.
> > > > > 
> > > > > Libvirt uses 'block-job-cancel' because it has different semantics 
> > > > > from
> > > > > 'job-cancel' which libvirt documented as the behaviour of the API that
> > > > > uses it. (Semantics regarding the expectation of what is written to 
> > > > > the
> > > > > destination node at the point when the job is cancelled).
> > > > > 
> > > > 
> > > > That's the following semantics:
> > > > 
> > > >    # Note that if you issue 'block-job-cancel' after 'drive-mirror' has
> > > >    # indicated (via the event BLOCK_JOB_READY) that the source and
> > > >    # destination are synchronized, then the event triggered by this
> > > >    # command changes to BLOCK_JOB_COMPLETED, to indicate that the
> > > >    # mirroring has ended and the destination now has a point-in-time 
> > > > copy
> > > >    # tied to the time of the cancellation.
> > > > 
> > > > Hmm. Looking at this, it looks for me, that should probably a
> > > > 'block-job-complete" command (as leading to BLOCK_JOB_COMPLETED).
> > > 
> > > Yes, it's just a different completion mode.
> > > 
> > > > Actually, what is the difference between block-job-complete and
> > > > block-job-cancel(force=false) for mirror in ready state?
> > > > 
> > > > I only see the following differencies:
> > > > 
> > > > 1. block-job-complete documents that it completes the job
> > > >     synchronously.. But looking at mirror code I see it just set
> > > >     s->should_complete = true, which will be then handled
> > > >     asynchronously..  So I doubt that documentation is correct.
> > > > 
> > > > 2. block-job-complete will trigger final graph changes.
> > > >     block-job-cancel will not.
> > > > 
> > > > Is [2] really useful? Seems yes: in case of some failure before
> > > > starting migration target, we'd like to continue executing source. So,
> > > > no reason to break block-graph in source, better keep it unchanged.
> > > > 
> > > > But I think, such behavior better be setup by mirror-job start
> > > > parameter, rather then by special option for cancel (or even
> >

Re: [PATCH] virtio-blk: iothread-vq-mapping coroutine pool sizing

2024-03-12 Thread Kevin Wolf

Am 11.03.2024 um 21:14 hat Stefan Hajnoczi geschrieben:
> It is possible to hit the sysctl vm.max_map_count limit when the
> coroutine pool size becomes large. Each coroutine requires two mappings
> (one for the stack and one for the guard page). QEMU can crash with
> "failed to set up stack guard page" or "failed to allocate memory for
> stack" when this happens.
> 
> Coroutine pool sizing is simple when there is only one AioContext: sum
> up all I/O requests across all virtqueues.
> 
> When the iothread-vq-mapping option is used we should calculate tighter
> bounds: take the maximum number of the number of I/O requests across all
> virtqueues. This number is lower than simply summing all virtqueues when
> only a subset of the virtqueues is handled by each AioContext.

The reasoning is that each thread has its own coroutine pool for which
the pool size applies individually, and it doesn't need to have space
for coroutines running in a different thread, right? I'd like to have
this recorded in the commit message.

Of course, this also makes me wonder if a global coroutine pool size
really makes sense or if it should be per thread. One thread could be
serving only one queue (maybe the main thread with a CD-ROM device) and
another thread 32 queues (the iothread with the interesting disks).
There is no reason for the first thread to have a coroutine pool as big
as the second one.

But before we make the size thread-local, maybe having thread-local
pools wasn't right to begin with because multiple threads can run main
context code and they should therefore share the same coroutine pool (we
already had the problem earlier that coroutines start on the vcpu thread
and terminate on the main thread and this plays havoc with coroutine
pools).

Maybe per-AioContext pools with per-AioContext sizes would make more
sense?

> This is not a solution to hitting vm.max_map_count, but it helps. A
> guest with 64 vCPUs (hence 64 virtqueues) across 4 IOThreads with one
> iothread-vq-mapping virtio-blk device and a root disk without goes from
> pool_max_size 16,448 to 10,304.
> 
> Reported-by: Sanjay Rao 
> Reported-by: Boaz Ben Shabat 
> Signed-off-by: Stefan Hajnoczi 

Either way, this should already strictly improve the situation, so I'm
happy to apply this change for now.

Kevin

Re: [PATCH v2] linux-aio: add IO_CMD_FDSYNC command support

2024-03-12 Thread Kevin Wolf

Am 11.03.2024 um 20:36 hat Stefan Hajnoczi geschrieben:
> On Mon, Mar 11, 2024 at 04:40:05PM +0100, Kevin Wolf wrote:
> > Am 11.03.2024 um 14:09 hat Stefan Hajnoczi geschrieben:
> > > On Mon, Mar 11, 2024 at 11:13:33AM +0530, Prasad Pandit wrote:
> > > > From: Prasad Pandit 
> > > > 
> > > > Libaio defines IO_CMD_FDSYNC command to sync all outstanding
> > > > asynchronous I/O operations, by flushing out file data to the
> > > > disk storage.
> > > > 
> > > > Enable linux-aio to submit such aio request. This helps to
> > > > reduce latency induced via pthread_create calls by
> > > > thread-pool (aio=threads).
> > > > 
> > > > Signed-off-by: Prasad Pandit 
> > > > ---
> > > >  block/file-posix.c | 12 
> > > >  block/linux-aio.c  |  5 -
> > > >  2 files changed, 16 insertions(+), 1 deletion(-)
> > > > 
> > > > v2: if IO_CMD_FDSYNC is not supported by the kernel,
> > > > fallback on thread-pool flush.
> > > >   -> 
> > > > https://lists.nongnu.org/archive/html/qemu-devel/2024-03/msg01986.html
> > > > 
> > > > diff --git a/block/file-posix.c b/block/file-posix.c
> > > > index 35684f7e21..4f2195d01d 100644
> > > > --- a/block/file-posix.c
> > > > +++ b/block/file-posix.c
> > > > @@ -2599,6 +2599,18 @@ static int coroutine_fn 
> > > > raw_co_flush_to_disk(BlockDriverState *bs)
> > > >  if (raw_check_linux_io_uring(s)) {
> > > >  return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
> > > >  }
> > > > +#endif
> > > > +#ifdef CONFIG_LINUX_AIO
> > > > +if (raw_check_linux_aio(s)) {
> > > > +ret = laio_co_submit(s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
> > > > +if (ret >= 0) {
> > > > +/*
> > > > + * if AIO_FLUSH is supported return
> > > > + * else fallback on thread-pool flush.
> > > > + */
> > > > +return ret;
> > > > +}
> > > 
> > > Falling back every time on an older host kernel might be a noticeable
> > > performance regression. That can be avoided with a variable that keeps
> > > track of whether -EINVAL was seen before and skips Linux AIO in that
> > > case.
> > > 
> > > However, it appears that popular distributions starting from Debian 10,
> > > Ubuntu 20.04, Fedora 27, CentOS 8, and OpenSUSE Leap 15.5 have the
> > > necessary minimum Linux 4.18 kernel:
> > > https://repology.org/project/linux/versions
> > > 
> > > Fallback should be very rare, so I don't think it needs to be optimized:
> > > 
> > > Reviewed-by: Stefan Hajnoczi 
> > 
> > We might need this approach for a different reason: This is an
> > io_submit() error, so while we retry the flush with the fallback path,
> > other requests in the same batch may incorrectly return errors. This
> > probably explains the errors Prasad saw in the guest when the kernel
> > doesn't have support for flush in Linux AIO.
> > 
> > So in order to avoid this, we'll probably have to send one flush just to
> > probe (while making sure that no other request is pending - maybe
> > immediately when opening the image?) and then remember whether it
> > worked.
> > 
> > Or we'd have to change the error handling around io_submit(), so that we
> > don't always fail the first request in the batch, but first fail any
> > flushes and only then the rest of the requests.
> 
> I don't see the behavior you are describing in the code. My
> interpretation of ioq_submit() is that only the flush request fails.
> Other queued requests (before and after the flush) are submitted
> successfully:
> [...]

You're right. I missed that io_submit() returns failure only if the
first request in the queue is invalid, and returns a "short submission"
for errors in later entries.

Kevin


signature.asc
Description: PGP signature

Re: [PATCH v2] linux-aio: add IO_CMD_FDSYNC command support

2024-03-11 Thread Kevin Wolf

Am 11.03.2024 um 14:09 hat Stefan Hajnoczi geschrieben:
> On Mon, Mar 11, 2024 at 11:13:33AM +0530, Prasad Pandit wrote:
> > From: Prasad Pandit 
> > 
> > Libaio defines IO_CMD_FDSYNC command to sync all outstanding
> > asynchronous I/O operations, by flushing out file data to the
> > disk storage.
> > 
> > Enable linux-aio to submit such aio request. This helps to
> > reduce latency induced via pthread_create calls by
> > thread-pool (aio=threads).
> > 
> > Signed-off-by: Prasad Pandit 
> > ---
> >  block/file-posix.c | 12 
> >  block/linux-aio.c  |  5 -
> >  2 files changed, 16 insertions(+), 1 deletion(-)
> > 
> > v2: if IO_CMD_FDSYNC is not supported by the kernel,
> > fallback on thread-pool flush.
> >   -> https://lists.nongnu.org/archive/html/qemu-devel/2024-03/msg01986.html
> > 
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 35684f7e21..4f2195d01d 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -2599,6 +2599,18 @@ static int coroutine_fn 
> > raw_co_flush_to_disk(BlockDriverState *bs)
> >  if (raw_check_linux_io_uring(s)) {
> >  return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
> >  }
> > +#endif
> > +#ifdef CONFIG_LINUX_AIO
> > +if (raw_check_linux_aio(s)) {
> > +ret = laio_co_submit(s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
> > +if (ret >= 0) {
> > +/*
> > + * if AIO_FLUSH is supported return
> > + * else fallback on thread-pool flush.
> > + */
> > +return ret;
> > +}
> 
> Falling back every time on an older host kernel might be a noticeable
> performance regression. That can be avoided with a variable that keeps
> track of whether -EINVAL was seen before and skips Linux AIO in that
> case.
> 
> However, it appears that popular distributions starting from Debian 10,
> Ubuntu 20.04, Fedora 27, CentOS 8, and OpenSUSE Leap 15.5 have the
> necessary minimum Linux 4.18 kernel:
> https://repology.org/project/linux/versions
> 
> Fallback should be very rare, so I don't think it needs to be optimized:
> 
> Reviewed-by: Stefan Hajnoczi 

We might need this approach for a different reason: This is an
io_submit() error, so while we retry the flush with the fallback path,
other requests in the same batch may incorrectly return errors. This
probably explains the errors Prasad saw in the guest when the kernel
doesn't have support for flush in Linux AIO.

So in order to avoid this, we'll probably have to send one flush just to
probe (while making sure that no other request is pending - maybe
immediately when opening the image?) and then remember whether it
worked.

Or we'd have to change the error handling around io_submit(), so that we
don't always fail the first request in the batch, but first fail any
flushes and only then the rest of the requests.

Kevin


signature.asc
Description: PGP signature

Re: [PATCH] linux-aio: add IO_CMD_FDSYNC command support

2024-03-08 Thread Kevin Wolf

Am 07.03.2024 um 18:19 hat Prasad Pandit geschrieben:
> Hi,
> 
> On Thu, 7 Mar 2024 at 19:21, Kevin Wolf  wrote:
> > Kernel support for this is "relatively" new, added in 2018. Should we
> > fall back to the thread pool if we receive -EINVAL?
> 
> laio_co_submit
>   laio_do_submit
> ioq_submit
>   io_submit
> 
> * When an aio operation is not supported by the kernel, io_submit()
> call would return '-EINVAL', indicating that the specified (_FDSYNC)
> operation is invalid for the file descriptor. ie. an error (-EINVAL)
> needs to reach the 'laio_co_submit' routine above, to make its caller
> fall back on the thread-pool functionality as default.

Hm... This might make it more challenging because then not only one
specific request fails, but the whole submission batch. Do we know if it
can include unrelated requests?

> * Strangely the 'ioq_submit' routine above ignores the return value
> from io_submit and returns void. I think we need to change that to be
> able to fall back on the thread-pool functionality. I'll try to see if
> such a change works as expected. Please let me know if there's another
> way to fix this.

It passes the return value to the request:

if (ret < 0) {
/* Fail the first request, retry the rest */
aiocb = QSIMPLEQ_FIRST(>io_q.pending);
QSIMPLEQ_REMOVE_HEAD(>io_q.pending, next);
s->io_q.in_queue--;
aiocb->ret = ret;
qemu_laio_process_completion(aiocb);
continue;
}

laio_co_submit() then fetches it from laiocb.ret and returns it to its
caller (in this case, your new code).

Kevin

Re: [PATCH v2 00/10] mirror: allow switching from background to active mode

2024-03-08 Thread Kevin Wolf

Am 07.03.2024 um 20:42 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 04.03.24 14:09, Peter Krempa wrote:
> > On Mon, Mar 04, 2024 at 11:48:54 +0100, Kevin Wolf wrote:
> > > Am 28.02.2024 um 19:07 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > > On 03.11.23 18:56, Markus Armbruster wrote:
> > > > > Kevin Wolf  writes:
> > 
> > [...]
> > 
> > > > > Is the job abstraction a failure?
> > > > > 
> > > > > We have
> > > > > 
> > > > >   block-job- command  since   job- commandsince
> > > > >   -
> > > > >   block-job-set-speed 1.1
> > > > >   block-job-cancel1.1 job-cancel  3.0
> > > > >   block-job-pause 1.3 job-pause   3.0
> > > > >   block-job-resume1.3 job-resume  3.0
> > > > >   block-job-complete  1.3 job-complete3.0
> > > > >   block-job-dismiss   2.12job-dismiss 3.0
> > > > >   block-job-finalize  2.12job-finalize3.0
> > > > >   block-job-change8.2
> > > > >   query-block-jobs1.1 query-jobs
> > 
> > [...]
> > 
> > > I consider these strictly optional. We don't really have strong reasons
> > > to deprecate these commands (they are just thin wrappers), and I think
> > > libvirt still uses block-job-* in some places.
> > 
> > Libvirt uses 'block-job-cancel' because it has different semantics from
> > 'job-cancel' which libvirt documented as the behaviour of the API that
> > uses it. (Semantics regarding the expectation of what is written to the
> > destination node at the point when the job is cancelled).
> > 
> 
> That's the following semantics:
> 
>   # Note that if you issue 'block-job-cancel' after 'drive-mirror' has
>   # indicated (via the event BLOCK_JOB_READY) that the source and
>   # destination are synchronized, then the event triggered by this
>   # command changes to BLOCK_JOB_COMPLETED, to indicate that the
>   # mirroring has ended and the destination now has a point-in-time copy
>   # tied to the time of the cancellation.
> 
> Hmm. Looking at this, it looks for me, that should probably a
> 'block-job-complete" command (as leading to BLOCK_JOB_COMPLETED).

Yes, it's just a different completion mode.

> Actually, what is the difference between block-job-complete and
> block-job-cancel(force=false) for mirror in ready state?
> 
> I only see the following differencies:
> 
> 1. block-job-complete documents that it completes the job
>synchronously.. But looking at mirror code I see it just set
>s->should_complete = true, which will be then handled
>asynchronously..  So I doubt that documentation is correct.
> 
> 2. block-job-complete will trigger final graph changes.
>block-job-cancel will not.
> 
> Is [2] really useful? Seems yes: in case of some failure before
> starting migration target, we'd like to continue executing source. So,
> no reason to break block-graph in source, better keep it unchanged.
> 
> But I think, such behavior better be setup by mirror-job start
> parameter, rather then by special option for cancel (or even
> compelete) command, useful only for mirror.

I'm not sure, having the option on the complete command makes more sense
to me than having it in blockdev-mirror.

I do see the challenge of representing this meaningfully in QAPI,
though. Semantically it should be a union with job-specific options and
only mirror adds the graph-changes option. But the union variant
can't be directly selected from another option - instead we have a job
ID, and the variant is the job type of the job with this ID.

Practically speaking, we would probably indeed end up with an optional
field in the existing completion command.

> So, what about the following substitution for block-job-cancel:
> 
> block-job-cancel(force=true)  -->  use job-cancel
> 
> block-job-cancel(force=false) for backup, stream, commit  -->  use job-cancel
> 
> block-job-cancel(force=false) for mirror in ready mode  -->
> 
>   instead, use block-job-complete. If you don't need final graph
>   modification which mirror job normally does, use graph-change=false
>   parameter for blockdev-mirror command.

Apart from the open question where to put the option, agreed.

> (I can hardly remember, that we've already discussed something like
> this long time ago, but I don't remember the results)

I think everyone agreed that this is how things should be, and nobody
did anything

Re: [PATCH] linux-aio: add IO_CMD_FDSYNC command support

2024-03-07 Thread Kevin Wolf

Am 07.03.2024 um 12:47 hat Prasad Pandit geschrieben:
> From: Prasad Pandit 
> 
> Libaio defines IO_CMD_FDSYNC command to sync all outstanding
> asynchronous I/O operations, by flushing out file data to the
> disk storage.
> 
> Enable linux-aio to submit such aio request. This helps to
> reduce latency induced via pthread_create calls by
> thread-pool (aio=threads).
> 
> Signed-off-by: Prasad Pandit 

Kernel support for this is "relatively" new, added in 2018. Should we
fall back to the thread pool if we receive -EINVAL?

Kevin

Re: [PATCH v2 00/10] mirror: allow switching from background to active mode

2024-03-04 Thread Kevin Wolf

Am 28.02.2024 um 19:07 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 03.11.23 18:56, Markus Armbruster wrote:
> > Kevin Wolf  writes:
> > 
> > > Am 03.11.2023 um 10:36 hat Markus Armbruster geschrieben:
> > > > Vladimir Sementsov-Ogievskiy  writes:
> > > > 
> > > > > On 11.10.23 13:18, Fiona Ebner wrote:
> > > > > > Am 10.10.23 um 19:55 schrieb Vladimir Sementsov-Ogievskiy:
> > > > > > > On 09.10.23 12:46, Fiona Ebner wrote:
> > > > > > > > Initially, I tried to go for a more general 'job-change' 
> > > > > > > > command, but
> > > > > > > > I couldn't figure out a way to avoid mutual inclusion between
> > > > > > > > block-core.json and job.json.
> > > > > > > > 
> > > > > > > What is the problem with it? I still think that job-change would 
> > > > > > > be better.
> > > > > > > 
> > > > > > If going for job-change in job.json, the dependencies would be
> > > > > > job-change -> JobChangeOptions -> JobChangeOptionsMirror -> 
> > > > > > MirrorCopyMode
> > > > > > query-jobs -> JobInfo -> JobInfoMirror
> > > > > > and we can't include block-core.json in job.json, because an 
> > > > > > inclusion
> > > > > > loop gives a build error.
> > > > Let me try to understand this.
> > > > 
> > > > Command job-change needs its argument type JobChangeOptions.
> > > > 
> > > > JobChangeOptions is a union, and JobChangeOptionsMirror is one of its
> > > > branches.
> > > > 
> > > > JobChangeOptionsMirror needs MirrorCopyMode from block-core.json.
> > > > 
> > > > block-core.json needs job.json for JobType and JobStatus.
> > > > 
> > > > > > Could be made to work by moving MirrorCopyMode (and
> > > > > > JobChangeOptionsMirror, JobInfoMirror) to job.json or some place 
> > > > > > that
> > > > > > can be included by both job.json and block-core.json. Moving the
> > > > > > type-specific definitions to the general job.json didn't feel right 
> > > > > > to
> > > > > > me. Including another file with type-specific definitions in 
> > > > > > job.json
> > > > > > feels slightly less wrong, but still not quite right and I didn't 
> > > > > > want
> > > > > > to create a new file just for MirrorCopyMode (and
> > > > > > JobChangeOptionsMirror, JobInfoMirror).
> > > > > > And going further and moving all mirror-related things to a separate
> > > > > > file would require moving along things like NewImageMode with it or
> > > > > > create yet another file for such general things used by multiple 
> > > > > > block-jobs.
> > > > > > If preferred, I can try and go with some version of the above.
> > > > > > 
> > > > > OK, I see the problem. Seems, that all requires some good 
> > > > > refactoring. But that's a preexisting big work, and should not hold 
> > > > > up your series. I'm OK to proceed with block-job-change.
> > > > Saving ourselves some internal refactoring is a poor excuse for
> > > > undesirable external interfaces.
> > > I'm not sure how undesirable it is. We have block-job-* commands for
> > > pretty much every other operation, so it's only consistent to have
> > > block-job-change, too.
> > Is the job abstraction a failure?
> > 
> > We have
> > 
> >  block-job- command  since   job- commandsince
> >  -
> >  block-job-set-speed 1.1
> >  block-job-cancel1.1 job-cancel  3.0
> >  block-job-pause 1.3 job-pause   3.0
> >  block-job-resume1.3 job-resume  3.0
> >  block-job-complete  1.3 job-complete3.0
> >  block-job-dismiss   2.12job-dismiss 3.0
> >  block-job-finalize  2.12job-finalize3.0
> >  block-job-change8.2
> >  query-block-jobs1.1 query-jobs
> > 
> > I was under the impression that we added the (more general) job-
> > commands to replace the (less general) block-job commands, and we're
> > keeping the latter just for compa

Re: [PATCH v6 01/15] hw/nvme: Use pcie_sriov_num_vfs()

2024-02-20 Thread Kevin Wolf

Am 20.02.2024 um 15:29 hat Kevin Wolf geschrieben:
> Am 20.02.2024 um 13:24 hat Akihiko Odaki geschrieben:
> > nvme_sriov_pre_write_ctrl() used to directly inspect SR-IOV
> > configurations to know the number of VFs being disabled due to SR-IOV
> > configuration writes, but the logic was flawed and resulted in
> > out-of-bound memory access.
> > 
> > It assumed PCI_SRIOV_NUM_VF always has the number of currently enabled
> > VFs, but it actually doesn't in the following cases:
> > - PCI_SRIOV_NUM_VF has been set but PCI_SRIOV_CTRL_VFE has never been.
> > - PCI_SRIOV_NUM_VF was written after PCI_SRIOV_CTRL_VFE was set.
> > - VFs were only partially enabled because of realization failure.
> > 
> > It is a responsibility of pcie_sriov to interpret SR-IOV configurations
> > and pcie_sriov does it correctly, so use pcie_sriov_num_vfs(), which it
> > provides, to get the number of enabled VFs before and after SR-IOV
> > configuration writes.
> > 
> > Cc: qemu-sta...@nongnu.org
> > Fixes: CVE-2024-26328
> > Fixes: 11871f53ef8e ("hw/nvme: Add support for the Virtualization 
> > Management command")
> > Suggested-by: Michael S. Tsirkin 
> > Signed-off-by: Akihiko Odaki 
> > ---
> >  hw/nvme/ctrl.c | 26 --
> >  1 file changed, 8 insertions(+), 18 deletions(-)
> > 
> > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > index f026245d1e9e..7a56e7b79b4d 100644
> > --- a/hw/nvme/ctrl.c
> > +++ b/hw/nvme/ctrl.c
> > @@ -8466,36 +8466,26 @@ static void nvme_pci_reset(DeviceState *qdev)
> >  nvme_ctrl_reset(n, NVME_RESET_FUNCTION);
> >  }
> >  
> > -static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
> > -  uint32_t val, int len)
> > +static void nvme_sriov_post_write_config(PCIDevice *dev, uint16_t 
> > old_num_vfs)
> >  {
> >  NvmeCtrl *n = NVME(dev);
> >  NvmeSecCtrlEntry *sctrl;
> > -uint16_t sriov_cap = dev->exp.sriov_cap;
> > -uint32_t off = address - sriov_cap;
> > -int i, num_vfs;
> > +int i;
> >  
> > -if (!sriov_cap) {
> > -return;
> > -}
> > -
> > -if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> > -if (!(val & PCI_SRIOV_CTRL_VFE)) {
> > -num_vfs = pci_get_word(dev->config + sriov_cap + 
> > PCI_SRIOV_NUM_VF);
> > -for (i = 0; i < num_vfs; i++) {
> > -sctrl = >sec_ctrl_list.sec[i];
> > -nvme_virt_set_state(n, le16_to_cpu(sctrl->scid), false);
> > -}
> > -}
> > +for (i = pcie_sriov_num_vfs(dev); i < old_num_vfs; i++) {
> > +sctrl = >sec_ctrl_list.sec[i];
> > +nvme_virt_set_state(n, le16_to_cpu(sctrl->scid), false);
> >  }
> >  }
> 
> Maybe I'm missing something, but if the concern is that 'i' could run
> beyond the end of the array, I don't see anything that limits
> pcie_sriov_num_vfs() to the static size of 127 that n->sec_ctrl_list.sec
> has. register_vfs() seems to just take whatever 16 bit value the guest
> wrote without imposing additional restrictions.
> 
> If there is some mechanism that makes register_vf() fail if we exceed
> the limit, maybe an assertion with a comment would be in order because
> it doesn't seem obvious. I couldn't find any code that enforces it,
> sriov_max_vfs only ever seems to be used as a hint for the guest.
> 
> If not, do we need another check that fails gracefully in the error
> case?

Ok, I see now that patch 2 fixes this. But then the commit message is
wrong because it implies that this patch is the only thing you need to
fix the problem with nvme. You can't say "Fixes: CVE-2024-26328" if half
of the fix is still missing.

Maybe asserting old_num_vfs < n->sec_ctrl_list.numcntl would still be a
good idea. But looking at this one, it seems to me that numcntl isn't
completely correct either:

list->numcntl = cpu_to_le16(max_vfs);

Both list->numcntl and max_vfs are uint8_t, so I think this will always
be 0 on big endian hosts?

Kevin

Re: [PATCH v6 01/15] hw/nvme: Use pcie_sriov_num_vfs()

2024-02-20 Thread Kevin Wolf

Am 20.02.2024 um 13:24 hat Akihiko Odaki geschrieben:
> nvme_sriov_pre_write_ctrl() used to directly inspect SR-IOV
> configurations to know the number of VFs being disabled due to SR-IOV
> configuration writes, but the logic was flawed and resulted in
> out-of-bound memory access.
> 
> It assumed PCI_SRIOV_NUM_VF always has the number of currently enabled
> VFs, but it actually doesn't in the following cases:
> - PCI_SRIOV_NUM_VF has been set but PCI_SRIOV_CTRL_VFE has never been.
> - PCI_SRIOV_NUM_VF was written after PCI_SRIOV_CTRL_VFE was set.
> - VFs were only partially enabled because of realization failure.
> 
> It is a responsibility of pcie_sriov to interpret SR-IOV configurations
> and pcie_sriov does it correctly, so use pcie_sriov_num_vfs(), which it
> provides, to get the number of enabled VFs before and after SR-IOV
> configuration writes.
> 
> Cc: qemu-sta...@nongnu.org
> Fixes: CVE-2024-26328
> Fixes: 11871f53ef8e ("hw/nvme: Add support for the Virtualization Management 
> command")
> Suggested-by: Michael S. Tsirkin 
> Signed-off-by: Akihiko Odaki 
> ---
>  hw/nvme/ctrl.c | 26 --
>  1 file changed, 8 insertions(+), 18 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index f026245d1e9e..7a56e7b79b4d 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -8466,36 +8466,26 @@ static void nvme_pci_reset(DeviceState *qdev)
>  nvme_ctrl_reset(n, NVME_RESET_FUNCTION);
>  }
>  
> -static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
> -  uint32_t val, int len)
> +static void nvme_sriov_post_write_config(PCIDevice *dev, uint16_t 
> old_num_vfs)
>  {
>  NvmeCtrl *n = NVME(dev);
>  NvmeSecCtrlEntry *sctrl;
> -uint16_t sriov_cap = dev->exp.sriov_cap;
> -uint32_t off = address - sriov_cap;
> -int i, num_vfs;
> +int i;
>  
> -if (!sriov_cap) {
> -return;
> -}
> -
> -if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> -if (!(val & PCI_SRIOV_CTRL_VFE)) {
> -num_vfs = pci_get_word(dev->config + sriov_cap + 
> PCI_SRIOV_NUM_VF);
> -for (i = 0; i < num_vfs; i++) {
> -sctrl = >sec_ctrl_list.sec[i];
> -nvme_virt_set_state(n, le16_to_cpu(sctrl->scid), false);
> -}
> -}
> +for (i = pcie_sriov_num_vfs(dev); i < old_num_vfs; i++) {
> +sctrl = >sec_ctrl_list.sec[i];
> +nvme_virt_set_state(n, le16_to_cpu(sctrl->scid), false);
>  }
>  }

Maybe I'm missing something, but if the concern is that 'i' could run
beyond the end of the array, I don't see anything that limits
pcie_sriov_num_vfs() to the static size of 127 that n->sec_ctrl_list.sec
has. register_vfs() seems to just take whatever 16 bit value the guest
wrote without imposing additional restrictions.

If there is some mechanism that makes register_vf() fail if we exceed
the limit, maybe an assertion with a comment would be in order because
it doesn't seem obvious. I couldn't find any code that enforces it,
sriov_max_vfs only ever seems to be used as a hint for the guest.

If not, do we need another check that fails gracefully in the error
case?

Kevin

[PATCH] iotests: Make 144 deterministic again

2024-02-09 Thread Kevin Wolf

Since commit effd60c8 changed how QMP commands are processed, the order
of the block-commit return value and job events in iotests 144 wasn't
fixed and more and caused the test to fail intermittently.

Change the test to cache events first and then print them in a
predefined order.

Waiting three times for JOB_STATUS_CHANGE is a bit uglier than just
waiting for the JOB_STATUS_CHANGE that has "status": "ready", but the
tooling we have doesn't seem to allow the latter easily.

Fixes: effd60c878176bcaf97fa7ce2b12d04bb8ead6f7
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2126
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/144 | 12 +++-
 tests/qemu-iotests/144.out |  2 +-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/144 b/tests/qemu-iotests/144
index bdcc498fa2..d284a0e442 100755
--- a/tests/qemu-iotests/144
+++ b/tests/qemu-iotests/144
@@ -83,12 +83,22 @@ echo
 echo === Performing block-commit on active layer ===
 echo
 
+capture_events="BLOCK_JOB_READY JOB_STATUS_CHANGE"
+
 # Block commit on active layer, push the new overlay into base
 _send_qemu_cmd $h "{ 'execute': 'block-commit',
 'arguments': {
  'device': 'virtio0'
   }
-}" "READY"
+}" "return"
+
+_wait_event $h "JOB_STATUS_CHANGE"
+_wait_event $h "JOB_STATUS_CHANGE"
+_wait_event $h "JOB_STATUS_CHANGE"
+
+_wait_event $h "BLOCK_JOB_READY"
+
+capture_events=
 
 _send_qemu_cmd $h "{ 'execute': 'block-job-complete',
 'arguments': {
diff --git a/tests/qemu-iotests/144.out b/tests/qemu-iotests/144.out
index b3b4812015..2245ddfa10 100644
--- a/tests/qemu-iotests/144.out
+++ b/tests/qemu-iotests/144.out
@@ -25,9 +25,9 @@ Formatting 'TEST_DIR/tmp.qcow2', fmt=qcow2 cluster_size=65536 
extended_l2=off co
  'device': 'virtio0'
   }
 }
+{"return": {}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"JOB_STATUS_CHANGE", "data": {"status": "created", "id": "virtio0"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"JOB_STATUS_CHANGE", "data": {"status": "running", "id": "virtio0"}}
-{"return": {}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "virtio0"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"BLOCK_JOB_READY", "data": {"device": "virtio0", "len": 0, "offset": 0, 
"speed": 0, "type": "commit"}}
 { 'execute': 'block-job-complete',
-- 
2.43.0

Re: [RFC/INCOMPLETE PATCH 0/8] Attempt to make qemu-img options consistent and --help working

2024-02-09 Thread Kevin Wolf

Am 07.02.2024 um 18:58 hat Michael Tokarev geschrieben:
> This is an incomplete first attempt only, there's a lot left to do.
> 
> All the options in qemu-img is a complete mess, - no, inconsistent or
> incomplete syntax in documentation, many undocumented options, option
> names are used inconsistently and differently for different commands,
> no long options exists for many short options, --help output is a huge
> mess by its own, and the way how it all is generated is another story.
> docs/tools/qemu-img.rst with qemu-img-opts.hx is yet another.
> 
> I hoped to fix just an option or two, but it ended up in a large task,
> and I need some help and discussion, hence the RFC.
> 
> The idea is:
> 
>  - create more or less consistent set of options between different
>subcommands
>  - provide long options which can be used without figuring out which
>-T/-t, -f|-F|-O etc to use for which of the two images given
>  - have qemu-img --help provide just a list of subcommands
>  - have qemu-img COMMAND --help to describe just this subcommand

The help desperately needs some cleanup like this, so thank you for
doing something about it.

>  - get rid of qemu-img-opts.hx, instead finish documentation in
>qemu-img.rst based on the actual options implemented in
>qemu-img.c.

You mean qemu-img-cmds.hx? The one advantage it has is that it makes it
obvious if there is a mismatch in the options we show in the help output
and in the documentation.

But I'm not overly concerned either way. I would probably have left it
alone just because leaving it is less work than changing it and the
result isn't very different.

> I started converting subcommands one by one, providing long options
> and --help output.  And immediately faced some questions which needs
> wider discussion.
> 
>  o We have --image-opts and --target-image-opts.  Do we really need both?
>In my view, --image-opts should be sort of global, turning *all*
>filenames on this command line into complete image specifications,
>there's no need to have separate image-opts and --target-image-opts.
>Don't know what to do wrt compatibility though.  It shouldn't be made
>this way from the beginning.  As a possible solution, introduce a new
>option and deprecate current set.

I think it's better not to touch things like this. qemu-img is much more
likely to be used directly by human users (and their small scripts) than
QEMU itself, so we need to be even more careful with deprecating things.

In fact, I'm not even sure if combining them would make it easier to
use. Often, it's only source _or_ target that have a complicated setup
that requires blockdev-type descriptions. As a human user, I prefer if I
can still just use the file name for the other image instead of getting
the full -blockdev verbosity there, too.

>  o For conversion (convert, dd, etc), we've source and destination,
>so it's easy to distinguish using long options, like --source-format
>--target-cache etc (-t/-T -f/-F is a mess here already).  What to
>do with compare? --format1 --format2 is ugly, maybe --a-format and
>--b-format?  Maybe we can get off with --source (a) and --target (b)
>instead of filename1 & filename2?
>(--cache in this context is global for both).

For those commands where there is a source and a target, --source-format
and --target-format have a clear advantage, they are easy to remember
and hard to confuse.

For compare, as you saw, there is no clear naming. 1/2 or a/b don't make
the command any clearer than -f/-F. So maybe just don't add long
versions there?

>  o qemu-img convert.  It's the most messy one, and it is not really
>documented (nor in qemu-img.rst , eg there's nothing in there about
>FILENAME2, -B is difficult to understand, etc).
>At the very least, I'd say the options should be
> --source-format, --source-cache etc
> --target-format, --target-options

These seem obvious. (Actually, --target-options vs. just --options isn't
completely obvious - after all, there are fundamentally no create
options for source.)

--source-* are also a bit weird because 'qemu-img convert' takes
multiple source images and then it applies the same format to all of
them. But that's more about how it works, so not a problem with the
option _name_.

> --target-image-opts - this shold go IMHO

This one less so. Even generally speaking, changing interfaces
incompatibly comes with a cost that is probably too big to justify it
just for interfaces that look a bit nicer, but don't provide any new
functionality.

And as I said above, I don't agree that image-opts should be global.

>  o check and commit - inconsistent cache flags?
>In convert, cache is backwards (source/target)?

It's a bit messy because the cache mode options -T/-t work the opposite
way of -f/-F. But I think the commands are consistent with each other?

-T was added as a source cache parameter to all of the subcommands at
the same time, in commit

Re: [PATCH] iothread: Simplify expression in qemu_in_iothread()

2024-02-08 Thread Kevin Wolf

Am 08.02.2024 um 11:48 hat Philippe Mathieu-Daudé geschrieben:
> On 8/2/24 11:16, Kevin Wolf wrote:
> > 'a == b ? false : true' is a rather convoluted way of writing 'a != b'.
> > Use the more obvious way to write it.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >   iothread.c | 3 +--
> >   1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/iothread.c b/iothread.c
> > index 6c1fc8c856..e1e9e04736 100644
> > --- a/iothread.c
> > +++ b/iothread.c
> > @@ -404,6 +404,5 @@ IOThread *iothread_by_id(const char *id)
> >   bool qemu_in_iothread(void)
> >   {
> > -return qemu_get_current_aio_context() == qemu_get_aio_context() ?
> > -false : true;
> > +return qemu_get_current_aio_context() != qemu_get_aio_context();
> >   }
> 
> Reviewed-by: Philippe Mathieu-Daudé 
> 
> BTW using the same pattern:
> 
> -- >8 --
> diff --git a/hw/nvram/xlnx-zynqmp-efuse.c b/hw/nvram/xlnx-zynqmp-efuse.c
> index ec98456e5d..d074762a25 100644
> --- a/hw/nvram/xlnx-zynqmp-efuse.c
> +++ b/hw/nvram/xlnx-zynqmp-efuse.c
> @@ -582,7 +582,7 @@ static uint64_t
> zynqmp_efuse_cache_load_prew(RegisterInfo *reg,
> 
>  static uint64_t zynqmp_efuse_wr_lock_prew(RegisterInfo *reg, uint64_t val)
>  {
> -return val == 0xDF0D ? 0 : 1;
> +return val != 0xDF0D;
>  }

Maybe. I would have to know that device to tell if this is really meant
as boolean. Or maybe it should be written 0x0 and 0x1 to signify that
it's a register value or something.

> diff --git a/tests/tcg/aarch64/sysregs.c b/tests/tcg/aarch64/sysregs.c
> index 301e61d0dd..bdd73bd181 100644
> --- a/tests/tcg/aarch64/sysregs.c
> +++ b/tests/tcg/aarch64/sysregs.c
> @@ -183,5 +183,5 @@ int main(void)
>  return 1;
>  }
> 
> -return should_fail_count == 6 ? 0 : 1;
> +return should_fail_count != 6;
>  }

This one isn't unclear to me, though. This is EXIT_SUCCESS and
EXIT_FAILURE, just open-coded. I think making your change would make it
only more confusing.

Kevin

[PATCH] iothread: Simplify expression in qemu_in_iothread()

2024-02-08 Thread Kevin Wolf

'a == b ? false : true' is a rather convoluted way of writing 'a != b'.
Use the more obvious way to write it.

Signed-off-by: Kevin Wolf 
---
 iothread.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/iothread.c b/iothread.c
index 6c1fc8c856..e1e9e04736 100644
--- a/iothread.c
+++ b/iothread.c
@@ -404,6 +404,5 @@ IOThread *iothread_by_id(const char *id)
 
 bool qemu_in_iothread(void)
 {
-return qemu_get_current_aio_context() == qemu_get_aio_context() ?
-false : true;
+return qemu_get_current_aio_context() != qemu_get_aio_context();
 }
-- 
2.43.0

Re: [PULL 1/1] virtio-blk: avoid using ioeventfd state in irqfd conditional

2024-02-08 Thread Kevin Wolf

Am 08.02.2024 um 06:37 hat Michael Tokarev geschrieben:
> 06.02.2024 18:31, Stefan Hajnoczi :
> > Requests that complete in an IOThread use irqfd to notify the guest
> > while requests that complete in the main loop thread use the traditional
> > qdev irq code path. The reason for this conditional is that the irq code
> > path requires the BQL:
> > 
> >if (s->ioeventfd_started && !s->ioeventfd_disabled) {
> >virtio_notify_irqfd(vdev, req->vq);
> >} else {
> >virtio_notify(vdev, req->vq);
> >}
> > 
> > There is a corner case where the conditional invokes the irq code path
> > instead of the irqfd code path:
> > 
> >static void virtio_blk_stop_ioeventfd(VirtIODevice *vdev)
> >{
> >...
> >/*
> > * Set ->ioeventfd_started to false before draining so that host 
> > notifiers
> > * are not detached/attached anymore.
> > */
> >s->ioeventfd_started = false;
> > 
> >/* Wait for virtio_blk_dma_restart_bh() and in flight I/O to 
> > complete */
> >blk_drain(s->conf.conf.blk);
> > 
> > During blk_drain() the conditional produces the wrong result because
> > ioeventfd_started is false.
> > 
> > Use qemu_in_iothread() instead of checking the ioeventfd state.
> > 
> > Buglink: https://issues.redhat.com/browse/RHEL-15394
> > Signed-off-by: Stefan Hajnoczi 
> > Message-id: 20240122172625.415386-1-stefa...@redhat.com
> > Signed-off-by: Stefan Hajnoczi 
> 
> Cc qemu-stable?  This smells like a stable material, please let me know
> if it is not.

The patch email itself was CCed to qemu-stable and even contained a note
for backporting to stable:

https://lists.gnu.org/archive/html/qemu-block/2024-01/msg00278.html

It's only missing in the commit message. I'll add the Cc: line to
my pull request (for Stefan's pull request it seems too late because
Peter is already processing it, so we'll probably end up having both
versions in the git history).

Kevin

Re: [PULL 0/1] Block patches

2024-02-07 Thread Kevin Wolf

Am 06.02.2024 um 16:31 hat Stefan Hajnoczi geschrieben:
> The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440:
> 
>   Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into 
> staging (2024-02-03 13:31:58 +)
> 
> are available in the Git repository at:
> 
>   https://gitlab.com/stefanha/qemu.git tags/block-pull-request
> 
> for you to fetch changes up to 1d52cc0ac27761e296b14655c2f5b2649ee69491:
> 
>   virtio-blk: avoid using ioeventfd state in irqfd conditional (2024-02-06 
> 10:22:18 -0500)
> 
> 
> Pull request
> 
> A bug fix for in-flight I/O during ioeventfd shutdown.
> 
> 
> 
> Stefan Hajnoczi (1):
>   virtio-blk: avoid using ioeventfd state in irqfd conditional

I noticed that this patch is also in the pull request I sent, so I
guess if mine goes through, you don't have to process this one
separately.

Kevin

[PULL 07/16] scsi: Await request purging

2024-02-07 Thread Kevin Wolf

From: Hanna Czenczek 

scsi_device_for_each_req_async() currently does not provide any way to
be awaited.  One of its callers is scsi_device_purge_requests(), which
therefore currently does not guarantee that all requests are fully
settled when it returns.

We want all requests to be settled, because scsi_device_purge_requests()
is called through the unrealize path, including the one invoked by
virtio_scsi_hotunplug() through qdev_simple_device_unplug_cb(), which
most likely assumes that all SCSI requests are done then.

In fact, scsi_device_purge_requests() already contains a blk_drain(),
but this will not fully await scsi_device_for_each_req_async(), only the
I/O requests it potentially cancels (not the non-I/O requests).
However, we can have scsi_device_for_each_req_async() increment the BB
in-flight counter, and have scsi_device_for_each_req_async_bh()
decrement it when it is done.  This way, the blk_drain() will fully
await all SCSI requests to be purged.

This also removes the need for scsi_device_for_each_req_async_bh() to
double-check the current context and potentially re-schedule itself,
should it now differ from the BB's context: Changing a BB's AioContext
with a root node is done through bdrv_try_change_aio_context(), which
creates a drained section.  With this patch, we keep the BB in-flight
counter elevated throughout, so we know the BB's context cannot change.

Signed-off-by: Hanna Czenczek 
Message-ID: <20240202144755.671354-3-hre...@redhat.com>
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 hw/scsi/scsi-bus.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 0a2eb11c56..230313022c 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -120,17 +120,13 @@ static void scsi_device_for_each_req_async_bh(void 
*opaque)
 SCSIRequest *next;
 
 /*
- * If the AioContext changed before this BH was called then reschedule into
- * the new AioContext before accessing ->requests. This can happen when
- * scsi_device_for_each_req_async() is called and then the AioContext is
- * changed before BHs are run.
+ * The BB cannot have changed contexts between this BH being scheduled and
+ * now: BBs' AioContexts, when they have a node attached, can only be
+ * changed via bdrv_try_change_aio_context(), in a drained section.  While
+ * we have the in-flight counter incremented, that drain must block.
  */
 ctx = blk_get_aio_context(s->conf.blk);
-if (ctx != qemu_get_current_aio_context()) {
-aio_bh_schedule_oneshot(ctx, scsi_device_for_each_req_async_bh,
-g_steal_pointer());
-return;
-}
+assert(ctx == qemu_get_current_aio_context());
 
 QTAILQ_FOREACH_SAFE(req, >requests, next, next) {
 data->fn(req, data->fn_opaque);
@@ -138,11 +134,16 @@ static void scsi_device_for_each_req_async_bh(void 
*opaque)
 
 /* Drop the reference taken by scsi_device_for_each_req_async() */
 object_unref(OBJECT(s));
+
+/* Paired with blk_inc_in_flight() in scsi_device_for_each_req_async() */
+blk_dec_in_flight(s->conf.blk);
 }
 
 /*
  * Schedule @fn() to be invoked for each enqueued request in device @s. @fn()
  * runs in the AioContext that is executing the request.
+ * Keeps the BlockBackend's in-flight counter incremented until everything is
+ * done, so draining it will settle all scheduled @fn() calls.
  */
 static void scsi_device_for_each_req_async(SCSIDevice *s,
void (*fn)(SCSIRequest *, void *),
@@ -163,6 +164,8 @@ static void scsi_device_for_each_req_async(SCSIDevice *s,
  */
 object_ref(OBJECT(s));
 
+/* Paired with blk_dec_in_flight() in scsi_device_for_each_req_async_bh() 
*/
+blk_inc_in_flight(s->conf.blk);
 aio_bh_schedule_oneshot(blk_get_aio_context(s->conf.blk),
 scsi_device_for_each_req_async_bh,
 data);
@@ -1728,11 +1731,20 @@ static void scsi_device_purge_one_req(SCSIRequest *req, 
void *opaque)
 scsi_req_cancel_async(req, NULL);
 }
 
+/**
+ * Cancel all requests, and block until they are deleted.
+ */
 void scsi_device_purge_requests(SCSIDevice *sdev, SCSISense sense)
 {
 scsi_device_for_each_req_async(sdev, scsi_device_purge_one_req, NULL);
 
+/*
+ * Await all the scsi_device_purge_one_req() calls scheduled by
+ * scsi_device_for_each_req_async(), and all I/O requests that were
+ * cancelled this way, but may still take a bit of time to settle.
+ */
 blk_drain(sdev->conf.blk);
+
 scsi_device_set_ua(sdev, sense);
 }
 
-- 
2.43.0

[PULL 06/16] block-backend: Allow concurrent context changes

2024-02-07 Thread Kevin Wolf

From: Hanna Czenczek 

Since AioContext locks have been removed, a BlockBackend's AioContext
may really change at any time (only exception is that it is often
confined to a drained section, as noted in this patch).  Therefore,
blk_get_aio_context() cannot rely on its root node's context always
matching that of the BlockBackend.

In practice, whether they match does not matter anymore anyway: Requests
can be sent to BDSs from any context, so anyone who requests the BB's
context should have no reason to require the root node to have the same
context.  Therefore, we can and should remove the assertion to that
effect.

In addition, because the context can be set and queried from different
threads concurrently, it has to be accessed with atomic operations.

Buglink: https://issues.redhat.com/browse/RHEL-19381
Suggested-by: Kevin Wolf 
Signed-off-by: Hanna Czenczek 
Message-ID: <20240202144755.671354-2-hre...@redhat.com>
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 block/block-backend.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 209eb07528..9c4de79e6b 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -44,7 +44,7 @@ struct BlockBackend {
 char *name;
 int refcnt;
 BdrvChild *root;
-AioContext *ctx;
+AioContext *ctx; /* access with atomic operations only */
 DriveInfo *legacy_dinfo;/* null unless created by drive_new() */
 QTAILQ_ENTRY(BlockBackend) link; /* for block_backends */
 QTAILQ_ENTRY(BlockBackend) monitor_link; /* for monitor_block_backends */
@@ -2414,22 +2414,22 @@ void blk_op_unblock_all(BlockBackend *blk, Error 
*reason)
 }
 }
 
+/**
+ * Return BB's current AioContext.  Note that this context may change
+ * concurrently at any time, with one exception: If the BB has a root node
+ * attached, its context will only change through 
bdrv_try_change_aio_context(),
+ * which creates a drained section.  Therefore, incrementing such a BB's
+ * in-flight counter will prevent its context from changing.
+ */
 AioContext *blk_get_aio_context(BlockBackend *blk)
 {
-BlockDriverState *bs;
 IO_CODE();
 
 if (!blk) {
 return qemu_get_aio_context();
 }
 
-bs = blk_bs(blk);
-if (bs) {
-AioContext *ctx = bdrv_get_aio_context(blk_bs(blk));
-assert(ctx == blk->ctx);
-}
-
-return blk->ctx;
+return qatomic_read(>ctx);
 }
 
 int blk_set_aio_context(BlockBackend *blk, AioContext *new_context,
@@ -2442,7 +2442,7 @@ int blk_set_aio_context(BlockBackend *blk, AioContext 
*new_context,
 GLOBAL_STATE_CODE();
 
 if (!bs) {
-blk->ctx = new_context;
+qatomic_set(>ctx, new_context);
 return 0;
 }
 
@@ -2471,7 +2471,7 @@ static void blk_root_set_aio_ctx_commit(void *opaque)
 AioContext *new_context = s->new_ctx;
 ThrottleGroupMember *tgm = >public.throttle_group_member;
 
-blk->ctx = new_context;
+qatomic_set(>ctx, new_context);
 if (tgm->throttle_state) {
 throttle_group_detach_aio_context(tgm);
 throttle_group_attach_aio_context(tgm, new_context);
-- 
2.43.0

[PULL 03/16] virtio-blk: add vq_rq[] bounds check in virtio_blk_dma_restart_cb()

2024-02-07 Thread Kevin Wolf

From: Stefan Hajnoczi 

Hanna Czenczek  noted that the array index in
virtio_blk_dma_restart_cb() is not bounds-checked:

  g_autofree VirtIOBlockReq **vq_rq = g_new0(VirtIOBlockReq *, num_queues);
  ...
  while (rq) {
  VirtIOBlockReq *next = rq->next;
  uint16_t idx = virtio_get_queue_index(rq->vq);

  rq->next = vq_rq[idx];
 ^^

The code is correct because both rq->vq and vq_rq[] depend on
num_queues, but this is indirect and not 100% obvious. Add an assertion.

Suggested-by: Hanna Czenczek 
Reviewed-by: Manos Pitsidianakis 
Reviewed-by: Hanna Czenczek 
Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240206190610.107963-4-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Kevin Wolf 
---
 hw/block/virtio-blk.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index e430ba583c..31212506ca 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1209,6 +1209,8 @@ static void virtio_blk_dma_restart_cb(void *opaque, bool 
running,
 VirtIOBlockReq *next = rq->next;
 uint16_t idx = virtio_get_queue_index(rq->vq);
 
+/* Only num_queues vqs were created so vq_rq[idx] is within bounds */
+assert(idx < num_queues);
 rq->next = vq_rq[idx];
 vq_rq[idx] = rq;
 rq = next;
-- 
2.43.0

[PULL 01/16] virtio-blk: enforce iothread-vq-mapping validation

2024-02-07 Thread Kevin Wolf

From: Stefan Hajnoczi 

Hanna Czenczek  noticed that the safety of
`vq_aio_context[vq->value] = ctx;` with user-defined vq->value inputs is
not obvious.

The code is structured in validate() + apply() steps so input validation
is there, but it happens way earlier and there is nothing that
guarantees apply() can only be called with validated inputs.

This patch moves the validate() call inside the apply() function so
validation is guaranteed. I also added the bounds checking assertion
that Hanna suggested.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Manos Pitsidianakis 
Reviewed-by: Hanna Czenczek 
Message-ID: <20240206190610.107963-2-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Kevin Wolf 
---
 hw/block/virtio-blk.c | 183 +++---
 1 file changed, 102 insertions(+), 81 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 227d83569f..6e3e3a23ee 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1485,68 +1485,6 @@ static int virtio_blk_load_device(VirtIODevice *vdev, 
QEMUFile *f,
 return 0;
 }
 
-static bool
-validate_iothread_vq_mapping_list(IOThreadVirtQueueMappingList *list,
-uint16_t num_queues, Error **errp)
-{
-g_autofree unsigned long *vqs = bitmap_new(num_queues);
-g_autoptr(GHashTable) iothreads =
-g_hash_table_new(g_str_hash, g_str_equal);
-
-for (IOThreadVirtQueueMappingList *node = list; node; node = node->next) {
-const char *name = node->value->iothread;
-uint16List *vq;
-
-if (!iothread_by_id(name)) {
-error_setg(errp, "IOThread \"%s\" object does not exist", name);
-return false;
-}
-
-if (!g_hash_table_add(iothreads, (gpointer)name)) {
-error_setg(errp,
-"duplicate IOThread name \"%s\" in iothread-vq-mapping",
-name);
-return false;
-}
-
-if (node != list) {
-if (!!node->value->vqs != !!list->value->vqs) {
-error_setg(errp, "either all items in iothread-vq-mapping "
- "must have vqs or none of them must have it");
-return false;
-}
-}
-
-for (vq = node->value->vqs; vq; vq = vq->next) {
-if (vq->value >= num_queues) {
-error_setg(errp, "vq index %u for IOThread \"%s\" must be "
-"less than num_queues %u in iothread-vq-mapping",
-vq->value, name, num_queues);
-return false;
-}
-
-if (test_and_set_bit(vq->value, vqs)) {
-error_setg(errp, "cannot assign vq %u to IOThread \"%s\" "
-"because it is already assigned", vq->value, name);
-return false;
-}
-}
-}
-
-if (list->value->vqs) {
-for (uint16_t i = 0; i < num_queues; i++) {
-if (!test_bit(i, vqs)) {
-error_setg(errp,
-"missing vq %u IOThread assignment in 
iothread-vq-mapping",
-i);
-return false;
-}
-}
-}
-
-return true;
-}
-
 static void virtio_resize_cb(void *opaque)
 {
 VirtIODevice *vdev = opaque;
@@ -1613,15 +1551,95 @@ static const BlockDevOps virtio_block_ops = {
 .drained_end   = virtio_blk_drained_end,
 };
 
-/* Generate vq:AioContext mappings from a validated iothread-vq-mapping list */
-static void
-apply_vq_mapping(IOThreadVirtQueueMappingList *iothread_vq_mapping_list,
- AioContext **vq_aio_context, uint16_t num_queues)
+static bool
+validate_iothread_vq_mapping_list(IOThreadVirtQueueMappingList *list,
+uint16_t num_queues, Error **errp)
+{
+g_autofree unsigned long *vqs = bitmap_new(num_queues);
+g_autoptr(GHashTable) iothreads =
+g_hash_table_new(g_str_hash, g_str_equal);
+
+for (IOThreadVirtQueueMappingList *node = list; node; node = node->next) {
+const char *name = node->value->iothread;
+uint16List *vq;
+
+if (!iothread_by_id(name)) {
+error_setg(errp, "IOThread \"%s\" object does not exist", name);
+return false;
+}
+
+if (!g_hash_table_add(iothreads, (gpointer)name)) {
+error_setg(errp,
+"duplicate IOThread name \"%s\" in iothread-vq-mapping",
+name);
+return false;
+}
+
+if (node != list) {
+if (!!node->value->vqs != !!list->value->vqs) {
+error_setg(errp, "either all items in iothread-vq-mapping "
+

[PULL 10/16] virtio-blk: do not use C99 mixed declarations

2024-02-07 Thread Kevin Wolf

From: Stefan Hajnoczi 

QEMU's coding style generally forbids C99 mixed declarations.

Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240206140410.65650-1-stefa...@redhat.com>
Reviewed-by: Hanna Czenczek 
Reviewed-by: Kevin Wolf 
Acked-by: Michael S. Tsirkin 
Signed-off-by: Kevin Wolf 
---
 hw/block/virtio-blk.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 31212506ca..bda5c117d4 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -661,6 +661,9 @@ static void virtio_blk_zone_report_complete(void *opaque, 
int ret)
 int64_t zrp_size, n, j = 0;
 int64_t nz = data->zone_report_data.nr_zones;
 int8_t err_status = VIRTIO_BLK_S_OK;
+struct virtio_blk_zone_report zrp_hdr = (struct virtio_blk_zone_report) {
+.nr_zones = cpu_to_le64(nz),
+};
 
 trace_virtio_blk_zone_report_complete(vdev, req, nz, ret);
 if (ret) {
@@ -668,9 +671,6 @@ static void virtio_blk_zone_report_complete(void *opaque, 
int ret)
 goto out;
 }
 
-struct virtio_blk_zone_report zrp_hdr = (struct virtio_blk_zone_report) {
-.nr_zones = cpu_to_le64(nz),
-};
 zrp_size = sizeof(struct virtio_blk_zone_report)
+ sizeof(struct virtio_blk_zone_descriptor) * nz;
 n = iov_from_buf(in_iov, in_num, 0, _hdr, sizeof(zrp_hdr));
@@ -898,13 +898,14 @@ static int virtio_blk_handle_zone_append(VirtIOBlockReq 
*req,
 
 int64_t offset = virtio_ldq_p(vdev, >out.sector) << BDRV_SECTOR_BITS;
 int64_t len = iov_size(out_iov, out_num);
+ZoneCmdData *data;
 
 trace_virtio_blk_handle_zone_append(vdev, req, offset >> BDRV_SECTOR_BITS);
 if (!check_zoned_request(s, offset, len, true, _status)) {
 goto out;
 }
 
-ZoneCmdData *data = g_malloc(sizeof(ZoneCmdData));
+data = g_malloc(sizeof(ZoneCmdData));
 data->req = req;
 data->in_iov = in_iov;
 data->in_num = in_num;
@@ -1191,14 +1192,15 @@ static void virtio_blk_dma_restart_cb(void *opaque, 
bool running,
 {
 VirtIOBlock *s = opaque;
 uint16_t num_queues = s->conf.num_queues;
+g_autofree VirtIOBlockReq **vq_rq = NULL;
+VirtIOBlockReq *rq;
 
 if (!running) {
 return;
 }
 
 /* Split the device-wide s->rq request list into per-vq request lists */
-g_autofree VirtIOBlockReq **vq_rq = g_new0(VirtIOBlockReq *, num_queues);
-VirtIOBlockReq *rq;
+vq_rq = g_new0(VirtIOBlockReq *, num_queues);
 
 WITH_QEMU_LOCK_GUARD(>rq_lock) {
 rq = s->rq;
@@ -1961,6 +1963,7 @@ static void virtio_blk_device_realize(DeviceState *dev, 
Error **errp)
 VirtIODevice *vdev = VIRTIO_DEVICE(dev);
 VirtIOBlock *s = VIRTIO_BLK(dev);
 VirtIOBlkConf *conf = >conf;
+BlockDriverState *bs;
 Error *err = NULL;
 unsigned i;
 
@@ -2006,7 +2009,7 @@ static void virtio_blk_device_realize(DeviceState *dev, 
Error **errp)
 return;
 }
 
-BlockDriverState *bs = blk_bs(conf->conf.blk);
+bs = blk_bs(conf->conf.blk);
 if (bs->bl.zoned != BLK_Z_NONE) {
 virtio_add_feature(>host_features, VIRTIO_BLK_F_ZONED);
 if (bs->bl.zoned == BLK_Z_HM) {
-- 
2.43.0

[PULL 12/16] blkio: Respect memory-alignment for bounce buffer allocations

2024-02-07 Thread Kevin Wolf

blkio_alloc_mem_region() requires that the requested buffer size is a
multiple of the memory-alignment property. If it isn't, the allocation
fails with a return value of -EINVAL.

Fix the call in blkio_resize_bounce_pool() to make sure the requested
size is properly aligned.

I observed this problem with vhost-vdpa, which requires page aligned
memory. As the virtio-blk device behind it still had 512 byte blocks, we
got bs->bl.request_alignment = 512, but actually any request that needed
a bounce buffer and was not aligned to 4k would fail without this fix.

Suggested-by: Stefano Garzarella 
Signed-off-by: Kevin Wolf 
Message-ID: <20240131173140.42398-1-kw...@redhat.com>
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Stefano Garzarella 
Signed-off-by: Kevin Wolf 
---
 block/blkio.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blkio.c b/block/blkio.c
index bc2f21784c..882e1c297b 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -89,6 +89,9 @@ static int blkio_resize_bounce_pool(BDRVBlkioState *s, 
int64_t bytes)
 /* Pad size to reduce frequency of resize calls */
 bytes += 128 * 1024;
 
+/* Align the pool size to avoid blkio_alloc_mem_region() failure */
+bytes = QEMU_ALIGN_UP(bytes, s->mem_region_alignment);
+
 WITH_QEMU_LOCK_GUARD(>blkio_lock) {
 int ret;
 
-- 
2.43.0

[PULL 05/16] monitor: use aio_co_reschedule_self()

2024-02-07 Thread Kevin Wolf

From: Stefan Hajnoczi 

The aio_co_reschedule_self() API is designed to avoid the race
condition between scheduling the coroutine in another AioContext and
yielding.

The QMP dispatch code uses the open-coded version that appears
susceptible to the race condition at first glance:

  aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
  qemu_coroutine_yield();

The code is actually safe because the iohandler and qemu_aio_context
AioContext run under the Big QEMU Lock. Nevertheless, set a good example
and use aio_co_reschedule_self() so it's obvious that there is no race.

Suggested-by: Hanna Reitz 
Reviewed-by: Manos Pitsidianakis 
Reviewed-by: Hanna Czenczek 
Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240206190610.107963-6-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Acked-by: Markus Armbruster 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Kevin Wolf 
---
 qapi/qmp-dispatch.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
index 176b549473..f3488afeef 100644
--- a/qapi/qmp-dispatch.c
+++ b/qapi/qmp-dispatch.c
@@ -212,8 +212,7 @@ QDict *coroutine_mixed_fn qmp_dispatch(const QmpCommandList 
*cmds, QObject *requ
  * executing the command handler so that it can make progress if it
  * involves an AIO_WAIT_WHILE().
  */
-aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
-qemu_coroutine_yield();
+aio_co_reschedule_self(qemu_get_aio_context());
 }
 
 monitor_set_cur(qemu_coroutine_self(), cur_mon);
@@ -227,9 +226,7 @@ QDict *coroutine_mixed_fn qmp_dispatch(const QmpCommandList 
*cmds, QObject *requ
  * Move back to iohandler_ctx so that nested event loops for
  * qemu_aio_context don't start new monitor commands.
  */
-aio_co_schedule(iohandler_get_aio_context(),
-qemu_coroutine_self());
-qemu_coroutine_yield();
+aio_co_reschedule_self(iohandler_get_aio_context());
 }
 } else {
/*
-- 
2.43.0

[PULL 09/16] iotests: give tempdir an identifying name

2024-02-07 Thread Kevin Wolf

From: Daniel P. Berrangé 

If something goes wrong causing the iotests not to cleanup their
temporary directory, it is useful if the dir had an identifying
name to show what is to blame.

Signed-off-by: Daniel P. Berrangé 
Message-ID: <20240205155158.1843304-1-berra...@redhat.com>
Revieved-by: Michael Tokarev 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/testenv.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/testenv.py b/tests/qemu-iotests/testenv.py
index 3ff38f2661..588f30a4f1 100644
--- a/tests/qemu-iotests/testenv.py
+++ b/tests/qemu-iotests/testenv.py
@@ -126,7 +126,7 @@ def init_directories(self) -> None:
 self.tmp_sock_dir = False
 Path(self.sock_dir).mkdir(parents=True, exist_ok=True)
 except KeyError:
-self.sock_dir = tempfile.mkdtemp()
+self.sock_dir = tempfile.mkdtemp(prefix="qemu-iotests-")
 self.tmp_sock_dir = True
 
 self.sample_img_dir = os.getenv('SAMPLE_IMG_DIR',
-- 
2.43.0

[PULL 14/16] virtio: Re-enable notifications after drain

2024-02-07 Thread Kevin Wolf

From: Hanna Czenczek 

During drain, we do not care about virtqueue notifications, which is why
we remove the handlers on it.  When removing those handlers, whether vq
notifications are enabled or not depends on whether we were in polling
mode or not; if not, they are enabled (by default); if so, they have
been disabled by the io_poll_start callback.

Because we do not care about those notifications after removing the
handlers, this is fine.  However, we have to explicitly ensure they are
enabled when re-attaching the handlers, so we will resume receiving
notifications.  We do this in virtio_queue_aio_attach_host_notifier*().
If such a function is called while we are in a polling section,
attaching the notifiers will then invoke the io_poll_start callback,
re-disabling notifications.

Because we will always miss virtqueue updates in the drained section, we
also need to poll the virtqueue once after attaching the notifiers.

Buglink: https://issues.redhat.com/browse/RHEL-3934
Signed-off-by: Hanna Czenczek 
Message-ID: <20240202153158.788922-3-hre...@redhat.com>
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 include/block/aio.h |  7 ++-
 hw/virtio/virtio.c  | 42 ++
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 5d0a114988..8378553eb9 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -480,9 +480,14 @@ void aio_set_event_notifier(AioContext *ctx,
 AioPollFn *io_poll,
 EventNotifierHandler *io_poll_ready);
 
-/* Set polling begin/end callbacks for an event notifier that has already been
+/*
+ * Set polling begin/end callbacks for an event notifier that has already been
  * registered with aio_set_event_notifier.  Do nothing if the event notifier is
  * not registered.
+ *
+ * Note that if the io_poll_end() callback (or the entire notifier) is removed
+ * during polling, it will not be called, so an io_poll_begin() is not
+ * necessarily always followed by an io_poll_end().
  */
 void aio_set_event_notifier_poll(AioContext *ctx,
  EventNotifier *notifier,
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 7549094154..d229755eae 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3556,6 +3556,17 @@ static void 
virtio_queue_host_notifier_aio_poll_end(EventNotifier *n)
 
 void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
+/*
+ * virtio_queue_aio_detach_host_notifier() can leave notifications 
disabled.
+ * Re-enable them.  (And if detach has not been used before, notifications
+ * being enabled is still the default state while a notifier is attached;
+ * see virtio_queue_host_notifier_aio_poll_end(), which will always leave
+ * notifications enabled once the polling section is left.)
+ */
+if (!virtio_queue_get_notification(vq)) {
+virtio_queue_set_notification(vq, 1);
+}
+
 aio_set_event_notifier(ctx, >host_notifier,
virtio_queue_host_notifier_read,
virtio_queue_host_notifier_aio_poll,
@@ -3563,6 +3574,13 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
 aio_set_event_notifier_poll(ctx, >host_notifier,
 virtio_queue_host_notifier_aio_poll_begin,
 virtio_queue_host_notifier_aio_poll_end);
+
+/*
+ * We will have ignored notifications about new requests from the guest
+ * while no notifiers were attached, so "kick" the virt queue to process
+ * those requests now.
+ */
+event_notifier_set(>host_notifier);
 }
 
 /*
@@ -3573,14 +3591,38 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
  */
 void virtio_queue_aio_attach_host_notifier_no_poll(VirtQueue *vq, AioContext 
*ctx)
 {
+/* See virtio_queue_aio_attach_host_notifier() */
+if (!virtio_queue_get_notification(vq)) {
+virtio_queue_set_notification(vq, 1);
+}
+
 aio_set_event_notifier(ctx, >host_notifier,
virtio_queue_host_notifier_read,
NULL, NULL);
+
+/*
+ * See virtio_queue_aio_attach_host_notifier().
+ * Note that this may be unnecessary for the type of virtqueues this
+ * function is used for.  Still, it will not hurt to have a quick look into
+ * whether we can/should process any of the virtqueue elements.
+ */
+event_notifier_set(>host_notifier);
 }
 
 void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
 aio_set_event_notifier(ctx, >host_notifier, NULL, NULL, NULL);
+
+/*
+ * aio_set_event_notifier_poll() does not guarantee whether io_poll_end()
+ * will run after io_poll_begin(), so by removing the notifi

[PULL 04/16] virtio-blk: declare VirtIOBlock::rq with a type

2024-02-07 Thread Kevin Wolf

From: Stefan Hajnoczi 

The VirtIOBlock::rq field has had the type void * since its introduction
in commit 869a5c6df19a ("Stop VM on error in virtio-blk. (Gleb
Natapov)").

Perhaps this was done to avoid the forward declaration of
VirtIOBlockReq.

Hanna Czenczek  pointed out the missing type. Specify
the actual type because there is no need to use void * here.

Suggested-by: Hanna Czenczek 
Reviewed-by: Manos Pitsidianakis 
Reviewed-by: Hanna Czenczek 
Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240206190610.107963-5-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Kevin Wolf 
---
 include/hw/virtio/virtio-blk.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 833a9a344f..5c14110c4b 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -55,7 +55,7 @@ struct VirtIOBlock {
 VirtIODevice parent_obj;
 BlockBackend *blk;
 QemuMutex rq_lock;
-void *rq; /* protected by rq_lock */
+struct VirtIOBlockReq *rq; /* protected by rq_lock */
 VirtIOBlkConf conf;
 unsigned short sector_mask;
 bool original_wce;
-- 
2.43.0

[PULL 13/16] virtio-scsi: Attach event vq notifier with no_poll

2024-02-07 Thread Kevin Wolf

From: Hanna Czenczek 

As of commit 38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi:
don't waste CPU polling the event virtqueue"), we only attach an io_read
notifier for the virtio-scsi event virtqueue instead, and no polling
notifiers.  During operation, the event virtqueue is typically
non-empty, but none of the buffers are intended to be used immediately.
Instead, they only get used when certain events occur.  Therefore, it
makes no sense to continuously poll it when non-empty, because it is
supposed to be and stay non-empty.

We do this by using virtio_queue_aio_attach_host_notifier_no_poll()
instead of virtio_queue_aio_attach_host_notifier() for the event
virtqueue.

Commit 766aa2de0f29b657148e04599320d771c36fd126 ("virtio-scsi: implement
BlockDevOps->drained_begin()") however has virtio_scsi_drained_end() use
virtio_queue_aio_attach_host_notifier() for all virtqueues, including
the event virtqueue.  This can lead to it being polled again, undoing
the benefit of commit 38738f7dbbda90fbc161757b7f4be35b52205552.

Fix it by using virtio_queue_aio_attach_host_notifier_no_poll() for the
event virtqueue.

Reported-by: Fiona Ebner 
Fixes: 766aa2de0f29b657148e04599320d771c36fd126
   ("virtio-scsi: implement BlockDevOps->drained_begin()")
Reviewed-by: Stefan Hajnoczi 
Tested-by: Fiona Ebner 
Reviewed-by: Fiona Ebner 
Signed-off-by: Hanna Czenczek 
Message-ID: <20240202153158.788922-2-hre...@redhat.com>
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 hw/scsi/virtio-scsi.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 690aceec45..9f02ceea09 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -1149,6 +1149,7 @@ static void virtio_scsi_drained_begin(SCSIBus *bus)
 static void virtio_scsi_drained_end(SCSIBus *bus)
 {
 VirtIOSCSI *s = container_of(bus, VirtIOSCSI, bus);
+VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
 VirtIODevice *vdev = VIRTIO_DEVICE(s);
 uint32_t total_queues = VIRTIO_SCSI_VQ_NUM_FIXED +
 s->parent_obj.conf.num_queues;
@@ -1166,7 +1167,11 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
 
 for (uint32_t i = 0; i < total_queues; i++) {
 VirtQueue *vq = virtio_get_queue(vdev, i);
-virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+if (vq == vs->event_vq) {
+virtio_queue_aio_attach_host_notifier_no_poll(vq, s->ctx);
+} else {
+virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+}
 }
 }
 
-- 
2.43.0

[PULL 08/16] iotests: fix leak of tmpdir in dry-run mode

2024-02-07 Thread Kevin Wolf

From: Daniel P. Berrangé 

Creating an instance of the 'TestEnv' class will create a temporary
directory. This dir is only deleted, however, in the __exit__ handler
invoked by a context manager.

In dry-run mode, we don't use the TestEnv via a context manager, so
were leaking the temporary directory. Since meson invokes 'check'
5 times on each configure run, developers /tmp was filling up with
empty temporary directories.

Signed-off-by: Daniel P. Berrangé 
Message-ID: <20240205154019.1841037-1-berra...@redhat.com>
Reviewed-by: Michael Tokarev 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/check | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/check b/tests/qemu-iotests/check
index f2e9d27dcf..56d88ca423 100755
--- a/tests/qemu-iotests/check
+++ b/tests/qemu-iotests/check
@@ -184,7 +184,8 @@ if __name__ == '__main__':
 sys.exit(str(e))
 
 if args.dry_run:
-print('\n'.join([os.path.basename(t) for t in tests]))
+with env:
+print('\n'.join([os.path.basename(t) for t in tests]))
 else:
 with TestRunner(env, tap=args.tap,
 color=args.color) as tr:
-- 
2.43.0

[PULL 00/16] Block layer patches

2024-02-07 Thread Kevin Wolf

The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440:

  Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into 
staging (2024-02-03 13:31:58 +)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 7ccd0415f2d67e6739da756241f60d98d5c80bf8:

  virtio-blk: avoid using ioeventfd state in irqfd conditional (2024-02-07 
21:59:07 +0100)


Block layer patches

- Allow concurrent BB context changes
- virtio: Re-enable notifications after drain
- virtio-blk: Fix missing use of irqfd
- scsi: Don't ignore most usb-storage properties
- blkio: Respect memory-alignment for bounce buffer allocations
- iotests tmpdir fixes
- virtio-blk: Code cleanups


Daniel P. Berrangé (2):
  iotests: fix leak of tmpdir in dry-run mode
  iotests: give tempdir an identifying name

Hanna Czenczek (5):
  block-backend: Allow concurrent context changes
  scsi: Await request purging
  virtio-scsi: Attach event vq notifier with no_poll
  virtio: Re-enable notifications after drain
  virtio-blk: Use ioeventfd_attach in start_ioeventfd

Kevin Wolf (2):
  scsi: Don't ignore most usb-storage properties
  blkio: Respect memory-alignment for bounce buffer allocations

Stefan Hajnoczi (7):
  virtio-blk: enforce iothread-vq-mapping validation
  virtio-blk: clarify that there is at least 1 virtqueue
  virtio-blk: add vq_rq[] bounds check in virtio_blk_dma_restart_cb()
  virtio-blk: declare VirtIOBlock::rq with a type
  monitor: use aio_co_reschedule_self()
  virtio-blk: do not use C99 mixed declarations
  virtio-blk: avoid using ioeventfd state in irqfd conditional

 include/block/aio.h|   7 +-
 include/hw/scsi/scsi.h |   5 +-
 include/hw/virtio/virtio-blk.h |   2 +-
 block/blkio.c  |   3 +
 block/block-backend.c  |  22 ++--
 hw/block/virtio-blk.c  | 226 +++--
 hw/scsi/scsi-bus.c |  63 ++--
 hw/scsi/virtio-scsi.c  |   7 +-
 hw/usb/dev-storage-classic.c   |   5 +-
 hw/virtio/virtio.c |  42 
 qapi/qmp-dispatch.c|   7 +-
 tests/qemu-iotests/testenv.py  |   2 +-
 tests/qemu-iotests/check   |   3 +-
 13 files changed, 236 insertions(+), 158 deletions(-)

[PULL 11/16] scsi: Don't ignore most usb-storage properties

2024-02-07 Thread Kevin Wolf

usb-storage is for the most part just a wrapper around an internally
created scsi-disk device. It uses DEFINE_BLOCK_PROPERTIES() to offer all
of the usual block device properties to the user, but then only forwards
a few select properties to the internal device while the rest is
silently ignored.

This changes scsi_bus_legacy_add_drive() to accept a whole BlockConf
instead of some individual values inside of it so that usb-storage can
now pass the whole configuration to the internal scsi-disk. This enables
the remaining block device properties, e.g. logical/physical_block_size
or discard_granularity.

Buglink: https://issues.redhat.com/browse/RHEL-22375
Signed-off-by: Kevin Wolf 
Message-ID: <20240131130607.24117-1-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 include/hw/scsi/scsi.h   |  5 +
 hw/scsi/scsi-bus.c   | 33 +
 hw/usb/dev-storage-classic.c |  5 +
 3 files changed, 15 insertions(+), 28 deletions(-)

diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index 10c4e8288d..c3d5e17e38 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -199,10 +199,7 @@ static inline SCSIBus *scsi_bus_from_device(SCSIDevice *d)
 }
 
 SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, BlockBackend *blk,
-  int unit, bool removable, int bootindex,
-  bool share_rw,
-  BlockdevOnError rerror,
-  BlockdevOnError werror,
+  int unit, bool removable, BlockConf 
*conf,
   const char *serial, Error **errp);
 void scsi_bus_set_ua(SCSIBus *bus, SCSISense sense);
 void scsi_bus_legacy_handle_cmdline(SCSIBus *bus);
diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 230313022c..9e40b0c920 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -376,15 +376,13 @@ static void scsi_qdev_unrealize(DeviceState *qdev)
 
 /* handle legacy '-drive if=scsi,...' cmd line args */
 SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, BlockBackend *blk,
-  int unit, bool removable, int bootindex,
-  bool share_rw,
-  BlockdevOnError rerror,
-  BlockdevOnError werror,
+  int unit, bool removable, BlockConf 
*conf,
   const char *serial, Error **errp)
 {
 const char *driver;
 char *name;
 DeviceState *dev;
+SCSIDevice *s;
 DriveInfo *dinfo;
 
 if (blk_is_sg(blk)) {
@@ -402,11 +400,10 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, 
BlockBackend *blk,
 object_property_add_child(OBJECT(bus), name, OBJECT(dev));
 g_free(name);
 
+s = SCSI_DEVICE(dev);
+s->conf = *conf;
+
 qdev_prop_set_uint32(dev, "scsi-id", unit);
-if (bootindex >= 0) {
-object_property_set_int(OBJECT(dev), "bootindex", bootindex,
-_abort);
-}
 if (object_property_find(OBJECT(dev), "removable")) {
 qdev_prop_set_bit(dev, "removable", removable);
 }
@@ -417,19 +414,12 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, 
BlockBackend *blk,
 object_unparent(OBJECT(dev));
 return NULL;
 }
-if (!object_property_set_bool(OBJECT(dev), "share-rw", share_rw, errp)) {
-object_unparent(OBJECT(dev));
-return NULL;
-}
-
-qdev_prop_set_enum(dev, "rerror", rerror);
-qdev_prop_set_enum(dev, "werror", werror);
 
 if (!qdev_realize_and_unref(dev, >qbus, errp)) {
 object_unparent(OBJECT(dev));
 return NULL;
 }
-return SCSI_DEVICE(dev);
+return s;
 }
 
 void scsi_bus_legacy_handle_cmdline(SCSIBus *bus)
@@ -437,6 +427,12 @@ void scsi_bus_legacy_handle_cmdline(SCSIBus *bus)
 Location loc;
 DriveInfo *dinfo;
 int unit;
+BlockConf conf = {
+.bootindex = -1,
+.share_rw = false,
+.rerror = BLOCKDEV_ON_ERROR_AUTO,
+.werror = BLOCKDEV_ON_ERROR_AUTO,
+};
 
 loc_push_none();
 for (unit = 0; unit <= bus->info->max_target; unit++) {
@@ -446,10 +442,7 @@ void scsi_bus_legacy_handle_cmdline(SCSIBus *bus)
 }
 qemu_opts_loc_restore(dinfo->opts);
 scsi_bus_legacy_add_drive(bus, blk_by_legacy_dinfo(dinfo),
-  unit, false, -1, false,
-  BLOCKDEV_ON_ERROR_AUTO,
-  BLOCKDEV_ON_ERROR_AUTO,
-  NULL, _fatal);
+  unit, false, , NULL, _fatal);
 }
 loc_pop();
 }
diff --git a/hw/usb/dev-storage-classic.c b/hw/usb/dev-storage-classic.c
index 84d19752b5..50a3ad6285 100644
---

[PULL 15/16] virtio-blk: Use ioeventfd_attach in start_ioeventfd

2024-02-07 Thread Kevin Wolf

From: Hanna Czenczek 

Commit d3f6f294aeadd5f88caf0155e4360808c95b3146 ("virtio-blk: always set
ioeventfd during startup") has made virtio_blk_start_ioeventfd() always
kick the virtqueue (set the ioeventfd), regardless of whether the BB is
drained.  That is no longer necessary, because attaching the host
notifier will now set the ioeventfd, too; this happens either
immediately right here in virtio_blk_start_ioeventfd(), or later when
the drain ends, in virtio_blk_ioeventfd_attach().

With event_notifier_set() removed, the code becomes the same as the one
in virtio_blk_ioeventfd_attach(), so we can reuse that function.

Signed-off-by: Hanna Czenczek 
Message-ID: <20240202153158.788922-4-hre...@redhat.com>
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 hw/block/virtio-blk.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index bda5c117d4..4ca5e632ea 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -37,6 +37,8 @@
 #include "hw/virtio/virtio-blk-common.h"
 #include "qemu/coroutine.h"
 
+static void virtio_blk_ioeventfd_attach(VirtIOBlock *s);
+
 static void virtio_blk_init_request(VirtIOBlock *s, VirtQueue *vq,
 VirtIOBlockReq *req)
 {
@@ -1847,17 +1849,14 @@ static int virtio_blk_start_ioeventfd(VirtIODevice 
*vdev)
 s->ioeventfd_started = true;
 smp_wmb(); /* paired with aio_notify_accept() on the read side */
 
-/* Get this show started by hooking up our callbacks */
-for (i = 0; i < nvqs; i++) {
-VirtQueue *vq = virtio_get_queue(vdev, i);
-AioContext *ctx = s->vq_aio_context[i];
-
-/* Kick right away to begin processing requests already in vring */
-event_notifier_set(virtio_queue_get_host_notifier(vq));
-
-if (!blk_in_drain(s->conf.conf.blk)) {
-virtio_queue_aio_attach_host_notifier(vq, ctx);
-}
+/*
+ * Get this show started by hooking up our callbacks.  If drained now,
+ * virtio_blk_drained_end() will do this later.
+ * Attaching the notifier also kicks the virtqueues, processing any 
requests
+ * they may already have.
+ */
+if (!blk_in_drain(s->conf.conf.blk)) {
+virtio_blk_ioeventfd_attach(s);
 }
 return 0;
 
-- 
2.43.0

[PULL 16/16] virtio-blk: avoid using ioeventfd state in irqfd conditional

2024-02-07 Thread Kevin Wolf

From: Stefan Hajnoczi 

Requests that complete in an IOThread use irqfd to notify the guest
while requests that complete in the main loop thread use the traditional
qdev irq code path. The reason for this conditional is that the irq code
path requires the BQL:

  if (s->ioeventfd_started && !s->ioeventfd_disabled) {
  virtio_notify_irqfd(vdev, req->vq);
  } else {
  virtio_notify(vdev, req->vq);
  }

There is a corner case where the conditional invokes the irq code path
instead of the irqfd code path:

  static void virtio_blk_stop_ioeventfd(VirtIODevice *vdev)
  {
  ...
  /*
   * Set ->ioeventfd_started to false before draining so that host notifiers
   * are not detached/attached anymore.
   */
  s->ioeventfd_started = false;

  /* Wait for virtio_blk_dma_restart_bh() and in flight I/O to complete */
  blk_drain(s->conf.conf.blk);

During blk_drain() the conditional produces the wrong result because
ioeventfd_started is false.

Use qemu_in_iothread() instead of checking the ioeventfd state.

Buglink: https://issues.redhat.com/browse/RHEL-15394
Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240122172625.415386-1-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 hw/block/virtio-blk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 4ca5e632ea..738cb2ac36 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -66,7 +66,7 @@ static void virtio_blk_req_complete(VirtIOBlockReq *req, 
unsigned char status)
 iov_discard_undo(>inhdr_undo);
 iov_discard_undo(>outhdr_undo);
 virtqueue_push(req->vq, >elem, req->in_len);
-if (s->ioeventfd_started && !s->ioeventfd_disabled) {
+if (qemu_in_iothread()) {
 virtio_notify_irqfd(vdev, req->vq);
 } else {
 virtio_notify(vdev, req->vq);
-- 
2.43.0

[PULL 02/16] virtio-blk: clarify that there is at least 1 virtqueue

2024-02-07 Thread Kevin Wolf

From: Stefan Hajnoczi 

It is not possible to instantiate a virtio-blk device with 0 virtqueues.
The following check is located in ->realize():

  if (!conf->num_queues) {
  error_setg(errp, "num-queues property must be larger than 0");
  return;
  }

Later on we access s->vq_aio_context[0] under the assumption that there
is as least one virtqueue. Hanna Czenczek  noted that
it would help to show that the array index is already valid.

Add an assertion to document that s->vq_aio_context[0] is always
safe...and catch future code changes that break this assumption.

Suggested-by: Hanna Czenczek 
Reviewed-by: Manos Pitsidianakis 
Reviewed-by: Hanna Czenczek 
Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240206190610.107963-3-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Kevin Wolf 
---
 hw/block/virtio-blk.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 6e3e3a23ee..e430ba583c 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1824,6 +1824,7 @@ static int virtio_blk_start_ioeventfd(VirtIODevice *vdev)
  * Try to change the AioContext so that block jobs and other operations can
  * co-locate their activity in the same AioContext. If it fails, nevermind.
  */
+assert(nvqs > 0); /* enforced during ->realize() */
 r = blk_set_aio_context(s->conf.conf.blk, s->vq_aio_context[0],
 _err);
 if (r < 0) {
-- 
2.43.0

Re: [PATCH] virtio-blk: avoid using ioeventfd state in irqfd conditional

2024-02-07 Thread Kevin Wolf

Am 22.01.2024 um 18:26 hat Stefan Hajnoczi geschrieben:
> Requests that complete in an IOThread use irqfd to notify the guest
> while requests that complete in the main loop thread use the traditional
> qdev irq code path. The reason for this conditional is that the irq code
> path requires the BQL:
> 
>   if (s->ioeventfd_started && !s->ioeventfd_disabled) {
>   virtio_notify_irqfd(vdev, req->vq);
>   } else {
>   virtio_notify(vdev, req->vq);
>   }
> 
> There is a corner case where the conditional invokes the irq code path
> instead of the irqfd code path:
> 
>   static void virtio_blk_stop_ioeventfd(VirtIODevice *vdev)
>   {
>   ...
>   /*
>* Set ->ioeventfd_started to false before draining so that host 
> notifiers
>* are not detached/attached anymore.
>*/
>   s->ioeventfd_started = false;
> 
>   /* Wait for virtio_blk_dma_restart_bh() and in flight I/O to complete */
>   blk_drain(s->conf.conf.blk);
> 
> During blk_drain() the conditional produces the wrong result because
> ioeventfd_started is false.
> 
> Use qemu_in_iothread() instead of checking the ioeventfd state.
> 
> Buglink: https://issues.redhat.com/browse/RHEL-15394
> Signed-off-by: Stefan Hajnoczi 

Thanks, applied to the block branch.

Kevin

Re: [PATCH v2 0/3] virtio: Re-enable notifications after drain

2024-02-07 Thread Kevin Wolf

Am 02.02.2024 um 16:31 hat Hanna Czenczek geschrieben:
> Hanna Czenczek (3):
>   virtio-scsi: Attach event vq notifier with no_poll
>   virtio: Re-enable notifications after drain
>   virtio-blk: Use ioeventfd_attach in start_ioeventfd
> 
>  include/block/aio.h   |  7 ++-
>  hw/block/virtio-blk.c | 21 ++---
>  hw/scsi/virtio-scsi.c |  7 ++-
>  hw/virtio/virtio.c| 42 ++
>  4 files changed, 64 insertions(+), 13 deletions(-)

Thanks, applied to the block branch.

Kevin

Re: [PATCH] virtio-blk: do not use C99 mixed declarations

2024-02-07 Thread Kevin Wolf

Am 06.02.2024 um 15:04 hat Stefan Hajnoczi geschrieben:
> QEMU's coding style generally forbids C99 mixed declarations.
> 
> Signed-off-by: Stefan Hajnoczi 

Thanks, applied to the block branch.

Kevin

Re: [PATCH] iotests: give tempdir an identifying name

2024-02-07 Thread Kevin Wolf

Am 05.02.2024 um 16:51 hat Daniel P. Berrangé geschrieben:
> If something goes wrong causing the iotests not to cleanup their
> temporary directory, it is useful if the dir had an identifying
> name to show what is to blame.
> 
> Signed-off-by: Daniel P. Berrangé 

Thanks, applied to the block branch.

Kevin

Re: [PATCH] iotests: fix leak of tmpdir in dry-run mode

2024-02-07 Thread Kevin Wolf

Am 05.02.2024 um 16:40 hat Daniel P. Berrangé geschrieben:
> Creating an instance of the 'TestEnv' class will create a temporary
> directory. This dir is only deleted, however, in the __exit__ handler
> invoked by a context manager.
> 
> In dry-run mode, we don't use the TestEnv via a context manager, so
> were leaking the temporary directory. Since meson invokes 'check'
> 5 times on each configure run, developers /tmp was filling up with
> empty temporary directories.
> 
> Signed-off-by: Daniel P. Berrangé 

Thanks, applied to the block branch.

Kevin

Re: [PATCH 0/2] block: Allow concurrent BB context changes

2024-02-07 Thread Kevin Wolf

Am 02.02.2024 um 15:47 hat Hanna Czenczek geschrieben:
> Hi,
> 
> Without the AioContext lock, a BB's context may kind of change at any
> time (unless it has a root node, and I/O requests are pending).  That
> also means that its own context (BlockBackend.ctx) and that of its root
> node can differ sometimes (while the context is being changed).
> 
> blk_get_aio_context() doesn't know this yet and asserts that both are
> always equal (if there is a root node).  Because it's no longer true,
> and because callers don't seem to really care about the root node's
> context, we can and should remove the assertion and just return the BB's
> context.
> 
> Beyond that, the question is whether the callers of
> blk_get_aio_context() are OK with the context potentially changing
> concurrently.  Honestly, it isn't entirely clear to me; most look OK,
> except for the virtio-scsi code, which operates under the general
> assumption that the BB's context is always equal to that of the
> virtio-scsi device.  I doubt that this assumption always holds (it is
> definitely not obvious to me that it would), but then again, this series
> will not make matters worse in that regard, and that is what counts for
> me now.
> 
> One clear point of contention is scsi_device_for_each_req_async(), which
> is addressed by patch 2.  Right now, it schedules a BH in the BB
> context, then the BH double-checks whether the context still fits, and
> if not, re-schedules itself.  Because virtio-scsi's context is fixed,
> this seems to indicate to me that it wants to be able to deal with a
> case where BB and virtio-scsi context differ, which seems to break that
> aforementioned general virtio-scsi assumption.
> 
> Unfortunately, I definitely have to touch that code, because accepting
> concurrent changes of AioContexts breaks the double-check (just because
> the BB has the right context in that place does not mean it stays in
> that context); instead, we must prevent any concurrent change until the
> BH is done.  Because changing contexts generally involves a drained
> section, we can prevent it by keeping the BB in-flight counter elevated.
> 
> Question is, how to reason for that.  I’d really rather not include the
> need to follow the BB context in my argument, because I find that part a
> bit fishy.
> 
> Luckily, there’s a second, completely different reason for having
> scsi_device_for_each_req_async() increment the in-flight counter:
> Specifically, scsi_device_purge_requests() probably wants to await full
> completion of scsi_device_for_each_req_async(), and we can do that most
> easily in the very same way by incrementing the in-flight counter.  This
> way, the blk_drain() in scsi_device_purge_requests() will not only await
> all (cancelled) I/O requests, but also the non-I/O requests.
> 
> The fact that this prevents the BB AioContext from changing while the BH
> is scheduled/running then is just a nice side effect.
> 
> 
> Hanna Czenczek (2):
>   block-backend: Allow concurrent context changes
>   scsi: Await request purging
> 
>  block/block-backend.c | 22 +++---
>  hw/scsi/scsi-bus.c| 30 +-
>  2 files changed, 32 insertions(+), 20 deletions(-)

Thanks, applied to the block branch.

Kevin

Re: [PATCH v2 0/5] virtio-blk: iothread-vq-mapping cleanups

2024-02-07 Thread Kevin Wolf

Am 06.02.2024 um 20:06 hat Stefan Hajnoczi geschrieben:
> v2:
> - Add comment in Patch 3 explaining why bounds check assertion [Manos]
> - Remove redundant nested if in Patch 1 [Hanna]
> 
> Hanna reviewed the iothread-vq-mapping patches after they were applied to
> qemu.git. This series consists of code cleanups that Hanna identified.
> 
> There are no functional changes or bug fixes that need to be backported to the
> stable tree here, but it may make sense to backport them in the future to 
> avoid
> conflicts.

Thanks, applied to the block branch.

Kevin

Re: [PATCH] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-02-07 Thread Kevin Wolf

Am 06.02.2024 um 17:44 hat Eugenio Perez Martin geschrieben:
> On Mon, Feb 5, 2024 at 2:49 PM Kevin Wolf  wrote:
> >
> > Am 05.02.2024 um 13:22 hat Eugenio Perez Martin geschrieben:
> > > On Fri, Feb 2, 2024 at 2:25 PM Kevin Wolf  wrote:
> > > >
> > > > VDUSE requires that virtqueues are first enabled before the DRIVER_OK
> > > > status flag is set; with the current API of the kernel module, it is
> > > > impossible to enable the opposite order in our block export code because
> > > > userspace is not notified when a virtqueue is enabled.
> > > >
> > > > This requirement also mathces the normal initialisation order as done by
> > > > the generic vhost code in QEMU. However, commit 6c482547 accidentally
> > > > changed the order for vdpa-dev and broke access to VDUSE devices with
> > > > this.
> > > >
> > > > This changes vdpa-dev to use the normal order again and use the standard
> > > > vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
> > > > used with vdpa-dev again after this fix.
> > > >
> > > > Cc: qemu-sta...@nongnu.org
> > > > Fixes: 6c4825476a4351530bcac17abab72295b75ffe98
> > > > Signed-off-by: Kevin Wolf 
> > > > ---
> > > >  hw/virtio/vdpa-dev.c   |  5 +
> > > >  hw/virtio/vhost-vdpa.c | 17 +
> > > >  2 files changed, 18 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/hw/virtio/vdpa-dev.c b/hw/virtio/vdpa-dev.c
> > > > index eb9ecea83b..13e87f06f6 100644
> > > > --- a/hw/virtio/vdpa-dev.c
> > > > +++ b/hw/virtio/vdpa-dev.c
> > > > @@ -253,14 +253,11 @@ static int vhost_vdpa_device_start(VirtIODevice 
> > > > *vdev, Error **errp)
> > > >
> > > >  s->dev.acked_features = vdev->guest_features;
> > > >
> > > > -ret = vhost_dev_start(>dev, vdev, false);
> > > > +ret = vhost_dev_start(>dev, vdev, true);
> > > >  if (ret < 0) {
> > > >  error_setg_errno(errp, -ret, "Error starting vhost");
> > > >  goto err_guest_notifiers;
> > > >  }
> > > > -for (i = 0; i < s->dev.nvqs; ++i) {
> > > > -vhost_vdpa_set_vring_ready(>vdpa, i);
> > > > -}
> > > >  s->started = true;
> > > >
> > > >  /*
> > > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > > index 3a43beb312..c4574d56c5 100644
> > > > --- a/hw/virtio/vhost-vdpa.c
> > > > +++ b/hw/virtio/vhost-vdpa.c
> > > > @@ -904,6 +904,22 @@ int vhost_vdpa_set_vring_ready(struct vhost_vdpa 
> > > > *v, unsigned idx)
> > > >  return r;
> > > >  }
> > > >
> > > > +static int vhost_vdpa_set_vring_enable(struct vhost_dev *dev, int 
> > > > enable)
> > > > +{
> > > > +struct vhost_vdpa *v = dev->opaque;
> > > > +unsigned int i;
> > > > +int ret;
> > > > +
> > > > +for (i = 0; i < dev->nvqs; ++i) {
> > > > +ret = vhost_vdpa_set_vring_ready(v, i);
> > > > +if (ret < 0) {
> > > > +return ret;
> > > > +}
> > > > +}
> > > > +
> > > > +return 0;
> > > > +}
> > > > +
> > > >  static int vhost_vdpa_set_config_call(struct vhost_dev *dev,
> > > > int fd)
> > > >  {
> > > > @@ -1524,6 +1540,7 @@ const VhostOps vdpa_ops = {
> > > >  .vhost_set_features = vhost_vdpa_set_features,
> > > >  .vhost_reset_device = vhost_vdpa_reset_device,
> > > >  .vhost_get_vq_index = vhost_vdpa_get_vq_index,
> > > > +.vhost_set_vring_enable = vhost_vdpa_set_vring_enable,
> > > >  .vhost_get_config  = vhost_vdpa_get_config,
> > > >  .vhost_set_config = vhost_vdpa_set_config,
> > > >  .vhost_requires_shm_log = NULL,
> > >
> > > vhost-vdpa net enables CVQ before dataplane ones to configure all the
> > > device in the destination of a live migration. To go back again to
> > > this callback would cause the device to enable the dataplane before
> > > virtqueues are configured again.
> >
> > Not that it makes a difference, but I don't

Re: [PATCH] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-02-05 Thread Kevin Wolf

Am 05.02.2024 um 13:22 hat Eugenio Perez Martin geschrieben:
> On Fri, Feb 2, 2024 at 2:25 PM Kevin Wolf  wrote:
> >
> > VDUSE requires that virtqueues are first enabled before the DRIVER_OK
> > status flag is set; with the current API of the kernel module, it is
> > impossible to enable the opposite order in our block export code because
> > userspace is not notified when a virtqueue is enabled.
> >
> > This requirement also mathces the normal initialisation order as done by
> > the generic vhost code in QEMU. However, commit 6c482547 accidentally
> > changed the order for vdpa-dev and broke access to VDUSE devices with
> > this.
> >
> > This changes vdpa-dev to use the normal order again and use the standard
> > vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
> > used with vdpa-dev again after this fix.
> >
> > Cc: qemu-sta...@nongnu.org
> > Fixes: 6c4825476a4351530bcac17abab72295b75ffe98
> > Signed-off-by: Kevin Wolf 
> > ---
> >  hw/virtio/vdpa-dev.c   |  5 +
> >  hw/virtio/vhost-vdpa.c | 17 +
> >  2 files changed, 18 insertions(+), 4 deletions(-)
> >
> > diff --git a/hw/virtio/vdpa-dev.c b/hw/virtio/vdpa-dev.c
> > index eb9ecea83b..13e87f06f6 100644
> > --- a/hw/virtio/vdpa-dev.c
> > +++ b/hw/virtio/vdpa-dev.c
> > @@ -253,14 +253,11 @@ static int vhost_vdpa_device_start(VirtIODevice 
> > *vdev, Error **errp)
> >
> >  s->dev.acked_features = vdev->guest_features;
> >
> > -ret = vhost_dev_start(>dev, vdev, false);
> > +ret = vhost_dev_start(>dev, vdev, true);
> >  if (ret < 0) {
> >  error_setg_errno(errp, -ret, "Error starting vhost");
> >  goto err_guest_notifiers;
> >  }
> > -for (i = 0; i < s->dev.nvqs; ++i) {
> > -vhost_vdpa_set_vring_ready(>vdpa, i);
> > -}
> >  s->started = true;
> >
> >  /*
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 3a43beb312..c4574d56c5 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -904,6 +904,22 @@ int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, 
> > unsigned idx)
> >  return r;
> >  }
> >
> > +static int vhost_vdpa_set_vring_enable(struct vhost_dev *dev, int enable)
> > +{
> > +struct vhost_vdpa *v = dev->opaque;
> > +unsigned int i;
> > +int ret;
> > +
> > +for (i = 0; i < dev->nvqs; ++i) {
> > +ret = vhost_vdpa_set_vring_ready(v, i);
> > +if (ret < 0) {
> > +return ret;
> > +}
> > +}
> > +
> > +return 0;
> > +}
> > +
> >  static int vhost_vdpa_set_config_call(struct vhost_dev *dev,
> > int fd)
> >  {
> > @@ -1524,6 +1540,7 @@ const VhostOps vdpa_ops = {
> >  .vhost_set_features = vhost_vdpa_set_features,
> >  .vhost_reset_device = vhost_vdpa_reset_device,
> >  .vhost_get_vq_index = vhost_vdpa_get_vq_index,
> > +.vhost_set_vring_enable = vhost_vdpa_set_vring_enable,
> >  .vhost_get_config  = vhost_vdpa_get_config,
> >  .vhost_set_config = vhost_vdpa_set_config,
> >  .vhost_requires_shm_log = NULL,
> 
> vhost-vdpa net enables CVQ before dataplane ones to configure all the
> device in the destination of a live migration. To go back again to
> this callback would cause the device to enable the dataplane before
> virtqueues are configured again.

Not that it makes a difference, but I don't think it would actually be
going back. Even before your commit 6c482547, we were not making use of
the generic callback but just called the function in a slightly
different place (but less different than after commit 6c482547).

But anyway... Why don't the other vhost backend need the same for
vhost-net then? Do they just not support live migration?

I don't know the code well enough to say where the problem is, but if
vhost-vdpa networking code relies on the usual vhost operations not
being implemented and bypasses VhostOps to replace it, that sounds like
a design problem to me. Maybe VhostOps needs a new operation to enable
just a single virtqueue that can be used by the networking code instead?

> How does VDUSE userspace knows how many queues should enable? Can't
> the kernel perform the necessary actions after DRIVER_OK, like
> configuring the kick etc?

Not sure if I understand the question. The vdpa kernel interface always
enables individual queues, so the VDUSE userspace will enable whatever
queues it was asked to enable. The only restriction is that the queues
need to be enabled before setting DRIVER_OK.

The interface that enables all virtqueues at once seems to be just
.vhost_set_vring_enable in QEMU.

Kevin

Re: [PATCH] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-02-05 Thread Kevin Wolf

Am 02.02.2024 um 14:25 hat Kevin Wolf geschrieben:
> VDUSE requires that virtqueues are first enabled before the DRIVER_OK
> status flag is set; with the current API of the kernel module, it is
> impossible to enable the opposite order in our block export code because
> userspace is not notified when a virtqueue is enabled.
> 
> This requirement also mathces the normal initialisation order as done by
> the generic vhost code in QEMU. However, commit 6c482547 accidentally
> changed the order for vdpa-dev and broke access to VDUSE devices with
> this.
> 
> This changes vdpa-dev to use the normal order again and use the standard
> vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
> used with vdpa-dev again after this fix.
> 
> Cc: qemu-sta...@nongnu.org
> Fixes: 6c4825476a4351530bcac17abab72295b75ffe98
> Signed-off-by: Kevin Wolf 
> ---
>  hw/virtio/vdpa-dev.c   |  5 +
>  hw/virtio/vhost-vdpa.c | 17 +
>  2 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/virtio/vdpa-dev.c b/hw/virtio/vdpa-dev.c
> index eb9ecea83b..13e87f06f6 100644
> --- a/hw/virtio/vdpa-dev.c
> +++ b/hw/virtio/vdpa-dev.c
> @@ -253,14 +253,11 @@ static int vhost_vdpa_device_start(VirtIODevice *vdev, 
> Error **errp)
>  
>  s->dev.acked_features = vdev->guest_features;
>  
> -ret = vhost_dev_start(>dev, vdev, false);
> +ret = vhost_dev_start(>dev, vdev, true);
>  if (ret < 0) {
>  error_setg_errno(errp, -ret, "Error starting vhost");
>  goto err_guest_notifiers;
>  }
> -for (i = 0; i < s->dev.nvqs; ++i) {
> -vhost_vdpa_set_vring_ready(>vdpa, i);
> -}
>  s->started = true;
>  
>  /*
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 3a43beb312..c4574d56c5 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -904,6 +904,22 @@ int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, 
> unsigned idx)
>  return r;
>  }
>  
> +static int vhost_vdpa_set_vring_enable(struct vhost_dev *dev, int enable)
> +{
> +struct vhost_vdpa *v = dev->opaque;
> +unsigned int i;
> +int ret;
> +
> +for (i = 0; i < dev->nvqs; ++i) {
> +ret = vhost_vdpa_set_vring_ready(v, i);
> +if (ret < 0) {
> +return ret;
> +}
> +}
> +
> +return 0;
> +}

Oops, this forgets to actually use the @enable parameter, and always
enables the queue even if the caller wanted to disable it.

I'll fix this in a v2, but I'd first like to see if something has to
change to address Stefano's concern, too.

Kevin

[PATCH] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-02-02 Thread Kevin Wolf

VDUSE requires that virtqueues are first enabled before the DRIVER_OK
status flag is set; with the current API of the kernel module, it is
impossible to enable the opposite order in our block export code because
userspace is not notified when a virtqueue is enabled.

This requirement also mathces the normal initialisation order as done by
the generic vhost code in QEMU. However, commit 6c482547 accidentally
changed the order for vdpa-dev and broke access to VDUSE devices with
this.

This changes vdpa-dev to use the normal order again and use the standard
vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
used with vdpa-dev again after this fix.

Cc: qemu-sta...@nongnu.org
Fixes: 6c4825476a4351530bcac17abab72295b75ffe98
Signed-off-by: Kevin Wolf 
---
 hw/virtio/vdpa-dev.c   |  5 +
 hw/virtio/vhost-vdpa.c | 17 +
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/vdpa-dev.c b/hw/virtio/vdpa-dev.c
index eb9ecea83b..13e87f06f6 100644
--- a/hw/virtio/vdpa-dev.c
+++ b/hw/virtio/vdpa-dev.c
@@ -253,14 +253,11 @@ static int vhost_vdpa_device_start(VirtIODevice *vdev, 
Error **errp)
 
 s->dev.acked_features = vdev->guest_features;
 
-ret = vhost_dev_start(>dev, vdev, false);
+ret = vhost_dev_start(>dev, vdev, true);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Error starting vhost");
 goto err_guest_notifiers;
 }
-for (i = 0; i < s->dev.nvqs; ++i) {
-vhost_vdpa_set_vring_ready(>vdpa, i);
-}
 s->started = true;
 
 /*
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 3a43beb312..c4574d56c5 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -904,6 +904,22 @@ int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, 
unsigned idx)
 return r;
 }
 
+static int vhost_vdpa_set_vring_enable(struct vhost_dev *dev, int enable)
+{
+struct vhost_vdpa *v = dev->opaque;
+unsigned int i;
+int ret;
+
+for (i = 0; i < dev->nvqs; ++i) {
+ret = vhost_vdpa_set_vring_ready(v, i);
+if (ret < 0) {
+return ret;
+}
+}
+
+return 0;
+}
+
 static int vhost_vdpa_set_config_call(struct vhost_dev *dev,
int fd)
 {
@@ -1524,6 +1540,7 @@ const VhostOps vdpa_ops = {
 .vhost_set_features = vhost_vdpa_set_features,
 .vhost_reset_device = vhost_vdpa_reset_device,
 .vhost_get_vq_index = vhost_vdpa_get_vq_index,
+.vhost_set_vring_enable = vhost_vdpa_set_vring_enable,
 .vhost_get_config  = vhost_vdpa_get_config,
 .vhost_set_config = vhost_vdpa_set_config,
 .vhost_requires_shm_log = NULL,
-- 
2.43.0

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-02-01 Thread Kevin Wolf

Am 01.02.2024 um 10:43 hat Hanna Czenczek geschrieben:
> On 31.01.24 11:17, Kevin Wolf wrote:
> > Am 29.01.2024 um 17:30 hat Hanna Czenczek geschrieben:
> > > I don’t like using drain as a form of lock specifically against AioContext
> > > changes, but maybe Stefan is right, and we should use it in this specific
> > > case to get just the single problem fixed.  (Though it’s not quite trivial
> > > either.  We’d probably still want to remove the assertion from
> > > blk_get_aio_context(), so we don’t have to require all of its callers to
> > > hold a count in the in-flight counter.)
> > Okay, fair, maybe fixing the specific problem is more important that
> > solving the more generic blk_get_aio_context() race.
> > 
> > In this case, wouldn't it be enough to increase the in-flight counter so
> > that the drain before switching AioContexts would run the BH before
> > anything bad can happen? Does the following work?
> 
> Yes, that’s what I had in mind (Stefan, too, I think), and in testing,
> it looks good.

Oh, sorry, I completely misunderstood then. I thought you were talking
about adding a new drained section somewhere and that sounded a bit more
complicated. :-)

If it works, let's do this. Would you like to pick this up and send it
as a formal patch (possibly in a more polished form), or should I do
that?

Kevin

[PATCH] blkio: Respect memory-alignment for bounce buffer allocations

2024-01-31 Thread Kevin Wolf

blkio_alloc_mem_region() requires that the requested buffer size is a
multiple of the memory-alignment property. If it isn't, the allocation
fails with a return value of -EINVAL.

Fix the call in blkio_resize_bounce_pool() to make sure the requested
size is properly aligned.

I observed this problem with vhost-vdpa, which requires page aligned
memory. As the virtio-blk device behind it still had 512 byte blocks, we
got bs->bl.request_alignment = 512, but actually any request that needed
a bounce buffer and was not aligned to 4k would fail without this fix.

Suggested-by: Stefano Garzarella 
Signed-off-by: Kevin Wolf 
---
 block/blkio.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blkio.c b/block/blkio.c
index 0a0a6c0f5f..b989617608 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -89,6 +89,9 @@ static int blkio_resize_bounce_pool(BDRVBlkioState *s, 
int64_t bytes)
 /* Pad size to reduce frequency of resize calls */
 bytes += 128 * 1024;
 
+/* Align the pool size to avoid blkio_alloc_mem_region() failure */
+bytes = QEMU_ALIGN_UP(bytes, s->mem_region_alignment);
+
 WITH_QEMU_LOCK_GUARD(>blkio_lock) {
 int ret;
 
-- 
2.43.0

[PATCH] scsi: Don't ignore most usb-storage properties

2024-01-31 Thread Kevin Wolf

usb-storage is for the most part just a wrapper around an internally
created scsi-disk device. It uses DEFINE_BLOCK_PROPERTIES() to offer all
of the usual block device properties to the user, but then only forwards
a few select properties to the internal device while the rest is
silently ignored.

This changes scsi_bus_legacy_add_drive() to accept a whole BlockConf
instead of some individual values inside of it so that usb-storage can
now pass the whole configuration to the internal scsi-disk. This enables
the remaining block device properties, e.g. logical/physical_block_size
or discard_granularity.

Buglink: https://issues.redhat.com/browse/RHEL-22375
Signed-off-by: Kevin Wolf 
---
 include/hw/scsi/scsi.h   |  5 +
 hw/scsi/scsi-bus.c   | 33 +
 hw/usb/dev-storage-classic.c |  5 +
 3 files changed, 15 insertions(+), 28 deletions(-)

diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index 10c4e8288d..c3d5e17e38 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -199,10 +199,7 @@ static inline SCSIBus *scsi_bus_from_device(SCSIDevice *d)
 }
 
 SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, BlockBackend *blk,
-  int unit, bool removable, int bootindex,
-  bool share_rw,
-  BlockdevOnError rerror,
-  BlockdevOnError werror,
+  int unit, bool removable, BlockConf 
*conf,
   const char *serial, Error **errp);
 void scsi_bus_set_ua(SCSIBus *bus, SCSISense sense);
 void scsi_bus_legacy_handle_cmdline(SCSIBus *bus);
diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 0a2eb11c56..f37737e6b6 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -373,15 +373,13 @@ static void scsi_qdev_unrealize(DeviceState *qdev)
 
 /* handle legacy '-drive if=scsi,...' cmd line args */
 SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, BlockBackend *blk,
-  int unit, bool removable, int bootindex,
-  bool share_rw,
-  BlockdevOnError rerror,
-  BlockdevOnError werror,
+  int unit, bool removable, BlockConf 
*conf,
   const char *serial, Error **errp)
 {
 const char *driver;
 char *name;
 DeviceState *dev;
+SCSIDevice *s;
 DriveInfo *dinfo;
 
 if (blk_is_sg(blk)) {
@@ -399,11 +397,10 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, 
BlockBackend *blk,
 object_property_add_child(OBJECT(bus), name, OBJECT(dev));
 g_free(name);
 
+s = SCSI_DEVICE(dev);
+s->conf = *conf;
+
 qdev_prop_set_uint32(dev, "scsi-id", unit);
-if (bootindex >= 0) {
-object_property_set_int(OBJECT(dev), "bootindex", bootindex,
-_abort);
-}
 if (object_property_find(OBJECT(dev), "removable")) {
 qdev_prop_set_bit(dev, "removable", removable);
 }
@@ -414,19 +411,12 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, 
BlockBackend *blk,
 object_unparent(OBJECT(dev));
 return NULL;
 }
-if (!object_property_set_bool(OBJECT(dev), "share-rw", share_rw, errp)) {
-object_unparent(OBJECT(dev));
-return NULL;
-}
-
-qdev_prop_set_enum(dev, "rerror", rerror);
-qdev_prop_set_enum(dev, "werror", werror);
 
 if (!qdev_realize_and_unref(dev, >qbus, errp)) {
 object_unparent(OBJECT(dev));
 return NULL;
 }
-return SCSI_DEVICE(dev);
+return s;
 }
 
 void scsi_bus_legacy_handle_cmdline(SCSIBus *bus)
@@ -434,6 +424,12 @@ void scsi_bus_legacy_handle_cmdline(SCSIBus *bus)
 Location loc;
 DriveInfo *dinfo;
 int unit;
+BlockConf conf = {
+.bootindex = -1,
+.share_rw = false,
+.rerror = BLOCKDEV_ON_ERROR_AUTO,
+.werror = BLOCKDEV_ON_ERROR_AUTO,
+};
 
 loc_push_none();
 for (unit = 0; unit <= bus->info->max_target; unit++) {
@@ -443,10 +439,7 @@ void scsi_bus_legacy_handle_cmdline(SCSIBus *bus)
 }
 qemu_opts_loc_restore(dinfo->opts);
 scsi_bus_legacy_add_drive(bus, blk_by_legacy_dinfo(dinfo),
-  unit, false, -1, false,
-  BLOCKDEV_ON_ERROR_AUTO,
-  BLOCKDEV_ON_ERROR_AUTO,
-  NULL, _fatal);
+  unit, false, , NULL, _fatal);
 }
 loc_pop();
 }
diff --git a/hw/usb/dev-storage-classic.c b/hw/usb/dev-storage-classic.c
index 84d19752b5..50a3ad6285 100644
--- a/hw/usb/dev-storage-classic.c
+++ b/hw/usb/dev-storage-classic.c
@@ -67,10 +67,7 @@

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-31 Thread Kevin Wolf

Am 29.01.2024 um 17:30 hat Hanna Czenczek geschrieben:
> I don’t like using drain as a form of lock specifically against AioContext
> changes, but maybe Stefan is right, and we should use it in this specific
> case to get just the single problem fixed.  (Though it’s not quite trivial
> either.  We’d probably still want to remove the assertion from
> blk_get_aio_context(), so we don’t have to require all of its callers to
> hold a count in the in-flight counter.)

Okay, fair, maybe fixing the specific problem is more important that
solving the more generic blk_get_aio_context() race.

In this case, wouldn't it be enough to increase the in-flight counter so
that the drain before switching AioContexts would run the BH before
anything bad can happen? Does the following work?

Kevin

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 0a2eb11c56..dc09eb8024 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -120,17 +120,11 @@ static void scsi_device_for_each_req_async_bh(void 
*opaque)
 SCSIRequest *next;
 
 /*
- * If the AioContext changed before this BH was called then reschedule into
- * the new AioContext before accessing ->requests. This can happen when
- * scsi_device_for_each_req_async() is called and then the AioContext is
- * changed before BHs are run.
+ * The AioContext can't have changed because we increased the in-flight
+ * counter for s->conf.blk.
  */
 ctx = blk_get_aio_context(s->conf.blk);
-if (ctx != qemu_get_current_aio_context()) {
-aio_bh_schedule_oneshot(ctx, scsi_device_for_each_req_async_bh,
-g_steal_pointer());
-return;
-}
+assert(ctx == qemu_get_current_aio_context());
 
 QTAILQ_FOREACH_SAFE(req, >requests, next, next) {
 data->fn(req, data->fn_opaque);
@@ -138,6 +132,7 @@ static void scsi_device_for_each_req_async_bh(void *opaque)
 
 /* Drop the reference taken by scsi_device_for_each_req_async() */
 object_unref(OBJECT(s));
+blk_dec_in_flight(s->conf.blk);
 }
 
 /*
@@ -163,6 +158,7 @@ static void scsi_device_for_each_req_async(SCSIDevice *s,
  */
 object_ref(OBJECT(s));
 
+blk_inc_in_flight(s->conf.blk);
 aio_bh_schedule_oneshot(blk_get_aio_context(s->conf.blk),
 scsi_device_for_each_req_async_bh,
 data);

Re: [PATCH [repost]] block/blkio: Don't assume size_t is 64 bit

2024-01-30 Thread Kevin Wolf

Am 30.01.2024 um 11:30 hat Richard W.M. Jones geschrieben:
> On Tue, Jan 30, 2024 at 09:51:59AM +0100, Kevin Wolf wrote:
> > Am 29.01.2024 um 19:53 hat Richard W.M. Jones geschrieben:
> > > With GCC 14 the code failed to compile on i686 (and was wrong for any
> > > version of GCC):
> > > 
> > > ../block/blkio.c: In function ‘blkio_file_open’:
> > > ../block/blkio.c:857:28: error: passing argument 3 of ‘blkio_get_uint64’ 
> > > from incompatible pointer type [-Wincompatible-pointer-types]
> > >   857 |>mem_region_alignment);
> > >   |^~~~
> > >   ||
> > >   |size_t * {aka unsigned int *}
> > > In file included from ../block/blkio.c:12:
> > > /usr/include/blkio.h:49:67: note: expected ‘uint64_t *’ {aka ‘long long 
> > > unsigned int *’} but argument is of type ‘size_t *’ {aka ‘unsigned int *’}
> > >49 | int blkio_get_uint64(struct blkio *b, const char *name, uint64_t 
> > > *value);
> > >   | 
> > > ~~^
> > > 
> > > Signed-off-by: Richard W.M. Jones 
> > 
> > Why not simply make BDRVBlkioState.mem_region_alignment a uint64_t
> > instead of keeping it size_t and doing an additional conversion with
> > a check that requires an #if (probably to avoid a warning on 64 bit
> > hosts because the condition is never true)?
> 
> The smaller change (attached) does work on i686, but this worries me a
> little (although it doesn't give any error or warning):
> 
> if (((uintptr_t)host | size) % s->mem_region_alignment) {
> error_setg(errp, "unaligned buf %p with size %zu", host, size);
> return BMRR_FAIL;
> }

I don't see the problem? The calculation will now be done in 64 bits
even on a 32 bit host, but that seems fine to me. Is there a trap I'm
missing?

Kevin

> From 500f3a81652dcefa79a4864c1f3fa6747c16952e Mon Sep 17 00:00:00 2001
> From: "Richard W.M. Jones" 
> Date: Mon, 29 Jan 2024 18:20:46 +
> Subject: [PATCH] block/blkio: Make s->mem_region_alignment be 64 bits
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> With GCC 14 the code failed to compile on i686 (and was wrong for any
> version of GCC):
> 
> ../block/blkio.c: In function ‘blkio_file_open’:
> ../block/blkio.c:857:28: error: passing argument 3 of ‘blkio_get_uint64’ from 
> incompatible pointer type [-Wincompatible-pointer-types]
>   857 |>mem_region_alignment);
>   |^~~~
>   ||
>   |size_t * {aka unsigned int *}
> In file included from ../block/blkio.c:12:
> /usr/include/blkio.h:49:67: note: expected ‘uint64_t *’ {aka ‘long long 
> unsigned int *’} but argument is of type ‘size_t *’ {aka ‘unsigned int *’}
>49 | int blkio_get_uint64(struct blkio *b, const char *name, uint64_t 
> *value);
>   | 
> ~~^
> 
> Signed-off-by: Richard W.M. Jones 
> ---
>  block/blkio.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blkio.c b/block/blkio.c
> index 0a0a6c0f5fd..bc2f21784c7 100644
> --- a/block/blkio.c
> +++ b/block/blkio.c
> @@ -68,7 +68,7 @@ typedef struct {
>  CoQueue bounce_available;
>  
>  /* The value of the "mem-region-alignment" property */
> -size_t mem_region_alignment;
> +uint64_t mem_region_alignment;
>  
>  /* Can we skip adding/deleting blkio_mem_regions? */
>  bool needs_mem_regions;
> -- 
> 2.43.0
>

Re: [PATCH [repost]] block/blkio: Don't assume size_t is 64 bit

2024-01-30 Thread Kevin Wolf

Am 29.01.2024 um 19:53 hat Richard W.M. Jones geschrieben:
> With GCC 14 the code failed to compile on i686 (and was wrong for any
> version of GCC):
> 
> ../block/blkio.c: In function ‘blkio_file_open’:
> ../block/blkio.c:857:28: error: passing argument 3 of ‘blkio_get_uint64’ from 
> incompatible pointer type [-Wincompatible-pointer-types]
>   857 |>mem_region_alignment);
>   |^~~~
>   ||
>   |size_t * {aka unsigned int *}
> In file included from ../block/blkio.c:12:
> /usr/include/blkio.h:49:67: note: expected ‘uint64_t *’ {aka ‘long long 
> unsigned int *’} but argument is of type ‘size_t *’ {aka ‘unsigned int *’}
>49 | int blkio_get_uint64(struct blkio *b, const char *name, uint64_t 
> *value);
>   | 
> ~~^
> 
> Signed-off-by: Richard W.M. Jones 

Why not simply make BDRVBlkioState.mem_region_alignment a uint64_t
instead of keeping it size_t and doing an additional conversion with
a check that requires an #if (probably to avoid a warning on 64 bit
hosts because the condition is never true)?

Kevin

>  block/blkio.c | 12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blkio.c b/block/blkio.c
> index 0a0a6c0f5fd..52d78935147 100644
> --- a/block/blkio.c
> +++ b/block/blkio.c
> @@ -794,6 +794,7 @@ static int blkio_file_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  const char *blkio_driver = bs->drv->protocol_name;
>  BDRVBlkioState *s = bs->opaque;
>  int ret;
> +uint64_t val;
>  
>  ret = blkio_create(blkio_driver, >blkio);
>  if (ret < 0) {
> @@ -854,7 +855,7 @@ static int blkio_file_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  
>  ret = blkio_get_uint64(s->blkio,
> "mem-region-alignment",
> -   >mem_region_alignment);
> +   );
>  if (ret < 0) {
>  error_setg_errno(errp, -ret,
>   "failed to get mem-region-alignment: %s",
> @@ -862,6 +863,15 @@ static int blkio_file_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  blkio_destroy(>blkio);
>  return ret;
>  }
> +#if HOST_LONG_BITS == 32
> +if (val > SIZE_MAX) {
> +error_setg_errno(errp, ERANGE,
> + "mem-region-alignment too large for size_t");
> +blkio_destroy(>blkio);
> +return -ERANGE;
> +}
> +#endif
> +s->mem_region_alignment = (size_t)val;
>  
>  ret = blkio_get_bool(s->blkio,
>   "may-pin-mem-regions",
> -- 
> 2.43.0
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 12716 matches

Mail list logo