Re: [PULL 0/1] Block patches

2024-03-19 Thread Peter Maydell
On Tue, 19 Mar 2024 at 15:09, Stefan Hajnoczi  wrote:
>
> The following changes since commit ddc27d2ad9361a81c2b3800d14143bf420dae172:
>
>   Merge tag 'pull-request-2024-03-18' of https://gitlab.com/thuth/qemu into 
> staging (2024-03-19 10:25:25 +)
>
> are available in the Git repository at:
>
>   https://gitlab.com/stefanha/qemu.git tags/block-pull-request
>
> for you to fetch changes up to 86a637e48104ae74d8be53bed6441ce32be33433:
>
>   coroutine: cap per-thread local pool size (2024-03-19 10:49:31 -0400)
>
> 
> Pull request
>
> This fix solves the "failed to set up stack guard page" error that has been
> reported on Linux hosts where the QEMU coroutine pool exceeds the
> vm.max_map_count limit.
>
> 


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/9.0
for any user-visible changes.

-- PMM



Re: [PATCH] block/io: accept NULL qiov in bdrv_pad_request

2024-03-19 Thread Stefan Hajnoczi
On Tue, Mar 19, 2024 at 10:13:41AM +0100, Fiona Ebner wrote:
> From: Stefan Reiter 
> 
> Some operations, e.g. block-stream, perform reads while discarding the
> results (only copy-on-read matters). In this case, they will pass NULL
> as the target QEMUIOVector, which will however trip bdrv_pad_request,
> since it wants to extend its passed vector. In particular, this is the
> case for the blk_co_preadv() call in stream_populate().
> 
> If there is no qiov, no operation can be done with it, but the bytes
> and offset still need to be updated, so the subsequent aligned read
> will actually be aligned and not run into an assertion failure.
> 
> In particular, this can happen when the request alignment of the top
> node is larger than the allocated part of the bottom node, in which
> case padding becomes necessary. For example:
> 
> > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> > ./qemu-system-x86_64 --qmp stdio \
> > --blockdev 
> > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> > < > {"execute": "qmp_capabilities"}
> > {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": 
> > "node0", "node-name": "node1" } }
> > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> > "node1" } }
> > EOF

Hi Fiona,
Can you add a qemu-iotests test case for this issue?

Thanks,
Stefan

> 
> Originally-by: Stefan Reiter 
> Signed-off-by: Thomas Lamprecht 
> [FE: do update bytes and offset in any case
>  add reproducer to commit message]
> Signed-off-by: Fiona Ebner 
> ---
>  block/io.c | 31 +++
>  1 file changed, 19 insertions(+), 12 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 33150c0359..395bea3bac 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs,
>  return 0;
>  }
>  
> -sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
> -  _head, _tail,
> -  _niov);
> +/*
> + * For prefetching in stream_populate(), no qiov is passed along, because
> + * only copy-on-read matters.
> + */
> +if (qiov && *qiov) {
> +sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
> +  _head, _tail,
> +  _niov);
>  
> -/* Guaranteed by bdrv_check_request32() */
> -assert(*bytes <= SIZE_MAX);
> -ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
> -  sliced_head, *bytes);
> -if (ret < 0) {
> -bdrv_padding_finalize(pad);
> -return ret;
> +/* Guaranteed by bdrv_check_request32() */
> +assert(*bytes <= SIZE_MAX);
> +ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
> +  sliced_head, *bytes);
> +if (ret < 0) {
> +bdrv_padding_finalize(pad);
> +return ret;
> +}
> +*qiov = >local_qiov;
> +*qiov_offset = 0;
>  }
> +
>  *bytes += pad->head + pad->tail;
>  *offset -= pad->head;
> -*qiov = >local_qiov;
> -*qiov_offset = 0;
>  if (padded) {
>  *padded = true;
>  }
> -- 
> 2.39.2
> 
> 


signature.asc
Description: PGP signature


Re: [PULL 1/1] coroutine: cap per-thread local pool size

2024-03-19 Thread Stefan Hajnoczi
On Tue, Mar 19, 2024 at 03:14:07PM +, Daniel P. Berrangé wrote:
> Sending this PULL feels little rushed, as I still have
> un-answered questions on the inital patch posting just
> a few hours ago

Sorry, I hadn't seen your email. I'll update this email thread once the
discussion has finished.

Stefan

> 
> On Tue, Mar 19, 2024 at 11:09:38AM -0400, Stefan Hajnoczi wrote:
> > The coroutine pool implementation can hit the Linux vm.max_map_count
> > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > or "failed to set up stack guard page" during coroutine creation.
> > 
> > This happens because per-thread pools can grow to tens of thousands of
> > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> > The per-thread pool sizes are non-uniform and depend on past coroutine
> > usage in each thread, so it's possible for one thread to have a large
> > pool while another thread's pool is empty.
> > 
> > Switch to a new coroutine pool implementation with a global pool that
> > grows to a maximum number of coroutines and per-thread local pools that
> > are capped at hardcoded small number of coroutines.
> > 
> > This approach does not leave large numbers of coroutines pooled in a
> > thread that may not use them again. In order to perform well it
> > amortizes the cost of global pool accesses by working in batches of
> > coroutines instead of individual coroutines.
> > 
> > The global pool is a list. Threads donate batches of coroutines to when
> > they have too many and take batches from when they have too few:
> > 
> > .---.
> > | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
> > `---'
> > 
> > Each thread has up to 2 batches of coroutines:
> > 
> > .---.
> > | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
> > `---'
> > 
> > The goal of this change is to reduce the excessive number of pooled
> > coroutines that cause QEMU to abort when vm.max_map_count is reached
> > without losing the performance of an adequately sized coroutine pool.
> > 
> > Here are virtio-blk disk I/O benchmark results:
> > 
> >   RW BLKSIZE IODEPTHOLDNEW CHANGE
> > randread  4k   1 113725 117451 +3.3%
> > randread  4k   8 192968 198510 +2.9%
> > randread  4k  16 207138 209429 +1.1%
> > randread  4k  32 212399 215145 +1.3%
> > randread  4k  64 218319 221277 +1.4%
> > randread128k   1  17587  17535 -0.3%
> > randread128k   8  17614  17616 +0.0%
> > randread128k  16  17608  17609 +0.0%
> > randread128k  32  17552  17553 +0.0%
> > randread128k  64  17484  17484 +0.0%
> > 
> > See files/{fio.sh,test.xml.j2} for the benchmark configuration:
> > https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing
> > 
> > Buglink: https://issues.redhat.com/browse/RHEL-28947
> > Reported-by: Sanjay Rao 
> > Reported-by: Boaz Ben Shabat 
> > Reported-by: Joe Mario 
> > Reviewed-by: Kevin Wolf 
> > Signed-off-by: Stefan Hajnoczi 
> > Message-ID: <20240318183429.1039340-1-stefa...@redhat.com>
> > ---
> >  util/qemu-coroutine.c | 282 +-
> >  1 file changed, 223 insertions(+), 59 deletions(-)
> > 
> > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > index 5fd2dbaf8b..2790959eaf 100644
> > --- a/util/qemu-coroutine.c
> > +++ b/util/qemu-coroutine.c
> > @@ -18,39 +18,200 @@
> >  #include "qemu/atomic.h"
> >  #include "qemu/coroutine_int.h"
> >  #include "qemu/coroutine-tls.h"
> > +#include "qemu/cutils.h"
> >  #include "block/aio.h"
> >  
> > -/**
> > - * The minimal batch size is always 64, coroutines from the release_pool 
> > are
> > - * reused as soon as there are 64 coroutines in it. The maximum pool size 
> > starts
> > - * with 64 and is increased on demand so that coroutines are not deleted 
> > even if
> > - * they are not immediately reused.
> > - */
> >  enum {
> > -POOL_MIN_BATCH_SIZE = 64,
> > -POOL_INITIAL_MAX_SIZE = 64,
> > +COROUTINE_POOL_BATCH_MAX_SIZE = 128,
> >  };
> >  
> > -/** Free list to speed up creation */
> > -static QSLIST_HEAD(, Coroutine) release_pool = 
> > QSLIST_HEAD_INITIALIZER(pool);
> > -static unsigned int pool_max_size = POOL_INITIAL_MAX_SIZE;
> > -static unsigned int release_pool_size;
> > +/*
> > + * Coroutine creation and deletion is expensive so a pool of unused 
> > coroutines
> > + * is kept as a cache. When the pool has coroutines available, they are
> > + * recycled instead of creating new ones from scratch. Coroutines are 
> > added to
> > + * the pool upon termination.
> > + *
> > + * The pool is global but each thread maintains a small local pool to avoid
> > + * global pool contention. Threads fetch and return batches of coroutines 
> > from
> > + * the global pool to maintain their local pool. The local 

Re: [PATCH for-9.0 1/2] nbd/server: Fix race in draining the export

2024-03-19 Thread Michael Tokarev

14.03.2024 19:58, Kevin Wolf wrote:

When draining an NBD export, nbd_drained_begin() first sets
client->quiescing so that nbd_client_receive_next_request() won't start
any new request coroutines. Then nbd_drained_poll() tries to makes sure
that we wait for any existing request coroutines by checking that
client->nb_requests has become 0.

However, there is a small window between creating a new request
coroutine and increasing client->nb_requests. If a coroutine is in this
state, it won't be waited for and drain returns too early.

In the context of switching to a different AioContext, this means that
blk_aio_attached() will see client->recv_coroutine != NULL and fail its
assertion.

Fix this by increasing client->nb_requests immediately when starting the
coroutine. Doing this after the checks if we should create a new
coroutine is okay because client->lock is held.

Cc: qemu-sta...@nongnu.org
Fixes: fd6afc501a019682d1b8468b562355a2887087bd
Signed-off-by: Kevin Wolf 


Kevin, Stefan,

This change in master, which is Cc'ed stable, touches (refines) exactly the
same areas as f816310d0c32c "nbd/server: only traverse NBDExport->clients
from main loop thread", which is not (yet?) in stable, neither 7.2 nor 8.2.

Also, 7075d235114b4 "nbd/server: introduce NBDClient->lock to protect fields"
touches one of the places too.

I can try to construct something out of the two, but I think it is better
if either of you can do that, - if this seems a good thing to do anyway.
This way it is definitely much saner than my possible attempts.

Or we can just pick f816310d0c32c and 7075d235114b4 into stable too, - when
I evaluated f816310d0c32c for stable before I thought it isn't needed there
because AioContext lock isn't removed in 8.2 yet.  And I haven't thought
about 7075d235114b4 at all.  All 3 applies cleanly and the result passes
check-block, but it smells a bit too much for stable.

What do you think?

Thanks,

/mjt


diff --git a/nbd/server.c b/nbd/server.c
index 941832f178..c3484cc1eb 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -3007,8 +3007,8 @@ static coroutine_fn int nbd_handle_request(NBDClient 
*client,
  /* Owns a reference to the NBDClient passed as opaque.  */
  static coroutine_fn void nbd_trip(void *opaque)
  {
-NBDClient *client = opaque;
-NBDRequestData *req = NULL;
+NBDRequestData *req = opaque;
+NBDClient *client = req->client;
  NBDRequest request = { 0 };/* GCC thinks it can be used uninitialized 
*/
  int ret;
  Error *local_err = NULL;
@@ -3037,8 +3037,6 @@ static coroutine_fn void nbd_trip(void *opaque)
  goto done;
  }
  
-req = nbd_request_get(client);

-
  /*
   * nbd_co_receive_request() returns -EAGAIN when nbd_drained_begin() has
   * set client->quiescing but by the time we get back nbd_drained_end() may
@@ -3112,9 +3110,7 @@ static coroutine_fn void nbd_trip(void *opaque)
  }
  
  done:

-if (req) {
-nbd_request_put(req);
-}
+nbd_request_put(req);
  
  qemu_mutex_unlock(>lock);
  
@@ -3143,10 +3139,13 @@ disconnect:

   */
  static void nbd_client_receive_next_request(NBDClient *client)
  {
+NBDRequestData *req;
+
  if (!client->recv_coroutine && client->nb_requests < MAX_NBD_REQUESTS &&
  !client->quiescing) {
  nbd_client_get(client);
-client->recv_coroutine = qemu_coroutine_create(nbd_trip, client);
+req = nbd_request_get(client);
+client->recv_coroutine = qemu_coroutine_create(nbd_trip, req);
  aio_co_schedule(client->exp->common.ctx, client->recv_coroutine);
  }
  }





Re: [PULL 1/1] coroutine: cap per-thread local pool size

2024-03-19 Thread Daniel P . Berrangé
Sending this PULL feels little rushed, as I still have
un-answered questions on the inital patch posting just
a few hours ago

On Tue, Mar 19, 2024 at 11:09:38AM -0400, Stefan Hajnoczi wrote:
> The coroutine pool implementation can hit the Linux vm.max_map_count
> limit, causing QEMU to abort with "failed to allocate memory for stack"
> or "failed to set up stack guard page" during coroutine creation.
> 
> This happens because per-thread pools can grow to tens of thousands of
> coroutines. Each coroutine causes 2 virtual memory areas to be created.
> Eventually vm.max_map_count is reached and memory-related syscalls fail.
> The per-thread pool sizes are non-uniform and depend on past coroutine
> usage in each thread, so it's possible for one thread to have a large
> pool while another thread's pool is empty.
> 
> Switch to a new coroutine pool implementation with a global pool that
> grows to a maximum number of coroutines and per-thread local pools that
> are capped at hardcoded small number of coroutines.
> 
> This approach does not leave large numbers of coroutines pooled in a
> thread that may not use them again. In order to perform well it
> amortizes the cost of global pool accesses by working in batches of
> coroutines instead of individual coroutines.
> 
> The global pool is a list. Threads donate batches of coroutines to when
> they have too many and take batches from when they have too few:
> 
> .---.
> | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
> `---'
> 
> Each thread has up to 2 batches of coroutines:
> 
> .---.
> | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
> `---'
> 
> The goal of this change is to reduce the excessive number of pooled
> coroutines that cause QEMU to abort when vm.max_map_count is reached
> without losing the performance of an adequately sized coroutine pool.
> 
> Here are virtio-blk disk I/O benchmark results:
> 
>   RW BLKSIZE IODEPTHOLDNEW CHANGE
> randread  4k   1 113725 117451 +3.3%
> randread  4k   8 192968 198510 +2.9%
> randread  4k  16 207138 209429 +1.1%
> randread  4k  32 212399 215145 +1.3%
> randread  4k  64 218319 221277 +1.4%
> randread128k   1  17587  17535 -0.3%
> randread128k   8  17614  17616 +0.0%
> randread128k  16  17608  17609 +0.0%
> randread128k  32  17552  17553 +0.0%
> randread128k  64  17484  17484 +0.0%
> 
> See files/{fio.sh,test.xml.j2} for the benchmark configuration:
> https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing
> 
> Buglink: https://issues.redhat.com/browse/RHEL-28947
> Reported-by: Sanjay Rao 
> Reported-by: Boaz Ben Shabat 
> Reported-by: Joe Mario 
> Reviewed-by: Kevin Wolf 
> Signed-off-by: Stefan Hajnoczi 
> Message-ID: <20240318183429.1039340-1-stefa...@redhat.com>
> ---
>  util/qemu-coroutine.c | 282 +-
>  1 file changed, 223 insertions(+), 59 deletions(-)
> 
> diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> index 5fd2dbaf8b..2790959eaf 100644
> --- a/util/qemu-coroutine.c
> +++ b/util/qemu-coroutine.c
> @@ -18,39 +18,200 @@
>  #include "qemu/atomic.h"
>  #include "qemu/coroutine_int.h"
>  #include "qemu/coroutine-tls.h"
> +#include "qemu/cutils.h"
>  #include "block/aio.h"
>  
> -/**
> - * The minimal batch size is always 64, coroutines from the release_pool are
> - * reused as soon as there are 64 coroutines in it. The maximum pool size 
> starts
> - * with 64 and is increased on demand so that coroutines are not deleted 
> even if
> - * they are not immediately reused.
> - */
>  enum {
> -POOL_MIN_BATCH_SIZE = 64,
> -POOL_INITIAL_MAX_SIZE = 64,
> +COROUTINE_POOL_BATCH_MAX_SIZE = 128,
>  };
>  
> -/** Free list to speed up creation */
> -static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
> -static unsigned int pool_max_size = POOL_INITIAL_MAX_SIZE;
> -static unsigned int release_pool_size;
> +/*
> + * Coroutine creation and deletion is expensive so a pool of unused 
> coroutines
> + * is kept as a cache. When the pool has coroutines available, they are
> + * recycled instead of creating new ones from scratch. Coroutines are added 
> to
> + * the pool upon termination.
> + *
> + * The pool is global but each thread maintains a small local pool to avoid
> + * global pool contention. Threads fetch and return batches of coroutines 
> from
> + * the global pool to maintain their local pool. The local pool holds up to 
> two
> + * batches whereas the maximum size of the global pool is controlled by the
> + * qemu_coroutine_inc_pool_size() API.
> + *
> + * .---.
> + * | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
> + * `---'
> + *
> + * .---.
> + * | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
> + * 

[PULL 0/1] Block patches

2024-03-19 Thread Stefan Hajnoczi
The following changes since commit ddc27d2ad9361a81c2b3800d14143bf420dae172:

  Merge tag 'pull-request-2024-03-18' of https://gitlab.com/thuth/qemu into 
staging (2024-03-19 10:25:25 +)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 86a637e48104ae74d8be53bed6441ce32be33433:

  coroutine: cap per-thread local pool size (2024-03-19 10:49:31 -0400)


Pull request

This fix solves the "failed to set up stack guard page" error that has been
reported on Linux hosts where the QEMU coroutine pool exceeds the
vm.max_map_count limit.



Stefan Hajnoczi (1):
  coroutine: cap per-thread local pool size

 util/qemu-coroutine.c | 282 +-
 1 file changed, 223 insertions(+), 59 deletions(-)

-- 
2.44.0




[PULL 1/1] coroutine: cap per-thread local pool size

2024-03-19 Thread Stefan Hajnoczi
The coroutine pool implementation can hit the Linux vm.max_map_count
limit, causing QEMU to abort with "failed to allocate memory for stack"
or "failed to set up stack guard page" during coroutine creation.

This happens because per-thread pools can grow to tens of thousands of
coroutines. Each coroutine causes 2 virtual memory areas to be created.
Eventually vm.max_map_count is reached and memory-related syscalls fail.
The per-thread pool sizes are non-uniform and depend on past coroutine
usage in each thread, so it's possible for one thread to have a large
pool while another thread's pool is empty.

Switch to a new coroutine pool implementation with a global pool that
grows to a maximum number of coroutines and per-thread local pools that
are capped at hardcoded small number of coroutines.

This approach does not leave large numbers of coroutines pooled in a
thread that may not use them again. In order to perform well it
amortizes the cost of global pool accesses by working in batches of
coroutines instead of individual coroutines.

The global pool is a list. Threads donate batches of coroutines to when
they have too many and take batches from when they have too few:

.---.
| Batch 1 | Batch 2 | Batch 3 | ... | global_pool
`---'

Each thread has up to 2 batches of coroutines:

.---.
| Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
`---'

The goal of this change is to reduce the excessive number of pooled
coroutines that cause QEMU to abort when vm.max_map_count is reached
without losing the performance of an adequately sized coroutine pool.

Here are virtio-blk disk I/O benchmark results:

  RW BLKSIZE IODEPTHOLDNEW CHANGE
randread  4k   1 113725 117451 +3.3%
randread  4k   8 192968 198510 +2.9%
randread  4k  16 207138 209429 +1.1%
randread  4k  32 212399 215145 +1.3%
randread  4k  64 218319 221277 +1.4%
randread128k   1  17587  17535 -0.3%
randread128k   8  17614  17616 +0.0%
randread128k  16  17608  17609 +0.0%
randread128k  32  17552  17553 +0.0%
randread128k  64  17484  17484 +0.0%

See files/{fio.sh,test.xml.j2} for the benchmark configuration:
https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing

Buglink: https://issues.redhat.com/browse/RHEL-28947
Reported-by: Sanjay Rao 
Reported-by: Boaz Ben Shabat 
Reported-by: Joe Mario 
Reviewed-by: Kevin Wolf 
Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240318183429.1039340-1-stefa...@redhat.com>
---
 util/qemu-coroutine.c | 282 +-
 1 file changed, 223 insertions(+), 59 deletions(-)

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 5fd2dbaf8b..2790959eaf 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -18,39 +18,200 @@
 #include "qemu/atomic.h"
 #include "qemu/coroutine_int.h"
 #include "qemu/coroutine-tls.h"
+#include "qemu/cutils.h"
 #include "block/aio.h"
 
-/**
- * The minimal batch size is always 64, coroutines from the release_pool are
- * reused as soon as there are 64 coroutines in it. The maximum pool size 
starts
- * with 64 and is increased on demand so that coroutines are not deleted even 
if
- * they are not immediately reused.
- */
 enum {
-POOL_MIN_BATCH_SIZE = 64,
-POOL_INITIAL_MAX_SIZE = 64,
+COROUTINE_POOL_BATCH_MAX_SIZE = 128,
 };
 
-/** Free list to speed up creation */
-static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int pool_max_size = POOL_INITIAL_MAX_SIZE;
-static unsigned int release_pool_size;
+/*
+ * Coroutine creation and deletion is expensive so a pool of unused coroutines
+ * is kept as a cache. When the pool has coroutines available, they are
+ * recycled instead of creating new ones from scratch. Coroutines are added to
+ * the pool upon termination.
+ *
+ * The pool is global but each thread maintains a small local pool to avoid
+ * global pool contention. Threads fetch and return batches of coroutines from
+ * the global pool to maintain their local pool. The local pool holds up to two
+ * batches whereas the maximum size of the global pool is controlled by the
+ * qemu_coroutine_inc_pool_size() API.
+ *
+ * .---.
+ * | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
+ * `---'
+ *
+ * .---.
+ * | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
+ * `---'
+ */
+typedef struct CoroutinePoolBatch {
+/* Batches are kept in a list */
+QSLIST_ENTRY(CoroutinePoolBatch) next;
 
-typedef QSLIST_HEAD(, Coroutine) CoroutineQSList;
-QEMU_DEFINE_STATIC_CO_TLS(CoroutineQSList, alloc_pool);
-QEMU_DEFINE_STATIC_CO_TLS(unsigned int, alloc_pool_size);
-QEMU_DEFINE_STATIC_CO_TLS(Notifier, coroutine_pool_cleanup_notifier);
+/* This batch holds up to 

Re: [PATCH for-9.0 v3] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-03-19 Thread Eugenio Perez Martin
On Tue, Mar 19, 2024 at 11:00 AM Kevin Wolf  wrote:
>
> Am 18.03.2024 um 20:27 hat Eugenio Perez Martin geschrieben:
> > On Mon, Mar 18, 2024 at 10:02 AM Michael S. Tsirkin  wrote:
> > >
> > > On Mon, Mar 18, 2024 at 12:31:26PM +0800, Jason Wang wrote:
> > > > On Fri, Mar 15, 2024 at 11:59 PM Kevin Wolf  wrote:
> > > > >
> > > > > VDUSE requires that virtqueues are first enabled before the DRIVER_OK
> > > > > status flag is set; with the current API of the kernel module, it is
> > > > > impossible to enable the opposite order in our block export code 
> > > > > because
> > > > > userspace is not notified when a virtqueue is enabled.
> > > > >
> > > > > This requirement also mathces the normal initialisation order as done 
> > > > > by
> > > > > the generic vhost code in QEMU. However, commit 6c482547 accidentally
> > > > > changed the order for vdpa-dev and broke access to VDUSE devices with
> > > > > this.
> > > > >
> > > > > This changes vdpa-dev to use the normal order again and use the 
> > > > > standard
> > > > > vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
> > > > > used with vdpa-dev again after this fix.
> > > > >
> > > > > vhost_net intentionally avoided enabling the vrings for vdpa and does
> > > > > this manually later while it does enable them for other vhost 
> > > > > backends.
> > > > > Reflect this in the vhost_net code and return early for vdpa, so that
> > > > > the behaviour doesn't change for this device.
> > > > >
> > > > > Cc: qemu-sta...@nongnu.org
> > > > > Fixes: 6c4825476a4351530bcac17abab72295b75ffe98
> > > > > Signed-off-by: Kevin Wolf 
> > > > > ---
> > > > > v2:
> > > > > - Actually make use of the @enable parameter
> > > > > - Change vhost_net to preserve the current behaviour
> > > > >
> > > > > v3:
> > > > > - Updated trace point [Stefano]
> > > > > - Fixed typo in comment [Stefano]
> > > > >
> > > > >  hw/net/vhost_net.c | 10 ++
> > > > >  hw/virtio/vdpa-dev.c   |  5 +
> > > > >  hw/virtio/vhost-vdpa.c | 29 ++---
> > > > >  hw/virtio/vhost.c  |  8 +++-
> > > > >  hw/virtio/trace-events |  2 +-
> > > > >  5 files changed, 45 insertions(+), 9 deletions(-)
> > > > >
> > > > > diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> > > > > index e8e1661646..fd1a93701a 100644
> > > > > --- a/hw/net/vhost_net.c
> > > > > +++ b/hw/net/vhost_net.c
> > > > > @@ -541,6 +541,16 @@ int vhost_set_vring_enable(NetClientState *nc, 
> > > > > int enable)
> > > > >  VHostNetState *net = get_vhost_net(nc);
> > > > >  const VhostOps *vhost_ops = net->dev.vhost_ops;
> > > > >
> > > > > +/*
> > > > > + * vhost-vdpa network devices need to enable dataplane 
> > > > > virtqueues after
> > > > > + * DRIVER_OK, so they can recover device state before starting 
> > > > > dataplane.
> > > > > + * Because of that, we don't enable virtqueues here and leave it 
> > > > > to
> > > > > + * net/vhost-vdpa.c.
> > > > > + */
> > > > > +if (nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
> > > > > +return 0;
> > > > > +}
> > > >
> > > > I think we need some inputs from Eugenio, this is only needed for
> > > > shadow virtqueue during live migration but not other cases.
> > > >
> > > > Thanks
> > >
> > >
> > > Yes I think we had a backend flag for this, right? Eugenio can you
> > > comment please?
> > >
> >
> > We have the VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend flag,
> > right. If the backend does not offer it, it is better to enable all
> > the queues here and add a migration blocker in net/vhost-vdpa.c.
> >
> > So the check should be:
> > nc->info->type == VHOST_VDPA && (backend_features &
> > VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK).
> >
> > I can manage to add the migration blocker on top of this patch.
>
> Note that my patch preserves the current behaviour for vhost_net. The
> callback wasn't implemented for vdpa so far, so we never called anything
> even if the flag wasn't set. This patch adds an implementation for the
> callback, so we have to skip it here to have everything in vhost_net
> work as before - which is what the condition as written does.
>
> If we add a check for the flag now (I don't know if that's correct or
> not), that would be a second, unrelated change of behaviour in the same
> patch. So if it's necessary, that's a preexisting problem and I'd argue
> it doesn't belong in this patch, but should be done separately.
>

Right, that's a very good point. I'll add proper checking on top of
your patch when it is merged.

Reviewed-by: Eugenio Pérez 

Thanks!




Re: [PULL 00/15] Block layer patches

2024-03-19 Thread Peter Maydell
On Mon, 18 Mar 2024 at 13:04, Kevin Wolf  wrote:
>
> The following changes since commit ba49d760eb04630e7b15f423ebecf6c871b8f77b:
>
>   Merge tag 'pull-maintainer-final-130324-1' of 
> https://gitlab.com/stsquad/qemu into staging (2024-03-13 15:12:14 +)
>
> are available in the Git repository at:
>
>   https://repo.or.cz/qemu/kevin.git tags/for-upstream
>
> for you to fetch changes up to 39a94d7c34ce9d222fa9c0c99a14e20a567456d7:
>
>   iotests: adapt to output change for recently introduced 'detached header' 
> field (2024-03-18 13:33:54 +0100)
>
> 
> Block layer patches
>
> - mirror: Fix deadlock
> - nbd/server: Fix race in draining the export
> - qemu-img snapshot: Fix formatting with large values
> - Fix blockdev-snapshot-sync error reporting for no medium
> - iotests fixes
>


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/9.0
for any user-visible changes.

-- PMM



Re: [PATCH for-9.0 v3] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-03-19 Thread Kevin Wolf
Am 18.03.2024 um 20:27 hat Eugenio Perez Martin geschrieben:
> On Mon, Mar 18, 2024 at 10:02 AM Michael S. Tsirkin  wrote:
> >
> > On Mon, Mar 18, 2024 at 12:31:26PM +0800, Jason Wang wrote:
> > > On Fri, Mar 15, 2024 at 11:59 PM Kevin Wolf  wrote:
> > > >
> > > > VDUSE requires that virtqueues are first enabled before the DRIVER_OK
> > > > status flag is set; with the current API of the kernel module, it is
> > > > impossible to enable the opposite order in our block export code because
> > > > userspace is not notified when a virtqueue is enabled.
> > > >
> > > > This requirement also mathces the normal initialisation order as done by
> > > > the generic vhost code in QEMU. However, commit 6c482547 accidentally
> > > > changed the order for vdpa-dev and broke access to VDUSE devices with
> > > > this.
> > > >
> > > > This changes vdpa-dev to use the normal order again and use the standard
> > > > vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
> > > > used with vdpa-dev again after this fix.
> > > >
> > > > vhost_net intentionally avoided enabling the vrings for vdpa and does
> > > > this manually later while it does enable them for other vhost backends.
> > > > Reflect this in the vhost_net code and return early for vdpa, so that
> > > > the behaviour doesn't change for this device.
> > > >
> > > > Cc: qemu-sta...@nongnu.org
> > > > Fixes: 6c4825476a4351530bcac17abab72295b75ffe98
> > > > Signed-off-by: Kevin Wolf 
> > > > ---
> > > > v2:
> > > > - Actually make use of the @enable parameter
> > > > - Change vhost_net to preserve the current behaviour
> > > >
> > > > v3:
> > > > - Updated trace point [Stefano]
> > > > - Fixed typo in comment [Stefano]
> > > >
> > > >  hw/net/vhost_net.c | 10 ++
> > > >  hw/virtio/vdpa-dev.c   |  5 +
> > > >  hw/virtio/vhost-vdpa.c | 29 ++---
> > > >  hw/virtio/vhost.c  |  8 +++-
> > > >  hw/virtio/trace-events |  2 +-
> > > >  5 files changed, 45 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> > > > index e8e1661646..fd1a93701a 100644
> > > > --- a/hw/net/vhost_net.c
> > > > +++ b/hw/net/vhost_net.c
> > > > @@ -541,6 +541,16 @@ int vhost_set_vring_enable(NetClientState *nc, int 
> > > > enable)
> > > >  VHostNetState *net = get_vhost_net(nc);
> > > >  const VhostOps *vhost_ops = net->dev.vhost_ops;
> > > >
> > > > +/*
> > > > + * vhost-vdpa network devices need to enable dataplane virtqueues 
> > > > after
> > > > + * DRIVER_OK, so they can recover device state before starting 
> > > > dataplane.
> > > > + * Because of that, we don't enable virtqueues here and leave it to
> > > > + * net/vhost-vdpa.c.
> > > > + */
> > > > +if (nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
> > > > +return 0;
> > > > +}
> > >
> > > I think we need some inputs from Eugenio, this is only needed for
> > > shadow virtqueue during live migration but not other cases.
> > >
> > > Thanks
> >
> >
> > Yes I think we had a backend flag for this, right? Eugenio can you
> > comment please?
> >
> 
> We have the VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend flag,
> right. If the backend does not offer it, it is better to enable all
> the queues here and add a migration blocker in net/vhost-vdpa.c.
> 
> So the check should be:
> nc->info->type == VHOST_VDPA && (backend_features &
> VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK).
> 
> I can manage to add the migration blocker on top of this patch.

Note that my patch preserves the current behaviour for vhost_net. The
callback wasn't implemented for vdpa so far, so we never called anything
even if the flag wasn't set. This patch adds an implementation for the
callback, so we have to skip it here to have everything in vhost_net
work as before - which is what the condition as written does.

If we add a check for the flag now (I don't know if that's correct or
not), that would be a second, unrelated change of behaviour in the same
patch. So if it's necessary, that's a preexisting problem and I'd argue
it doesn't belong in this patch, but should be done separately.

Kevin




[PATCH] block/io: accept NULL qiov in bdrv_pad_request

2024-03-19 Thread Fiona Ebner
From: Stefan Reiter 

Some operations, e.g. block-stream, perform reads while discarding the
results (only copy-on-read matters). In this case, they will pass NULL
as the target QEMUIOVector, which will however trip bdrv_pad_request,
since it wants to extend its passed vector. In particular, this is the
case for the blk_co_preadv() call in stream_populate().

If there is no qiov, no operation can be done with it, but the bytes
and offset still need to be updated, so the subsequent aligned read
will actually be aligned and not run into an assertion failure.

In particular, this can happen when the request alignment of the top
node is larger than the allocated part of the bottom node, in which
case padding becomes necessary. For example:

> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": 
> "node0", "node-name": "node1" } }
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node1" } }
> EOF

Originally-by: Stefan Reiter 
Signed-off-by: Thomas Lamprecht 
[FE: do update bytes and offset in any case
 add reproducer to commit message]
Signed-off-by: Fiona Ebner 
---
 block/io.c | 31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/block/io.c b/block/io.c
index 33150c0359..395bea3bac 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs,
 return 0;
 }
 
-sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
-  _head, _tail,
-  _niov);
+/*
+ * For prefetching in stream_populate(), no qiov is passed along, because
+ * only copy-on-read matters.
+ */
+if (qiov && *qiov) {
+sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+  _head, _tail,
+  _niov);
 
-/* Guaranteed by bdrv_check_request32() */
-assert(*bytes <= SIZE_MAX);
-ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
-  sliced_head, *bytes);
-if (ret < 0) {
-bdrv_padding_finalize(pad);
-return ret;
+/* Guaranteed by bdrv_check_request32() */
+assert(*bytes <= SIZE_MAX);
+ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
+  sliced_head, *bytes);
+if (ret < 0) {
+bdrv_padding_finalize(pad);
+return ret;
+}
+*qiov = >local_qiov;
+*qiov_offset = 0;
 }
+
 *bytes += pad->head + pad->tail;
 *offset -= pad->head;
-*qiov = >local_qiov;
-*qiov_offset = 0;
 if (padded) {
 *padded = true;
 }
-- 
2.39.2





Re: [PATCH] aspeed/smc: Only wire flash devices at reset

2024-03-19 Thread Thomas Huth

On 19/03/2024 08.33, Cédric Le Goater wrote:

The Aspeed machines have many Static Memory Controllers (SMC), up to
8, which can only drive flash memory devices. Commit 27a2c66c92ec
("aspeed/smc: Wire CS lines at reset") tried to ease the definitions
of these devices by allowing flash devices from the command line to be
attached to a SSI bus. For that, the wiring of the CS lines of the
Aspeed SMC controller was moved at reset. Two assumptions are made
though, first that the device has a SSI_GPIO_CS GPIO line, which is
not always the case, and second that it is flash device.

Correct this problem by ensuring that the devices attached to the bus
are the correct flash type. This fixes a QEMU abort when devices
without a CS line, such as the max111x, are passed on the command
line.

While at it, export TYPE_M25P80 used in the Xilinx Versal Virtual
machine.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2228
Fixes: 27a2c66c92ec ("aspeed/smc: Wire CS lines at reset")
Reported-by: Thomas Huth 
Signed-off-by: Cédric Le Goater 
---


Thanks!

Reviewed-by: Thomas Huth 
Tested-by: Thomas Huth 




[PATCH] aspeed/smc: Only wire flash devices at reset

2024-03-19 Thread Cédric Le Goater
The Aspeed machines have many Static Memory Controllers (SMC), up to
8, which can only drive flash memory devices. Commit 27a2c66c92ec
("aspeed/smc: Wire CS lines at reset") tried to ease the definitions
of these devices by allowing flash devices from the command line to be
attached to a SSI bus. For that, the wiring of the CS lines of the
Aspeed SMC controller was moved at reset. Two assumptions are made
though, first that the device has a SSI_GPIO_CS GPIO line, which is
not always the case, and second that it is flash device.

Correct this problem by ensuring that the devices attached to the bus
are the correct flash type. This fixes a QEMU abort when devices
without a CS line, such as the max111x, are passed on the command
line.

While at it, export TYPE_M25P80 used in the Xilinx Versal Virtual
machine.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2228
Fixes: 27a2c66c92ec ("aspeed/smc: Wire CS lines at reset")
Reported-by: Thomas Huth 
Signed-off-by: Cédric Le Goater 
---
 include/hw/block/flash.h  | 2 ++
 hw/arm/xlnx-versal-virt.c | 3 ++-
 hw/block/m25p80.c | 1 -
 hw/ssi/aspeed_smc.c   | 9 +
 4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/hw/block/flash.h b/include/hw/block/flash.h
index de93756cbe8f..2b5ccd92f463 100644
--- a/include/hw/block/flash.h
+++ b/include/hw/block/flash.h
@@ -78,6 +78,8 @@ extern const VMStateDescription vmstate_ecc_state;
 
 /* m25p80.c */
 
+#define TYPE_M25P80 "m25p80-generic"
+
 BlockBackend *m25p80_get_blk(DeviceState *dev);
 
 #endif
diff --git a/hw/arm/xlnx-versal-virt.c b/hw/arm/xlnx-versal-virt.c
index bfaed1aebfc6..962f98fee2ea 100644
--- a/hw/arm/xlnx-versal-virt.c
+++ b/hw/arm/xlnx-versal-virt.c
@@ -13,6 +13,7 @@
 #include "qemu/error-report.h"
 #include "qapi/error.h"
 #include "sysemu/device_tree.h"
+#include "hw/block/flash.h"
 #include "hw/boards.h"
 #include "hw/sysbus.h"
 #include "hw/arm/fdt.h"
@@ -759,7 +760,7 @@ static void versal_virt_init(MachineState *machine)
 flash_klass = object_class_by_name(s->ospi_model);
 if (!flash_klass ||
 object_class_is_abstract(flash_klass) ||
-!object_class_dynamic_cast(flash_klass, "m25p80-generic")) {
+!object_class_dynamic_cast(flash_klass, TYPE_M25P80)) {
 error_setg(_fatal, "'%s' is either abstract or"
" not a subtype of m25p80", s->ospi_model);
 return;
diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
index 08a00a6d9b89..8dec134832a1 100644
--- a/hw/block/m25p80.c
+++ b/hw/block/m25p80.c
@@ -515,7 +515,6 @@ struct M25P80Class {
 FlashPartInfo *pi;
 };
 
-#define TYPE_M25P80 "m25p80-generic"
 OBJECT_DECLARE_TYPE(Flash, M25P80Class, M25P80)
 
 static inline Manufacturer get_man(Flash *s)
diff --git a/hw/ssi/aspeed_smc.c b/hw/ssi/aspeed_smc.c
index 0dff1d910778..de57e690e124 100644
--- a/hw/ssi/aspeed_smc.c
+++ b/hw/ssi/aspeed_smc.c
@@ -23,6 +23,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "hw/block/flash.h"
 #include "hw/sysbus.h"
 #include "migration/vmstate.h"
 #include "qemu/log.h"
@@ -711,6 +712,14 @@ static void aspeed_smc_reset(DeviceState *d)
 for (i = 0; i < asc->cs_num_max; i++) {
 DeviceState *dev = ssi_get_cs(s->spi, i);
 if (dev) {
+Object *o = OBJECT(dev);
+
+if (!object_dynamic_cast(o, TYPE_M25P80)) {
+warn_report("Aspeed SMC %s.%d : Invalid %s device type",
+BUS(s->spi)->name, i, object_get_typename(o));
+continue;
+}
+
 qemu_irq cs_line = qdev_get_gpio_in_named(dev, SSI_GPIO_CS, 0);
 qdev_connect_gpio_out_named(DEVICE(s), "cs", i, cs_line);
 }
-- 
2.44.0