from:"Stefan Hajnoczi"

Re: [PATCH v2 0/3] docs: define policy forbidding use of "AI" / LLM code generators

2024-05-21 Thread Stefan Hajnoczi

On Thu, 16 May 2024 at 12:23, Daniel P. Berrangé  wrote:
>
> This patch kicks the hornet's nest of AI / LLM code generators.
>
> With the increasing interest in code generators in recent times,
> it is inevitable that QEMU contributions will include AI generated
> code. Thus far we have remained silent on the matter. Given that
> everyone knows these tools exist, our current position has to be
> considered tacit acceptance of the use of AI generated code in QEMU.
>
> The question for the project is whether that is a good position for
> QEMU to take or not ?
>
> IANAL, but I like to think I'm reasonably proficient at understanding
> open source licensing. I am not inherantly against the use of AI tools,
> rather I am anti-risk. I also want to see OSS licenses respected and
> complied with.
>
> AFAICT at its current state of (im)maturity the question of licensing
> of AI code generator output does not have a broadly accepted / settled
> legal position. This is an inherant bias/self-interest from the vendors
> promoting their usage, who tend to minimize/dismiss the legal questions.
> From my POV, this puts such tools in a position of elevated legal risk.
>
> Given the fuzziness over the legal position of generated code from
> such tools, I don't consider it credible (today) for a contributor
> to assert compliance with the DCO terms (b) or (c) (which is a stated
> pre-requisite for QEMU accepting patches) when a patch includes (or is
> derived from) AI generated code.
>
> By implication, I think that QEMU must (for now) explicitly decline
> to (knowingly) accept AI generated code.
>
> Perhaps a few years down the line the legal uncertainty will have
> reduced and we can re-evaluate this policy.
>
> Discuss...

Although this policy is unenforceable, I think it's a valid position
to take until the legal situation becomes clear.

Acked-by: Stefan Hajnoczi

Re: [PATCH 00/20] qapi: new sphinx qapi domain pre-requisites

2024-05-16 Thread Stefan Hajnoczi

  |   4 +-
>  scripts/qapi/visit.py |   9 +-
>  tests/qapi-schema/doc-empty-section.err   |   2 +-
>  tests/qapi-schema/doc-empty-section.json  |   2 +-
>  tests/qapi-schema/doc-good.json   |  18 +-
>  tests/qapi-schema/doc-good.out|  61 +++---
>  tests/qapi-schema/doc-good.txt    |  31 +--
>  .../qapi-schema/doc-interleaved-section.json  |   2 +-
>  47 files changed, 1152 insertions(+), 753 deletions(-)
>  create mode 100755 scripts/qapi-lint.sh
>  create mode 100644 scripts/qapi/Makefile
> 
> -- 
> 2.44.0
> 
> 

For block-core.json/block-export.json/block.json:

Acked-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH] scripts/simpletrace: Mark output with unstable timestamp as WARN

2024-05-14 Thread Stefan Hajnoczi

On Tue, May 14, 2024, 03:57 Zhao Liu  wrote:

> Hi Stefan,
>
> > QEMU uses clock_gettime(CLOCK_MONOTONIC) on Linux hosts. The man page
> > says:
> >
> >   All CLOCK_MONOTONIC variants guarantee that the time returned by
> >   consecutive  calls  will  not go backwards, but successive calls
> >   may—depending  on  the  architecture—return  identical  (not-in‐
> >   creased) time values.
> >
> > trace_record_start() calls clock_gettime(CLOCK_MONOTONIC) so trace events
> > should have monotonically increasing timestamps.
> >
> > I don't see a scenario where trace record A's timestamp is greater than
> > trace record B's timestamp unless the clock is non-monotonic.
> >
> > Which host CPU architecture and operating system are you running?
>
> I tested on these 2 machines:
> * CML (intel 10th) with Ubuntu 22.04 + kernel v6.5.0-28
> * MTL (intel 14th) with Ubuntu 22.04.2 + kernel v6.9.0
>
> > Please attach to the QEMU process with gdb and print out the value of
> > the use_rt_clock variable or add a printf in init_get_clock(). The value
> > should be 1.
>
> Thanks, on both above machines, use_rt_clock is 1 and there're both
> timestamp reversal issues with the following debug print:
>
> diff --git a/include/qemu/timer.h b/include/qemu/timer.h
> index 9a366e551fb3..7657785c27dc 100644
> --- a/include/qemu/timer.h
> +++ b/include/qemu/timer.h
> @@ -831,10 +831,17 @@ extern int use_rt_clock;
>
>  static inline int64_t get_clock(void)
>  {
> +static int64_t clock = 0;
>

Please try with a thread local variable (__thread) to check whether this
happens within a single thread.

If it only happens with a global variable then we'd need to look more
closely at race conditions in the patch below. I don't think the patch is a
reliable way to detect non-monotonic timestamps in a multi-threaded program.

 if (use_rt_clock) {
>  struct timespec ts;
>  clock_gettime(CLOCK_MONOTONIC, );
> -return ts.tv_sec * 10LL + ts.tv_nsec;
> +int64_t tmp = ts.tv_sec * 10LL + ts.tv_nsec;
> +if (tmp <= clock) {
> +printf("get_clock: strange, clock: %ld, tmp: %ld\n", clock,
> tmp);
> +}
> +assert(tmp > clock);
> +clock = tmp;
> +return clock;
>  } else {
>  /* XXX: using gettimeofday leads to problems if the date
> changes, so it should be avoided. */
> diff --git a/util/qemu-timer-common.c b/util/qemu-timer-common.c
> index cc1326f72646..3bf06eb4a4ce 100644
> --- a/util/qemu-timer-common.c
> +++ b/util/qemu-timer-common.c
> @@ -59,5 +59,6 @@ static void __attribute__((constructor))
> init_get_clock(void)
>  use_rt_clock = 1;
>  }
>  clock_start = get_clock();
> +printf("init_get_clock: use_rt_clock: %d\n", use_rt_clock);
>  }
>  #endif
>
> ---
> The timestamp interval is very small, for example:
> get_clock: strange, clock: 3302130503505, tmp: 3302130503503
>
> or
>
> get_clock: strange, clock: 2761577819846455, tmp: 2761577819846395
>
> I also tried to use CLOCK_MONOTONIC_RAW, but there's still the reversal
> issue.
>
> Thanks,
> Zhao
>
>

Re: [PATCH v4 00/12] vhost-user: support any POSIX system (tested on macOS, FreeBSD, OpenBSD)

2024-05-09 Thread Stefan Hajnoczi

jects/libvhost-user/libvhost-user.h |   2 +-
>  backends/hostmem-shm.c| 123 ++
>  contrib/vhost-user-blk/vhost-user-blk.c   |  27 +++--
>  contrib/vhost-user-input/main.c   |  16 +--
>  hw/net/vhost_net.c|   5 +
>  subprojects/libvhost-user/libvhost-user.c |  76 -
>  tests/qtest/vhost-user-blk-test.c |   2 +-
>  tests/qtest/vhost-user-test.c |  23 
>  util/vhost-user-server.c  |  12 +++
>  backends/meson.build  |   1 +
>  hw/block/Kconfig  |   2 +-
>  qemu-options.hx   |  13 +++
>  util/meson.build  |   4 +-
>  16 files changed, 305 insertions(+), 28 deletions(-)
>  create mode 100644 backends/hostmem-shm.c
> 
> -- 
> 2.45.0
> 

Acked-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 5/9] hw/scsi: add persistent reservation in/out api for scsi device

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:25PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations in the
> SCSI device layer. By introducing the persistent
> reservation in/out api, this enables the SCSI device
> to perform reservation-related tasks, including querying
> keys, querying reservation status, registering reservation
> keys, initiating and releasing reservations, as well as
> clearing and preempting reservations held by other keys.
> 
> These operations are crucial for management and control of
> shared storage resources in a persistent manner.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  hw/scsi/scsi-disk.c | 302 
>  1 file changed, 302 insertions(+)

Does sg_persist --report-capabilities
(https://linux.die.net/man/8/sg_persist) show that persistent
reservations are supported? Maybe some INQUIRY data still needs to be
added.

> 
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index 4bd7af9d0c..bdd66c4026 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -32,6 +32,7 @@
>  #include "migration/vmstate.h"
>  #include "hw/scsi/emulation.h"
>  #include "scsi/constants.h"
> +#include "scsi/utils.h"
>  #include "sysemu/block-backend.h"
>  #include "sysemu/blockdev.h"
>  #include "hw/block/block.h"
> @@ -1474,6 +1475,296 @@ static void scsi_disk_emulate_read_data(SCSIRequest 
> *req)
>  scsi_req_complete(>req, GOOD);
>  }
>  
> +typedef struct SCSIPrReadKeys {
> +uint32_t generation;
> +uint32_t num_keys;
> +uint64_t *keys;
> +void *req;
> +} SCSIPrReadKeys;
> +
> +typedef struct SCSIPrReadReservation {
> +uint32_t generation;
> +uint64_t key;
> +BlockPrType type;
> +void *req;
> +} SCSIPrReadReservation;
> +
> +static void scsi_pr_read_keys_complete(void *opaque, int ret)
> +{
> +int num_keys;
> +uint8_t *buf;
> +SCSIPrReadKeys *blk_keys = (SCSIPrReadKeys *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_keys->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +num_keys = MIN(blk_keys->num_keys, ret);
> +blk_keys->generation = cpu_to_be32(blk_keys->generation);
> +memcpy([0], _keys->generation, 4);
> +for (int i = 0; i < num_keys; i++) {
> +blk_keys->keys[i] = cpu_to_be64(blk_keys->keys[i]);
> +memcpy([8 + i * 8], _keys->keys[i], 8);
> +}
> +num_keys = cpu_to_be32(num_keys * 8);
> +memcpy([4], _keys, 4);
> +
> +scsi_req_data(>req, r->buflen);
> +done:
> +scsi_req_unref(>req);
> +g_free(blk_keys->keys);
> +g_free(blk_keys);
> +}
> +
> +static int scsi_disk_emulate_pr_read_keys(SCSIRequest *req)
> +{
> +SCSIPrReadKeys *blk_keys;
> +SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
> +int buflen = MIN(r->req.cmd.xfer, r->buflen);
> +int num_keys = (buflen - sizeof(uint32_t) * 2) / sizeof(uint64_t);
> +
> +blk_keys = g_new0(SCSIPrReadKeys, 1);
> +blk_keys->generation = 0;
> +/* num_keys is the maximum number of keys that can be transmitted */
> +blk_keys->num_keys = num_keys;
> +blk_keys->keys = g_malloc(sizeof(uint64_t) * num_keys);
> +blk_keys->req = r;
> +
> +scsi_req_ref(>req);
> +r->req.aiocb = blk_aio_pr_read_keys(s->qdev.conf.blk, 
> _keys->generation,
> +blk_keys->num_keys, blk_keys->keys,
> +scsi_pr_read_keys_complete, 
> blk_keys);
> +return 0;
> +}
> +
> +static void scsi_pr_read_reservation_complete(void *opaque, int ret)
> +{
> +uint8_t *buf;
> +uint32_t num_keys = 0;
> +SCSIPrReadReservation *blk_rsv = (SCSIPrReadReservation *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_rsv->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +blk_rsv->generation = cpu_to_be32(blk_rsv->generation);
> +memcpy([0], _rsv->generation, 4);
> +if (ret) {
> +num_keys = cpu_to_be32(16);
> +blk_rsv->key = cpu_to_be64(blk_rsv->key);
> +memcpy([8], _rsv->key, 8);
> +buf[21] = block_pr_type_to_scsi(blk_rsv->type) & 0xf;
> +} else {
> +num_keys = cpu_to_be32(0);
> +}
> +
> +memcpy([4], _keys, 4);
> +scsi_req_data(>req, r->buflen);
> +
> +done:
> +

Re: [PATCH 0/9] Support persistent reservation operations

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:20PM +0800, Changqi Lu wrote:
> Hi,
> 
> I am going to introduce persistent reservation for QEMU block.
> There are three parts in this series:
> 
> Firstly, at the block layer, the commit abstracts seven APIs related to
> the persistent reservation command. These APIs including reading keys,
> reading reservations, registering, reserving, releasing, clearing and 
> preempting.
> 
> Next, the commit implements the necessary pr-related operation APIs for both 
> the
> SCSI protocol and NVMe protocol at the device layer. This ensures that the 
> necessary
> functionality is available for handling persistent reservations in these 
> protocols.
> 
> Finally, the commit includes adaptations to the iscsi driver at the driver 
> layer
> to verify the correct implementation and functionality of the changes.
> 
> With these changes, GFS works fine in the guest. Also, sg-utils(for SCSI 
> block) and
> nvme-cli(for NVMe block) work fine too.

What is the relationship to the existing PRManager functionality
(docs/interop/pr-helper.rst) where block/file-posix.c interprets SCSI
ioctls and sends persistent reservation requests to an external helper
process?

I wonder if block/file-posix.c can implement the new block driver
callbacks using pr_mgr (while keeping the existing scsi-generic
support).

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 7/9] hw/nvme: add helper functions for converting reservation types

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:27PM +0800, Changqi Lu wrote:
> This commit introduces two helper functions
> that facilitate the conversion between the
> reservation types used in the NVME protocol
> and those used in the block layer.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  hw/nvme/nvme.h | 40 
>  1 file changed, 40 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 6/9] block/nvme: add reservation command protocol constants

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:26PM +0800, Changqi Lu wrote:
> Add constants for the NVMe persistent command protocol.
> The constants include the reservation command opcode and
> reservation type values defined in section 7 of the NVMe
> 2.0 specification.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  include/block/nvme.h | 30 ++
>  1 file changed, 30 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 5/9] hw/scsi: add persistent reservation in/out api for scsi device

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:25PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations in the
> SCSI device layer. By introducing the persistent
> reservation in/out api, this enables the SCSI device
> to perform reservation-related tasks, including querying
> keys, querying reservation status, registering reservation
> keys, initiating and releasing reservations, as well as
> clearing and preempting reservations held by other keys.
> 
> These operations are crucial for management and control of
> shared storage resources in a persistent manner.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  hw/scsi/scsi-disk.c | 302 
>  1 file changed, 302 insertions(+)

Paolo: Please review for buffer overflows. I find the buffer size
assumption in the SCSI layer mysterious (e.g. scsi_req_get_buf() returns
a void* and it's not obvious how large the buffer is), so I didn't check
for buffer overflows.

> 
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index 4bd7af9d0c..bdd66c4026 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -32,6 +32,7 @@
>  #include "migration/vmstate.h"
>  #include "hw/scsi/emulation.h"
>  #include "scsi/constants.h"
> +#include "scsi/utils.h"
>  #include "sysemu/block-backend.h"
>  #include "sysemu/blockdev.h"
>  #include "hw/block/block.h"
> @@ -1474,6 +1475,296 @@ static void scsi_disk_emulate_read_data(SCSIRequest 
> *req)
>  scsi_req_complete(>req, GOOD);
>  }
>  
> +typedef struct SCSIPrReadKeys {
> +uint32_t generation;
> +uint32_t num_keys;
> +uint64_t *keys;
> +void *req;
> +} SCSIPrReadKeys;
> +
> +typedef struct SCSIPrReadReservation {
> +uint32_t generation;
> +uint64_t key;
> +BlockPrType type;
> +void *req;
> +} SCSIPrReadReservation;
> +
> +static void scsi_pr_read_keys_complete(void *opaque, int ret)
> +{
> +int num_keys;
> +uint8_t *buf;
> +SCSIPrReadKeys *blk_keys = (SCSIPrReadKeys *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_keys->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +num_keys = MIN(blk_keys->num_keys, ret);
> +blk_keys->generation = cpu_to_be32(blk_keys->generation);
> +memcpy([0], _keys->generation, 4);
> +for (int i = 0; i < num_keys; i++) {
> +blk_keys->keys[i] = cpu_to_be64(blk_keys->keys[i]);
> +memcpy([8 + i * 8], _keys->keys[i], 8);
> +}
> +num_keys = cpu_to_be32(num_keys * 8);
> +memcpy([4], _keys, 4);
> +
> +scsi_req_data(>req, r->buflen);
> +done:
> +scsi_req_unref(>req);
> +g_free(blk_keys->keys);
> +g_free(blk_keys);
> +}
> +
> +static int scsi_disk_emulate_pr_read_keys(SCSIRequest *req)
> +{
> +SCSIPrReadKeys *blk_keys;
> +SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
> +int buflen = MIN(r->req.cmd.xfer, r->buflen);
> +int num_keys = (buflen - sizeof(uint32_t) * 2) / sizeof(uint64_t);
> +
> +blk_keys = g_new0(SCSIPrReadKeys, 1);
> +blk_keys->generation = 0;
> +/* num_keys is the maximum number of keys that can be transmitted */
> +blk_keys->num_keys = num_keys;
> +blk_keys->keys = g_malloc(sizeof(uint64_t) * num_keys);
> +blk_keys->req = r;
> +
> +scsi_req_ref(>req);
> +r->req.aiocb = blk_aio_pr_read_keys(s->qdev.conf.blk, 
> _keys->generation,
> +blk_keys->num_keys, blk_keys->keys,
> +scsi_pr_read_keys_complete, 
> blk_keys);
> +return 0;
> +}
> +
> +static void scsi_pr_read_reservation_complete(void *opaque, int ret)
> +{
> +uint8_t *buf;
> +uint32_t num_keys = 0;
> +SCSIPrReadReservation *blk_rsv = (SCSIPrReadReservation *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_rsv->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +blk_rsv->generation = cpu_to_be32(blk_rsv->generation);
> +memcpy([0], _rsv->generation, 4);
> +if (ret) {
> +num_keys = cpu_to_be32(16);

The SPC-6 calls this field Additional Length, which is clearer than
num_keys (there is only 1 key here!). Could you rename it to
additional_len to avoid confusion?

> +blk_rsv->key = cpu_to_be64(blk_rsv->key);
> +memcpy([8], _rsv->key, 8);
> +

Re: [PATCH 4/9] scsi/util: add helper functions for persistent reservation types conversion

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:24PM +0800, Changqi Lu wrote:
> This commit introduces two helper functions
> that facilitate the conversion between the
> persistent reservation types used in the SCSI
> protocol and those used in the block layer.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  include/scsi/utils.h |  5 +
>  scsi/utils.c | 40 
>  2 files changed, 45 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 3/9] scsi/constant: add persistent reservation in/out protocol constants

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:23PM +0800, Changqi Lu wrote:
> Add constants for the persistent reservation in/out protocol
> in the scsi/constant module. The constants include the persistent
> reservation command, type, and scope values defined in sections
> 6.13 and 6.14 of the SCSI Primary Commands-4 (SPC-4) specification.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  include/scsi/constants.h | 29 +
>  1 file changed, 29 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 2/9] block/raw: add persistent reservation in/out driver

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:22PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations for raw driver.
> The following methods are implemented: bdrv_co_pr_read_keys,
> bdrv_co_pr_read_reservation, bdrv_co_pr_register, bdrv_co_pr_reserve,
> bdrv_co_pr_release, bdrv_co_pr_clear and bdrv_co_pr_preempt.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  block/raw-format.c | 55 ++
>  1 file changed, 55 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 1/9] block: add persistent reservation in/out api

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:21PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations
> at the block level. The following operations
> are included:
> 
> - read_keys:retrieves the list of registered keys.
> - read_reservation: retrieves the current reservation status.
> - register: registers a new reservation key.
> - reserve:  initiates a reservation for a specific key.
> - release:  releases a reservation for a specific key.
> - clear:clears all existing reservations.
> - preempt:  preempts a reservation held by another key.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  block/block-backend.c | 386 ++
>  block/io.c| 161 +
>  include/block/block-common.h  |   9 +
>  include/block/block-io.h  |  19 ++
>  include/block/block_int-common.h  |  31 +++
>  include/sysemu/block-backend-io.h |  22 ++
>  6 files changed, 628 insertions(+)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index db6f9b92a3..ec67937d28 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1770,6 +1770,392 @@ BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned 
> long int req, void *buf,
>  return blk_aio_prwv(blk, req, 0, buf, blk_aio_ioctl_entry, 0, cb, 
> opaque);
>  }
>  
> +typedef struct BlkPrInCo {
> +BlockBackend *blk;
> +uint32_t *generation;
> +uint32_t num_keys;
> +BlockPrType *type;
> +uint64_t *keys;
> +int ret;
> +} BlkPrInCo;
> +
> +typedef struct BlkPrInCB {
> +BlockAIOCB common;
> +BlkPrInCo prco;
> +bool has_returned;
> +} BlkPrInCB;
> +
> +static const AIOCBInfo blk_pr_in_aiocb_info = {
> +.aiocb_size = sizeof(BlkPrInCB),
> +};
> +
> +static void blk_pr_in_complete(BlkPrInCB *acb)
> +{
> +if (acb->has_returned) {
> +acb->common.cb(acb->common.opaque, acb->prco.ret);
> +blk_dec_in_flight(acb->prco.blk);

Please add a comment identifying which blk_inc_in_flight() call this dec
is paired with. That makes it easier for people trying to understand
in-flight reference counting.

> +qemu_aio_unref(acb);
> +}
> +}
> +
> +static void blk_pr_in_complete_bh(void *opaque)
> +{
> +BlkPrInCB *acb = opaque;
> +assert(acb->has_returned);
> +blk_pr_in_complete(acb);
> +}
> +
> +static BlockAIOCB *blk_aio_pr_in(BlockBackend *blk, uint32_t *generation,
> + uint32_t num_keys, BlockPrType *type,
> + uint64_t *keys, CoroutineEntry co_entry,
> + BlockCompletionFunc *cb, void *opaque)
> +{
> +BlkPrInCB *acb;
> +Coroutine *co;
> +
> +blk_inc_in_flight(blk);

Please add a comment identifying which blk_dec_in_flight() call this inc
is paired with.

> +acb = blk_aio_get(_pr_in_aiocb_info, blk, cb, opaque);
> +acb->prco = (BlkPrInCo) {
> +.blk= blk,
> +.generation = generation,
> +.num_keys   = num_keys,
> +.type   = type,
> +.ret= NOT_DONE,
> +.keys   = keys,
> +};
> +acb->has_returned = false;
> +
> +co = qemu_coroutine_create(co_entry, acb);
> +aio_co_enter(qemu_get_current_aio_context(), co);
> +
> +acb->has_returned = true;
> +if (acb->prco.ret != NOT_DONE) {
> +replay_bh_schedule_oneshot_event(qemu_get_current_aio_context(),
> + blk_pr_in_complete_bh, acb);
> +}
> +
> +return >common;
> +}
> +
> +
> +static int coroutine_fn
> +blk_aio_pr_do_read_keys(BlockBackend *blk, uint32_t *generation,
> +uint32_t num_keys, uint64_t *keys)
> +{
> +IO_CODE();
> +
> +blk_wait_while_drained(blk);
> +GRAPH_RDLOCK_GUARD();
> +
> +if (!blk_co_is_available(blk)) {
> +return -ENOMEDIUM;
> +}
> +
> +return bdrv_co_pr_read_keys(blk_bs(blk), generation, num_keys, keys);
> +}
> +
> +static void coroutine_fn blk_aio_pr_read_keys_entry(void *opaque)
> +{
> +BlkPrInCB *acb = opaque;
> +BlkPrInCo *prco = >prco;
> +
> +prco->ret = blk_aio_pr_do_read_keys(prco->blk, prco->generation,
> +prco->num_keys, prco->keys);
> +blk_pr_in_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_pr_read_keys(BlockBackend *blk, uint32_t *generation,
> + uint32_t num_keys, uint64_t *keys,
> + BlockCompletionFunc *cb, void *opaque)
> +{
> +IO_CODE();
> +return blk_aio_pr_in(blk, generation, num_keys, NULL, keys,
> + blk_aio_pr_read_keys_entry, cb, opaque);
> +}
> +
> +
> +static int coroutine_fn
> +blk_aio_pr_do_read_reservation(BlockBackend *blk, uint32_t *generation,
> +   uint64_t *key, BlockPrType *type)
> +{
> +IO_CODE();
> +
> +blk_wait_while_drained(blk);
> +

Re: [PATCH] scripts/simpletrace: Mark output with unstable timestamp as WARN

2024-05-09 Thread Stefan Hajnoczi

On Thu, May 09, 2024 at 11:59:10AM +0800, Zhao Liu wrote:
> On Wed, May 08, 2024 at 02:05:04PM -0400, Stefan Hajnoczi wrote:
> > Date: Wed, 8 May 2024 14:05:04 -0400
> > From: Stefan Hajnoczi 
> > Subject: Re: [PATCH] scripts/simpletrace: Mark output with unstable
> >  timestamp as WARN
> > 
> > On Wed, 8 May 2024 at 00:19, Zhao Liu  wrote:
> > >
> > > In some trace log, there're unstable timestamp breaking temporal
> > > ordering of trace records. For example:
> > >
> > > kvm_run_exit -0.015 pid=3289596 cpu_index=0x0 reason=0x6
> > > kvm_vm_ioctl -0.020 pid=3289596 type=0xc008ae67 arg=0x7ffeefb5aa60
> > > kvm_vm_ioctl -0.021 pid=3289596 type=0xc008ae67 arg=0x7ffeefb5aa60
> > >
> > > Negative delta intervals tend to get drowned in the massive trace logs,
> > > and an unstable timestamp can corrupt the calculation of intervals
> > > between two events adjacent to it.
> > >
> > > Therefore, mark the outputs with unstable timestamps as WARN like:
> > 
> > Why are the timestamps non-monotonic?
> > 
> > In a situation like that maybe not only the negative timestamps are
> > useless but even some positive timestamps are incorrect. I think it's
> > worth understanding the nature of the instability before merging a
> > fix.
> 
> I grabbed more traces (by -trace "*" to cover as many events as possible
> ) and have a couple observations:
> 
> * It's not an issue with kvm's ioctl, and that qemu_mutex_lock/
>   object_dynamic_cast_assert accounted for the vast majority of all
>   exception timestamps.
> * The total exception timestamp occurrence probability was roughly 0.013%
>   (608 / 4,616,938) in a 398M trace file.
> * And the intervals between the "wrong" timestamp and its pre event are
>   almost all within 50ns, even more concentrated within 20ns (there are
>   even quite a few 1~10ns).
> 
> Just a guess:
> 
> Would it be possible that a trace event which is too short of an interval,
> and happen to meet a delay in signaling to send to writeout_thread?
> 
> It seems that this short interval like a lack of real-time guarantees in
> the underlying mechanism...
> 
> If it's QEMU's own issue, I feel like the intervals should be randomized,
> not just within 50ns...
> 
> May I ask what you think? Any suggestions for researching this situation
> ;-)

QEMU uses clock_gettime(CLOCK_MONOTONIC) on Linux hosts. The man page
says:

  All CLOCK_MONOTONIC variants guarantee that the time returned by
  consecutive  calls  will  not go backwards, but successive calls
  may—depending  on  the  architecture—return  identical  (not-in‐
  creased) time values.

trace_record_start() calls clock_gettime(CLOCK_MONOTONIC) so trace events
should have monotonically increasing timestamps.

I don't see a scenario where trace record A's timestamp is greater than
trace record B's timestamp unless the clock is non-monotonic.

Which host CPU architecture and operating system are you running?

Please attach to the QEMU process with gdb and print out the value of
the use_rt_clock variable or add a printf in init_get_clock(). The value
should be 1.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH] scripts/simpletrace: Mark output with unstable timestamp as WARN

2024-05-08 Thread Stefan Hajnoczi

On Wed, 8 May 2024 at 00:19, Zhao Liu  wrote:
>
> In some trace log, there're unstable timestamp breaking temporal
> ordering of trace records. For example:
>
> kvm_run_exit -0.015 pid=3289596 cpu_index=0x0 reason=0x6
> kvm_vm_ioctl -0.020 pid=3289596 type=0xc008ae67 arg=0x7ffeefb5aa60
> kvm_vm_ioctl -0.021 pid=3289596 type=0xc008ae67 arg=0x7ffeefb5aa60
>
> Negative delta intervals tend to get drowned in the massive trace logs,
> and an unstable timestamp can corrupt the calculation of intervals
> between two events adjacent to it.
>
> Therefore, mark the outputs with unstable timestamps as WARN like:

Why are the timestamps non-monotonic?

In a situation like that maybe not only the negative timestamps are
useless but even some positive timestamps are incorrect. I think it's
worth understanding the nature of the instability before merging a
fix.

Stefan

>
> WARN: skip unstable timestamp: kvm_run_exit 
> cur(8497404907761146)-pre(8497404907761161) pid=3289596 cpu_index=0x0 
> reason=0x6
> WARN: skip unstable timestamp: kvm_vm_ioctl 
> cur(8497404908603653)-pre(8497404908603673) pid=3289596 
> type=0xc008ae67 arg=0x7ffeefb5aa60
> WARN: skip unstable timestamp: kvm_vm_ioctl 
> cur(8497404908625787)-pre(8497404908625808) pid=3289596 
> type=0xc008ae67 arg=0x7ffeefb5aa60
>
> This would help to identify unusual events.
>
> And skip them without updating Formatter2.last_timestamp_ns to avoid
> time back.
>
> Signed-off-by: Zhao Liu 
> ---
>  scripts/simpletrace.py | 11 +++
>  1 file changed, 11 insertions(+)
>
> diff --git a/scripts/simpletrace.py b/scripts/simpletrace.py
> index cef81b0707f0..23911dddb8a6 100755
> --- a/scripts/simpletrace.py
> +++ b/scripts/simpletrace.py
> @@ -343,6 +343,17 @@ def __init__(self):
>  def catchall(self, *rec_args, event, timestamp_ns, pid, event_id):
>  if self.last_timestamp_ns is None:
>  self.last_timestamp_ns = timestamp_ns
> +
> +if timestamp_ns < self.last_timestamp_ns:
> +fields = [
> +f'{name}={r}' if is_string(type) else f'{name}=0x{r:x}'
> +for r, (type, name) in zip(rec_args, event.args)
> +]
> +print(f'WARN: skip unstable timestamp: {event.name} '
> +  f'cur({timestamp_ns})-pre({self.last_timestamp_ns}) 
> {pid=} ' +
> +  f' '.join(fields))
> +return
> +
>  delta_ns = timestamp_ns - self.last_timestamp_ns
>  self.last_timestamp_ns = timestamp_ns
>
> --
> 2.34.1
>
>

[PATCH] qemu-io: add cvtnum() error handling for zone commands

2024-05-07 Thread Stefan Hajnoczi

cvtnum() parses positive int64_t values and returns a negative errno on
failure. Print errors and return early when cvtnum() fails.

While we're at it, also reject nr_zones values greater or equal to 2^32
since they cannot be represented.

Reported-by: Peter Maydell 
Cc: Sam Li 
Signed-off-by: Stefan Hajnoczi 
---
 qemu-io-cmds.c | 48 +++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index f5d7202a13..e2fab57183 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1739,12 +1739,26 @@ static int zone_report_f(BlockBackend *blk, int argc, 
char **argv)
 {
 int ret;
 int64_t offset;
+int64_t val;
 unsigned int nr_zones;
 
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
-nr_zones = cvtnum(argv[optind]);
+val = cvtnum(argv[optind]);
+if (val < 0) {
+print_cvtnum_err(val, argv[optind]);
+return val;
+}
+if (val > UINT_MAX) {
+printf("Number of zones must be less than 2^32\n");
+return -ERANGE;
+}
+nr_zones = val;
 
 g_autofree BlockZoneDescriptor *zones = NULL;
 zones = g_new(BlockZoneDescriptor, nr_zones);
@@ -1780,8 +1794,16 @@ static int zone_open_f(BlockBackend *blk, int argc, char 
**argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
 if (ret < 0) {
 printf("zone open failed: %s\n", strerror(-ret));
@@ -1805,8 +1827,16 @@ static int zone_close_f(BlockBackend *blk, int argc, 
char **argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
 if (ret < 0) {
 printf("zone close failed: %s\n", strerror(-ret));
@@ -1830,8 +1860,16 @@ static int zone_finish_f(BlockBackend *blk, int argc, 
char **argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len);
 if (ret < 0) {
 printf("zone finish failed: %s\n", strerror(-ret));
@@ -1855,8 +1893,16 @@ static int zone_reset_f(BlockBackend *blk, int argc, 
char **argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len);
 if (ret < 0) {
 printf("zone reset failed: %s\n", strerror(-ret));
-- 
2.45.0

Re: [PULL v2 03/16] block/block-backend: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2024-05-07 Thread Stefan Hajnoczi

On Fri, May 03, 2024 at 01:33:51PM +0100, Peter Maydell wrote:
> On Mon, 15 May 2023 at 17:07, Stefan Hajnoczi  wrote:
> >
> > From: Sam Li 
> >
> > Add zoned device option to host_device BlockDriver. It will be presented 
> > only
> > for zoned host block devices. By adding zone management operations to the
> > host_block_device BlockDriver, users can use the new block layer APIs
> > including Report Zone and four zone management operations
> > (open, close, finish, reset, reset_all).
> >
> > Qemu-io uses the new APIs to perform zoned storage commands of the device:
> > zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> > zone_finish(zf).
> >
> > For example, to test zone_report, use following command:
> > $ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
> > -c "zrp offset nr_zones"
> 
> Hi; Coverity points out an issue in this commit (CID 1544771):
> 
> > +static int zone_report_f(BlockBackend *blk, int argc, char **argv)
> > +{
> > +int ret;
> > +int64_t offset;
> > +unsigned int nr_zones;
> > +
> > +++optind;
> > +offset = cvtnum(argv[optind]);
> > +++optind;
> > +nr_zones = cvtnum(argv[optind]);
> 
> cvtnum() can fail and return a negative value on error
> (e.g. if the number in the string is out of range),
> but we are not checking for that. Instead we stuff
> the value into an 'unsigned int' and then pass that to
> g_new(), which will result in our trying to allocate a large
> amount of memory.
> 
> Here, and also in the other functions below that use cvtnum(),
> I think we should follow the pattern for use of that function
> that is used in the pre-existing code in this function:
> 
>  int64_t foo; /* NB: not an unsigned or some smaller type */
> 
>  foo = cvtnum(arg)
>  if (foo < 0) {
>  print_cvtnum_err(foo, arg);
>  return foo; /* or otherwise handle returning an error upward */
>  }
> 
> It looks like all the uses of cvtnum in this patch should be
> adjusted to handle errors.

Thanks for letting me know. I will send a patch.

Stefan


signature.asc
Description: PGP signature

[PATCH 2/2] aio: warn about iohandler_ctx special casing

2024-05-06 Thread Stefan Hajnoczi

The main loop has two AioContexts: qemu_aio_context and iohandler_ctx.
The main loop runs them both, but nested aio_poll() calls on
qemu_aio_context exclude iohandler_ctx.

Which one should qemu_get_current_aio_context() return when called from
the main loop? Document that it's always qemu_aio_context.

This has subtle effects on functions that use
qemu_get_current_aio_context(). For example, aio_co_reschedule_self()
does not work when moving from iohandler_ctx to qemu_aio_context because
qemu_get_current_aio_context() does not differentiate these two
AioContexts.

Document this in order to reduce the chance of future bugs.

Signed-off-by: Stefan Hajnoczi 
---
 include/block/aio.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/block/aio.h b/include/block/aio.h
index 8378553eb9..4ee81936ed 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -629,6 +629,9 @@ void aio_co_schedule(AioContext *ctx, Coroutine *co);
  *
  * Move the currently running coroutine to new_ctx. If the coroutine is already
  * running in new_ctx, do nothing.
+ *
+ * Note that this function cannot reschedule from iohandler_ctx to
+ * qemu_aio_context.
  */
 void coroutine_fn aio_co_reschedule_self(AioContext *new_ctx);
 
@@ -661,6 +664,9 @@ void aio_co_enter(AioContext *ctx, Coroutine *co);
  * If called from an IOThread this will be the IOThread's AioContext.  If
  * called from the main thread or with the "big QEMU lock" taken it
  * will be the main loop AioContext.
+ *
+ * Note that the return value is never the main loop's iohandler_ctx and the
+ * return value is the main loop AioContext instead.
  */
 AioContext *qemu_get_current_aio_context(void);
 
-- 
2.45.0

[PATCH 1/2] Revert "monitor: use aio_co_reschedule_self()"

2024-05-06 Thread Stefan Hajnoczi

Commit 1f25c172f837 ("monitor: use aio_co_reschedule_self()") was a code
cleanup that uses aio_co_reschedule_self() instead of open coding
coroutine rescheduling.

Bug RHEL-34618 was reported and Kevin Wolf  identified
the root cause. I missed that aio_co_reschedule_self() ->
qemu_get_current_aio_context() only knows about
qemu_aio_context/IOThread AioContexts and not about iohandler_ctx. It
does not function correctly when going back from the iohandler_ctx to
qemu_aio_context.

Go back to open coding the AioContext transitions to avoid this bug.

This reverts commit 1f25c172f83704e350c0829438d832384084a74d.

Buglink: https://issues.redhat.com/browse/RHEL-34618
Signed-off-by: Stefan Hajnoczi 
---
 qapi/qmp-dispatch.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
index f3488afeef..176b549473 100644
--- a/qapi/qmp-dispatch.c
+++ b/qapi/qmp-dispatch.c
@@ -212,7 +212,8 @@ QDict *coroutine_mixed_fn qmp_dispatch(const QmpCommandList 
*cmds, QObject *requ
  * executing the command handler so that it can make progress if it
  * involves an AIO_WAIT_WHILE().
  */
-aio_co_reschedule_self(qemu_get_aio_context());
+aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
+qemu_coroutine_yield();
 }
 
 monitor_set_cur(qemu_coroutine_self(), cur_mon);
@@ -226,7 +227,9 @@ QDict *coroutine_mixed_fn qmp_dispatch(const QmpCommandList 
*cmds, QObject *requ
  * Move back to iohandler_ctx so that nested event loops for
  * qemu_aio_context don't start new monitor commands.
  */
-aio_co_reschedule_self(iohandler_get_aio_context());
+aio_co_schedule(iohandler_get_aio_context(),
+qemu_coroutine_self());
+qemu_coroutine_yield();
 }
 } else {
/*
-- 
2.45.0

[PATCH 0/2] Revert "monitor: use aio_co_reschedule_self()"

2024-05-06 Thread Stefan Hajnoczi

This series fixes RHEL-34618 "qemu crash on Assertion `luringcb->co->ctx ==
s->aio_context' failed when do block_resize on hotplug disk with aio=io_uring":
https://issues.redhat.com/browse/RHEL-34618

Kevin identified commit 1f25c172f837 ("monitor: use aio_co_reschedule_self()")
as the root cause. There is a subtlety regarding how
qemu_get_current_aio_context() returns qemu_aio_context even though we may be
running in iohandler_ctx.

Revert commit 1f25c172f837, it was just intended as a code cleanup.

Stefan Hajnoczi (2):
  Revert "monitor: use aio_co_reschedule_self()"
  aio: warn about iohandler_ctx special casing

 include/block/aio.h | 6 ++
 qapi/qmp-dispatch.c | 7 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

-- 
2.45.0

Re: qemu-img cache modes with Linux cgroup v1

2024-05-06 Thread Stefan Hajnoczi

On Mon, May 06, 2024 at 08:10:25PM +0300, Alex Kalenyuk wrote:
> Hey, just FYI about tmpfs, during some development on Fedora 39 I noticed
> O_DIRECT is now supported on tmpfs (as opposed to our CI which runs Centos
> 9 Stream).
> `qemu-img convert -t none -O raw tests/images/cirros-qcow2.img
> /tmp/cirros.raw`
> where /tmp is indeed a tmpfs.
> 
> I might be missing something so feel free to call that out

Yes, it was added by:

commit e88e0d366f9cfbb810b0c8509dc5d130d5a53e02
Author: Hugh Dickins 
Date:   Thu Aug 10 23:27:07 2023 -0700

tmpfs: trivial support for direct IO

It's fairly new but great to have.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v2 5/5] monitor: use aio_co_reschedule_self()

2024-05-06 Thread Stefan Hajnoczi

On Fri, May 03, 2024 at 07:33:17PM +0200, Kevin Wolf wrote:
> Am 06.02.2024 um 20:06 hat Stefan Hajnoczi geschrieben:
> > The aio_co_reschedule_self() API is designed to avoid the race
> > condition between scheduling the coroutine in another AioContext and
> > yielding.
> > 
> > The QMP dispatch code uses the open-coded version that appears
> > susceptible to the race condition at first glance:
> > 
> >   aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
> >   qemu_coroutine_yield();
> > 
> > The code is actually safe because the iohandler and qemu_aio_context
> > AioContext run under the Big QEMU Lock. Nevertheless, set a good example
> > and use aio_co_reschedule_self() so it's obvious that there is no race.
> > 
> > Suggested-by: Hanna Reitz 
> > Reviewed-by: Manos Pitsidianakis 
> > Reviewed-by: Hanna Czenczek 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> >  qapi/qmp-dispatch.c | 7 ++-
> >  1 file changed, 2 insertions(+), 5 deletions(-)
> > 
> > diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
> > index 176b549473..f3488afeef 100644
> > --- a/qapi/qmp-dispatch.c
> > +++ b/qapi/qmp-dispatch.c
> > @@ -212,8 +212,7 @@ QDict *coroutine_mixed_fn qmp_dispatch(const 
> > QmpCommandList *cmds, QObject *requ
> >   * executing the command handler so that it can make progress 
> > if it
> >   * involves an AIO_WAIT_WHILE().
> >   */
> > -aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
> > -qemu_coroutine_yield();
> > +aio_co_reschedule_self(qemu_get_aio_context());
> 
> Turns out that this one actually causes a regression. [1] This code is
> ŕun in iohandler_ctx, aio_co_reschedule_self() looks at the new context
> and compares it with qemu_get_current_aio_context() - and because both
> are qemu_aio_context, it decides that it has nothing to do. So the
> command handler coroutine actually still runs in iohandler_ctx now,
> which is not what we want.
> 
> We could just revert this patch because it was only meant as a cleanup
> without a semantic difference.
> 
> Or aio_co_reschedule_self() could look at qemu_coroutine_self()->ctx
> instead of using qemu_get_current_aio_context(). That would be a little
> more indirect, though, and I'm not sure if co->ctx is always up to date.
> 
> Any opinions on what is the best way to fix this?

If the commit is reverted then similar bugs may be introduced again in
the future. The qemu_get_current_aio_context() API is unaware of
iohandler_ctx and this can lead to unexpected results.

I will send patches to revert the commit and add doc comments explaining
iohandler_ctx's special behavior. This will reduce, but not eliminate,
the risk of future bugs.

Modifying aio_co_reschedule_self() might be better long-term fix, but
I'm afraid it will create more bugs because it will expose the subtle
distinction between the current coroutine AioContext and non-coroutine
AioContext in new places. I think the root cause is that iohandler_ctx
isn't a full-fledged AioContext with its own event loop. iohandler_ctx
is a special superset of qemu_aio_context that the main loop monitors.

Stefan

> 
> Kevin
> 
> [1] https://issues.redhat.com/browse/RHEL-34618
> 


signature.asc
Description: PGP signature

Re: [PATCH v8 0/5] Support message-based DMA in vfio-user server

2024-05-06 Thread Stefan Hajnoczi

On Thu, 28 Mar 2024 at 03:54, Mattias Nissler  wrote:
>
> Stefan, to the best of my knowledge this is fully reviewed and ready
> to go in - can you kindly pick it up or advise in case there's
> something I missed? Thanks!

This code is outside the areas that I maintain. I think it would make
sense for Jag to merge it and send a pull request as vfio-user
maintainer.

Stefan

Re: [qemu-web PATCH] blog: KVM Forum 2024 CFP

2024-05-06 Thread Stefan Hajnoczi

The mistakes were mine. Thanks for pointing them out, Thomas!

Acked-by: Stefan Hajnoczi

[PULL 1/1] hw/ufs: Fix buffer overflow bug

2024-04-29 Thread Stefan Hajnoczi

From: Jeuk Kim 

It fixes the buffer overflow vulnerability in the ufs device.
The bug was detected by sanitizers.

You can reproduce it by:

cat << EOF |\
qemu-system-x86_64 \
-display none -machine accel=qtest -m 512M -M q35 -nodefaults -drive \
file=null-co://,if=none,id=disk0 -device ufs,id=ufs_bus -device \
ufs-lu,drive=disk0,bus=ufs_bus -qtest stdio
outl 0xcf8 0x8810
outl 0xcfc 0xe000
outl 0xcf8 0x8804
outw 0xcfc 0x06
write 0xe058 0x1 0xa7
write 0xa 0x1 0x50
EOF

Resolves: #2299
Fixes: 329f16624499 ("hw/ufs: Support for Query Transfer Requests")
Reported-by: Zheyu Ma 
Signed-off-by: Jeuk Kim 
Signed-off-by: Stefan Hajnoczi 
Message-ID: 

---
 hw/ufs/ufs.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/hw/ufs/ufs.c b/hw/ufs/ufs.c
index eccdb852a0..bac78a32bb 100644
--- a/hw/ufs/ufs.c
+++ b/hw/ufs/ufs.c
@@ -126,6 +126,10 @@ static MemTxResult ufs_dma_read_req_upiu(UfsRequest *req)
 copy_size = sizeof(UtpUpiuHeader) + UFS_TRANSACTION_SPECIFIC_FIELD_SIZE +
 data_segment_length;
 
+if (copy_size > sizeof(req->req_upiu)) {
+copy_size = sizeof(req->req_upiu);
+}
+
 ret = ufs_addr_read(u, req_upiu_base_addr, >req_upiu, copy_size);
 if (ret) {
 trace_ufs_err_dma_read_req_upiu(req->slot, req_upiu_base_addr);
@@ -225,6 +229,10 @@ static MemTxResult ufs_dma_write_rsp_upiu(UfsRequest *req)
 copy_size = rsp_upiu_byte_len;
 }
 
+if (copy_size > sizeof(req->rsp_upiu)) {
+copy_size = sizeof(req->rsp_upiu);
+}
+
 ret = ufs_addr_write(u, rsp_upiu_base_addr, >rsp_upiu, copy_size);
 if (ret) {
 trace_ufs_err_dma_write_rsp_upiu(req->slot, rsp_upiu_base_addr);
-- 
2.44.0

[PULL 0/1] Block patches

2024-04-29 Thread Stefan Hajnoczi

The following changes since commit fd87be1dada5672f877e03c2ca8504458292c479:

  Merge tag 'accel-20240426' of https://github.com/philmd/qemu into staging 
(2024-04-26 15:28:13 -0700)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to d1c4580662bf75bf6875bb5e1ad446b300816ac7:

  hw/ufs: Fix buffer overflow bug (2024-04-29 09:33:06 -0400)


Pull request

Buffer overflow fix for Universal Flash Storage (UFS) emulation.



Jeuk Kim (1):
  hw/ufs: Fix buffer overflow bug

 hw/ufs/ufs.c | 8 
 1 file changed, 8 insertions(+)

-- 
2.44.0

Re: [PULL 0/1] ufs queue

2024-04-29 Thread Stefan Hajnoczi

On Mon, Apr 29, 2024 at 12:25:37PM +0900, Jeuk Kim wrote:
> From: Jeuk Kim 
> 
> The following changes since commit fd87be1dada5672f877e03c2ca8504458292c479:
> 
>   Merge tag 'accel-20240426' of https://github.com/philmd/qemu into staging 
> (2024-04-26 15:28:13 -0700)
> 
> are available in the Git repository at:
> 
>   https://gitlab.com/jeuk20.kim/qemu.git tags/pull-ufs-20240429
> 
> for you to fetch changes up to f2c8aeb1afefcda92054c448b21fc59cdd99db30:
> 
>   hw/ufs: Fix buffer overflow bug (2024-04-29 12:13:35 +0900)
> 
> 
> ufs queue
> 
> - Fix ufs sanitizer vulnerability
> 
> 
> Jeuk Kim (1):
>   hw/ufs: Fix buffer overflow bug
> 
>  hw/ufs/ufs.c | 8 
>  1 file changed, 8 insertions(+)
> 

Thanks, applied to my block tree:
https://gitlab.com/stefanha/qemu/commits/block

It will be included in my next block pull request.

You are welcome to send pull requests directly to the qemu.git/master
maintainer (Richard Henderson is on duty for this release cycle). If you
do that, make sure to GPG sign your pull request.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH for-9.1] util/log: add cleanup function

2024-04-22 Thread Stefan Hajnoczi

On Wed, Apr 17, 2024 at 09:33:33PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We leak global_filename, and do not close global_file. Let's fix that.

What is the goal?

Leaking global_filename does not cause unbounded memory consumption. I
guess the goal in freeing global_filename is to keep leak checker
reports tidy?

Closing global_file doesn't improve anything AFAICT. It might cause
problems if another component still wants to log something from a
destructor function. I'm not sure if the order of destructors is
defined.

What about qemu_mutex_destroy(_mutex) to balance startup()?

What about debug_regions?

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
> 
> Interesting: seems, nobody is maintainer of util/log.c
> 
>  util/log.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/util/log.c b/util/log.c
> index d36c98da0b..30de209210 100644
> --- a/util/log.c
> +++ b/util/log.c
> @@ -85,6 +85,15 @@ static void qemu_log_thread_cleanup(Notifier *n, void 
> *unused)
>  }
>  }
>  
> +static void __attribute__((__destructor__)) cleanup(void)
> +{
> +g_free(global_filename);
> +if (global_file && global_file != stderr) {
> +fclose(global_file);
> +global_file = NULL;
> +}
> +}
> +
>  /* Lock/unlock output. */
>  
>  static FILE *qemu_log_trylock_with_err(Error **errp)
> -- 
> 2.34.1
> 

signature.asc
Description: PGP signature

Re: [PATCH v2] tests/unit: Remove debug statements in test-nested-aio-poll.c

2024-04-22 Thread Stefan Hajnoczi

On Mon, Apr 22, 2024 at 01:22:46PM +0200, Philippe Mathieu-Daudé wrote:
> We have been running this test for almost a year; it
> is safe to remove its debug statements, which clutter
> CI jobs output:
> 
>   ▶  88/100 /nested-aio-poll  OK
>   io_read 0x16bb26158
>   io_poll_true 0x16bb26158
>   > io_poll_ready
>   io_read 0x16bb26164
>   < io_poll_ready
>   io_poll_true 0x16bb26158
>   io_poll_false 0x16bb26164
>   > io_poll_ready
>   io_poll_false 0x16bb26164
>   io_poll_false 0x16bb26164
>   io_poll_false 0x16bb26164
>   io_poll_false 0x16bb26164
>   io_poll_false 0x16bb26164
>   io_poll_false 0x16bb26164
>   io_poll_false 0x16bb26164
>   io_poll_false 0x16bb26164
>   io_poll_false 0x16bb26164
>   io_read 0x16bb26164
>   < io_poll_ready
>   88/100 qemu:unit / test-nested-aio-pollOK
> 
> Reviewed-by: Eric Blake 
> Reviewed-by: Richard Henderson 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  tests/unit/test-nested-aio-poll.c | 7 ---
>  1 file changed, 7 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v3 03/27] util/hexdump: Use a GString for qemu_hexdump_line

2024-04-13 Thread Stefan Hajnoczi

On Sat, 13 Apr 2024 at 05:46, Philippe Mathieu-Daudé  wrote:
>
> On 12/4/24 20:59, Richard Henderson wrote:
> > On 4/12/24 10:41, Philippe Mathieu-Daudé wrote:
> >>> -void qemu_hexdump_line(char *line, const void *bufptr, size_t len)
> >>> +GString *qemu_hexdump_line(GString *str, const void *vbuf, size_t len)
> >>>   {
> >>> -const char *buf = bufptr;
> >>> -int i, c;
> >>> +const uint8_t *buf = vbuf;
> >>> +size_t i;
> >>> -if (len > QEMU_HEXDUMP_LINE_BYTES) {
> >>> -len = QEMU_HEXDUMP_LINE_BYTES;
> >>> +if (str == NULL) {
> >>> +/* Estimate the length of the output to avoid reallocs. */
> >>> +i = len * 3 + len / 4;
> >>> +str = g_string_sized_new(i + 1);
> >>>   }
> >>
> >> [*]
> >>   else {
> >> g_string_truncate(str, 0);
> >>   }
> >>
> > ...
> >>> @@ -49,24 +52,26 @@ static void asciidump_line(char *line, const void
> >>> *bufptr, size_t len)
> >>>   *line = '\0';
> >>>   }
> >>> +#define QEMU_HEXDUMP_LINE_BYTES 16
> >>>   #define QEMU_HEXDUMP_LINE_WIDTH \
> >>>   (QEMU_HEXDUMP_LINE_BYTES * 2 + QEMU_HEXDUMP_LINE_BYTES / 4)
> >>>   void qemu_hexdump(FILE *fp, const char *prefix,
> >>> const void *bufptr, size_t size)
> >>>   {
> >>> -char line[QEMU_HEXDUMP_LINE_LEN];
> >>> +g_autoptr(GString) str =
> >>> g_string_sized_new(QEMU_HEXDUMP_LINE_WIDTH + 1);
> >>>   char ascii[QEMU_HEXDUMP_LINE_BYTES + 1];
> >>>   size_t b, len;
> >>>   for (b = 0; b < size; b += len) {
> >>>   len = MIN(size - b, QEMU_HEXDUMP_LINE_BYTES);
> >>> -qemu_hexdump_line(line, bufptr + b, len);
> >>> +g_string_truncate(str, 0);
> >>
> >> Shouldn't we truncate in [*] ?
> >
> > The usage in tpm puts several lines together in one string,
> > adding \n in between, for output in one go.
>
> I see the trace_tpm_util_show_buffer() call. However this
> isn't a recommended use of the tracing API (Cc'ing Stefan).
> It breaks the "log" backend output, and is sub-optimal for
> all other backends.
>
> IMHO the TPM buffer should be traced by multiple calls of
> (offset, hexbuf) instead.

I think so too.

Stefan

[PULL for-9.0 0/1] Block patches

2024-04-04 Thread Stefan Hajnoczi

The following changes since commit 786fd793b81410fb2a28914315e2f05d2ff6733b:

  Merge tag 'for-upstream' of https://gitlab.com/bonzini/qemu into staging 
(2024-04-03 12:52:03 +0100)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to bbdf9023665f409113cb07b463732861af63fb47:

  block/virtio-blk: Fix memory leak from virtio_blk_zone_report (2024-04-04 
09:29:42 -0400)


Pull request

Fix a memory leak in virtio-blk zone report emulation code when the request is
invalid.



Zheyu Ma (1):
  block/virtio-blk: Fix memory leak from virtio_blk_zone_report

 hw/block/virtio-blk.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

-- 
2.44.0

[PULL for-9.0 1/1] block/virtio-blk: Fix memory leak from virtio_blk_zone_report

2024-04-04 Thread Stefan Hajnoczi

From: Zheyu Ma 

This modification ensures that in scenarios where the buffer size is
insufficient for a zone report, the function will now properly set an
error status and proceed to a cleanup label, instead of merely
returning.

The following ASAN log reveals it:

==1767400==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 312 byte(s) in 1 object(s) allocated from:
#0 0x64ac7b3280cd in malloc 
llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:129:3
#1 0x735b02fb9738 in g_malloc 
(/lib/x86_64-linux-gnu/libglib-2.0.so.0+0x5e738)
#2 0x64ac7d23be96 in virtqueue_split_pop hw/virtio/virtio.c:1612:12
#3 0x64ac7d23728a in virtqueue_pop hw/virtio/virtio.c:1783:16
#4 0x64ac7cfcaacd in virtio_blk_get_request hw/block/virtio-blk.c:228:27
#5 0x64ac7cfca7c7 in virtio_blk_handle_vq hw/block/virtio-blk.c:1123:23
#6 0x64ac7cfecb95 in virtio_blk_handle_output hw/block/virtio-blk.c:1157:5

Signed-off-by: Zheyu Ma 
Message-id: 20240404120040.1951466-1-zheyum...@gmail.com
Signed-off-by: Stefan Hajnoczi 
---
 hw/block/virtio-blk.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 92de315f17..bb86e65f65 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -768,7 +768,8 @@ static void virtio_blk_handle_zone_report(VirtIOBlockReq 
*req,
 sizeof(struct virtio_blk_zone_report) +
 sizeof(struct virtio_blk_zone_descriptor)) {
 virtio_error(vdev, "in buffer too small for zone report");
-return;
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
 }
 
 /* start byte offset of the zone report */
-- 
2.44.0

Re: [PATCH] block/virtio-blk: Fix memory leak from virtio_blk_zone_report

2024-04-04 Thread Stefan Hajnoczi

On Thu, Apr 04, 2024 at 02:00:40PM +0200, Zheyu Ma wrote:
> This modification ensures that in scenarios where the buffer size is
> insufficient for a zone report, the function will now properly set an
> error status and proceed to a cleanup label, instead of merely
> returning.
> 
> The following ASAN log reveals it:
> 
> ==1767400==ERROR: LeakSanitizer: detected memory leaks
> Direct leak of 312 byte(s) in 1 object(s) allocated from:
> #0 0x64ac7b3280cd in malloc 
> llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:129:3
> #1 0x735b02fb9738 in g_malloc 
> (/lib/x86_64-linux-gnu/libglib-2.0.so.0+0x5e738)
> #2 0x64ac7d23be96 in virtqueue_split_pop hw/virtio/virtio.c:1612:12
> #3 0x64ac7d23728a in virtqueue_pop hw/virtio/virtio.c:1783:16
> #4 0x64ac7cfcaacd in virtio_blk_get_request hw/block/virtio-blk.c:228:27
> #5 0x64ac7cfca7c7 in virtio_blk_handle_vq hw/block/virtio-blk.c:1123:23
> #6 0x64ac7cfecb95 in virtio_blk_handle_output hw/block/virtio-blk.c:1157:5
> 
> Signed-off-by: Zheyu Ma 
> ---
>  hw/block/virtio-blk.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index 92de315f17..bb86e65f65 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -768,7 +768,8 @@ static void virtio_blk_handle_zone_report(VirtIOBlockReq 
> *req,
>  sizeof(struct virtio_blk_zone_report) +
>  sizeof(struct virtio_blk_zone_descriptor)) {
>  virtio_error(vdev, "in buffer too small for zone report");
> -return;
> +err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
> +goto out;
>  }
>  
>  /* start byte offset of the zone report */
> -- 
> 2.34.1
> 

Thanks, applied to my block tree:
https://gitlab.com/stefanha/qemu/commits/block

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 13/19] hw/virtio-blk: fix -Werror=maybe-uninitialized false-positive

2024-04-02 Thread Stefan Hajnoczi

On Thu, Mar 28, 2024 at 02:20:46PM +0400, marcandre.lur...@redhat.com wrote:
> From: Marc-André Lureau 
> 
> ../hw/block/virtio-blk.c:1212:12: error: ‘rq’ may be used uninitialized 
> [-Werror=maybe-uninitialized]
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  hw/block/virtio-blk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 01/19] util/coroutine: fix -Werror=maybe-uninitialized false-positive

2024-04-02 Thread Stefan Hajnoczi

On Thu, Mar 28, 2024 at 02:20:34PM +0400, marcandre.lur...@redhat.com wrote:
> From: Marc-André Lureau 
> 
> ../util/qemu-coroutine.c:150:8: error: ‘batch’ may be used uninitialized 
> [-Werror=maybe-uninitialized]
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  util/qemu-coroutine.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: Backdoor in xz, should we switch compression format for tarballs?

2024-03-30 Thread Stefan Hajnoczi

On Fri, 29 Mar 2024 at 14:00, Paolo Bonzini  wrote:
>
> For more info, see 
> https://lwn.net/ml/oss-security/20240329155126.kjjfduxw2yrlx...@awork3.anarazel.de/
>  but, essentially, xz was backdoored and it seems like upstream was directly 
> responsible for this.
>
> Based on this, should we switch our distribution from bz2+xz to bz2+zstd or 
> bz2+lzip?

I think it's reasonable to drop xz as a precaution due to the
long-term control the attacker may have had over the code base. I
haven't researched the alternatives though.

I CCed Michael Tokarev because he looked at compression formats for
distributing QEMU recently and may have thoughts on which alternative
is suitable.

For the record, I confirmed that the following QEMU servers do not
have xz-utils 5.6.0 or 5.6.1 packages installed:
- shell1.qemu.org
- node1.qemu.org
- ci1 at OSUOSL
- qemu2.osuosl.org

Stefan

bdrv_pad_request() Coverity report

2024-03-27 Thread Stefan Hajnoczi

Hi Fiona,
The Coverity static checker sent a report about commit 3f934817c82c
("block/io: accept NULL qiov in bdrv_pad_request").

Please take a look and send a follow-up patch, if necessary:

*** CID 1542668:  Null pointer dereferences  (REVERSE_INULL)
/builds/qemu-project/qemu/bloc
k/io.c: 1733 in bdrv_pad_request()
1727 }
1728
1729 /*
1730  * For prefetching in stream_populate(), no qiov is
passed along, because
1731  * only copy-on-read matters.
1732  */
>>> CID 1542668:  Null pointer dereferences  (REVERSE_INULL)
>>> Null-checking "qiov" suggests that it may be null, but it has already 
>>> been dereferenced on all paths leading to the check.
1733 if (qiov && *qiov) {
1734 sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
1735   _head, _tail,
1736   _niov);
1737
1738 /* Guaranteed by bdrv_check_request32() */

Thanks,
Stefan

Re: [PATCH v3 0/4] fix two edge cases related to stream block jobs

2024-03-25 Thread Stefan Hajnoczi

On Fri, Mar 22, 2024 at 10:50:05AM +0100, Fiona Ebner wrote:
> Changes in v3:
> * Also deal with edge case in bdrv_next_cleanup(). Haven't run
>   into an actual issue there, but at least the caller in
>   migration/block.c uses bdrv_nb_sectors() which, while not a
>   coroutine wrapper itself (it's written manually), may call
>   bdrv_refresh_total_sectors(), which is a generated coroutine
>   wrapper, so AFAIU, the block graph can change during that call.
>   And even without that, it's just better to be more consistent
>   with bdrv_next().
> 
> Changes in v2:
> * Ran into another issue while writing the IO test Stefan wanted
>   to have (good call :)), so include a fix for that and add the
>   test. I didn't notice during manual testing, because I hadn't
>   used a scripted QMP 'quit', so there was no race.
> 
> Fiona Ebner (3):
>   block-backend: fix edge case in bdrv_next() where BDS associated to BB
> changes
>   block-backend: fix edge case in bdrv_next_cleanup() where BDS
> associated to BB changes
>   iotests: add test for stream job with an unaligned prefetch read
> 
> Stefan Reiter (1):
>   block/io: accept NULL qiov in bdrv_pad_request
> 
>  block/block-backend.c | 18 ++--
>  block/io.c| 31 ---
>  .../tests/stream-unaligned-prefetch   | 86 +++
>  .../tests/stream-unaligned-prefetch.out   |  5 ++
>  4 files changed, 117 insertions(+), 23 deletions(-)
>  create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
>  create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

Looks good to me. I will wait until Thursday before merging in case
Hanna, Vladimir, or Kevin have comments. Thanks!

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v3 4/4] iotests: add test for stream job with an unaligned prefetch read

2024-03-25 Thread Stefan Hajnoczi

On Fri, Mar 22, 2024 at 10:50:09AM +0100, Fiona Ebner wrote:
> Previously, bdrv_pad_request() could not deal with a NULL qiov when
> a read needed to be aligned. During prefetch, a stream job will pass a
> NULL qiov. Add a test case to cover this scenario.
> 
> By accident, also covers a previous race during shutdown, where block
> graph changes during iteration in bdrv_flush_all() could lead to
> unreferencing the wrong block driver state and an assertion failure
> later.
> 
> Signed-off-by: Fiona Ebner 
> ---
> 
> No changes in v3.
> New in v2.
> 
>  .../tests/stream-unaligned-prefetch   | 86 +++
>  .../tests/stream-unaligned-prefetch.out   |  5 ++
>  2 files changed, 91 insertions(+)
>  create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
>  create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v3 3/4] block-backend: fix edge case in bdrv_next_cleanup() where BDS associated to BB changes

2024-03-25 Thread Stefan Hajnoczi

On Fri, Mar 22, 2024 at 10:50:08AM +0100, Fiona Ebner wrote:
> Same rationale as for commit "block-backend: fix edge case in
> bdrv_next() where BDS associated to BB changes". The block graph might
> change between the bdrv_next() call and the bdrv_next_cleanup() call,
> so it could be that the associated BDS is not the same that was
> referenced previously anymore. Instead, rely on bdrv_next() to set
> it->bs to the BDS it referenced and unreference that one in any case.
> 
> Signed-off-by: Fiona Ebner 
> ---
> 
> New in v3.
> 
>  block/block-backend.c | 11 ---
>  1 file changed, 4 insertions(+), 7 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-03-25 Thread Stefan Hajnoczi

On Fri, Mar 22, 2024 at 10:50:07AM +0100, Fiona Ebner wrote:
> The old_bs variable in bdrv_next() is currently determined by looking
> at the old block backend. However, if the block graph changes before
> the next bdrv_next() call, it might be that the associated BDS is not
> the same that was referenced previously. In that case, the wrong BDS
> is unreferenced, leading to an assertion failure later:
> 
> > bdrv_unref: Assertion `bs->refcnt > 0' failed.
> 
> In particular, this can happen in the context of bdrv_flush_all(),
> when polling for bdrv_co_flush() in the generated co-wrapper leads to
> a graph change (for example with a stream block job [0]).
> 
> A racy reproducer:
> 
> > #!/bin/bash
> > rm -f /tmp/backing.qcow2
> > rm -f /tmp/top.qcow2
> > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M
> > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> > ./qemu-system-x86_64 --qmp stdio \
> > --blockdev 
> > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> > < > {"execute": "qmp_capabilities"}
> > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> > "node0" } }
> > {"execute": "quit"}
> > EOF
> 
> [0]:
> 
> > #0  bdrv_replace_child_tran (child=..., new_bs=..., tran=...)
> > #1  bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., 
> > errp=...)
> > #2  bdrv_replace_node_common (from=..., to=..., auto_skip=..., 
> > detach_subchain=..., errp=...)
> > #3  bdrv_drop_filter (bs=..., errp=...)
> > #4  bdrv_cor_filter_drop (cor_filter_bs=...)
> > #5  stream_prepare (job=...)
> > #6  job_prepare_locked (job=...)
> > #7  job_txn_apply_locked (fn=..., job=...)
> > #8  job_do_finalize_locked (job=...)
> > #9  job_exit (opaque=...)
> > #10 aio_bh_poll (ctx=...)
> > #11 aio_poll (ctx=..., blocking=...)
> > #12 bdrv_poll_co (s=...)
> > #13 bdrv_flush (bs=...)
> > #14 bdrv_flush_all ()
> > #15 do_vm_stop (state=..., send_stop=...)
> > #16 vm_shutdown ()
> 
> Signed-off-by: Fiona Ebner 
> ---
> 
> No changes in v3.
> New in v2.
> 
>  block/block-backend.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v3 1/4] block/io: accept NULL qiov in bdrv_pad_request

2024-03-25 Thread Stefan Hajnoczi

On Fri, Mar 22, 2024 at 10:50:06AM +0100, Fiona Ebner wrote:
> From: Stefan Reiter 
> 
> Some operations, e.g. block-stream, perform reads while discarding the
> results (only copy-on-read matters). In this case, they will pass NULL
> as the target QEMUIOVector, which will however trip bdrv_pad_request,
> since it wants to extend its passed vector. In particular, this is the
> case for the blk_co_preadv() call in stream_populate().
> 
> If there is no qiov, no operation can be done with it, but the bytes
> and offset still need to be updated, so the subsequent aligned read
> will actually be aligned and not run into an assertion failure.
> 
> In particular, this can happen when the request alignment of the top
> node is larger than the allocated part of the bottom node, in which
> case padding becomes necessary. For example:
> 
> > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> > ./qemu-system-x86_64 --qmp stdio \
> > --blockdev 
> > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> > < > {"execute": "qmp_capabilities"}
> > {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": 
> > "node0", "node-name": "node1" } }
> > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> > "node1" } }
> > EOF
> 
> Originally-by: Stefan Reiter 
> Signed-off-by: Thomas Lamprecht 
> [FE: do update bytes and offset in any case
>  add reproducer to commit message]
> Signed-off-by: Fiona Ebner 
> ---
> 
> No changes in v3.
> No changes in v2.
> 
>  block/io.c | 31 +++
>  1 file changed, 19 insertions(+), 12 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v1] coroutine: avoid inserting duplicate coroutine to co_queue_wakeup

2024-03-25 Thread Stefan Hajnoczi

On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang wrote:
> If g_main_loop_run()/aio_poll() is called in the coroutine context,
> the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
> may be disordered.

aio_poll() must not be called from coroutine context:

  bool no_coroutine_fn aio_poll(AioContext *ctx, bool blocking);
   ^^^

Coroutines are not supposed to block. Instead, they should yield.

> When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
> some listened events is completed. Therefore, the completion callback
> function is dispatched.
> 
> If this callback function needs to invoke aio_co_enter(), it will only
> wake up the coroutine (because we are already in coroutine context),
> which may cause that the data on this listening event_fd/socket_fd
> is not read/cleared. When the next poll () exits, it will be woken up again
> and inserted into the wakeup queue again.
> 
> For example, if TLS is enabled in NBD, the server will call g_main_loop_run()
> in the coroutine, and repeatedly wake up the io_read event on a socket.
> The call stack is as follows:
> 
> aio_co_enter()
> aio_co_wake()
> qio_channel_restart_read()
> aio_dispatch_handler()
> aio_dispatch_handlers()
> aio_dispatch()
> aio_ctx_dispatch()
> g_main_context_dispatch()
> g_main_loop_run()
> nbd_negotiate_handle_starttls()

This code does not look like it was designed to run in coroutine
context. Two options:

1. Don't run it in coroutine context (e.g. use a BH instead). This
   avoids blocking the coroutine but calling g_main_loop_run() is still
   ugly, in my opinion.

2. Get rid of data.loop and use coroutine APIs instead:

   while (!data.complete) {
   qemu_coroutine_yield();
   }

   and update nbd_tls_handshake() to call aio_co_wake(data->co) instead
   of g_main_loop_quit(data->loop).

   This requires auditing the code to check whether the event loop might
   invoke something that interferes with
   nbd_negotiate_handle_starttls(). Typically this means monitor
   commands or fd activity that could change the state of this
   connection while it is yielded. This is where the real work is but
   hopefully it will not be that hard to figure out.

> nbd_negotiate_options()
> nbd_negotiate()
> nbd_co_client_start()
> coroutine_trampoline()
> 
> Signed-off-by: zhuyangyang 
> ---
>  util/async.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/util/async.c b/util/async.c
> index 0467890052..25fc1e6083 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -705,7 +705,18 @@ void aio_co_enter(AioContext *ctx, Coroutine *co)
>  if (qemu_in_coroutine()) {
>  Coroutine *self = qemu_coroutine_self();
>  assert(self != co);
> -QSIMPLEQ_INSERT_TAIL(>co_queue_wakeup, co, co_queue_next);
> +/*
> + * If the Coroutine *co is already in the co_queue_wakeup, this
> + * repeated insertion will causes the loss of other queue element
> + * or infinite loop.
> + * For examplex:
> + * Head->a->b->c->NULL, after insert_tail(head, b) => 
> Head->a->b->NULL
> + * Head->a-b>->NULL, after insert_tail(head, b) => Head->a->b->b...
> + */
> +if (!co->co_queue_next.sqe_next &&
> +self->co_queue_wakeup.sqh_last != >co_queue_next.sqe_next) {
> +QSIMPLEQ_INSERT_TAIL(>co_queue_wakeup, co, co_queue_next);
> +}
>  } else {
>  qemu_aio_coroutine_enter(ctx, co);
>  }
> -- 
> 2.33.0
> 


signature.asc
Description: PGP signature

[PULL for-9.0 0/1] Block patches

2024-03-21 Thread Stefan Hajnoczi

The following changes since commit fea445e8fe9acea4f775a832815ee22bdf2b0222:

  Merge tag 'pull-maintainer-final-for-real-this-time-200324-1' of 
https://gitlab.com/stsquad/qemu into staging (2024-03-21 10:31:56 +)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 9352f80cd926fe2dde7c89b93ee33bb0356ff40e:

  coroutine: reserve 5,000 mappings (2024-03-21 13:14:30 -0400)


Pull request

I was too quick in sending the coroutine pool sizing change for -rc0 and still
needed to address feedback from Daniel Berrangé.



Stefan Hajnoczi (1):
  coroutine: reserve 5,000 mappings

 util/qemu-coroutine.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

-- 
2.44.0

[PULL for-9.0 1/1] coroutine: reserve 5,000 mappings

2024-03-21 Thread Stefan Hajnoczi

Daniel P. Berrangé  pointed out that the coroutine
pool size heuristic is very conservative. Instead of halving
max_map_count, he suggested reserving 5,000 mappings for non-coroutine
users based on observations of guests he has access to.

Fixes: 86a637e48104 ("coroutine: cap per-thread local pool size")
Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Daniel P. Berrangé 
Message-id: 20240320181232.1464819-1-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 util/qemu-coroutine.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 2790959eaf..eb4eebefdf 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -377,12 +377,17 @@ static unsigned int get_global_pool_hard_max_size(void)
 NULL) &&
 qemu_strtoi(contents, NULL, 10, _map_count) == 0) {
 /*
- * This is a conservative upper bound that avoids exceeding
- * max_map_count. Leave half for non-coroutine users like library
- * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
- * halve the amount again.
+ * This is an upper bound that avoids exceeding max_map_count. Leave a
+ * fixed amount for non-coroutine users like library dependencies,
+ * vhost-user, etc. Each coroutine takes up 2 VMAs so halve the
+ * remaining amount.
  */
-return max_map_count / 4;
+if (max_map_count > 5000) {
+return (max_map_count - 5000) / 2;
+} else {
+/* Disable the global pool but threads still have local pools */
+return 0;
+}
 }
 #endif
 
-- 
2.44.0

Re: [PATCH] coroutine: reserve 5,000 mappings

2024-03-21 Thread Stefan Hajnoczi

On Wed, Mar 20, 2024 at 02:12:32PM -0400, Stefan Hajnoczi wrote:
> Daniel P. Berrangé  pointed out that the coroutine
> pool size heuristic is very conservative. Instead of halving
> max_map_count, he suggested reserving 5,000 mappings for non-coroutine
> users based on observations of guests he has access to.
> 
> Fixes: 86a637e48104 ("coroutine: cap per-thread local pool size")
> Signed-off-by: Stefan Hajnoczi 
> ---
>  util/qemu-coroutine.c | 15 ++-
>  1 file changed, 10 insertions(+), 5 deletions(-)

Discussed with Kevin and applied to my block tree:
https://gitlab.com/stefanha/qemu/commits/block

Stefan


signature.asc
Description: PGP signature

Re: [PATCH] coroutine: cap per-thread local pool size

2024-03-21 Thread Stefan Hajnoczi

On Thu, 21 Mar 2024 at 08:22, Kevin Wolf  wrote:
>
> Am 20.03.2024 um 15:09 hat Daniel P. Berrangé geschrieben:
> > On Wed, Mar 20, 2024 at 09:35:39AM -0400, Stefan Hajnoczi wrote:
> > > On Tue, Mar 19, 2024 at 08:10:49PM +, Daniel P. Berrangé wrote:
> > > > On Tue, Mar 19, 2024 at 01:55:10PM -0400, Stefan Hajnoczi wrote:
> > > > > On Tue, Mar 19, 2024 at 01:43:32PM +, Daniel P. Berrangé wrote:
> > > > > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > > > > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > > > > > > index 5fd2dbaf8b..2790959eaf 100644
> > > > > > > --- a/util/qemu-coroutine.c
> > > > > > > +++ b/util/qemu-coroutine.c
> > > > > >
> > > > > > > +static unsigned int get_global_pool_hard_max_size(void)
> > > > > > > +{
> > > > > > > +#ifdef __linux__
> > > > > > > +g_autofree char *contents = NULL;
> > > > > > > +int max_map_count;
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * Linux processes can have up to max_map_count virtual 
> > > > > > > memory areas
> > > > > > > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond 
> > > > > > > this limit. We
> > > > > > > + * must limit the coroutine pool to a safe size to avoid 
> > > > > > > running out of
> > > > > > > + * VMAs.
> > > > > > > + */
> > > > > > > +if (g_file_get_contents("/proc/sys/vm/max_map_count", 
> > > > > > > , NULL,
> > > > > > > +NULL) &&
> > > > > > > +qemu_strtoi(contents, NULL, 10, _map_count) == 0) {
> > > > > > > +/*
> > > > > > > + * This is a conservative upper bound that avoids 
> > > > > > > exceeding
> > > > > > > + * max_map_count. Leave half for non-coroutine users 
> > > > > > > like library
> > > > > > > + * dependencies, vhost-user, etc. Each coroutine takes 
> > > > > > > up 2 VMAs so
> > > > > > > + * halve the amount again.
> > > >
> > > > Leaving half for loaded libraries, etc is quite conservative
> > > > if max_map_count is the small-ish 64k default.
> > > >
> > > > That reservation could perhaps a fixed number like 5,000 ?
> > >
> > > While I don't want QEMU to abort, once this heuristic is in the code it
> > > will be scary to make it more optimistic and we may never change it. So
> > > now is the best time to try 5,000.
> > >
> > > I'll send a follow-up patch that reserves 5,000 mappings. If that turns
> > > out to be too optimistic we can increase the reservation.
> >
> > BTW, I suggested 5,000, because I looked at a few QEM processes I have
> > running on Fedora and saw just under 1,000 lines in /proc/$PID/maps,
> > of which only a subset is library mappings. So multiplying that x5 felt
> > like a fairly generous overhead for more complex build configurations.
>
> On my system, the boring desktop VM with no special hardware or other
> advanced configuration takes ~1500 mappings, most of which are
> libraries. I'm not concerned about the library mappings, it's unlikely
> that we'll double the number of libraries soon.
>
> But I'm not sure about dynamic mappings outside of coroutines, maybe
> when enabling features my simple desktop VM doesn't even use at all. If
> we're sure that nothing else uses any number worth mentioning, fine with
> me. But I couldn't tell.
>
> Staying the area we know reasonably well, how many libblkio bounce
> buffers could be in use at the same time? I think each one is an
> individual mmap(), right?

libblkio's mapping requirements are similar to vhost-user. There is
one general-purpose bounce buffer mapping plus a mapping for each QEMU
RAMBlock. That means the number of in-flight I/Os does not directly
influence the number of mappings.

Stefan

[PATCH] coroutine: reserve 5,000 mappings

2024-03-20 Thread Stefan Hajnoczi

Daniel P. Berrangé  pointed out that the coroutine
pool size heuristic is very conservative. Instead of halving
max_map_count, he suggested reserving 5,000 mappings for non-coroutine
users based on observations of guests he has access to.

Fixes: 86a637e48104 ("coroutine: cap per-thread local pool size")
Signed-off-by: Stefan Hajnoczi 
---
 util/qemu-coroutine.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 2790959eaf..eb4eebefdf 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -377,12 +377,17 @@ static unsigned int get_global_pool_hard_max_size(void)
 NULL) &&
 qemu_strtoi(contents, NULL, 10, _map_count) == 0) {
 /*
- * This is a conservative upper bound that avoids exceeding
- * max_map_count. Leave half for non-coroutine users like library
- * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
- * halve the amount again.
+ * This is an upper bound that avoids exceeding max_map_count. Leave a
+ * fixed amount for non-coroutine users like library dependencies,
+ * vhost-user, etc. Each coroutine takes up 2 VMAs so halve the
+ * remaining amount.
  */
-return max_map_count / 4;
+if (max_map_count > 5000) {
+return (max_map_count - 5000) / 2;
+} else {
+/* Disable the global pool but threads still have local pools */
+return 0;
+}
 }
 #endif
 
-- 
2.44.0

Re: [PATCH] libqos/virtio.c: Correct 'flags' reading in qvirtqueue_kick

2024-03-20 Thread Stefan Hajnoczi

On Wed, 20 Mar 2024 at 09:10, Zheyu Ma  wrote:
>
> In qvirtqueue_kick(), the 'flags' were previously being incorrectly read from
> vq->avail instead of the correct vq->used location. This update ensures 
> 'flags'
> are read from the correct location as per the virtio standard.
>
> Signed-off-by: Zheyu Ma 
> ---
>  tests/qtest/libqos/virtio.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 

> diff --git a/tests/qtest/libqos/virtio.c b/tests/qtest/libqos/virtio.c
> index 82a6e122bf..a21b6eee9c 100644
> --- a/tests/qtest/libqos/virtio.c
> +++ b/tests/qtest/libqos/virtio.c
> @@ -394,7 +394,7 @@ void qvirtqueue_kick(QTestState *qts, QVirtioDevice *d, 
> QVirtQueue *vq,
>  qvirtio_writew(d, qts, vq->avail + 2, idx + 1);
>
>  /* Must read after idx is updated */
> -flags = qvirtio_readw(d, qts, vq->avail);
> +flags = qvirtio_readw(d, qts, vq->used);
>  avail_event = qvirtio_readw(d, qts, vq->used + 4 +
>  sizeof(struct vring_used_elem) * vq->size);
>
> --
> 2.34.1
>
>

Re: [PATCH] coroutine: cap per-thread local pool size

2024-03-20 Thread Stefan Hajnoczi

On Tue, Mar 19, 2024 at 08:10:49PM +, Daniel P. Berrangé wrote:
> On Tue, Mar 19, 2024 at 01:55:10PM -0400, Stefan Hajnoczi wrote:
> > On Tue, Mar 19, 2024 at 01:43:32PM +, Daniel P. Berrangé wrote:
> > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > > > index 5fd2dbaf8b..2790959eaf 100644
> > > > --- a/util/qemu-coroutine.c
> > > > +++ b/util/qemu-coroutine.c
> > > 
> > > > +static unsigned int get_global_pool_hard_max_size(void)
> > > > +{
> > > > +#ifdef __linux__
> > > > +g_autofree char *contents = NULL;
> > > > +int max_map_count;
> > > > +
> > > > +/*
> > > > + * Linux processes can have up to max_map_count virtual memory 
> > > > areas
> > > > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this 
> > > > limit. We
> > > > + * must limit the coroutine pool to a safe size to avoid running 
> > > > out of
> > > > + * VMAs.
> > > > + */
> > > > +if (g_file_get_contents("/proc/sys/vm/max_map_count", , 
> > > > NULL,
> > > > +NULL) &&
> > > > +qemu_strtoi(contents, NULL, 10, _map_count) == 0) {
> > > > +/*
> > > > + * This is a conservative upper bound that avoids exceeding
> > > > + * max_map_count. Leave half for non-coroutine users like 
> > > > library
> > > > + * dependencies, vhost-user, etc. Each coroutine takes up 2 
> > > > VMAs so
> > > > + * halve the amount again.
> 
> Leaving half for loaded libraries, etc is quite conservative
> if max_map_count is the small-ish 64k default.
> 
> That reservation could perhaps a fixed number like 5,000 ?

While I don't want QEMU to abort, once this heuristic is in the code it
will be scary to make it more optimistic and we may never change it. So
now is the best time to try 5,000.

I'll send a follow-up patch that reserves 5,000 mappings. If that turns
out to be too optimistic we can increase the reservation.

> > > > + */
> > > > +return max_map_count / 4;
> > > 
> > > That's 256,000 coroutines, which still sounds incredibly large
> > > to me.
> > 
> > Any ideas for tweaking this heuristic?
> 
> The awkward thing about this limit is that its hardcoded, and
> since it is indeed a "heuristic", we know it is going to be
> sub-optimal for some use cases / scenarios.
> 
> The worst case upper limit is
> 
>num virtio-blk * num threads * num queues
> 
> Reducing the number of devices isn't practical if the guest
> genuinely needs that many volumes.
> 
> Reducing the threads or queues artificially limits the peak
> performance of a single disk handling in isolation, while
> other disks are idle, so that's not desirable.
> 
> So there's no way to cap the worst case scenario, while
> still maximising the single disk performance possibilities.
> 
> With large VMs with many CPUs and many disks, it could be
> reasonable to not expect a real guest to need to maximise
> I/O on every disk at the same time, and thus want to put
> some cap there to control worst case resource usage.
> 
> It feels like it leans towards being able to control the
> coroutine pool limit explicitly, as a CLI option, to override
> this default hueristic.
> 
> > > > +}
> > > > +#endif
> > > > +
> > > > +return UINT_MAX;
> > > 
> > > Why UINT_MAX as a default ?  If we can't read procfs, we should
> > > assume some much smaller sane default IMHO, that corresponds to
> > > what current linux default max_map_count would be.
> > 
> > This line is not Linux-specific. I don't know if other OSes have an
> > equivalent to max_map_count.
> > 
> > I agree with defaulting to 64k-ish on Linux.
> 
> 
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
> 


signature.asc
Description: PGP signature

Re: [PATCH] block/io: accept NULL qiov in bdrv_pad_request

2024-03-19 Thread Stefan Hajnoczi

On Tue, Mar 19, 2024 at 10:13:41AM +0100, Fiona Ebner wrote:
> From: Stefan Reiter 
> 
> Some operations, e.g. block-stream, perform reads while discarding the
> results (only copy-on-read matters). In this case, they will pass NULL
> as the target QEMUIOVector, which will however trip bdrv_pad_request,
> since it wants to extend its passed vector. In particular, this is the
> case for the blk_co_preadv() call in stream_populate().
> 
> If there is no qiov, no operation can be done with it, but the bytes
> and offset still need to be updated, so the subsequent aligned read
> will actually be aligned and not run into an assertion failure.
> 
> In particular, this can happen when the request alignment of the top
> node is larger than the allocated part of the bottom node, in which
> case padding becomes necessary. For example:
> 
> > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> > ./qemu-system-x86_64 --qmp stdio \
> > --blockdev 
> > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> > < > {"execute": "qmp_capabilities"}
> > {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": 
> > "node0", "node-name": "node1" } }
> > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> > "node1" } }
> > EOF

Hi Fiona,
Can you add a qemu-iotests test case for this issue?

Thanks,
Stefan

> 
> Originally-by: Stefan Reiter 
> Signed-off-by: Thomas Lamprecht 
> [FE: do update bytes and offset in any case
>  add reproducer to commit message]
> Signed-off-by: Fiona Ebner 
> ---
>  block/io.c | 31 +++
>  1 file changed, 19 insertions(+), 12 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 33150c0359..395bea3bac 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs,
>  return 0;
>  }
>  
> -sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
> -  _head, _tail,
> -  _niov);
> +/*
> + * For prefetching in stream_populate(), no qiov is passed along, because
> + * only copy-on-read matters.
> + */
> +if (qiov && *qiov) {
> +sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
> +  _head, _tail,
> +  _niov);
>  
> -/* Guaranteed by bdrv_check_request32() */
> -assert(*bytes <= SIZE_MAX);
> -ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
> -  sliced_head, *bytes);
> -if (ret < 0) {
> -bdrv_padding_finalize(pad);
> -return ret;
> +/* Guaranteed by bdrv_check_request32() */
> +assert(*bytes <= SIZE_MAX);
> +ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
> +  sliced_head, *bytes);
> +if (ret < 0) {
> +bdrv_padding_finalize(pad);
> +return ret;
> +}
> +*qiov = >local_qiov;
> +*qiov_offset = 0;
>  }
> +
>  *bytes += pad->head + pad->tail;
>  *offset -= pad->head;
> -*qiov = >local_qiov;
> -*qiov_offset = 0;
>  if (padded) {
>  *padded = true;
>  }
> -- 
> 2.39.2
> 
> 


signature.asc
Description: PGP signature

Re: [PULL 1/1] coroutine: cap per-thread local pool size

2024-03-19 Thread Stefan Hajnoczi

On Tue, Mar 19, 2024 at 03:14:07PM +, Daniel P. Berrangé wrote:
> Sending this PULL feels little rushed, as I still have
> un-answered questions on the inital patch posting just
> a few hours ago

Sorry, I hadn't seen your email. I'll update this email thread once the
discussion has finished.

Stefan

> 
> On Tue, Mar 19, 2024 at 11:09:38AM -0400, Stefan Hajnoczi wrote:
> > The coroutine pool implementation can hit the Linux vm.max_map_count
> > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > or "failed to set up stack guard page" during coroutine creation.
> > 
> > This happens because per-thread pools can grow to tens of thousands of
> > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> > The per-thread pool sizes are non-uniform and depend on past coroutine
> > usage in each thread, so it's possible for one thread to have a large
> > pool while another thread's pool is empty.
> > 
> > Switch to a new coroutine pool implementation with a global pool that
> > grows to a maximum number of coroutines and per-thread local pools that
> > are capped at hardcoded small number of coroutines.
> > 
> > This approach does not leave large numbers of coroutines pooled in a
> > thread that may not use them again. In order to perform well it
> > amortizes the cost of global pool accesses by working in batches of
> > coroutines instead of individual coroutines.
> > 
> > The global pool is a list. Threads donate batches of coroutines to when
> > they have too many and take batches from when they have too few:
> > 
> > .---.
> > | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
> > `---'
> > 
> > Each thread has up to 2 batches of coroutines:
> > 
> > .---.
> > | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
> > `---'
> > 
> > The goal of this change is to reduce the excessive number of pooled
> > coroutines that cause QEMU to abort when vm.max_map_count is reached
> > without losing the performance of an adequately sized coroutine pool.
> > 
> > Here are virtio-blk disk I/O benchmark results:
> > 
> >   RW BLKSIZE IODEPTHOLDNEW CHANGE
> > randread  4k   1 113725 117451 +3.3%
> > randread  4k   8 192968 198510 +2.9%
> > randread  4k  16 207138 209429 +1.1%
> > randread  4k  32 212399 215145 +1.3%
> > randread  4k  64 218319 221277 +1.4%
> > randread128k   1  17587  17535 -0.3%
> > randread128k   8  17614  17616 +0.0%
> > randread128k  16  17608  17609 +0.0%
> > randread128k  32  17552  17553 +0.0%
> > randread128k  64  17484  17484 +0.0%
> > 
> > See files/{fio.sh,test.xml.j2} for the benchmark configuration:
> > https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing
> > 
> > Buglink: https://issues.redhat.com/browse/RHEL-28947
> > Reported-by: Sanjay Rao 
> > Reported-by: Boaz Ben Shabat 
> > Reported-by: Joe Mario 
> > Reviewed-by: Kevin Wolf 
> > Signed-off-by: Stefan Hajnoczi 
> > Message-ID: <20240318183429.1039340-1-stefa...@redhat.com>
> > ---
> >  util/qemu-coroutine.c | 282 +-
> >  1 file changed, 223 insertions(+), 59 deletions(-)
> > 
> > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > index 5fd2dbaf8b..2790959eaf 100644
> > --- a/util/qemu-coroutine.c
> > +++ b/util/qemu-coroutine.c
> > @@ -18,39 +18,200 @@
> >  #include "qemu/atomic.h"
> >  #include "qemu/coroutine_int.h"
> >  #include "qemu/coroutine-tls.h"
> > +#include "qemu/cutils.h"
> >  #include "block/aio.h"
> >  
> > -/**
> > - * The minimal batch size is always 64, coroutines from the release_pool 
> > are
> > - * reused as soon as there are 64 coroutines in it. The maximum pool size 
> > starts
> > - * with 64 and is increased on demand so that coroutines are not deleted 
> > even if
> > - * they are not immediately reused.
> > - */
> >  enum {
> > -POOL_MIN_BATCH_SIZE = 64,
> > -POOL_INITIAL_MAX_SIZE = 64,
> > +COROUTINE_POOL_BATCH_MAX_SIZE = 128,
> >  };
> >  
> > -/** Free list to speed up creation */
> > -static QSLIST_HEAD(, Coroutine) release_pool = 
> > QSLIST_HEAD_INITIALIZER

Re: [PATCH] coroutine: cap per-thread local pool size

2024-03-19 Thread Stefan Hajnoczi

On Tue, Mar 19, 2024 at 01:43:32PM +, Daniel P. Berrangé wrote:
> On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > The coroutine pool implementation can hit the Linux vm.max_map_count
> > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > or "failed to set up stack guard page" during coroutine creation.
> > 
> > This happens because per-thread pools can grow to tens of thousands of
> > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> 
> This sounds quite alarming. What usage scenario is justified in
> creating so many coroutines ?

The coroutine pool hides creation and deletion latency. The pool
initially has a modest size of 64, but each virtio-blk device increases
the pool size by num_queues * queue_size (256) / 2.

The issue pops up with large SMP guests (i.e. large num_queues) with
multiple virtio-blk devices.

> IIUC, coroutine stack size is 1 MB, and so tens of thousands of
> coroutines implies 10's of GB of memory just on stacks alone.
> 
> > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> 
> On my system max_map_count is 1048576, quite alot higher than
> 10's of 1000's. Hitting that would imply ~500,000 coroutines and
> ~500 GB of stacks !

Fedora recently increased the limit to 1048576. Before that it was
65k-ish and still is on most other distros.

Regarding why QEMU might have 65k coroutines pooled, it's because the
existing coroutine pool algorithm is per thread. So if the max pool size
is 15k but you have 4 IOThreads then up to 4 x 15k total coroutines can
be sitting in pools. This patch addresses this by setting a small fixed
size on per thread pools (256).

> 
> > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > index 5fd2dbaf8b..2790959eaf 100644
> > --- a/util/qemu-coroutine.c
> > +++ b/util/qemu-coroutine.c
> 
> > +static unsigned int get_global_pool_hard_max_size(void)
> > +{
> > +#ifdef __linux__
> > +g_autofree char *contents = NULL;
> > +int max_map_count;
> > +
> > +/*
> > + * Linux processes can have up to max_map_count virtual memory areas
> > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this 
> > limit. We
> > + * must limit the coroutine pool to a safe size to avoid running out of
> > + * VMAs.
> > + */
> > +if (g_file_get_contents("/proc/sys/vm/max_map_count", , NULL,
> > +NULL) &&
> > +qemu_strtoi(contents, NULL, 10, _map_count) == 0) {
> > +/*
> > + * This is a conservative upper bound that avoids exceeding
> > + * max_map_count. Leave half for non-coroutine users like library
> > + * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
> > + * halve the amount again.
> > + */
> > +return max_map_count / 4;
> 
> That's 256,000 coroutines, which still sounds incredibly large
> to me.

Any ideas for tweaking this heuristic?

> 
> > +}
> > +#endif
> > +
> > +return UINT_MAX;
> 
> Why UINT_MAX as a default ?  If we can't read procfs, we should
> assume some much smaller sane default IMHO, that corresponds to
> what current linux default max_map_count would be.

This line is not Linux-specific. I don't know if other OSes have an
equivalent to max_map_count.

I agree with defaulting to 64k-ish on Linux.

Stefan

> 
> > +}
> > +
> > +static void __attribute__((constructor)) qemu_coroutine_init(void)
> > +{
> > +qemu_mutex_init(_pool_lock);
> > +global_pool_hard_max_size = get_global_pool_hard_max_size();
> >  }
> > -- 
> > 2.44.0
> > 
> > 
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
> 


signature.asc
Description: PGP signature

[PULL 1/1] coroutine: cap per-thread local pool size

2024-03-19 Thread Stefan Hajnoczi

The coroutine pool implementation can hit the Linux vm.max_map_count
limit, causing QEMU to abort with "failed to allocate memory for stack"
or "failed to set up stack guard page" during coroutine creation.

This happens because per-thread pools can grow to tens of thousands of
coroutines. Each coroutine causes 2 virtual memory areas to be created.
Eventually vm.max_map_count is reached and memory-related syscalls fail.
The per-thread pool sizes are non-uniform and depend on past coroutine
usage in each thread, so it's possible for one thread to have a large
pool while another thread's pool is empty.

Switch to a new coroutine pool implementation with a global pool that
grows to a maximum number of coroutines and per-thread local pools that
are capped at hardcoded small number of coroutines.

This approach does not leave large numbers of coroutines pooled in a
thread that may not use them again. In order to perform well it
amortizes the cost of global pool accesses by working in batches of
coroutines instead of individual coroutines.

The global pool is a list. Threads donate batches of coroutines to when
they have too many and take batches from when they have too few:

.---.
| Batch 1 | Batch 2 | Batch 3 | ... | global_pool
`---'

Each thread has up to 2 batches of coroutines:

.---.
| Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
`---'

The goal of this change is to reduce the excessive number of pooled
coroutines that cause QEMU to abort when vm.max_map_count is reached
without losing the performance of an adequately sized coroutine pool.

Here are virtio-blk disk I/O benchmark results:

  RW BLKSIZE IODEPTHOLDNEW CHANGE
randread  4k   1 113725 117451 +3.3%
randread  4k   8 192968 198510 +2.9%
randread  4k  16 207138 209429 +1.1%
randread  4k  32 212399 215145 +1.3%
randread  4k  64 218319 221277 +1.4%
randread128k   1  17587  17535 -0.3%
randread128k   8  17614  17616 +0.0%
randread128k  16  17608  17609 +0.0%
randread128k  32  17552  17553 +0.0%
randread128k  64  17484  17484 +0.0%

See files/{fio.sh,test.xml.j2} for the benchmark configuration:
https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing

Buglink: https://issues.redhat.com/browse/RHEL-28947
Reported-by: Sanjay Rao 
Reported-by: Boaz Ben Shabat 
Reported-by: Joe Mario 
Reviewed-by: Kevin Wolf 
Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240318183429.1039340-1-stefa...@redhat.com>
---
 util/qemu-coroutine.c | 282 +-
 1 file changed, 223 insertions(+), 59 deletions(-)

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 5fd2dbaf8b..2790959eaf 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -18,39 +18,200 @@
 #include "qemu/atomic.h"
 #include "qemu/coroutine_int.h"
 #include "qemu/coroutine-tls.h"
+#include "qemu/cutils.h"
 #include "block/aio.h"
 
-/**
- * The minimal batch size is always 64, coroutines from the release_pool are
- * reused as soon as there are 64 coroutines in it. The maximum pool size 
starts
- * with 64 and is increased on demand so that coroutines are not deleted even 
if
- * they are not immediately reused.
- */
 enum {
-POOL_MIN_BATCH_SIZE = 64,
-POOL_INITIAL_MAX_SIZE = 64,
+COROUTINE_POOL_BATCH_MAX_SIZE = 128,
 };
 
-/** Free list to speed up creation */
-static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int pool_max_size = POOL_INITIAL_MAX_SIZE;
-static unsigned int release_pool_size;
+/*
+ * Coroutine creation and deletion is expensive so a pool of unused coroutines
+ * is kept as a cache. When the pool has coroutines available, they are
+ * recycled instead of creating new ones from scratch. Coroutines are added to
+ * the pool upon termination.
+ *
+ * The pool is global but each thread maintains a small local pool to avoid
+ * global pool contention. Threads fetch and return batches of coroutines from
+ * the global pool to maintain their local pool. The local pool holds up to two
+ * batches whereas the maximum size of the global pool is controlled by the
+ * qemu_coroutine_inc_pool_size() API.
+ *
+ * .---.
+ * | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
+ * `---'
+ *
+ * .---.
+ * | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
+ * `---'
+ */
+typedef struct CoroutinePoolBatch {
+/* Batches are kept in a list */
+QSLIST_ENTRY(CoroutinePoolBatch) next;
 
-typedef QSLIST_HEAD(, Coroutine) CoroutineQSList;
-QEMU_DEFINE_STATIC_CO_TLS(CoroutineQSList, alloc_pool);
-QEMU_DEFINE_STATIC_CO_TLS(unsigned int, alloc_pool_size);
-QEMU_DEFINE_STATIC_CO

[PULL 0/1] Block patches

2024-03-19 Thread Stefan Hajnoczi

The following changes since commit ddc27d2ad9361a81c2b3800d14143bf420dae172:

  Merge tag 'pull-request-2024-03-18' of https://gitlab.com/thuth/qemu into 
staging (2024-03-19 10:25:25 +)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 86a637e48104ae74d8be53bed6441ce32be33433:

  coroutine: cap per-thread local pool size (2024-03-19 10:49:31 -0400)


Pull request

This fix solves the "failed to set up stack guard page" error that has been
reported on Linux hosts where the QEMU coroutine pool exceeds the
vm.max_map_count limit.

----

Stefan Hajnoczi (1):
  coroutine: cap per-thread local pool size

 util/qemu-coroutine.c | 282 +-
 1 file changed, 223 insertions(+), 59 deletions(-)

-- 
2.44.0

Re: [PATCH] coroutine: cap per-thread local pool size

2024-03-19 Thread Stefan Hajnoczi

On Tue, Mar 19, 2024 at 02:32:06PM +0100, Kevin Wolf wrote:
> Am 18.03.2024 um 19:34 hat Stefan Hajnoczi geschrieben:
> > The coroutine pool implementation can hit the Linux vm.max_map_count
> > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > or "failed to set up stack guard page" during coroutine creation.
> > 
> > This happens because per-thread pools can grow to tens of thousands of
> > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> > The per-thread pool sizes are non-uniform and depend on past coroutine
> > usage in each thread, so it's possible for one thread to have a large
> > pool while another thread's pool is empty.
> > 
> > Switch to a new coroutine pool implementation with a global pool that
> > grows to a maximum number of coroutines and per-thread local pools that
> > are capped at hardcoded small number of coroutines.
> > 
> > This approach does not leave large numbers of coroutines pooled in a
> > thread that may not use them again. In order to perform well it
> > amortizes the cost of global pool accesses by working in batches of
> > coroutines instead of individual coroutines.
> > 
> > The global pool is a list. Threads donate batches of coroutines to when
> > they have too many and take batches from when they have too few:
> > 
> > .---.
> > | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
> > `---'
> > 
> > Each thread has up to 2 batches of coroutines:
> > 
> > .---.
> > | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
> > `---'
> > 
> > The goal of this change is to reduce the excessive number of pooled
> > coroutines that cause QEMU to abort when vm.max_map_count is reached
> > without losing the performance of an adequately sized coroutine pool.
> > 
> > Here are virtio-blk disk I/O benchmark results:
> > 
> >   RW BLKSIZE IODEPTHOLDNEW CHANGE
> > randread  4k   1 113725 117451 +3.3%
> > randread  4k   8 192968 198510 +2.9%
> > randread  4k  16 207138 209429 +1.1%
> > randread  4k  32 212399 215145 +1.3%
> > randread  4k  64 218319 221277 +1.4%
> > randread128k   1  17587  17535 -0.3%
> > randread128k   8  17614  17616 +0.0%
> > randread128k  16  17608  17609 +0.0%
> > randread128k  32  17552  17553 +0.0%
> > randread128k  64  17484  17484 +0.0%
> > 
> > See files/{fio.sh,test.xml.j2} for the benchmark configuration:
> > https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing
> > 
> > Buglink: https://issues.redhat.com/browse/RHEL-28947
> > Reported-by: Sanjay Rao 
> > Reported-by: Boaz Ben Shabat 
> > Reported-by: Joe Mario 
> > Signed-off-by: Stefan Hajnoczi 
> 
> Reviewed-by: Kevin Wolf 
> 
> Though I do wonder if we can do something about the slight performance
> degradation that Sanjay reported. We seem to stay well under the hard
> limit, so the reduced global pool size shouldn't be the issue. Maybe
> it's the locking?

I'm not sure if it's the lock. When writing the code I tried to avoid
thresholds that cause batches to bounce between the global and local
thread pools, because that is another way to lose performance. So maybe
it's something related to the algorithm.

> Either way, even though it could be called a fix, I don't think this is
> for 9.0, right?

There is a bug report for the max_map_count issue (RHEL-28947), so I
think this fix should go into QEMU 9.0.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v2] virtio-blk: iothread-vq-mapping coroutine pool sizing

2024-03-18 Thread Stefan Hajnoczi

On Tue, Mar 12, 2024 at 11:12:04AM -0400, Stefan Hajnoczi wrote:
> It is possible to hit the sysctl vm.max_map_count limit when the
> coroutine pool size becomes large. Each coroutine requires two mappings
> (one for the stack and one for the guard page). QEMU can crash with
> "failed to set up stack guard page" or "failed to allocate memory for
> stack" when this happens.
> 
> Coroutine pool sizing is simple when there is only thread: sum up all
> I/O requests across all virtqueues.
> 
> When the iothread-vq-mapping option is used we should calculate tighter
> bounds because thread may serve a subset of the device's virtqueues:
> take the maximum number of the number of I/O requests across all
> virtqueues. A thread does not need coroutine pool space for I/O requests
> that are handled by other threads.
> 
> This is not a solution to hitting vm.max_map_count, but it helps. A
> guest with 64 vCPUs (hence 64 virtqueues) across 4 IOThreads with one
> iothread-vq-mapping virtio-blk device and a root disk without goes from
> pool_max_size 16,448 to 10,304.
> 
> Reported-by: Sanjay Rao 
> Reported-by: Boaz Ben Shabat 
> Reported-by: Joe Mario 
> Signed-off-by: Stefan Hajnoczi 
> ---
> v2:
> - State the the tighter bounds reflect the fact that threads may only
>   process a subset of the total I/O requests from a device [Kevin]
> - Add Reported-by: Joe Mario, he has been investigating this issue.

I have sent a new patch series that obsoletes this patch. Please do not
apply this patch.

The new series is here:
https://lore.kernel.org/qemu-devel/20240318183429.1039340-1-stefa...@redhat.com/T/#u

> 
>  include/hw/virtio/virtio-blk.h |  2 ++
>  hw/block/virtio-blk.c  | 34 --
>  2 files changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
> index 5c14110c4b..ac29700ad4 100644
> --- a/include/hw/virtio/virtio-blk.h
> +++ b/include/hw/virtio/virtio-blk.h
> @@ -74,6 +74,8 @@ struct VirtIOBlock {
>  uint64_t host_features;
>  size_t config_size;
>  BlockRAMRegistrar blk_ram_registrar;
> +
> +unsigned coroutine_pool_size;
>  };
>  
>  typedef struct VirtIOBlockReq {
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index 738cb2ac36..0a14b2b175 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -1957,6 +1957,35 @@ static void virtio_blk_stop_ioeventfd(VirtIODevice 
> *vdev)
>  s->ioeventfd_stopping = false;
>  }
>  
> +/* Increase the coroutine pool size to include our I/O requests */
> +static void virtio_blk_inc_coroutine_pool_size(VirtIOBlock *s)
> +{
> +VirtIOBlkConf *conf = >conf;
> +unsigned max_requests = 0;
> +
> +/* Tracks the total number of requests for AioContext */
> +g_autoptr(GHashTable) counters = g_hash_table_new(NULL, NULL);
> +
> +/* Call this function after setting up vq_aio_context[] */
> +assert(s->vq_aio_context);
> +
> +for (unsigned i = 0; i < conf->num_queues; i++) {
> +AioContext *ctx = s->vq_aio_context[i];
> +unsigned n = GPOINTER_TO_UINT(g_hash_table_lookup(counters, ctx));
> +
> +n += conf->queue_size / 2; /* this is a heuristic */
> +
> +g_hash_table_insert(counters, ctx, GUINT_TO_POINTER(n));
> +
> +if (n > max_requests) {
> +max_requests = n;
> +}
> +}
> +
> +qemu_coroutine_inc_pool_size(max_requests);
> +s->coroutine_pool_size = max_requests; /* stash it for ->unrealize() */
> +}
> +
>  static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
>  {
>  VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> @@ -2048,7 +2077,6 @@ static void virtio_blk_device_realize(DeviceState *dev, 
> Error **errp)
>  for (i = 0; i < conf->num_queues; i++) {
>  virtio_add_queue(vdev, conf->queue_size, virtio_blk_handle_output);
>  }
> -qemu_coroutine_inc_pool_size(conf->num_queues * conf->queue_size / 2);
>  
>  /* Don't start ioeventfd if transport does not support notifiers. */
>  if (!virtio_device_ioeventfd_enabled(vdev)) {
> @@ -2065,6 +2093,8 @@ static void virtio_blk_device_realize(DeviceState *dev, 
> Error **errp)
>  return;
>  }
>  
> +virtio_blk_inc_coroutine_pool_size(s);
> +
>  /*
>   * This must be after virtio_init() so virtio_blk_dma_restart_cb() gets
>   * called after ->start_ioeventfd() has already set blk's AioContext.
> @@ -2096,7 +2126,7 @@ static void virtio_blk_device_unrealize(DeviceState 
> *dev)
>  for (i = 0; i < conf->num_queues; i++) {
>  virtio_del_queue(vdev, i);
>  }
> -qemu_coroutine_dec_pool_size(conf->num_queues * conf->queue_size / 2);
> +qemu_coroutine_dec_pool_size(s->coroutine_pool_size);
>  qemu_mutex_destroy(>rq_lock);
>  blk_ram_registrar_destroy(>blk_ram_registrar);
>  qemu_del_vm_change_state_handler(s->change);
> -- 
> 2.44.0
> 


signature.asc
Description: PGP signature

[PATCH] coroutine: cap per-thread local pool size

2024-03-18 Thread Stefan Hajnoczi

The coroutine pool implementation can hit the Linux vm.max_map_count
limit, causing QEMU to abort with "failed to allocate memory for stack"
or "failed to set up stack guard page" during coroutine creation.

This happens because per-thread pools can grow to tens of thousands of
coroutines. Each coroutine causes 2 virtual memory areas to be created.
Eventually vm.max_map_count is reached and memory-related syscalls fail.
The per-thread pool sizes are non-uniform and depend on past coroutine
usage in each thread, so it's possible for one thread to have a large
pool while another thread's pool is empty.

Switch to a new coroutine pool implementation with a global pool that
grows to a maximum number of coroutines and per-thread local pools that
are capped at hardcoded small number of coroutines.

This approach does not leave large numbers of coroutines pooled in a
thread that may not use them again. In order to perform well it
amortizes the cost of global pool accesses by working in batches of
coroutines instead of individual coroutines.

The global pool is a list. Threads donate batches of coroutines to when
they have too many and take batches from when they have too few:

.---.
| Batch 1 | Batch 2 | Batch 3 | ... | global_pool
`---'

Each thread has up to 2 batches of coroutines:

.---.
| Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
`---'

The goal of this change is to reduce the excessive number of pooled
coroutines that cause QEMU to abort when vm.max_map_count is reached
without losing the performance of an adequately sized coroutine pool.

Here are virtio-blk disk I/O benchmark results:

  RW BLKSIZE IODEPTHOLDNEW CHANGE
randread  4k   1 113725 117451 +3.3%
randread  4k   8 192968 198510 +2.9%
randread  4k  16 207138 209429 +1.1%
randread  4k  32 212399 215145 +1.3%
randread  4k  64 218319 221277 +1.4%
randread128k   1  17587  17535 -0.3%
randread128k   8  17614  17616 +0.0%
randread128k  16  17608  17609 +0.0%
randread128k  32  17552  17553 +0.0%
randread128k  64  17484  17484 +0.0%

See files/{fio.sh,test.xml.j2} for the benchmark configuration:
https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing

Buglink: https://issues.redhat.com/browse/RHEL-28947
Reported-by: Sanjay Rao 
Reported-by: Boaz Ben Shabat 
Reported-by: Joe Mario 
Signed-off-by: Stefan Hajnoczi 
---
This patch obsoletes "[PATCH v2] virtio-blk: iothread-vq-mapping
coroutine pool sizing" because the pool size is now global instead of
per thread:
https://lore.kernel.org/qemu-devel/20240312151204.412624-1-stefa...@redhat.com/

Please don't apply "[PATCH v2] virtio-blk: iothread-vq-mapping coroutine
pool sizing".

 util/qemu-coroutine.c | 282 +-
 1 file changed, 223 insertions(+), 59 deletions(-)

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 5fd2dbaf8b..2790959eaf 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -18,39 +18,200 @@
 #include "qemu/atomic.h"
 #include "qemu/coroutine_int.h"
 #include "qemu/coroutine-tls.h"
+#include "qemu/cutils.h"
 #include "block/aio.h"
 
-/**
- * The minimal batch size is always 64, coroutines from the release_pool are
- * reused as soon as there are 64 coroutines in it. The maximum pool size 
starts
- * with 64 and is increased on demand so that coroutines are not deleted even 
if
- * they are not immediately reused.
- */
 enum {
-POOL_MIN_BATCH_SIZE = 64,
-POOL_INITIAL_MAX_SIZE = 64,
+COROUTINE_POOL_BATCH_MAX_SIZE = 128,
 };
 
-/** Free list to speed up creation */
-static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int pool_max_size = POOL_INITIAL_MAX_SIZE;
-static unsigned int release_pool_size;
+/*
+ * Coroutine creation and deletion is expensive so a pool of unused coroutines
+ * is kept as a cache. When the pool has coroutines available, they are
+ * recycled instead of creating new ones from scratch. Coroutines are added to
+ * the pool upon termination.
+ *
+ * The pool is global but each thread maintains a small local pool to avoid
+ * global pool contention. Threads fetch and return batches of coroutines from
+ * the global pool to maintain their local pool. The local pool holds up to two
+ * batches whereas the maximum size of the global pool is controlled by the
+ * qemu_coroutine_inc_pool_size() API.
+ *
+ * .---.
+ * | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
+ * `---'
+ *
+ * .---.
+ * | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
+ * `---'
+ */
+typedef struct CoroutinePoolBatch {
+/* Batches are kept in a list */
+

Re: [PATCH for-9.0 0/2] nbd: Fix server crash on reset with iothreads

2024-03-18 Thread Stefan Hajnoczi

On Thu, Mar 14, 2024 at 05:58:23PM +0100, Kevin Wolf wrote:
> Kevin Wolf (2):
>   nbd/server: Fix race in draining the export
>   iotests: Add test for reset/AioContext switches with NBD exports
> 
>  nbd/server.c  | 15 ++---
>  tests/qemu-iotests/tests/iothreads-nbd-export | 66 +++
>  .../tests/iothreads-nbd-export.out| 19 ++
>  3 files changed, 92 insertions(+), 8 deletions(-)
>  create mode 100755 tests/qemu-iotests/tests/iothreads-nbd-export
>  create mode 100644 tests/qemu-iotests/tests/iothreads-nbd-export.out
> 
> -- 
> 2.44.0
> 

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH for-9.0] mirror: Don't call job_pause_point() under graph lock

2024-03-18 Thread Stefan Hajnoczi

On Mon, Mar 18, 2024 at 12:37:19PM +0100, Kevin Wolf wrote:
> Am 14.03.2024 um 15:29 hat Stefan Hajnoczi geschrieben:
> > On Wed, Mar 13, 2024 at 04:30:00PM +0100, Kevin Wolf wrote:
> > > Calling job_pause_point() while holding the graph reader lock
> > > potentially results in a deadlock: bdrv_graph_wrlock() first drains
> > > everything, including the mirror job, which pauses it. The job is only
> > > unpaused at the end of the drain section, which is when the graph writer
> > > lock has been successfully taken. However, if the job happens to be
> > > paused at a pause point where it still holds the reader lock, the writer
> > > lock can't be taken as long as the job is still paused.
> > > 
> > > Mark job_pause_point() as GRAPH_UNLOCKED and fix mirror accordingly.
> > > 
> > > Cc: qemu-sta...@nongnu.org
> > > Buglink: https://issues.redhat.com/browse/RHEL-28125
> > > Fixes: 004915a96a7a40e942ac85e6d22518cbcd283506
> > > Signed-off-by: Kevin Wolf 
> > > ---
> > >  include/qemu/job.h |  2 +-
> > >  block/mirror.c | 10 ++
> > >  2 files changed, 7 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/include/qemu/job.h b/include/qemu/job.h
> > > index 9ea98b5927..2b873f2576 100644
> > > --- a/include/qemu/job.h
> > > +++ b/include/qemu/job.h
> > > @@ -483,7 +483,7 @@ void job_enter(Job *job);
> > >   *
> > >   * Called with job_mutex *not* held.
> > >   */
> > > -void coroutine_fn job_pause_point(Job *job);
> > > +void coroutine_fn GRAPH_UNLOCKED job_pause_point(Job *job);
> > >  
> > >  /**
> > >   * @job: The job that calls the function.
> > > diff --git a/block/mirror.c b/block/mirror.c
> > > index 5145eb53e1..1bdce3b657 100644
> > > --- a/block/mirror.c
> > > +++ b/block/mirror.c
> > > @@ -479,9 +479,9 @@ static unsigned mirror_perform(MirrorBlockJob *s, 
> > > int64_t offset,
> > >  return bytes_handled;
> > >  }
> > >  
> > > -static void coroutine_fn GRAPH_RDLOCK mirror_iteration(MirrorBlockJob *s)
> > > +static void coroutine_fn GRAPH_UNLOCKED mirror_iteration(MirrorBlockJob 
> > > *s)
> > >  {
> > > -BlockDriverState *source = s->mirror_top_bs->backing->bs;
> > > +BlockDriverState *source;
> > >  MirrorOp *pseudo_op;
> > >  int64_t offset;
> > >  /* At least the first dirty chunk is mirrored in one iteration. */
> > > @@ -489,6 +489,10 @@ static void coroutine_fn GRAPH_RDLOCK 
> > > mirror_iteration(MirrorBlockJob *s)
> > >  bool write_zeroes_ok = 
> > > bdrv_can_write_zeroes_with_unmap(blk_bs(s->target));
> > >  int max_io_bytes = MAX(s->buf_size / MAX_IN_FLIGHT, MAX_IO_BYTES);
> > >  
> > > +bdrv_graph_co_rdlock();
> > > +source = s->mirror_top_bs->backing->bs;
> > 
> > Is bdrv_ref(source) needed here so that source cannot go away if someone
> > else write locks the graph and removes it? Or maybe something else
> > protects against that. Either way, please add a comment that explains
> > why this is safe.
> 
> We didn't even get to looking at this level of detail with the graph
> locking work. We probably should, but this is not the only place in
> mirror we need to look at then. Commit 004915a9 just took the lazy path
> of taking the lock for the whole function, and it turns out that this
> was wrong and causes deadlocks, so I'm reverting it and replacing it
> with what other parts of the code do - the minimal thing to let it
> compile.
> 
> I think we already own a reference, we do a block_job_add_bdrv() in
> mirror_start_job(). But once it changes, we have a reference to the
> wrong node. So it looks to me that mirror has a problem with a changing
> source node that is more fundamental than graph locking in one specific
> function because it stores BDS pointers in its state.
> 
> Active commit already freezes the backing chain between mirror_top_bs
> and target, maybe other mirror jobs need to freeze the link between
> mirror_top_bs and source at least.
> 
> So I agree that it might be worth looking into this more, but I consider
> it unrelated to this patch. We just go back to the state in which it has
> always been before 8.2 (which might contain a latent bug that apparently
> never triggered in practice) to fix a regression that we do see in
> practice.
> 
> Kevin

Okay:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH for-9.0] mirror: Don't call job_pause_point() under graph lock

2024-03-14 Thread Stefan Hajnoczi

On Wed, Mar 13, 2024 at 04:30:00PM +0100, Kevin Wolf wrote:
> Calling job_pause_point() while holding the graph reader lock
> potentially results in a deadlock: bdrv_graph_wrlock() first drains
> everything, including the mirror job, which pauses it. The job is only
> unpaused at the end of the drain section, which is when the graph writer
> lock has been successfully taken. However, if the job happens to be
> paused at a pause point where it still holds the reader lock, the writer
> lock can't be taken as long as the job is still paused.
> 
> Mark job_pause_point() as GRAPH_UNLOCKED and fix mirror accordingly.
> 
> Cc: qemu-sta...@nongnu.org
> Buglink: https://issues.redhat.com/browse/RHEL-28125
> Fixes: 004915a96a7a40e942ac85e6d22518cbcd283506
> Signed-off-by: Kevin Wolf 
> ---
>  include/qemu/job.h |  2 +-
>  block/mirror.c | 10 ++
>  2 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/include/qemu/job.h b/include/qemu/job.h
> index 9ea98b5927..2b873f2576 100644
> --- a/include/qemu/job.h
> +++ b/include/qemu/job.h
> @@ -483,7 +483,7 @@ void job_enter(Job *job);
>   *
>   * Called with job_mutex *not* held.
>   */
> -void coroutine_fn job_pause_point(Job *job);
> +void coroutine_fn GRAPH_UNLOCKED job_pause_point(Job *job);
>  
>  /**
>   * @job: The job that calls the function.
> diff --git a/block/mirror.c b/block/mirror.c
> index 5145eb53e1..1bdce3b657 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -479,9 +479,9 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t 
> offset,
>  return bytes_handled;
>  }
>  
> -static void coroutine_fn GRAPH_RDLOCK mirror_iteration(MirrorBlockJob *s)
> +static void coroutine_fn GRAPH_UNLOCKED mirror_iteration(MirrorBlockJob *s)
>  {
> -BlockDriverState *source = s->mirror_top_bs->backing->bs;
> +BlockDriverState *source;
>  MirrorOp *pseudo_op;
>  int64_t offset;
>  /* At least the first dirty chunk is mirrored in one iteration. */
> @@ -489,6 +489,10 @@ static void coroutine_fn GRAPH_RDLOCK 
> mirror_iteration(MirrorBlockJob *s)
>  bool write_zeroes_ok = 
> bdrv_can_write_zeroes_with_unmap(blk_bs(s->target));
>  int max_io_bytes = MAX(s->buf_size / MAX_IN_FLIGHT, MAX_IO_BYTES);
>  
> +bdrv_graph_co_rdlock();
> +source = s->mirror_top_bs->backing->bs;

Is bdrv_ref(source) needed here so that source cannot go away if someone
else write locks the graph and removes it? Or maybe something else
protects against that. Either way, please add a comment that explains
why this is safe.

> +bdrv_graph_co_rdunlock();
> +
>  bdrv_dirty_bitmap_lock(s->dirty_bitmap);
>  offset = bdrv_dirty_iter_next(s->dbi);
>  if (offset < 0) {
> @@ -1066,9 +1070,7 @@ static int coroutine_fn mirror_run(Job *job, Error 
> **errp)
>  mirror_wait_for_free_in_flight_slot(s);
>  continue;
>  } else if (cnt != 0) {
> -bdrv_graph_co_rdlock();
>  mirror_iteration(s);
> -bdrv_graph_co_rdunlock();
>  }
>  }
>  
> -- 
> 2.44.0
> 


signature.asc
Description: PGP signature

Re: [PATCH] file-posix: rearrange BDRVRawState fields

2024-03-14 Thread Stefan Hajnoczi

On Thu, Mar 14, 2024 at 04:47:41PM +0530, Prasad Pandit wrote:
> From: Prasad Pandit 
> 
> Rearrange BRDVRawState structure fields to avoid memory
> fragments in its object's memory and save some(~8) bytes
> per object.
> 
> Signed-off-by: Prasad Pandit 
> ---
>  block/file-posix.c | 39 +++
>  1 file changed, 19 insertions(+), 20 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v4] linux-aio: add IO_CMD_FDSYNC command support

2024-03-14 Thread Stefan Hajnoczi

On Thu, Mar 14, 2024 at 04:46:28PM +0530, Prasad Pandit wrote:
> From: Prasad Pandit 
> 
> Libaio defines IO_CMD_FDSYNC command to sync all outstanding
> asynchronous I/O operations, by flushing out file data to the
> disk storage.
> 
> Enable linux-aio to submit such aio request. This helps to
> reduce latency induced via pthread_create calls by
> thread-pool (aio=threads).
> 
> Signed-off-by: Prasad Pandit 
> ---
>  block/file-posix.c  |  7 +++
>  block/linux-aio.c   | 21 -
>  include/block/raw-aio.h |  1 +
>  3 files changed, 28 insertions(+), 1 deletion(-)
> 
> v4: New boolean field to indicate if aio_fdsync is available or not.
> It is set at file open time and checked before AIO_FLUSH call.
>   - https://lists.nongnu.org/archive/html/qemu-devel/2024-03/msg03701.html

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v3] linux-aio: add IO_CMD_FDSYNC command support

2024-03-13 Thread Stefan Hajnoczi

On Wed, Mar 13, 2024 at 10:49:31PM +0530, Prasad Pandit wrote:
> On Wed, 13 Mar 2024 at 20:48, Stefan Hajnoczi  wrote:
> > > +extern bool laio_has_fdsync(int);
> > Please declare this in include/block/raw-aio.h alongside the other laio 
> > APIs.
> >
> > FDSYNC support should be probed at open() time and the result should be
> > stored in a new bool field like s->laio_supports_fdsync. That way the
> > cost of laio_has_fdsync() on every flush request is avoided.
> 
> * Okay. Here 's' is a BDRVRawState object and file open seems to
> happen in the raw_open_common() function? I'll move the
> laio_has_fdsync() call there and see how it works.

Yes. Thanks!

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v3] linux-aio: add IO_CMD_FDSYNC command support

2024-03-13 Thread Stefan Hajnoczi

On Wed, Mar 13, 2024 at 02:19:35PM +0530, Prasad Pandit wrote:
> From: Prasad Pandit 
> 
> Libaio defines IO_CMD_FDSYNC command to sync all outstanding
> asynchronous I/O operations, by flushing out file data to the
> disk storage.
> 
> Enable linux-aio to submit such aio request. This helps to
> reduce latency induced via pthread_create calls by
> thread-pool (aio=threads).
> 
> Signed-off-by: Prasad Pandit 
> ---
>  block/file-posix.c |  7 +++
>  block/linux-aio.c  | 22 +-
>  2 files changed, 28 insertions(+), 1 deletion(-)
> 
> v3: check if host kernel supports aio_fsync call
>   -> https://lists.nongnu.org/archive/html/qemu-devel/2024-03/msg03210.html
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 35684f7e21..a30494d1a8 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -2453,6 +2453,8 @@ static inline bool 
> raw_check_linux_io_uring(BDRVRawState *s)
>  #endif
>  
>  #ifdef CONFIG_LINUX_AIO
> +extern bool laio_has_fdsync(int);

Please declare this in include/block/raw-aio.h alongside the other laio APIs.

> +
>  static inline bool raw_check_linux_aio(BDRVRawState *s)
>  {
>  Error *local_err = NULL;
> @@ -2599,6 +2601,11 @@ static int coroutine_fn 
> raw_co_flush_to_disk(BlockDriverState *bs)
>  if (raw_check_linux_io_uring(s)) {
>  return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
>  }
> +#endif
> +#ifdef CONFIG_LINUX_AIO
> +if (raw_check_linux_aio(s) && laio_has_fdsync(s->fd)) {

FDSYNC support should be probed at open() time and the result should be
stored in a new bool field like s->laio_supports_fdsync. That way the
cost of laio_has_fdsync() on every flush request is avoided.

> +return laio_co_submit(s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
> +}
>  #endif
>  return raw_thread_pool_submit(handle_aiocb_flush, );
>  }
> diff --git a/block/linux-aio.c b/block/linux-aio.c
> index ec05d946f3..ed20a503e9 100644
> --- a/block/linux-aio.c
> +++ b/block/linux-aio.c
> @@ -67,6 +67,7 @@ struct LinuxAioState {
>  int event_max;
>  };
>  
> +bool laio_has_fdsync(int);
>  static void ioq_submit(LinuxAioState *s);
>  
>  static inline ssize_t io_event_ret(struct io_event *ev)
> @@ -384,6 +385,9 @@ static int laio_do_submit(int fd, struct qemu_laiocb 
> *laiocb, off_t offset,
>  case QEMU_AIO_READ:
>  io_prep_preadv(iocbs, fd, qiov->iov, qiov->niov, offset);
>  break;
> +case QEMU_AIO_FLUSH:
> +io_prep_fdsync(iocbs, fd);
> +break;
>  /* Currently Linux kernel does not support other operations */
>  default:
>  fprintf(stderr, "%s: invalid AIO request type 0x%x.\n",
> @@ -412,7 +416,7 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, 
> QEMUIOVector *qiov,
>  AioContext *ctx = qemu_get_current_aio_context();
>  struct qemu_laiocb laiocb = {
>  .co = qemu_coroutine_self(),
> -.nbytes = qiov->size,
> +.nbytes = qiov ? qiov->size : 0,
>  .ctx= aio_get_linux_aio(ctx),
>  .ret= -EINPROGRESS,
>  .is_read= (type == QEMU_AIO_READ),
> @@ -486,3 +490,19 @@ void laio_cleanup(LinuxAioState *s)
>  }
>  g_free(s);
>  }
> +
> +bool laio_has_fdsync(int fd)
> +{
> +struct iocb cb;
> +struct iocb *cbs[] = {, NULL};
> +
> +io_context_t ctx = 0;
> +io_setup(1, );
> +
> +/* check if host kernel supports IO_CMD_FDSYNC */
> +io_prep_fdsync(, fd);
> +int ret = io_submit(ctx, 1, cbs);
> +
> +io_destroy(ctx);
> +return (ret == -EINVAL) ? false : true;
> +}
> -- 
> 2.44.0
> 


signature.asc
Description: PGP signature

Re: [PATCH for 9.0] migration: Skip only empty block devices

2024-03-12 Thread Stefan Hajnoczi

On Tue, Mar 12, 2024 at 09:22:06PM +0100, Cédric Le Goater wrote:
> On 3/12/24 19:41, Stefan Hajnoczi wrote:
> > On Tue, Mar 12, 2024 at 01:04:31PM +0100, Cédric Le Goater wrote:
> > > The block .save_setup() handler calls a helper routine
> > > init_blk_migration() which builds a list of block devices to take into
> > > account for migration. When one device is found to be empty (sectors
> > > == 0), the loop exits and all the remaining devices are ignored. This
> > > is a regression introduced when bdrv_iterate() was removed.
> > > 
> > > Change that by skipping only empty devices.
> > > 
> > > Cc: Markus Armbruster 
> > > Suggested: Kevin Wolf 
> > > Fixes: fea68bb6e9fa ("block: Eliminate bdrv_iterate(), use bdrv_next()")
> > 
> > It's not clear to me that fea68bb6e9fa introduced the bug. The code is
> > still <= 0 there and I don't see anything else that skips empty devices.
> > Can you explain the bug in fea68bb6e9fa?
> 
> Let me try. Initially, the code was :
> static void init_blk_migration_it(void *opaque, BlockDriverState *bs)
> {
> BlockDriverState *bs;
> BlkMigDevState *bmds;
> int64_t sectors;
> if (!bdrv_is_read_only(bs)) {
>  sectors = bdrv_nb_sectors(bs);
>  if (sectors <= 0) {
>  return;
> ...
> }
>   
> static void init_blk_migration(QEMUFile *f)
> {
> block_mig_state.submitted = 0;
> block_mig_state.read_done = 0;
> block_mig_state.transferred = 0;
> block_mig_state.total_sector_sum = 0;
> block_mig_state.prev_progress = -1;
> block_mig_state.bulk_completed = 0;
> block_mig_state.zero_blocks = migrate_zero_blocks();
> bdrv_iterate(init_blk_migration_it, NULL);
> }
> 
> bdrv_iterate() was iterating on all devices and exiting one iteration
> early if the device was empty or an error detected. The loop applied
> on all devices always, only empty devices and the ones with a failure
> were skipped.
> It was replaced by :
> 
> static void init_blk_migration(QEMUFile *f)
> {
>  BlockDriverState *bs;
>  BlkMigDevState *bmds;
>  int64_t sectors;
>  block_mig_state.submitted = 0;
>  block_mig_state.read_done = 0;
>  block_mig_state.transferred = 0;
>  block_mig_state.total_sector_sum = 0;
>  block_mig_state.prev_progress = -1;
>  block_mig_state.bulk_completed = 0;
>  block_mig_state.zero_blocks = migrate_zero_blocks();
>  for (bs = bdrv_next(NULL); bs; bs = bdrv_next(bs)) {
>  if (bdrv_is_read_only(bs)) {
>  continue;
>  }
>  sectors = bdrv_nb_sectors(bs);
>  if (sectors <= 0) {
>  return;
>   
>...
>   }
> 
> The loop and function exit at first failure or first empty device,
> skipping the remaining devices. This is a different behavior.
> 
> What we would want today is ignore empty devices, as it done for
> the read only ones, return at first failure and let the caller of
> init_blk_migration() report.
> 
> This is a short term fix for 9.0 because the block migration code
> is deprecated and should be removed in 9.1.

I understand now. I missed that returning from init_blk_migration_it()
did not abort iteration. Thank you!

Stefan


signature.asc
Description: PGP signature

[PULL 0/2] Tracing patches

2024-03-12 Thread Stefan Hajnoczi

The following changes since commit 8f3f329f5e0117bd1a23a79ab751f8a7d3471e4b:

  Merge tag 'migration-20240311-pull-request' of https://gitlab.com/peterx/qemu 
into staging (2024-03-12 11:35:41 +)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/tracing-pull-request

for you to fetch changes up to 2b608e16ca00017509fa2a211b7b49aacdedb760:

  meson: generate .stp files for tools too (2024-03-12 14:52:07 -0400)


Pull request



Daniel P. Berrangé (2):
  tracetool: remove redundant --target-type / --target-name args
  meson: generate .stp files for tools too

 docs/devel/tracing.rst |  3 +-
 meson.build| 63 +++---
 scripts/tracetool.py   | 24 
 3 files changed, 46 insertions(+), 44 deletions(-)

-- 
2.44.0

Re: [PATCH 0/2] trace: fix ability to use systemtap with qemu tools

2024-03-12 Thread Stefan Hajnoczi

On Mon, Jan 08, 2024 at 05:13:54PM +, Daniel P. Berrangé wrote:
> Currently we're only generating .stp definitions for the system and
> user emulators forgetting all about the tools which support tracing
> too.
> 
> Daniel P. Berrangé (2):
>   tracetool: remove redundant --target-type / --target-name args
>   meson: generate .stp files for tools too
> 
>  docs/devel/tracing.rst |  3 +-
>  meson.build| 63 +++---
>  scripts/tracetool.py   | 24 
>  3 files changed, 46 insertions(+), 44 deletions(-)
> 
> -- 
> 2.43.0
> 

Thanks, applied to my tracing tree:
https://gitlab.com/stefanha/qemu/commits/tracing

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 0/2] trace: fix ability to use systemtap with qemu tools

2024-03-12 Thread Stefan Hajnoczi

On Tue, Mar 12, 2024 at 11:34:21AM +, Daniel P. Berrangé wrote:
> Ping again for these patches to get into this release.
> 
> On Fri, Feb 16, 2024 at 01:09:33PM +, Daniel P. Berrangé wrote:
> > Ping: Stefan, are you OK with picking up these trace patches
> > for merge ?

Sorry, I missed them.

Stefan


signature.asc
Description: PGP signature

[PULL 2/2] meson: generate .stp files for tools too

2024-03-12 Thread Stefan Hajnoczi

From: Daniel P. Berrangé 

The qemu-img, qemu-io, qemu-nbd, qemu-storage-daemon tools all have
support for systemtap tracing built-in, so should be given corresponding
.stp files to define their probes.

Signed-off-by: Daniel P. Berrangé 
Reviewed-by: Eric Blake 
Message-id: 20240108171356.1037059-3-berra...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 meson.build | 61 +++--
 1 file changed, 40 insertions(+), 21 deletions(-)

diff --git a/meson.build b/meson.build
index b8d40d7c0a..520f862d7b 100644
--- a/meson.build
+++ b/meson.build
@@ -3808,6 +3808,7 @@ if host_os == 'darwin'
   entitlement = find_program('scripts/entitlement.sh')
 endif
 
+traceable = []
 emulators = {}
 foreach target : target_dirs
   config_target = config_target_mak[target]
@@ -3976,27 +3977,11 @@ foreach target : target_dirs
   emulators += {exe['name']: emulator}
 endif
 
-if stap.found()
-  foreach stp: [
-{'ext': '.stp-build', 'fmt': 'stap', 'bin': meson.current_build_dir() 
/ exe['name'], 'install': false},
-{'ext': '.stp', 'fmt': 'stap', 'bin': get_option('prefix') / 
get_option('bindir') / exe['name'], 'install': true},
-{'ext': '-simpletrace.stp', 'fmt': 'simpletrace-stap', 'bin': '', 
'install': true},
-{'ext': '-log.stp', 'fmt': 'log-stap', 'bin': '', 'install': true},
-  ]
-custom_target(exe['name'] + stp['ext'],
-  input: trace_events_all,
-  output: exe['name'] + stp['ext'],
-  install: stp['install'],
-  install_dir: get_option('datadir') / 'systemtap/tapset',
-  command: [
-tracetool, '--group=all', '--format=' + stp['fmt'],
-'--binary=' + stp['bin'],
-'--probe-prefix=qemu.' + target_type + '.' + 
target_name,
-'@INPUT@', '@OUTPUT@'
-  ],
-  depend_files: tracetool_depends)
-  endforeach
-endif
+traceable += [{
+  'exe': exe['name'],
+  'probe-prefix': 'qemu.' + target_type + '.' + target_name,
+}]
+
   endforeach
 endforeach
 
@@ -4031,6 +4016,14 @@ if have_tools
install: true)
 
   subdir('storage-daemon')
+
+  foreach exe: [ 'qemu-img', 'qemu-io', 'qemu-nbd', 'qemu-storage-daemon']
+traceable += [{
+  'exe': exe,
+  'probe-prefix': 'qemu.' + exe.substring(5).replace('-', '_')
+}]
+  endforeach
+
   subdir('contrib/rdmacm-mux')
   subdir('contrib/elf2dmp')
 
@@ -4063,6 +4056,32 @@ if have_tools
   endif
 endif
 
+if stap.found()
+  foreach t: traceable
+foreach stp: [
+  {'ext': '.stp-build', 'fmt': 'stap', 'bin': meson.current_build_dir() / 
t['exe'], 'install': false},
+  {'ext': '.stp', 'fmt': 'stap', 'bin': get_option('prefix') / 
get_option('bindir') / t['exe'], 'install': true},
+  {'ext': '-simpletrace.stp', 'fmt': 'simpletrace-stap', 'bin': '', 
'install': true},
+  {'ext': '-log.stp', 'fmt': 'log-stap', 'bin': '', 'install': true},
+]
+  cmd = [
+tracetool, '--group=all', '--format=' + stp['fmt'],
+'--binary=' + stp['bin'],
+'--probe-prefix=' + t['probe-prefix'],
+'@INPUT@', '@OUTPUT@'
+  ]
+
+  custom_target(t['exe'] + stp['ext'],
+input: trace_events_all,
+output: t['exe'] + stp['ext'],
+install: stp['install'],
+install_dir: get_option('datadir') / 'systemtap/tapset',
+command: cmd,
+depend_files: tracetool_depends)
+endforeach
+  endforeach
+endif
+
 subdir('scripts')
 subdir('tools')
 subdir('pc-bios')
-- 
2.44.0

Re: [PATCH 1/2] tracetool: remove redundant --target-type / --target-name args

2024-03-12 Thread Stefan Hajnoczi

On Mon, Jan 08, 2024 at 04:50:40PM -0500, John Snow wrote:
> On Mon, Jan 8, 2024 at 12:14 PM Daniel P. Berrangé  
> wrote:
> >
> > The --target-type and --target-name args are used to construct
> > the default probe prefix if '--probe-prefix' is not given. The
> > meson.build will always pass '--probe-prefix', so the other args
> > are effectively redundant.
> >
> > Signed-off-by: Daniel P. Berrangé 
> 
> Fine by me, provided there's no reason anyone is calling tracetool
> manually for some reason I haven't thought about. I assume we'll hear
> about it if so...

That's okay. tracetool.py is internal to QEMU so we're free to change
the command-line options.

Stefan


signature.asc
Description: PGP signature

[PULL 1/2] tracetool: remove redundant --target-type / --target-name args

2024-03-12 Thread Stefan Hajnoczi

From: Daniel P. Berrangé 

The --target-type and --target-name args are used to construct
the default probe prefix if '--probe-prefix' is not given. The
meson.build will always pass '--probe-prefix', so the other args
are effectively redundant.

Signed-off-by: Daniel P. Berrangé 
Reviewed-by: John Snow 
Message-id: 20240108171356.1037059-2-berra...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 docs/devel/tracing.rst |  3 +--
 meson.build|  2 --
 scripts/tracetool.py   | 24 +---
 3 files changed, 6 insertions(+), 23 deletions(-)

diff --git a/docs/devel/tracing.rst b/docs/devel/tracing.rst
index d288480db1..043bed7fd0 100644
--- a/docs/devel/tracing.rst
+++ b/docs/devel/tracing.rst
@@ -357,8 +357,7 @@ probes::
 
 scripts/tracetool.py --backends=dtrace --format=stap \
  --binary path/to/qemu-binary \
- --target-type system \
- --target-name x86_64 \
+ --probe-prefix qemu.system.x86_64 \
  --group=all \
  trace-events-all \
  qemu.stp
diff --git a/meson.build b/meson.build
index f9dbe7634e..b8d40d7c0a 100644
--- a/meson.build
+++ b/meson.build
@@ -3991,8 +3991,6 @@ foreach target : target_dirs
   command: [
 tracetool, '--group=all', '--format=' + stp['fmt'],
 '--binary=' + stp['bin'],
-'--target-name=' + target_name,
-'--target-type=' + target_type,
 '--probe-prefix=qemu.' + target_type + '.' + 
target_name,
 '@INPUT@', '@OUTPUT@'
   ],
diff --git a/scripts/tracetool.py b/scripts/tracetool.py
index ab7653a5ce..5de9ce96d3 100755
--- a/scripts/tracetool.py
+++ b/scripts/tracetool.py
@@ -44,12 +44,9 @@ def error_opt(msg = None):
 --help   This help message.
 --list-backends  Print list of available backends.
 --check-backends Check if the given backend is valid.
---binary   Full path to QEMU binary.
---target-type  QEMU emulator target type ('system' or 'user').
---target-name  QEMU emulator target name.
---groupName of the event group
---probe-prefix   Prefix for dtrace probe names
- (default: qemu--).\
+--binary   Full path to QEMU binary (required for 'stap' 
backend).
+--groupName of the event group.
+--probe-prefix   Prefix for dtrace probe names (required for 
'stap' backend).
 """ % {
 "script" : _SCRIPT,
 "backends" : backend_descr,
@@ -67,7 +64,7 @@ def main(args):
 
 long_opts = ["backends=", "format=", "help", "list-backends",
  "check-backends", "group="]
-long_opts += ["binary=", "target-type=", "target-name=", "probe-prefix="]
+long_opts += ["binary=", "probe-prefix="]
 
 try:
 opts, args = getopt.getopt(args[1:], "", long_opts)
@@ -79,8 +76,6 @@ def main(args):
 arg_format = ""
 arg_group = None
 binary = None
-target_type = None
-target_name = None
 probe_prefix = None
 for opt, arg in opts:
 if opt == "--help":
@@ -102,10 +97,6 @@ def main(args):
 
 elif opt == "--binary":
 binary = arg
-elif opt == '--target-type':
-target_type = arg
-elif opt == '--target-name':
-target_name = arg
 elif opt == '--probe-prefix':
 probe_prefix = arg
 
@@ -127,13 +118,8 @@ def main(args):
 if arg_format == "stap":
 if binary is None:
 error_opt("--binary is required for SystemTAP tapset generator")
-if probe_prefix is None and target_type is None:
-error_opt("--target-type is required for SystemTAP tapset 
generator")
-if probe_prefix is None and target_name is None:
-error_opt("--target-name is required for SystemTAP tapset 
generator")
-
 if probe_prefix is None:
-probe_prefix = ".".join(["qemu", target_type, target_name])
+error_opt("--probe-prefix is required for SystemTAP tapset 
generator")
 
 if len(args) < 2:
 error_opt("missing trace-events and output filepaths")
-- 
2.44.0

Re: [PATCH v7 3/4] qcow2: add zoned emulation capability

2024-03-12 Thread Stefan Hajnoczi

On Mon, Jan 22, 2024 at 07:48:29PM +0100, Sam Li wrote:
> By adding zone operations and zoned metadata, the zoned emulation
> capability enables full emulation support of zoned device using
> a qcow2 file. The zoned device metadata includes zone type,
> zoned device state and write pointer of each zone, which is stored
> to an array of unsigned integers.
> 
> Each zone of a zoned device makes state transitions following
> the zone state machine. The zone state machine mainly describes
> five states, IMPLICIT OPEN, EXPLICIT OPEN, FULL, EMPTY and CLOSED.
> READ ONLY and OFFLINE states will generally be affected by device
> internal events. The operations on zones cause corresponding state
> changing.
> 
> Zoned devices have a limit on zone resources, which puts constraints on
> write operations into zones. It is managed by active zone lists
> following LRU policy.
> 
> Signed-off-by: Sam Li 
> ---
>  block/qcow2.c| 791 ++-
>  block/trace-events   |   2 +
>  include/qemu/queue.h |   1 +
>  3 files changed, 792 insertions(+), 2 deletions(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index b987f1e751..db28585b82 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -195,6 +195,274 @@ qcow2_extract_crypto_opts(QemuOpts *opts, const char 
> *fmt, Error **errp)
>  return cryptoopts_qdict;
>  }
>  
> +#define QCOW2_ZT_IS_CONV(wp)(wp & 1ULL << 59)
> +
> +/*
> + * To emulate a real zoned device, closed, empty and full states are
> + * preserved after a power cycle. The open states are in-memory and will
> + * be lost after closing the device. Read-only and offline states are
> + * device-internal events, which are not considered for simplicity.
> + */
> +static inline BlockZoneState qcow2_get_zone_state(BlockDriverState *bs,
> +  uint32_t index)
> +{
> +BDRVQcow2State *s = bs->opaque;
> +Qcow2ZoneListEntry *zone_entry = >zone_list_entries[index];
> +uint64_t zone_wp = bs->wps->wp[index];
> +uint64_t zone_start;
> +
> +if (QCOW2_ZT_IS_CONV(zone_wp)) {
> +return BLK_ZS_NOT_WP;
> +}
> +
> +if (QLIST_IS_INSERTED(zone_entry, exp_open_zone_entry)) {
> +return BLK_ZS_EOPEN;
> +}
> +if (QLIST_IS_INSERTED(zone_entry, imp_open_zone_entry)) {
> +return BLK_ZS_IOPEN;
> +}
> +
> +zone_start = index * bs->bl.zone_size;
> +if (zone_wp == zone_start) {
> +return BLK_ZS_EMPTY;
> +}
> +if (zone_wp >= zone_start + bs->bl.zone_capacity) {
> +return BLK_ZS_FULL;
> +}
> +if (zone_wp > zone_start) {
> +if (!QLIST_IS_INSERTED(zone_entry, closed_zone_entry)) {
> +/*
> + * The number of closed zones is not always updated in time when
> + * the device is closed. However, it only matters when doing
> + * zone report. Refresh the count and list of closed zones to
> + * provide correct zone states for zone report.
> + */
> +QLIST_INSERT_HEAD(>closed_zones, zone_entry, 
> closed_zone_entry);
> +s->nr_zones_closed++;
> +}
> +return BLK_ZS_CLOSED;
> +}
> +return BLK_ZS_NOT_WP;
> +}
> +
> +static void qcow2_rm_exp_open_zone(BDRVQcow2State *s,
> +   uint32_t index)
> +{
> +Qcow2ZoneListEntry *zone_entry = >zone_list_entries[index];
> +
> +QLIST_REMOVE(zone_entry, exp_open_zone_entry);
> +s->nr_zones_exp_open--;
> +}
> +
> +static void qcow2_rm_imp_open_zone(BDRVQcow2State *s,
> +   int32_t index)
> +{
> +Qcow2ZoneListEntry *zone_entry;
> +if (index < 0) {
> +/* Apply LRU when the index is not specified. */
> +zone_entry = QLIST_LAST(>imp_open_zones, imp_open_zone_entry);
> +} else {
> +zone_entry = >zone_list_entries[index];
> +}
> +
> +QLIST_REMOVE(zone_entry, imp_open_zone_entry);
> +s->nr_zones_imp_open--;
> +}
> +
> +static void qcow2_rm_open_zone(BDRVQcow2State *s,
> +   uint32_t index)
> +{
> +Qcow2ZoneListEntry *zone_entry = >zone_list_entries[index];
> +
> +if (QLIST_IS_INSERTED(zone_entry, exp_open_zone_entry)) {
> +qcow2_rm_exp_open_zone(s, index);
> +} else if (QLIST_IS_INSERTED(zone_entry, imp_open_zone_entry)) {
> +qcow2_rm_imp_open_zone(s, index);
> +}
> +}
> +
> +static void qcow2_rm_closed_zone(BDRVQcow2State *s,
> + uint32_t index)
> +{
> +Qcow2ZoneListEntry *zone_entry = >zone_list_entries[index];
> +
> +QLIST_REMOVE(zone_entry, closed_zone_entry);
> +s->nr_zones_closed--;
> +}
> +
> +static void qcow2_do_imp_open_zone(BDRVQcow2State *s,
> +   uint32_t index,
> +   BlockZoneState zs)
> +{
> +Qcow2ZoneListEntry *zone_entry = >zone_list_entries[index];
> +
> +switch (zs) {
> +case BLK_ZS_EMPTY:
> +

Re: [PATCH for 9.0] migration: Skip only empty block devices

2024-03-12 Thread Stefan Hajnoczi

On Tue, Mar 12, 2024 at 01:04:31PM +0100, Cédric Le Goater wrote:
> The block .save_setup() handler calls a helper routine
> init_blk_migration() which builds a list of block devices to take into
> account for migration. When one device is found to be empty (sectors
> == 0), the loop exits and all the remaining devices are ignored. This
> is a regression introduced when bdrv_iterate() was removed.
> 
> Change that by skipping only empty devices.
> 
> Cc: Markus Armbruster 
> Suggested: Kevin Wolf 
> Fixes: fea68bb6e9fa ("block: Eliminate bdrv_iterate(), use bdrv_next()")

It's not clear to me that fea68bb6e9fa introduced the bug. The code is
still <= 0 there and I don't see anything else that skips empty devices.
Can you explain the bug in fea68bb6e9fa?

Otherwise:
Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

[PATCH v2] virtio-blk: iothread-vq-mapping coroutine pool sizing

2024-03-12 Thread Stefan Hajnoczi

It is possible to hit the sysctl vm.max_map_count limit when the
coroutine pool size becomes large. Each coroutine requires two mappings
(one for the stack and one for the guard page). QEMU can crash with
"failed to set up stack guard page" or "failed to allocate memory for
stack" when this happens.

Coroutine pool sizing is simple when there is only thread: sum up all
I/O requests across all virtqueues.

When the iothread-vq-mapping option is used we should calculate tighter
bounds because thread may serve a subset of the device's virtqueues:
take the maximum number of the number of I/O requests across all
virtqueues. A thread does not need coroutine pool space for I/O requests
that are handled by other threads.

This is not a solution to hitting vm.max_map_count, but it helps. A
guest with 64 vCPUs (hence 64 virtqueues) across 4 IOThreads with one
iothread-vq-mapping virtio-blk device and a root disk without goes from
pool_max_size 16,448 to 10,304.

Reported-by: Sanjay Rao 
Reported-by: Boaz Ben Shabat 
Reported-by: Joe Mario 
Signed-off-by: Stefan Hajnoczi 
---
v2:
- State the the tighter bounds reflect the fact that threads may only
  process a subset of the total I/O requests from a device [Kevin]
- Add Reported-by: Joe Mario, he has been investigating this issue.

 include/hw/virtio/virtio-blk.h |  2 ++
 hw/block/virtio-blk.c  | 34 --
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 5c14110c4b..ac29700ad4 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -74,6 +74,8 @@ struct VirtIOBlock {
 uint64_t host_features;
 size_t config_size;
 BlockRAMRegistrar blk_ram_registrar;
+
+unsigned coroutine_pool_size;
 };
 
 typedef struct VirtIOBlockReq {
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 738cb2ac36..0a14b2b175 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1957,6 +1957,35 @@ static void virtio_blk_stop_ioeventfd(VirtIODevice *vdev)
 s->ioeventfd_stopping = false;
 }
 
+/* Increase the coroutine pool size to include our I/O requests */
+static void virtio_blk_inc_coroutine_pool_size(VirtIOBlock *s)
+{
+VirtIOBlkConf *conf = >conf;
+unsigned max_requests = 0;
+
+/* Tracks the total number of requests for AioContext */
+g_autoptr(GHashTable) counters = g_hash_table_new(NULL, NULL);
+
+/* Call this function after setting up vq_aio_context[] */
+assert(s->vq_aio_context);
+
+for (unsigned i = 0; i < conf->num_queues; i++) {
+AioContext *ctx = s->vq_aio_context[i];
+unsigned n = GPOINTER_TO_UINT(g_hash_table_lookup(counters, ctx));
+
+n += conf->queue_size / 2; /* this is a heuristic */
+
+g_hash_table_insert(counters, ctx, GUINT_TO_POINTER(n));
+
+if (n > max_requests) {
+max_requests = n;
+}
+}
+
+qemu_coroutine_inc_pool_size(max_requests);
+s->coroutine_pool_size = max_requests; /* stash it for ->unrealize() */
+}
+
 static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
 {
 VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -2048,7 +2077,6 @@ static void virtio_blk_device_realize(DeviceState *dev, 
Error **errp)
 for (i = 0; i < conf->num_queues; i++) {
 virtio_add_queue(vdev, conf->queue_size, virtio_blk_handle_output);
 }
-qemu_coroutine_inc_pool_size(conf->num_queues * conf->queue_size / 2);
 
 /* Don't start ioeventfd if transport does not support notifiers. */
 if (!virtio_device_ioeventfd_enabled(vdev)) {
@@ -2065,6 +2093,8 @@ static void virtio_blk_device_realize(DeviceState *dev, 
Error **errp)
 return;
 }
 
+virtio_blk_inc_coroutine_pool_size(s);
+
 /*
  * This must be after virtio_init() so virtio_blk_dma_restart_cb() gets
  * called after ->start_ioeventfd() has already set blk's AioContext.
@@ -2096,7 +2126,7 @@ static void virtio_blk_device_unrealize(DeviceState *dev)
 for (i = 0; i < conf->num_queues; i++) {
 virtio_del_queue(vdev, i);
 }
-qemu_coroutine_dec_pool_size(conf->num_queues * conf->queue_size / 2);
+qemu_coroutine_dec_pool_size(s->coroutine_pool_size);
 qemu_mutex_destroy(>rq_lock);
 blk_ram_registrar_destroy(>blk_ram_registrar);
 qemu_del_vm_change_state_handler(s->change);
-- 
2.44.0

Re: [PATCH v7 2/4] qcow2: add configurations for zoned format extension

2024-03-12 Thread Stefan Hajnoczi

On Mon, Jan 22, 2024 at 07:48:28PM +0100, Sam Li wrote:
> To configure the zoned format feature on the qcow2 driver, it
> requires settings as: the device size, zone model, zone size,
> zone capacity, number of conventional zones, limits on zone
> resources (max append bytes, max open zones, and max_active_zones).
> 
> To create a qcow2 image with zoned format feature, use command like
> this:
> qemu-img create -f qcow2 zbc.qcow2 -o size=768M \
> -o zone.size=64M -o zone.capacity=64M -o zone.conventional_zones=0 \
> -o zone.max_append_bytes=4096 -o zone.max_open_zones=6 \
> -o zone.max_active_zones=8 -o zone.mode=host-managed
> 
> Signed-off-by: Sam Li 
> ---
>  block/qcow2.c| 252 ++-
>  block/qcow2.h|  36 -
>  docs/interop/qcow2.txt   | 107 -
>  include/block/block_int-common.h |  13 ++
>  qapi/block-core.json |  67 +++-
>  5 files changed, 469 insertions(+), 6 deletions(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 9bee66fff5..b987f1e751 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -73,6 +73,7 @@ typedef struct {
>  #define  QCOW2_EXT_MAGIC_CRYPTO_HEADER 0x0537be77
>  #define  QCOW2_EXT_MAGIC_BITMAPS 0x23852875
>  #define  QCOW2_EXT_MAGIC_DATA_FILE 0x44415441
> +#define  QCOW2_EXT_MAGIC_ZONED_FORMAT 0x007a6264
>  
>  static int coroutine_fn
>  qcow2_co_preadv_compressed(BlockDriverState *bs,
> @@ -194,6 +195,68 @@ qcow2_extract_crypto_opts(QemuOpts *opts, const char 
> *fmt, Error **errp)
>  return cryptoopts_qdict;
>  }
>  
> +/*
> + * Passing by the zoned device configurations by a zoned_header struct, check
> + * if the zone device options are under constraints. Return false when some
> + * option is invalid
> + */
> +static inline bool
> +qcow2_check_zone_options(Qcow2ZonedHeaderExtension *zone_opt)
> +{
> +if (zone_opt) {
> +uint32_t sequential_zones;
> +
> +if (zone_opt->zone_size == 0) {
> +error_report("Zoned extension header zone_size field "
> + "can not be 0");

Please add an Error **errp argument to qcow2_check_zone_options() and
use error_setg() instead of calling error_report(). That way the caller
can decide how to print or propagate error messages.

> +return false;
> +}
> +
> +if (zone_opt->zone_capacity > zone_opt->zone_size) {
> +error_report("zone capacity %" PRIu32 "B exceeds zone size "
> + "%" PRIu32 "B", zone_opt->zone_capacity,
> + zone_opt->zone_size);
> +return false;
> +}
> +
> +if (zone_opt->max_append_bytes + BDRV_SECTOR_SIZE >=
> +zone_opt->zone_capacity) {
> +error_report("max append bytes %" PRIu32 "B exceeds zone "
> + "capacity %" PRIu32 "B by more than block size",
> + zone_opt->zone_capacity,
> + zone_opt->max_append_bytes);
> +return false;
> +}
> +
> +if (zone_opt->max_active_zones > zone_opt->nr_zones) {
> +error_report("Max_active_zones %" PRIu32 " exceeds "
> + "nr_zones %" PRIu32 ". Set it to nr_zones.",
> + zone_opt->max_active_zones, zone_opt->nr_zones);
> +zone_opt->max_active_zones = zone_opt->nr_zones;
> +}
> +
> +if (zone_opt->max_open_zones > zone_opt->max_active_zones) {
> +error_report("Max_open_zones %" PRIu32 " exceeds "
> + "max_active_zones %" PRIu32 ". Set it to "
> + "max_active_zones.",
> + zone_opt->max_open_zones,
> + zone_opt->max_active_zones);
> +zone_opt->max_open_zones = zone_opt->max_active_zones;
> +}
> +
> +sequential_zones = zone_opt->nr_zones - zone_opt->conventional_zones;
> +if (zone_opt->max_open_zones > sequential_zones) {

Please check that conventional_zones < zone_opt->nr_zones first to avoid
integer underflow.

> +error_report("Max_open_zones field can not be larger "
> + "than the number of SWR zones. Set it to number of "
> + "SWR zones %" PRIu32 ".", sequential_zones);
> +zone_opt->max_open_zones = sequential_zones;
> +}
> +
> +return true;
> +}
> +return false;
> +}
> +
>  /*
>   * read qcow2 extension and fill bs
>   * start reading from start_offset
> @@ -211,6 +274,7 @@ qcow2_read_extensions(BlockDriverState *bs, uint64_t 
> start_offset,
>  uint64_t offset;
>  int ret;
>  Qcow2BitmapHeaderExt bitmaps_ext;
> +Qcow2ZonedHeaderExtension zoned_ext;
>  
>  if (need_update_header != NULL) {
>  *need_update_header = false;
> @@ -432,6 +496,51 @@ qcow2_read_extensions(BlockDriverState *bs, uint64_t 
> start_offset,
>  break;
>  }
>

Re: [PATCH v2] linux-aio: add IO_CMD_FDSYNC command support

2024-03-12 Thread Stefan Hajnoczi

On Tue, Mar 12, 2024 at 07:07:04PM +0530, Prasad Pandit wrote:
> Hello,
> 
> On Tue, 12 Mar 2024 at 15:15, Kevin Wolf  wrote:
> > Am 11.03.2024 um 20:36 hat Stefan Hajnoczi geschrieben:
> > > > > That can be avoided with a variable that keeps track of whether 
> > > > > -EINVAL was seen before and skips Linux AIO in that
> > > > > case.
> > > > >
> > > > > Fallback should be very rare, so I don't think it needs to be 
> > > > > optimized:
> > You're right. I missed that io_submit() returns failure only if the
> > first request in the queue is invalid, and returns a "short submission"
> > for errors in later entries.
> 
> ===
> +bool laio_has_fdsync(int fd)
> +{
> +AioContext *ctx = qemu_get_current_aio_context();
> +struct qemu_laiocb cb = {
> +.co = qemu_coroutine_self(),
> +.ctx= aio_get_linux_aio(ctx),
> +};
> +struct iocb *iocbs[] = {, NULL};
> +
> +/* check if host kernel supports IO_CMD_FDSYNC */
> +io_prep_fdsync(, fd);
> +int ret = io_submit(cb.ctx->ctx, 1, iocbs);
> +
> +return ret != -EINVAL;
> +}
> ===
> 
> To confirm:
> * Do we need a revised patch V3? I'm testing one with the above
> function to check if IO_CMD_FDSYNC is supported. If it returns true we
> call laio_co_submit(..., QEMU_AIO_FLUSH, ), else fallback to
> thread-pool.

If you are already developing and testing it then I think a v3 with
laio_has_fdsync() would be great.

Thanks,
Stefan


signature.asc
Description: PGP signature

Re: [PATCH] virtio-blk: iothread-vq-mapping coroutine pool sizing

2024-03-12 Thread Stefan Hajnoczi

On Tue, Mar 12, 2024 at 12:24:30PM +0100, Kevin Wolf wrote:
> Am 11.03.2024 um 21:14 hat Stefan Hajnoczi geschrieben:
> > It is possible to hit the sysctl vm.max_map_count limit when the
> > coroutine pool size becomes large. Each coroutine requires two mappings
> > (one for the stack and one for the guard page). QEMU can crash with
> > "failed to set up stack guard page" or "failed to allocate memory for
> > stack" when this happens.
> > 
> > Coroutine pool sizing is simple when there is only one AioContext: sum
> > up all I/O requests across all virtqueues.
> > 
> > When the iothread-vq-mapping option is used we should calculate tighter
> > bounds: take the maximum number of the number of I/O requests across all
> > virtqueues. This number is lower than simply summing all virtqueues when
> > only a subset of the virtqueues is handled by each AioContext.
> 
> The reasoning is that each thread has its own coroutine pool for which
> the pool size applies individually, and it doesn't need to have space
> for coroutines running in a different thread, right? I'd like to have
> this recorded in the commit message.

Right, I'll update the commit description to mention this.

I also forgot to include Joe Mario in the Reported-by tags. Joe has been
helping investigate the issue.

> Of course, this also makes me wonder if a global coroutine pool size
> really makes sense or if it should be per thread. One thread could be
> serving only one queue (maybe the main thread with a CD-ROM device) and
> another thread 32 queues (the iothread with the interesting disks).
> There is no reason for the first thread to have a coroutine pool as big
> as the second one.

Agreed. The main loop thread, in particular, has very different
coroutine pool sizing requirements from the IOThreads that are
performing the bulk of the I/O.

> But before we make the size thread-local, maybe having thread-local
> pools wasn't right to begin with because multiple threads can run main
> context code and they should therefore share the same coroutine pool (we
> already had the problem earlier that coroutines start on the vcpu thread
> and terminate on the main thread and this plays havoc with coroutine
> pools).
> 
> Maybe per-AioContext pools with per-AioContext sizes would make more
> sense?

That's a good observation. Originally I hoped we could keep the
coroutine code independent of AioContext, but it is already connected in
other ways (e.g. co->ctx) so there's no reason to avoid it.

I'm not happy with how coroutine pools perform. I reproduced the
max_map_count failures that Sanjay, Boaz, and Joe have encountered and
found that one IOThread had 10k coroutines pooled while another
IOThread's pool was empty. Using timers to decay the pool size when it's
not actively in use has come to mind.

On the other hand, the fundamental problem is that the guest has access
to more virtqueue space than QEMU can allocate coroutines. Adding timers
to decay the pool size might hide issues but a guest could fill all
virtqueues at once and still crash.

QEMU needs to be reliable and run out of resources. There is no
accounting code in place capable of tracking all possible coroutine
usage, but maybe QEMU should refuse to launch when the virtqueue
resources promised to the guest exceed vm.max_map_count...

> > This is not a solution to hitting vm.max_map_count, but it helps. A
> > guest with 64 vCPUs (hence 64 virtqueues) across 4 IOThreads with one
> > iothread-vq-mapping virtio-blk device and a root disk without goes from
> > pool_max_size 16,448 to 10,304.
> > 
> > Reported-by: Sanjay Rao 
> > Reported-by: Boaz Ben Shabat 
> > Signed-off-by: Stefan Hajnoczi 
> 
> Either way, this should already strictly improve the situation, so I'm
> happy to apply this change for now.

Thanks, I will send a v2.

Stefan

signature.asc
Description: PGP signature

[PATCH] virtio-blk: iothread-vq-mapping coroutine pool sizing

2024-03-11 Thread Stefan Hajnoczi

It is possible to hit the sysctl vm.max_map_count limit when the
coroutine pool size becomes large. Each coroutine requires two mappings
(one for the stack and one for the guard page). QEMU can crash with
"failed to set up stack guard page" or "failed to allocate memory for
stack" when this happens.

Coroutine pool sizing is simple when there is only one AioContext: sum
up all I/O requests across all virtqueues.

When the iothread-vq-mapping option is used we should calculate tighter
bounds: take the maximum number of the number of I/O requests across all
virtqueues. This number is lower than simply summing all virtqueues when
only a subset of the virtqueues is handled by each AioContext.

This is not a solution to hitting vm.max_map_count, but it helps. A
guest with 64 vCPUs (hence 64 virtqueues) across 4 IOThreads with one
iothread-vq-mapping virtio-blk device and a root disk without goes from
pool_max_size 16,448 to 10,304.

Reported-by: Sanjay Rao 
Reported-by: Boaz Ben Shabat 
Signed-off-by: Stefan Hajnoczi 
---
 include/hw/virtio/virtio-blk.h |  2 ++
 hw/block/virtio-blk.c  | 34 --
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 5c14110c4b..ac29700ad4 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -74,6 +74,8 @@ struct VirtIOBlock {
 uint64_t host_features;
 size_t config_size;
 BlockRAMRegistrar blk_ram_registrar;
+
+unsigned coroutine_pool_size;
 };
 
 typedef struct VirtIOBlockReq {
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 738cb2ac36..0a14b2b175 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1957,6 +1957,35 @@ static void virtio_blk_stop_ioeventfd(VirtIODevice *vdev)
 s->ioeventfd_stopping = false;
 }
 
+/* Increase the coroutine pool size to include our I/O requests */
+static void virtio_blk_inc_coroutine_pool_size(VirtIOBlock *s)
+{
+VirtIOBlkConf *conf = >conf;
+unsigned max_requests = 0;
+
+/* Tracks the total number of requests for AioContext */
+g_autoptr(GHashTable) counters = g_hash_table_new(NULL, NULL);
+
+/* Call this function after setting up vq_aio_context[] */
+assert(s->vq_aio_context);
+
+for (unsigned i = 0; i < conf->num_queues; i++) {
+AioContext *ctx = s->vq_aio_context[i];
+unsigned n = GPOINTER_TO_UINT(g_hash_table_lookup(counters, ctx));
+
+n += conf->queue_size / 2; /* this is a heuristic */
+
+g_hash_table_insert(counters, ctx, GUINT_TO_POINTER(n));
+
+if (n > max_requests) {
+max_requests = n;
+}
+}
+
+qemu_coroutine_inc_pool_size(max_requests);
+s->coroutine_pool_size = max_requests; /* stash it for ->unrealize() */
+}
+
 static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
 {
 VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -2048,7 +2077,6 @@ static void virtio_blk_device_realize(DeviceState *dev, 
Error **errp)
 for (i = 0; i < conf->num_queues; i++) {
 virtio_add_queue(vdev, conf->queue_size, virtio_blk_handle_output);
 }
-qemu_coroutine_inc_pool_size(conf->num_queues * conf->queue_size / 2);
 
 /* Don't start ioeventfd if transport does not support notifiers. */
 if (!virtio_device_ioeventfd_enabled(vdev)) {
@@ -2065,6 +2093,8 @@ static void virtio_blk_device_realize(DeviceState *dev, 
Error **errp)
 return;
 }
 
+virtio_blk_inc_coroutine_pool_size(s);
+
 /*
  * This must be after virtio_init() so virtio_blk_dma_restart_cb() gets
  * called after ->start_ioeventfd() has already set blk's AioContext.
@@ -2096,7 +2126,7 @@ static void virtio_blk_device_unrealize(DeviceState *dev)
 for (i = 0; i < conf->num_queues; i++) {
 virtio_del_queue(vdev, i);
 }
-qemu_coroutine_dec_pool_size(conf->num_queues * conf->queue_size / 2);
+qemu_coroutine_dec_pool_size(s->coroutine_pool_size);
 qemu_mutex_destroy(>rq_lock);
 blk_ram_registrar_destroy(>blk_ram_registrar);
 qemu_del_vm_change_state_handler(s->change);
-- 
2.44.0

Re: [PATCH v2] linux-aio: add IO_CMD_FDSYNC command support

2024-03-11 Thread Stefan Hajnoczi

On Mon, Mar 11, 2024 at 04:40:05PM +0100, Kevin Wolf wrote:
> Am 11.03.2024 um 14:09 hat Stefan Hajnoczi geschrieben:
> > On Mon, Mar 11, 2024 at 11:13:33AM +0530, Prasad Pandit wrote:
> > > From: Prasad Pandit 
> > > 
> > > Libaio defines IO_CMD_FDSYNC command to sync all outstanding
> > > asynchronous I/O operations, by flushing out file data to the
> > > disk storage.
> > > 
> > > Enable linux-aio to submit such aio request. This helps to
> > > reduce latency induced via pthread_create calls by
> > > thread-pool (aio=threads).
> > > 
> > > Signed-off-by: Prasad Pandit 
> > > ---
> > >  block/file-posix.c | 12 
> > >  block/linux-aio.c  |  5 -
> > >  2 files changed, 16 insertions(+), 1 deletion(-)
> > > 
> > > v2: if IO_CMD_FDSYNC is not supported by the kernel,
> > > fallback on thread-pool flush.
> > >   -> 
> > > https://lists.nongnu.org/archive/html/qemu-devel/2024-03/msg01986.html
> > > 
> > > diff --git a/block/file-posix.c b/block/file-posix.c
> > > index 35684f7e21..4f2195d01d 100644
> > > --- a/block/file-posix.c
> > > +++ b/block/file-posix.c
> > > @@ -2599,6 +2599,18 @@ static int coroutine_fn 
> > > raw_co_flush_to_disk(BlockDriverState *bs)
> > >  if (raw_check_linux_io_uring(s)) {
> > >  return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
> > >  }
> > > +#endif
> > > +#ifdef CONFIG_LINUX_AIO
> > > +if (raw_check_linux_aio(s)) {
> > > +ret = laio_co_submit(s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
> > > +if (ret >= 0) {
> > > +/*
> > > + * if AIO_FLUSH is supported return
> > > + * else fallback on thread-pool flush.
> > > + */
> > > +return ret;
> > > +}
> > 
> > Falling back every time on an older host kernel might be a noticeable
> > performance regression. That can be avoided with a variable that keeps
> > track of whether -EINVAL was seen before and skips Linux AIO in that
> > case.
> > 
> > However, it appears that popular distributions starting from Debian 10,
> > Ubuntu 20.04, Fedora 27, CentOS 8, and OpenSUSE Leap 15.5 have the
> > necessary minimum Linux 4.18 kernel:
> > https://repology.org/project/linux/versions
> > 
> > Fallback should be very rare, so I don't think it needs to be optimized:
> > 
> > Reviewed-by: Stefan Hajnoczi 
> 
> We might need this approach for a different reason: This is an
> io_submit() error, so while we retry the flush with the fallback path,
> other requests in the same batch may incorrectly return errors. This
> probably explains the errors Prasad saw in the guest when the kernel
> doesn't have support for flush in Linux AIO.
> 
> So in order to avoid this, we'll probably have to send one flush just to
> probe (while making sure that no other request is pending - maybe
> immediately when opening the image?) and then remember whether it
> worked.
> 
> Or we'd have to change the error handling around io_submit(), so that we
> don't always fail the first request in the batch, but first fail any
> flushes and only then the rest of the requests.

I don't see the behavior you are describing in the code. My
interpretation of ioq_submit() is that only the flush request fails.
Other queued requests (before and after the flush) are submitted
successfully:

  static void ioq_submit(LinuxAioState *s)
  {
  int ret, len;
  struct qemu_laiocb *aiocb;
  struct iocb *iocbs[MAX_EVENTS];
  QSIMPLEQ_HEAD(, qemu_laiocb) completed;
  
  do {
  if (s->io_q.in_flight >= MAX_EVENTS) {
  break;
  }
  len = 0;
  QSIMPLEQ_FOREACH(aiocb, >io_q.pending, next) {
  iocbs[len++] = >iocb;
  if (s->io_q.in_flight + len >= MAX_EVENTS) {
  break;
  }
  }
  
  ret = io_submit(s->ctx, len, iocbs);
  if (ret == -EAGAIN) {
  break;
  }
  if (ret < 0) {
  /* Fail the first request, retry the rest */
  aiocb = QSIMPLEQ_FIRST(>io_q.pending);
  QSIMPLEQ_REMOVE_HEAD(>io_q.pending, next);
  s->io_q.in_queue--;
  aiocb->ret = ret;
  qemu_laio_process_completion(aiocb);
  continue;
  }
  
  s->io_q.in_flight += ret;
  s->io_q.in_queue  -= ret;
  aiocb = container_of(iocbs[ret - 1], struct qemu_laiocb, iocb);
  QSIMPLEQ_SPLIT_AFTER(>io_q.pending, aiocb, next, );
  } while (ret == len && !QSIMPLEQ_EMPTY(>io_q.pending));

Have I missed something?

Thanks,
Stefan


signature.asc
Description: PGP signature

Re: [PATCH v2] linux-aio: add IO_CMD_FDSYNC command support

2024-03-11 Thread Stefan Hajnoczi

On Mon, Mar 11, 2024 at 11:13:33AM +0530, Prasad Pandit wrote:
> From: Prasad Pandit 
> 
> Libaio defines IO_CMD_FDSYNC command to sync all outstanding
> asynchronous I/O operations, by flushing out file data to the
> disk storage.
> 
> Enable linux-aio to submit such aio request. This helps to
> reduce latency induced via pthread_create calls by
> thread-pool (aio=threads).
> 
> Signed-off-by: Prasad Pandit 
> ---
>  block/file-posix.c | 12 
>  block/linux-aio.c  |  5 -
>  2 files changed, 16 insertions(+), 1 deletion(-)
> 
> v2: if IO_CMD_FDSYNC is not supported by the kernel,
> fallback on thread-pool flush.
>   -> https://lists.nongnu.org/archive/html/qemu-devel/2024-03/msg01986.html
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 35684f7e21..4f2195d01d 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -2599,6 +2599,18 @@ static int coroutine_fn 
> raw_co_flush_to_disk(BlockDriverState *bs)
>  if (raw_check_linux_io_uring(s)) {
>  return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
>  }
> +#endif
> +#ifdef CONFIG_LINUX_AIO
> +if (raw_check_linux_aio(s)) {
> +ret = laio_co_submit(s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
> +if (ret >= 0) {
> +/*
> + * if AIO_FLUSH is supported return
> + * else fallback on thread-pool flush.
> + */
> +return ret;
> +}

Falling back every time on an older host kernel might be a noticeable
performance regression. That can be avoided with a variable that keeps
track of whether -EINVAL was seen before and skips Linux AIO in that
case.

However, it appears that popular distributions starting from Debian 10,
Ubuntu 20.04, Fedora 27, CentOS 8, and OpenSUSE Leap 15.5 have the
necessary minimum Linux 4.18 kernel:
https://repology.org/project/linux/versions

Fallback should be very rare, so I don't think it needs to be optimized:

Reviewed-by: Stefan Hajnoczi 

> +}
>  #endif
>  return raw_thread_pool_submit(handle_aiocb_flush, );
>  }
> diff --git a/block/linux-aio.c b/block/linux-aio.c
> index ec05d946f3..d940d029e3 100644
> --- a/block/linux-aio.c
> +++ b/block/linux-aio.c
> @@ -384,6 +384,9 @@ static int laio_do_submit(int fd, struct qemu_laiocb 
> *laiocb, off_t offset,
>  case QEMU_AIO_READ:
>  io_prep_preadv(iocbs, fd, qiov->iov, qiov->niov, offset);
>  break;
> +case QEMU_AIO_FLUSH:
> +io_prep_fdsync(iocbs, fd);
> +break;
>  /* Currently Linux kernel does not support other operations */
>  default:
>  fprintf(stderr, "%s: invalid AIO request type 0x%x.\n",
> @@ -412,7 +415,7 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, 
> QEMUIOVector *qiov,
>  AioContext *ctx = qemu_get_current_aio_context();
>  struct qemu_laiocb laiocb = {
>  .co = qemu_coroutine_self(),
> -.nbytes = qiov->size,
> +.nbytes = qiov ? qiov->size : 0,
>  .ctx= aio_get_linux_aio(ctx),
>  .ret= -EINPROGRESS,
>  .is_read= (type == QEMU_AIO_READ),
> -- 
> 2.44.0
> 


signature.asc
Description: PGP signature

Re: no target for a link in the post "How to access libvirt domains in KubeVirt"

2024-03-06 Thread Stefan Hajnoczi

On Wed, 6 Mar 2024 at 15:09,  wrote:
> BTW thank you for your meaningful posts.

Thanks for the kind words!

Stefan

Re: no target for a link in the post "How to access libvirt domains in KubeVirt"

2024-03-06 Thread Stefan Hajnoczi

On Wed, 6 Mar 2024 at 14:52,  wrote:
>
> Hello, thank you for the post.
>
> The href in the link
> https://kubevirt.io/user-guide/debug_virt_stack/;>Virtualization 
> Debugging
>
> should be replaced by
> https://kubevirt.io/user-guide/debug_virt_stack/logging/

Hi Alexei,
Thanks for pointing out the broken link! The blog post has been updated.

Stefan

Re: [PATCH] make-release: switch to .xz format by default

2024-03-05 Thread Stefan Hajnoczi

On Mon, 4 Mar 2024 at 13:52, Michael Tokarev  wrote:
>
> For a long time, we provide two compression formats in the
> download area, .bz2 and .xz.  There's absolutely no reason
> to provide two in parallel, .xz compresses better, and all
> the links we use points to .xz.  Downstream distributions
> mostly use .xz too.
>
> For the release maintenance providing two formats is definitely
> extra burden too.
>
> Signed-off-by: Michael Tokarev 
> ---
>  scripts/make-release | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 

> diff --git a/scripts/make-release b/scripts/make-release
> index 9c570b87f4..6e0433de24 100755
> --- a/scripts/make-release
> +++ b/scripts/make-release
> @@ -47,5 +47,5 @@ meson subprojects download $SUBPROJECTS
>  CryptoPkg/Library/OpensslLib/openssl \
>  MdeModulePkg/Library/BrotliCustomDecompressLib/brotli)
>  popd
> -tar --exclude=.git -cjf ${destination}.tar.bz2 ${destination}
> +tar --exclude=.git -cJf ${destination}.tar.xz ${destination}
>  rm -rf ${destination}
> --
> 2.39.2
>
>

Re: [PATCH] make-release: switch to .xz format by default

2024-03-05 Thread Stefan Hajnoczi

On Tue, 5 Mar 2024 at 04:52, Peter Maydell  wrote:
>
> On Mon, 4 Mar 2024 at 18:52, Michael Tokarev  wrote:
> >
> > For a long time, we provide two compression formats in the
> > download area, .bz2 and .xz.  There's absolutely no reason
> > to provide two in parallel, .xz compresses better, and all
> > the links we use points to .xz.  Downstream distributions
> > mostly use .xz too.
>
> Seems reasonable. Out of curiosity, do we have the
> download stats on how many .xz vs .bz2 downloads we get?
> Stefan or Paolo, do you have the webserver info?
> (Probably not worth bothering if it's a big pain to
> get the data.)

I don't have access to the CDN stats. Maybe Paolo does.

Stefan

Re: [PATCH 13/16] block/virtio-blk: Fix missing ERRP_GUARD() for error_prepend()

2024-02-28 Thread Stefan Hajnoczi

On Thu, Feb 29, 2024 at 12:37:20AM +0800, Zhao Liu wrote:
> From: Zhao Liu 
> 
> As the comment in qapi/error, passing @errp to error_prepend() requires
> ERRP_GUARD():
> 
> * = Why, when and how to use ERRP_GUARD() =
> *
> * Without ERRP_GUARD(), use of the @errp parameter is restricted:
> ...
> * - It should not be passed to error_prepend(), error_vprepend() or
> *   error_append_hint(), because that doesn't work with _fatal.
> * ERRP_GUARD() lifts these restrictions.
> *
> * To use ERRP_GUARD(), add it right at the beginning of the function.
> * @errp can then be used without worrying about the argument being
> * NULL or _fatal.
> 
> ERRP_GUARD() could avoid the case when @errp is the pointer of
> error_fatal, the user can't see this additional information, because
> exit() happens in error_setg earlier than information is added [1].
> 
> The virtio_blk_vq_aio_context_init() passes @errp to error_prepend().
> 
> Though its @errp points its caller's local @err variable, to follow the
> requirement of @errp, add missing ERRP_GUARD() at the beginning of
> virtio_blk_vq_aio_context_init().
> 
> [1]: Issue description in the commit message of commit ae7c80a7bd73
>  ("error: New macro ERRP_GUARD()").
> 
> Cc: "Michael S. Tsirkin" 
> Cc: Stefan Hajnoczi 
> Cc: Kevin Wolf 
> Cc: Hanna Reitz 
> Signed-off-by: Zhao Liu 
> ---
>  hw/block/virtio-blk.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 09/16] block/qed: Fix missing ERRP_GUARD() for error_prepend()

2024-02-28 Thread Stefan Hajnoczi

On Thu, Feb 29, 2024 at 12:37:16AM +0800, Zhao Liu wrote:
> From: Zhao Liu 
> 
> As the comment in qapi/error, passing @errp to error_prepend() requires
> ERRP_GUARD():
> 
> * = Why, when and how to use ERRP_GUARD() =
> *
> * Without ERRP_GUARD(), use of the @errp parameter is restricted:
> ...
> * - It should not be passed to error_prepend(), error_vprepend() or
> *   error_append_hint(), because that doesn't work with _fatal.
> * ERRP_GUARD() lifts these restrictions.
> *
> * To use ERRP_GUARD(), add it right at the beginning of the function.
> * @errp can then be used without worrying about the argument being
> * NULL or _fatal.
> 
> ERRP_GUARD() could avoid the case when @errp is the pointer of
> error_fatal, the user can't see this additional information, because
> exit() happens in error_setg earlier than information is added [1].
> 
> The bdrv_qed_co_invalidate_cache() passes @errp to error_prepend()
> without ERRP_GUARD().
> 
> Though it is a BlockDriver.bdrv_co_invalidate_cache() method, and
> currently its @errp parameter only points to callers' local_err, to
> follow the requirement of @errp, add missing ERRP_GUARD() at the
> beginning of this function.
> 
> [1]: Issue description in the commit message of commit ae7c80a7bd73
>  ("error: New macro ERRP_GUARD()").
> 
> Cc: Stefan Hajnoczi 
> Cc: Kevin Wolf 
> Cc: Hanna Reitz 
> Signed-off-by: Zhao Liu 
> ---
>  block/qed.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 06/16] block/nvme: Fix missing ERRP_GUARD() for error_prepend()

2024-02-28 Thread Stefan Hajnoczi

On Thu, Feb 29, 2024 at 12:37:13AM +0800, Zhao Liu wrote:
> From: Zhao Liu 
> 
> As the comment in qapi/error, passing @errp to error_prepend() requires
> ERRP_GUARD():
> 
> * = Why, when and how to use ERRP_GUARD() =
> *
> * Without ERRP_GUARD(), use of the @errp parameter is restricted:
> ...
> * - It should not be passed to error_prepend(), error_vprepend() or
> *   error_append_hint(), because that doesn't work with _fatal.
> * ERRP_GUARD() lifts these restrictions.
> *
> * To use ERRP_GUARD(), add it right at the beginning of the function.
> * @errp can then be used without worrying about the argument being
> * NULL or _fatal.
> 
> ERRP_GUARD() could avoid the case when @errp is the pointer of
> error_fatal, the user can't see this additional information, because
> exit() happens in error_setg earlier than information is added [1].
> 
> In nvme.c, there're 3 functions passing @errp to error_prepend()
> without ERRP_GUARD():
> - nvme_init_queue()
> - nvme_create_queue_pair()
> - nvme_identify()
> 
> All these 3 functions take their @errp parameters from the
> nvme_file_open(), which is a BlockDriver.bdrv_nvme() method and its
> @errp points to its caller's local_err.
> 
> Though these 3 cases haven't trigger the issue like [1] said, to
> follow the requirement of @errp, add missing ERRP_GUARD() at their
> beginning.
> 
> [1]: Issue description in the commit message of commit ae7c80a7bd73
>  ("error: New macro ERRP_GUARD()").
> 
> Cc: Stefan Hajnoczi 
> Cc: Fam Zheng 
> Cc: "Philippe Mathieu-Daudé" 
> Cc: Kevin Wolf 
> Cc: Hanna Reitz 
> Signed-off-by: Zhao Liu 
> ---
>  block/nvme.c | 3 +++
>  1 file changed, 3 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH] Print tool binary names in ./configure --help

2024-02-26 Thread Stefan Hajnoczi

On Sat, 17 Feb 2024 at 08:21, Manos Pitsidianakis
 wrote:
>
> configure --help currently outputs the following line for the tools
> option:
>
> -->8---
> ░░tcg░TCG░support░░
>   tools   build support utilities that come with QEMU
> ░░tpm░TPM░support░░
> ░░u2f░U2F░emulation░support
> ---8<--
>
> Which does not convey information if you don't already know what these
> utilities are going to be.
>
> This commit uses script/meson-buildoptions.py to parse the hard-coded
> test binary names in meson.build and update the --help output to include
> their names, like as follows:
>
> -->8---
> ░░tcg░TCG░support░░
>   tools   build utility tool binaries like qemu-edid, qemu-img,
>   qemu-io, qemu-nbd, qemu-bridge-helper, qemu-pr-helper
> ░░tpm░TPM░support░░
> ░░u2f░U2F░emulation░support
> ---8<--
>
> Since it uses the meson.build AST to find those values, only hard-coded
> binary names are selected and the description is non-exhaustive.

How about updating the description in meson_options.txt to what's
shown above and dropping the meson introspect part?

I think the complexity is a little high given that the generated list
is incomplete. Less custom build system code makes it easier to
understand and reduces the chance of breakage.

>
> Signed-off-by: Manos Pitsidianakis 
> ---
>  Makefile  |  8 +--
>  meson_options.txt |  2 +-
>  scripts/meson-buildoptions.py | 43 ---
>  scripts/meson-buildoptions.sh |  3 ++-
>  4 files changed, 49 insertions(+), 7 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index 8f36990335..79ab594c4b 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -128,8 +128,12 @@ Makefile.mtest: build.ninja scripts/mtest2make.py
>  .PHONY: update-buildoptions
>  all update-buildoptions: $(SRC_PATH)/scripts/meson-buildoptions.sh
>  $(SRC_PATH)/scripts/meson-buildoptions.sh: $(SRC_PATH)/meson_options.txt
> -   $(MESON) introspect --buildoptions $(SRC_PATH)/meson.build | 
> $(PYTHON) \
> - scripts/meson-buildoptions.py > $@.tmp && mv $@.tmp $@
> +   { printf '{"buildoptions":'; \
> +   $(MESON) introspect --buildoptions $(SRC_PATH)/meson.build 2> 
> >(grep -v "Unable to evaluate subdir(\[\])" >&2) \
> +   && printf ',"ast":' \
> +   && $(MESON) introspect --ast $(SRC_PATH)/meson.build 2> 
> >(grep -v "Unable to evaluate subdir(\[\])" >&2) \
> +   && printf "}" ; } \
> +   | $(PYTHON) scripts/meson-buildoptions.py > $@.tmp && mv 
> $@.tmp $@
>  endif
>
>  # 4. Rules to bridge to other makefiles
> diff --git a/meson_options.txt b/meson_options.txt
> index 0a99a059ec..53a8b6b3e2 100644
> --- a/meson_options.txt
> +++ b/meson_options.txt
> @@ -58,7 +58,7 @@ option('guest_agent', type : 'feature', value : 'auto',
>  option('guest_agent_msi', type : 'feature', value : 'auto',
> description: 'Build MSI package for the QEMU Guest Agent')
>  option('tools', type : 'feature', value : 'auto',
> -   description: 'build support utilities that come with QEMU')
> +   description: 'build utility tool binaries')
>  option('qga_vss', type : 'feature', value: 'auto',
> description: 'build QGA VSS support (broken with MinGW)')
>
> diff --git a/scripts/meson-buildoptions.py b/scripts/meson-buildoptions.py
> index 4814a8ff61..4abdfc1d05 100644
> --- a/scripts/meson-buildoptions.py
> +++ b/scripts/meson-buildoptions.py
> @@ -24,6 +24,7 @@
>  import textwrap
>  import shlex
>  import sys
> +from collections import deque
>
>  # Options with nonstandard names (e.g. --with/--without) or OS-dependent
>  # defaults.  Try not to add any.
> @@ -182,7 +183,7 @@ def cli_metavar(opt):
>  return "CHOICE"
>
>
> -def print_help(options):
> +def print_help(options, tools: list[str]):
>  print("meson_options_help() {")
>  feature_opts = []
>  for opt in sorted(options, key=cli_help_key):
> @@ -212,6 +213,8 @@ def print_help(options):
>  sh_print()
>  for opt in sorted(feature_opts, key=cli_option):
>  key = cli_option(opt)
> +if key == "tools":
> +opt["description"] += " like " + ", ".join(tools)
>  help_line(key, opt, 18, False)
>  print("}")
>
> @@ -242,7 +245,41 @@ def print_parse(options):
>  print("}")
>
>
> -options = load_options(json.load(sys.stdin))
> +# Returns hard-coded executables from meson.build AST
> +def tool_executables(d:

QEMU was not accepted into GSoC 2024

2024-02-21 Thread Stefan Hajnoczi

Dear QEMU and KVM community,
Unfortunately QEMU was not accepted into Google Summer of Code this
year and we will not be able to run the projects we had proposed.

This will come as a disappointment to both applicants and mentors, but
please read on. For applicants we encourage you to look at the other
great organizations participating in Google Summer of Code this year,
like libvirt, Linux foundation, or AFLplusplus.

I have not had detailed feedback from Google yet, but at first glance
this seems similar to 2012 when QEMU was also not selected. Sometimes
veteran organizations like QEMU cannot participate due to the finite
amount of funding available.

Applicants can still contribute to QEMU outside GSoC although funding
will not be available this summer. Don't hesitate to reach out to the
mentors for the project idea you are interested in!
https://wiki.qemu.org/Google_Summer_of_Code_2024

QEMU will apply to Google Summer of Code again next year and we hope
to work with you in the future.

Stefan

Re: Sponsorship via Github

2024-02-20 Thread Stefan Hajnoczi

Hi Jakob,
Thanks for the generous offer to sponsor QEMU. It's great to hear you
are finding QEMU useful!

I am forwarding the request to Software Freedom Conservancy, which
handles donations and expenses for the QEMU project.

I'll let you know when we have an answer.

Thanks,
Stefan


signature.asc
Description: PGP signature

Re: [PATCH 3/3] monitor: only run coroutine commands in qemu_aio_context

2024-02-13 Thread Stefan Hajnoczi

On Sat, 3 Feb 2024 at 06:30, Michael Tokarev  wrote:
>
> 03.02.2024 12:01, Michael Tokarev wrote:
> ...
> > This change broke something in 7.2. I'm still debugging it, will
> > come with a follow-up once some more details are found, I'll also
> > check current master with and without this commit.
> >
> > The prob happens with multiple suspend-resume cycles, - with this
> > change applied, guest does not work as expected after *second*
> > suspend-resume.
>
> So, it turned out the prob here exists on master too, and manifests
> itself the same way on 7.2.9 or on 8.2.1, - in all cases where we
> have this change applied it works (or breaks) equally.
>
> A (simple) reproducer so far is a hibernate test, - it fails *only*
> after suspend-to-ram, but works fine after just hibernate.
>
> I used just an initrd (with a drive image used for swap -
> for hibernation space).
>
>   qemu-img create s.img 256M
>   mkswap s.img
>   qemu-system-x86_64 \
>-serial stdio -vga none -display none -parallel none -net none \
>-machine q35 \
>-drive file=s.img,if=ide,format=raw \
>-m 256 \
>-monitor unix:ttyS0,server,nowait \
>-kernel /boot/vmlinuz-6.1.0-15-amd64 \
>-initrd /boot/initrd.img-6.1.0-15-amd64 \
>-append "shell=/bin/sh console=ttyS0 root=none"
>
>   There, in the guest (it has busybox only here):
>   # swapon /dev/sda
>   # echo mem > /sys/power/state
>   (system_wakeup on the monitor)
>   # echo disk > /sys/power/state
>
> The system will hibernate but *not* turn off power, qemu
> will continue running, while all console messages are the
> same as when it works fine.  qemu process is spinning up
> with 100% cpu usage at this stage.
>
> Without the intermediate suspend-to-ram or without the
> commit in question, qemu process will exit normally at
> this stage.
>
> This is a somewhat patalogical test case, but I see it as an
> indicator of something else being wrong, like we aren't saving
> or restoring some state now which we should do.
>
> The tight loop also suggests we're not having success in there.

I'm unable to reproduce this. QEMU v8.2.0, v8.1.0, and even v7.2.0 all
spin with 100% CPU usage on my machine. It looks to me like this
behavior is not a regression caused by this commit.

Do v7.2.0, v8.1.0, and v8.2.0 terminate QEMU on your machine?

Stefan

Re: [PATCH] iotests: Make 144 deterministic again

2024-02-12 Thread Stefan Hajnoczi

On Fri, Feb 09, 2024 at 06:31:03PM +0100, Kevin Wolf wrote:
> Since commit effd60c8 changed how QMP commands are processed, the order
> of the block-commit return value and job events in iotests 144 wasn't
> fixed and more and caused the test to fail intermittently.
> 
> Change the test to cache events first and then print them in a
> predefined order.
> 
> Waiting three times for JOB_STATUS_CHANGE is a bit uglier than just
> waiting for the JOB_STATUS_CHANGE that has "status": "ready", but the
> tooling we have doesn't seem to allow the latter easily.
> 
> Fixes: effd60c878176bcaf97fa7ce2b12d04bb8ead6f7
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2126
> Signed-off-by: Kevin Wolf 
> ---
>  tests/qemu-iotests/144 | 12 +++-
>  tests/qemu-iotests/144.out |  2 +-
>  2 files changed, 12 insertions(+), 2 deletions(-)

Thank you!

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH] pflash: fix sectors vs bytes confusion in blk_pread_nonzeroes()

2024-02-08 Thread Stefan Hajnoczi

On Thu, 1 Feb 2024 at 06:37, Michael Tokarev  wrote:
>
> 30.01.2024 03:27, Stefan Hajnoczi wrote:
> > The following expression is incorrect because blk_pread_nonzeroes()
> > deals in units of bytes, not sectors:
> >
> >bytes = MIN(size - offset, BDRV_REQUEST_MAX_SECTORS)
> >^^^
> >
> > BDRV_REQUEST_MAX_BYTES is the appropriate constant.
> >
> > Fixes: a4b15a8b9ef2 ("pflash: Only read non-zero parts of backend image")
> > Cc: Xiang Zheng 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> >   hw/block/block.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/hw/block/block.c b/hw/block/block.c
> > index 9f52ee6e72..ff503002aa 100644
> > --- a/hw/block/block.c
> > +++ b/hw/block/block.c
> > @@ -30,7 +30,7 @@ static int blk_pread_nonzeroes(BlockBackend *blk, hwaddr 
> > size, void *buf)
> >   BlockDriverState *bs = blk_bs(blk);
> >
> >   for (;;) {
> > -bytes = MIN(size - offset, BDRV_REQUEST_MAX_SECTORS);
> > +bytes = MIN(size - offset, BDRV_REQUEST_MAX_BYTES);
>
> Hmm.  This smells like a -stable material, but you know better not
> to Cc: qemu-stable@ for unrelated stuff...  Is it not for stable?

This is not a user-visible bug. The code still works with the smaller
MAX_SECTORS value thanks to the loop.

It doesn't hurt to include it in -stable but I also think it doesn't
help :-). It's just an inconsistency in the code.

Stefan

Re: [PATCH 0/2] block: Allow concurrent BB context changes

2024-02-08 Thread Stefan Hajnoczi

On Wed, 7 Feb 2024 at 04:36, Hanna Czenczek  wrote:
>
> On 06.02.24 17:53, Stefan Hajnoczi wrote:
>
> On Fri, Feb 02, 2024 at 03:47:53PM +0100, Hanna Czenczek wrote:
>
> Hi,
>
> Without the AioContext lock, a BB's context may kind of change at any
> time (unless it has a root node, and I/O requests are pending).  That
> also means that its own context (BlockBackend.ctx) and that of its root
> node can differ sometimes (while the context is being changed).
>
> blk_get_aio_context() doesn't know this yet and asserts that both are
> always equal (if there is a root node).  Because it's no longer true,
> and because callers don't seem to really care about the root node's
> context, we can and should remove the assertion and just return the BB's
> context.
>
> Beyond that, the question is whether the callers of
> blk_get_aio_context() are OK with the context potentially changing
> concurrently.  Honestly, it isn't entirely clear to me; most look OK,
> except for the virtio-scsi code, which operates under the general
> assumption that the BB's context is always equal to that of the
> virtio-scsi device.  I doubt that this assumption always holds (it is
> definitely not obvious to me that it would), but then again, this series
> will not make matters worse in that regard, and that is what counts for
> me now.
>
> One clear point of contention is scsi_device_for_each_req_async(), which
> is addressed by patch 2.  Right now, it schedules a BH in the BB
> context, then the BH double-checks whether the context still fits, and
> if not, re-schedules itself.  Because virtio-scsi's context is fixed,
> this seems to indicate to me that it wants to be able to deal with a
> case where BB and virtio-scsi context differ, which seems to break that
> aforementioned general virtio-scsi assumption.
>
> I don't agree with the last sentence: virtio-scsi's context isn't fixed.
>
> The AioContext changes when dataplane is started/stopped. virtio-scsi
> switches AioContext between the IOThread's AioContext and the main
> loop's qemu_aio_context.
>
> However, virtio-scsi virtqueue processing only happens in the IOThread's
> AioContext. Maybe this is what you meant when you said the AioContext is
> fixed?
>
>
> Specifically, I meant VirtIOSCSI.ctx, which is set only once in 
> virtio_scsi_dataplane_setup().  That’s at least where the virtqueue notifiers 
> are registered, so yes, virtqueue processing should at least be fixed to that 
> context.  It seems like it’s always possible some things are processed in the 
> main thread (not just setup/teardown, but also e.g. TMF_LOGICAL_UNIT_RESET), 
> so to me it seems like virtio-scsi kind of runs in two contexts 
> simultaneously.  Yes, when virtqueue processing is paused, all processing 
> VirtIOSCSI.ctx is stopped, but I wouldn’t say it switches contexts there.  It 
> just stops processing some requests.
>
> Either way, virtio-scsi request processing doesn’t stop just because a 
> scsi-hd device is hot-plugged or -unplugged.  If the BB changes contexts in 
> the hot-unplug path (while vq request processing is continuing in the I/O 
> thread), its context will differ from that of virtio-scsi.
>
> So should I just replace the “the context is fixed” and say that in this 
> specific instance, virtio-scsi vq processing continues in the I/O thread?
>
> The BH function is aware that the current AioContext might not be the
> same as the AioContext at the time the BH was scheduled. That doesn't
> break assumptions in the code.
>
> (It may be possible to rewrite virtio-blk, virtio-scsi, and core
> VirtIODevice ioeventfd code to use the simpler model where the
> AioContext really is fixed because things have changed significantly
> over the years, but I looked a few weeks ago and it's difficult work.)
>
> I'm just pointing out that I think this description is incomplete. I
> *do* agree with what this patch series is doing :).
>
>
> Well, this description won’t land in any commit log, so from my side, I’m not 
> too worried about its correctness. O:)

Okay, I think we're in agreement. What you described in your reply
matches how I understand the code. No need to resend anything.

Stefan

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-02-06 Thread Stefan Hajnoczi

On Fri, Feb 02, 2024 at 01:32:39PM +0100, Hanna Czenczek wrote:
> On 01.02.24 16:25, Hanna Czenczek wrote:
> > On 01.02.24 15:28, Stefan Hajnoczi wrote:
> 
> [...]
> 
> > > Did you find a scenario where the virtio-scsi AioContext is different
> > > from the scsi-hd BB's Aiocontext?
> > 
> > Technically, that’s the reason for this thread, specifically that
> > virtio_scsi_hotunplug() switches the BB back to the main context while
> > scsi_device_for_each_req_async_bh() is running.  Yes, we can fix that
> > specific case via the in-flight counter, but I’m wondering whether
> > there’s really any merit in requiring the BB to always be in
> > virtio-scsi’s context, or whether it would make more sense to schedule
> > everything in virtio-scsi’s context.  Now that BBs/BDSs can receive
> > requests from any context, that is.
> 
> Now that I know that wouldn’t be easy, let me turn this around: As far as I
> understand, scsi_device_for_each_req_async_bh() should still run in
> virtio-scsi’s context, but that’s hard, so we take the BB’s context, which
> we therefore require to be the same one. Further, (again AFAIU,)
> virtio-scsi’s context cannot change (only set in
> virtio_scsi_dataplane_setup(), which is run in
> virtio_scsi_device_realize()).  Therefore, why does the
> scsi_device_for_each_req_async() code accommodate for BB context changes?

1. scsi_disk_reset() -> scsi_device_purge_requests() is called without
   in-flight requests.
2. The BH is scheduled by scsi_device_purge_requests() ->
   scsi_device_for_each_req_async().
3. blk_drain() is a nop when there no in-flight requests and does not
   flush BHs.
3. The AioContext changes when the virtio-scsi device resets.
4. The BH executes.

Kevin and I touched on the idea of flushing BHs in bdrv_drain() even
when there are no requests in flight. This hasn't been implemented as of
today, but would also reduce the chance of scenarios like the one I
mentioned.

I think it's safer to handle the case where the BH runs after an
AioContext change until either everything is thread-safe or the
AioContext never changes.

Stefan

signature.asc
Description: PGP signature

[PATCH v2 3/5] virtio-blk: add vq_rq[] bounds check in virtio_blk_dma_restart_cb()

2024-02-06 Thread Stefan Hajnoczi

Hanna Czenczek  noted that the array index in
virtio_blk_dma_restart_cb() is not bounds-checked:

  g_autofree VirtIOBlockReq **vq_rq = g_new0(VirtIOBlockReq *, num_queues);
  ...
  while (rq) {
  VirtIOBlockReq *next = rq->next;
  uint16_t idx = virtio_get_queue_index(rq->vq);

  rq->next = vq_rq[idx];
 ^^

The code is correct because both rq->vq and vq_rq[] depend on
num_queues, but this is indirect and not 100% obvious. Add an assertion.

Suggested-by: Hanna Czenczek 
Reviewed-by: Manos Pitsidianakis 
Reviewed-by: Hanna Czenczek 
Signed-off-by: Stefan Hajnoczi 
---
 hw/block/virtio-blk.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index e430ba583c..31212506ca 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1209,6 +1209,8 @@ static void virtio_blk_dma_restart_cb(void *opaque, bool 
running,
 VirtIOBlockReq *next = rq->next;
 uint16_t idx = virtio_get_queue_index(rq->vq);
 
+/* Only num_queues vqs were created so vq_rq[idx] is within bounds */
+assert(idx < num_queues);
 rq->next = vq_rq[idx];
 vq_rq[idx] = rq;
 rq = next;
-- 
2.43.0

[PATCH v2 4/5] virtio-blk: declare VirtIOBlock::rq with a type

2024-02-06 Thread Stefan Hajnoczi

The VirtIOBlock::rq field has had the type void * since its introduction
in commit 869a5c6df19a ("Stop VM on error in virtio-blk. (Gleb
Natapov)").

Perhaps this was done to avoid the forward declaration of
VirtIOBlockReq.

Hanna Czenczek  pointed out the missing type. Specify
the actual type because there is no need to use void * here.

Suggested-by: Hanna Czenczek 
Reviewed-by: Manos Pitsidianakis 
Reviewed-by: Hanna Czenczek 
Signed-off-by: Stefan Hajnoczi 
---
 include/hw/virtio/virtio-blk.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 833a9a344f..5c14110c4b 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -55,7 +55,7 @@ struct VirtIOBlock {
 VirtIODevice parent_obj;
 BlockBackend *blk;
 QemuMutex rq_lock;
-void *rq; /* protected by rq_lock */
+struct VirtIOBlockReq *rq; /* protected by rq_lock */
 VirtIOBlkConf conf;
 unsigned short sector_mask;
 bool original_wce;
-- 
2.43.0

[PATCH v2 5/5] monitor: use aio_co_reschedule_self()

2024-02-06 Thread Stefan Hajnoczi

The aio_co_reschedule_self() API is designed to avoid the race
condition between scheduling the coroutine in another AioContext and
yielding.

The QMP dispatch code uses the open-coded version that appears
susceptible to the race condition at first glance:

  aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
  qemu_coroutine_yield();

The code is actually safe because the iohandler and qemu_aio_context
AioContext run under the Big QEMU Lock. Nevertheless, set a good example
and use aio_co_reschedule_self() so it's obvious that there is no race.

Suggested-by: Hanna Reitz 
Reviewed-by: Manos Pitsidianakis 
Reviewed-by: Hanna Czenczek 
Signed-off-by: Stefan Hajnoczi 
---
 qapi/qmp-dispatch.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
index 176b549473..f3488afeef 100644
--- a/qapi/qmp-dispatch.c
+++ b/qapi/qmp-dispatch.c
@@ -212,8 +212,7 @@ QDict *coroutine_mixed_fn qmp_dispatch(const QmpCommandList 
*cmds, QObject *requ
  * executing the command handler so that it can make progress if it
  * involves an AIO_WAIT_WHILE().
  */
-aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
-qemu_coroutine_yield();
+aio_co_reschedule_self(qemu_get_aio_context());
 }
 
 monitor_set_cur(qemu_coroutine_self(), cur_mon);
@@ -227,9 +226,7 @@ QDict *coroutine_mixed_fn qmp_dispatch(const QmpCommandList 
*cmds, QObject *requ
  * Move back to iohandler_ctx so that nested event loops for
  * qemu_aio_context don't start new monitor commands.
  */
-aio_co_schedule(iohandler_get_aio_context(),
-qemu_coroutine_self());
-qemu_coroutine_yield();
+aio_co_reschedule_self(iohandler_get_aio_context());
 }
 } else {
/*
-- 
2.43.0

[PATCH v2 0/5] virtio-blk: iothread-vq-mapping cleanups

2024-02-06 Thread Stefan Hajnoczi

v2:
- Add comment in Patch 3 explaining why bounds check assertion [Manos]
- Remove redundant nested if in Patch 1 [Hanna]

Hanna reviewed the iothread-vq-mapping patches after they were applied to
qemu.git. This series consists of code cleanups that Hanna identified.

There are no functional changes or bug fixes that need to be backported to the
stable tree here, but it may make sense to backport them in the future to avoid
conflicts.

Stefan Hajnoczi (5):
  virtio-blk: enforce iothread-vq-mapping validation
  virtio-blk: clarify that there is at least 1 virtqueue
  virtio-blk: add vq_rq[] bounds check in virtio_blk_dma_restart_cb()
  virtio-blk: declare VirtIOBlock::rq with a type
  monitor: use aio_co_reschedule_self()

 include/hw/virtio/virtio-blk.h |   2 +-
 hw/block/virtio-blk.c  | 194 ++---
 qapi/qmp-dispatch.c|   7 +-
 3 files changed, 112 insertions(+), 91 deletions(-)

-- 
2.43.0

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 27144 matches

Mail list logo