date:20221116

On Nov 17 07:56, Cédric Le Goater wrote:
> On 11/17/22 07:40, Klaus Jensen wrote:
> > On Nov 16 16:58, Cédric Le Goater wrote:
> > > On 11/16/22 09:43, Klaus Jensen wrote:
> > > > From: Klaus Jensen 
> > > > 
> > > > It is not given that the current master will release the bus after a
> > > > transfer ends. Only schedule a pending master if the bus is idle.
> > > > 
> > > > Fixes: 37fa5ca42623 ("hw/i2c: support multiple masters")
> > > > Signed-off-by: Klaus Jensen 
> > > > ---
> > > >hw/i2c/aspeed_i2c.c  |  2 ++
> > > >hw/i2c/core.c| 37 ++---
> > > >include/hw/i2c/i2c.h |  2 ++
> > > >3 files changed, 26 insertions(+), 15 deletions(-)
> > > > 
> > > > diff --git a/hw/i2c/aspeed_i2c.c b/hw/i2c/aspeed_i2c.c
> > > > index c166fd20fa11..1f071a3811f7 100644
> > > > --- a/hw/i2c/aspeed_i2c.c
> > > > +++ b/hw/i2c/aspeed_i2c.c
> > > > @@ -550,6 +550,8 @@ static void aspeed_i2c_bus_handle_cmd(AspeedI2CBus 
> > > > *bus, uint64_t value)
> > > >}
> > > >SHARED_ARRAY_FIELD_DP32(bus->regs, reg_cmd, M_STOP_CMD, 0);
> > > >aspeed_i2c_set_state(bus, I2CD_IDLE);
> > > > +
> > > > +i2c_schedule_pending_master(bus->bus);
> > > 
> > > Shouldn't it be i2c_bus_release() ?
> > > 
> > 
> > The reason for having both i2c_bus_release() and
> > i2c_schedule_pending_master() is that i2c_bus_release() sort of pairs
> > with i2c_bus_master(). They either set or clear the bus->bh member.
> > 
> > In the current design, the controller (in this case the Aspeed I2C) is
> > an "implicit" master (it does not have a bottom half driving it), so
> > there is no bus->bh to clear.
> > 
> > I should (and will) write some documentation on the asynchronous API.
> 
> I found the routine names confusing. Thanks for the clarification.
> 
> Maybe we could do this rename  :
> 
>   i2c_bus_release() -> i2c_bus_release_and_clear()
>   i2c_schedule_pending_master() -> i2c_bus_release()
> 
> and keep i2c_schedule_pending_master() internal the I2C core subsystem.
> 

How about renaming i2c_bus_master to i2c_bus_acquire() such that it
pairs with i2c_bus_release().

And then add an i2c_bus_yield() to be used by the controller? I think we
should be able to assert in i2c_bus_yield() that bus->bh is NULL. But
I'll take a closer look at that.


signature.asc
Description: PGP signature

Re: [PATCH v3 2/2] nvme: Add physical writes/reads from OCP log

On Nov 16 18:14, Joel Granados wrote:
> In order to evaluate write amplification factor (WAF) within the storage
> stack it is important to know the number of bytes written to the
> controller. The existing SMART log value of Data Units Written is too
> coarse (given in units of 500 Kb) and so we add the SMART health
> information extended from the OCP specification (given in units of bytes)
> 
> We add a controller argument (ocp) that toggles on/off the SMART log
> extended structure.  To accommodate different vendor specific specifications
> like OCP, we add a multiplexing function (nvme_vendor_specific_log) which
> will route to the different log functions based on arguments and log ids.
> We only return the OCP extended SMART log when the command is 0xC0 and ocp
> has been turned on in the args.
> 
> Though we add the whole nvme SMART log extended structure, we only populate
> the physical_media_units_{read,written}, log_page_version and
> log_page_uuid.
> 
> Signed-off-by: Joel Granados 
> ---
>  docs/system/devices/nvme.rst |  7 +
>  hw/nvme/ctrl.c   | 55 
>  hw/nvme/nvme.h   |  1 +
>  include/block/nvme.h | 36 +++
>  4 files changed, 99 insertions(+)
> 
> diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst
> index 30f841ef62..1cc5e52c00 100644
> --- a/docs/system/devices/nvme.rst
> +++ b/docs/system/devices/nvme.rst
> @@ -53,6 +53,13 @@ parameters.
>Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID
>previously used.
>  
> +``ocp`` (default: ``off``)
> +  The Open Compute Project defines the Datacenter NVMe SSD Specification that
> +  sits on top of NVMe. It describes additional commands and NVMe behaviors
> +  specific for the Datacenter. When this option is ``on`` OCP features such 
> as
> +  the SMART / Health information extended log become available in the
> +  controller.
> +
>  Additional Namespaces
>  -
>  
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index bf291f7ffe..c7215a4ed1 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -4455,6 +4455,41 @@ static void nvme_set_blk_stats(NvmeNamespace *ns, 
> struct nvme_stats *stats)
>  stats->write_commands += s->nr_ops[BLOCK_ACCT_WRITE];
>  }
>  
> +static uint16_t nvme_ocp_extended_smart_info(NvmeCtrl *n, uint8_t rae,
> + uint32_t buf_len, uint64_t off,
> + NvmeRequest *req)
> +{
> +NvmeNamespace *ns = NULL;
> +NvmeSmartLogExtended smart_l = { 0 };
> +struct nvme_stats stats = { 0 };
> +uint32_t trans_len;
> +
> +if (off >= sizeof(smart_l)) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +/* accumulate all stats from all namespaces */
> +for (int i = 1; i <= NVME_MAX_NAMESPACES; i++) {
> +ns = nvme_ns(n, i);
> +if (ns) {
> +nvme_set_blk_stats(ns, );
> +}
> +}
> +
> +smart_l.physical_media_units_written[0] = 
> cpu_to_le32(stats.units_written);
> +smart_l.physical_media_units_read[0] = cpu_to_le32(stats.units_read);

These are uint64s, so should be cpu_to_le64().

> +smart_l.log_page_version = 0x0003;
> +smart_l.log_page_uuid[0] = 0xA4F2BFEA2810AFC5;
> +smart_l.log_page_uuid[1] = 0xAFD514C97C6F4F9C;

Technically the field is called the "Log Page GUID", not the UUID.
Perhaps this is a bit of Microsoft leaking in, or it is to differentiate
it from the UUID Index functionality, who knows.

It looks like you byte swapped the two 64 bit parts, but not the
individual bytes. It's super confusing when the spec just says "shall be
set to VALUE". Is that VALUE already in little endian? Sigh.

Anyway, I think it is fair to assume that, so just make
log_page_uuid/guid a uint8_t 16-array and do something like:

static const uint8_t uuid[16] = {
0xAF, 0xD5, 0x14, 0xC9, 0x7C, 0x6F, 0x4F, 0x9C,
0xA4, 0xF2, 0xBF, 0xEA, 0x28, 0x10, 0xAF, 0xC5,
};

memcpy(smart_l.log_page_guid, uuid, sizeof(smart_l.log_page_guid));


signature.asc
Description: PGP signature

Re: [PATCH maybe-7.2 1/3] hw/i2c: only schedule pending master when bus is idle

2022-11-16 Thread Cédric Le Goater


On 11/17/22 07:40, Klaus Jensen wrote:

On Nov 16 16:58, Cédric Le Goater wrote:

On 11/16/22 09:43, Klaus Jensen wrote:

From: Klaus Jensen 

It is not given that the current master will release the bus after a
transfer ends. Only schedule a pending master if the bus is idle.

Fixes: 37fa5ca42623 ("hw/i2c: support multiple masters")
Signed-off-by: Klaus Jensen 
---
   hw/i2c/aspeed_i2c.c  |  2 ++
   hw/i2c/core.c| 37 ++---
   include/hw/i2c/i2c.h |  2 ++
   3 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/hw/i2c/aspeed_i2c.c b/hw/i2c/aspeed_i2c.c
index c166fd20fa11..1f071a3811f7 100644
--- a/hw/i2c/aspeed_i2c.c
+++ b/hw/i2c/aspeed_i2c.c
@@ -550,6 +550,8 @@ static void aspeed_i2c_bus_handle_cmd(AspeedI2CBus *bus, 
uint64_t value)
   }
   SHARED_ARRAY_FIELD_DP32(bus->regs, reg_cmd, M_STOP_CMD, 0);
   aspeed_i2c_set_state(bus, I2CD_IDLE);
+
+i2c_schedule_pending_master(bus->bus);


Shouldn't it be i2c_bus_release() ?



The reason for having both i2c_bus_release() and
i2c_schedule_pending_master() is that i2c_bus_release() sort of pairs
with i2c_bus_master(). They either set or clear the bus->bh member.

In the current design, the controller (in this case the Aspeed I2C) is
an "implicit" master (it does not have a bottom half driving it), so
there is no bus->bh to clear.

I should (and will) write some documentation on the asynchronous API.


I found the routine names confusing. Thanks for the clarification.

Maybe we could do this rename  :

  i2c_bus_release() -> i2c_bus_release_and_clear()
  i2c_schedule_pending_master() -> i2c_bus_release()

and keep i2c_schedule_pending_master() internal the I2C core subsystem.

C.

Re: [PATCH RFC 2/3] hw/i2c: add mctp core

On Nov 16 08:27, Corey Minyard wrote:
> On Wed, Nov 16, 2022 at 09:43:11AM +0100, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Add an abstract MCTP over I2C endpoint model. This implements MCTP
> > control message handling as well as handling the actual I2C transport
> > (packetization).
> > 
> > Devices are intended to derive from this and implement the class
> > methods.
> > 
> > Parts of this implementation is inspired by code[1] previously posted by
> > Jonathan Cameron.
> 
> I have some comments inline, mostly about buffer handling.  Buffer
> handling is scary to me, so you might see some paranoia here :-).
> 

Totally understood :) Thanks for the review!

> > 
> >   [1]: 
> > https://lore.kernel.org/qemu-devel/20220520170128.4436-1-jonathan.came...@huawei.com/
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/arm/Kconfig |   1 +
> >  hw/i2c/Kconfig |   4 +
> >  hw/i2c/mctp.c  | 365 +
> >  hw/i2c/meson.build |   1 +
> >  hw/i2c/trace-events|  12 ++
> >  include/hw/i2c/mctp.h  |  83 ++
> >  include/hw/misc/mctp.h |  43 +
> >  7 files changed, 509 insertions(+)
> >  create mode 100644 hw/i2c/mctp.c
> >  create mode 100644 include/hw/i2c/mctp.h
> >  create mode 100644 include/hw/misc/mctp.h
> > 
> > diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
> > index 17fcde8e1ccc..3233bdc193d7 100644
> > --- a/hw/arm/Kconfig
> > +++ b/hw/arm/Kconfig
> > @@ -444,6 +444,7 @@ config ASPEED_SOC
> >  select DS1338
> >  select FTGMAC100
> >  select I2C
> > +select MCTP_I2C
> >  select DPS310
> >  select PCA9552
> >  select SERIAL
> > diff --git a/hw/i2c/Kconfig b/hw/i2c/Kconfig
> > index 9bb8870517f8..5dd43d550c32 100644
> > --- a/hw/i2c/Kconfig
> > +++ b/hw/i2c/Kconfig
> > @@ -41,3 +41,7 @@ config PCA954X
> >  config PMBUS
> >  bool
> >  select SMBUS
> > +
> > +config MCTP_I2C
> > +bool
> > +select I2C
> > diff --git a/hw/i2c/mctp.c b/hw/i2c/mctp.c
> > new file mode 100644
> > index ..46376de95a98
> > --- /dev/null
> > +++ b/hw/i2c/mctp.c
> > @@ -0,0 +1,365 @@
> > +/*
> > + * SPDX-License-Identifier: GPL-2.0-or-later
> > + * SPDX-FileCopyrightText: Copyright (c) 2022 Samsung Electronics Co., Ltd.
> > + * SPDX-FileContributor: Klaus Jensen 
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "qemu/main-loop.h"
> > +
> > +#include "hw/qdev-properties.h"
> > +#include "hw/i2c/i2c.h"
> > +#include "hw/i2c/mctp.h"
> > +
> > +#include "trace.h"
> > +
> > +static uint8_t crc8(uint16_t data)
> > +{
> > +#define POLY (0x1070U << 3)
> > +int i;
> > +
> > +for (i = 0; i < 8; i++) {
> > +if (data & 0x8000) {
> > +data = data ^ POLY;
> > +}
> > +
> > +data = data << 1;
> > +}
> > +
> > +return (uint8_t)(data >> 8);
> > +#undef POLY
> > +}
> > +
> > +static uint8_t i2c_smbus_pec(uint8_t crc, uint8_t *buf, size_t len)
> > +{
> > +int i;
> > +
> > +for (i = 0; i < len; i++) {
> > +crc = crc8((crc ^ buf[i]) << 8);
> > +}
> > +
> > +return crc;
> > +}
> 
> The PEC calculation probably belongs in it's own smbus.c file, since
> it's generic, so someone looking will find it.
> 

Makes sense. I'll move it.

> > +
> > +void i2c_mctp_schedule_send(MCTPI2CEndpoint *mctp)
> > +{
> > +I2CBus *i2c = I2C_BUS(qdev_get_parent_bus(DEVICE(mctp)));
> > +
> > +mctp->tx.state = I2C_MCTP_STATE_TX_START_SEND;
> > +
> > +i2c_bus_master(i2c, mctp->tx.bh);
> > +}
> > +
> > +static void i2c_mctp_tx(void *opaque)
> > +{
> > +DeviceState *dev = DEVICE(opaque);
> > +I2CBus *i2c = I2C_BUS(qdev_get_parent_bus(dev));
> > +I2CSlave *slave = I2C_SLAVE(dev);
> > +MCTPI2CEndpoint *mctp = MCTP_I2C_ENDPOINT(dev);
> > +MCTPI2CEndpointClass *mc = MCTP_I2C_ENDPOINT_GET_CLASS(mctp);
> > +MCTPI2CPacket *pkt = (MCTPI2CPacket *)mctp->buffer;
> > +uint8_t flags = 0;
> > +
> > +switch (mctp->tx.state) {
> > +case I2C_MCTP_STATE_TX_SEND_BYTE:
> > +if (mctp->pos < mctp->len) {
> > +uint8_t byte = mctp->buffer[mctp->pos];
> > +
> > +trace_i2c_mctp_tx_send_byte(mctp->pos, byte);
> > +
> > +/* send next byte */
> > +i2c_send_async(i2c, byte);
> > +
> > +mctp->pos++;
> > +
> > +break;
> > +}
> > +
> > +/* packet sent */
> > +i2c_end_transfer(i2c);
> > +
> > +/* fall through */
> > +
> > +case I2C_MCTP_STATE_TX_START_SEND:
> > +if (mctp->tx.is_control) {
> > +/* packet payload is already in buffer */
> > +flags |= MCTP_H_FLAGS_SOM | MCTP_H_FLAGS_EOM;
> > +} else {
> > +/* get message bytes from derived device */
> > +mctp->len = mc->get_message_bytes(mctp, pkt->mctp.payload,
> > +  I2C_MCTP_MAXMTU, );
> > +}
> > +
> > +if (!mctp->len) {
> > +trace_i2c_mctp_tx_done();

Re: [PATCH maybe-7.2 1/3] hw/i2c: only schedule pending master when bus is idle

On Nov 16 16:58, Cédric Le Goater wrote:
> On 11/16/22 09:43, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > It is not given that the current master will release the bus after a
> > transfer ends. Only schedule a pending master if the bus is idle.
> > 
> > Fixes: 37fa5ca42623 ("hw/i2c: support multiple masters")
> > Signed-off-by: Klaus Jensen 
> > ---
> >   hw/i2c/aspeed_i2c.c  |  2 ++
> >   hw/i2c/core.c| 37 ++---
> >   include/hw/i2c/i2c.h |  2 ++
> >   3 files changed, 26 insertions(+), 15 deletions(-)
> > 
> > diff --git a/hw/i2c/aspeed_i2c.c b/hw/i2c/aspeed_i2c.c
> > index c166fd20fa11..1f071a3811f7 100644
> > --- a/hw/i2c/aspeed_i2c.c
> > +++ b/hw/i2c/aspeed_i2c.c
> > @@ -550,6 +550,8 @@ static void aspeed_i2c_bus_handle_cmd(AspeedI2CBus 
> > *bus, uint64_t value)
> >   }
> >   SHARED_ARRAY_FIELD_DP32(bus->regs, reg_cmd, M_STOP_CMD, 0);
> >   aspeed_i2c_set_state(bus, I2CD_IDLE);
> > +
> > +i2c_schedule_pending_master(bus->bus);
> 
> Shouldn't it be i2c_bus_release() ?
> 

The reason for having both i2c_bus_release() and
i2c_schedule_pending_master() is that i2c_bus_release() sort of pairs
with i2c_bus_master(). They either set or clear the bus->bh member.

In the current design, the controller (in this case the Aspeed I2C) is
an "implicit" master (it does not have a bottom half driving it), so
there is no bus->bh to clear.

I should (and will) write some documentation on the asynchronous API.


signature.asc
Description: PGP signature

Re: [PATCH v2 3/3] nvme: Add physical writes/reads from OCP log

On Nov 16 17:19, Joel Granados wrote:
> On Tue, Nov 15, 2022 at 12:26:17PM +0100, Klaus Jensen wrote:
> > On Nov 14 14:50, Joel Granados wrote:
> > >  
> > > +static uint16_t nvme_vendor_specific_log(uint8_t lid, NvmeCtrl *n, 
> > > uint8_t rae,
> > > + uint32_t buf_len, uint64_t off,
> > > + NvmeRequest *req)
> > 
> > `NvmeCtrl *n` must be first parameter.
> Any reason why this is the case? I'll change it in my code, but would be
> nice to understand the reason.
> 

No other reason than consistency with existing code.


signature.asc
Description: PGP signature

Re: [PATCH v3 14/17] vfio/migration: Reset device if setting recover state fails

2022-11-16 Thread Alex Williamson

On Thu, 3 Nov 2022 18:16:17 +0200
Avihai Horon  wrote:

> If vfio_migration_set_state() fails to set the device in the requested
> state it tries to put it in a recover state. If setting the device in
> the recover state fails as well, hw_error is triggered and the VM is
> aborted.
> 
> To improve user experience and avoid VM data loss, reset the device with
> VFIO_RESET_DEVICE instead of aborting the VM.
> 
> Signed-off-by: Avihai Horon 
> ---
>  hw/vfio/migration.c | 14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index f8c3228314..e8068b9147 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -92,8 +92,18 @@ static int vfio_migration_set_state(VFIODevice *vbasedev,
>  
>  mig_state->device_state = recover_state;
>  if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> -hw_error("%s: Failed setting device in recover state, err: %s",
> - vbasedev->name, strerror(errno));
> +error_report(
> +"%s: Failed setting device in recover state, err: %s. 
> Resetting device",
> + vbasedev->name, strerror(errno));
> +
> +if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) {
> +hw_error("%s: Failed resetting device, err: %s", 
> vbasedev->name,
> + strerror(errno));
> +}
> +
> +migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> +
> +return -1;
>  }
>  
>  migration->device_state = recover_state;

This addresses one of my comments on 12/ and should probably be rolled
in there.  Thanks,

Alex

Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2

2022-11-16 Thread Alex Williamson

On Thu, 3 Nov 2022 18:16:15 +0200
Avihai Horon  wrote:

> Add implementation of VFIO migration protocol v2. The two protocols, v1
> and v2, will co-exist and in next patch v1 protocol will be removed.
> 
> There are several main differences between v1 and v2 protocols:
> - VFIO device state is now represented as a finite state machine instead
>   of a bitmap.
> 
> - Migration interface with kernel is now done using VFIO_DEVICE_FEATURE
>   ioctl and normal read() and write() instead of the migration region.
> 
> - VFIO migration protocol v2 currently doesn't support the pre-copy
>   phase of migration.
> 
> Detailed information about VFIO migration protocol v2 and difference
> compared to v1 can be found here [1].
> 
> [1]
> https://lore.kernel.org/all/20220224142024.147653-10-yish...@nvidia.com/
> 
> Signed-off-by: Avihai Horon 
> ---
>  hw/vfio/common.c  |  19 +-
>  hw/vfio/migration.c   | 386 ++
>  hw/vfio/trace-events  |   4 +
>  include/hw/vfio/vfio-common.h |   5 +
>  4 files changed, 375 insertions(+), 39 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 617e6cd901..0bdbd1586b 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -355,10 +355,18 @@ static bool 
> vfio_devices_all_dirty_tracking(VFIOContainer *container)
>  return false;
>  }
>  
> -if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) 
> &&
> +if (!migration->v2 &&
> +(vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) 
> &&
>  (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) 
> {
>  return false;
>  }
> +
> +if (migration->v2 &&
> +(vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) 
> &&
> +(migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> + migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
> +return false;
> +}
>  }
>  }
>  return true;
> @@ -385,7 +393,14 @@ static bool 
> vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>  return false;
>  }
>  
> -if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> +if (!migration->v2 &&
> +migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> +continue;
> +}
> +
> +if (migration->v2 &&
> +(migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> + migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
>  continue;
>  } else {
>  return false;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index e784374453..62afc23a8c 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -44,8 +44,84 @@
>  #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
>  #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
>  
> +#define VFIO_MIG_DATA_BUFFER_SIZE (1024 * 1024)

Add comment explaining heuristic of this size.

> +
>  static int64_t bytes_transferred;
>  
> +static const char *mig_state_to_str(enum vfio_device_mig_state state)
> +{
> +switch (state) {
> +case VFIO_DEVICE_STATE_ERROR:
> +return "ERROR";
> +case VFIO_DEVICE_STATE_STOP:
> +return "STOP";
> +case VFIO_DEVICE_STATE_RUNNING:
> +return "RUNNING";
> +case VFIO_DEVICE_STATE_STOP_COPY:
> +return "STOP_COPY";
> +case VFIO_DEVICE_STATE_RESUMING:
> +return "RESUMING";
> +case VFIO_DEVICE_STATE_RUNNING_P2P:
> +return "RUNNING_P2P";
> +default:
> +return "UNKNOWN STATE";
> +}
> +}
> +
> +static int vfio_migration_set_state(VFIODevice *vbasedev,
> +enum vfio_device_mig_state new_state,
> +enum vfio_device_mig_state recover_state)
> +{
> +VFIOMigration *migration = vbasedev->migration;
> +uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
> +  sizeof(struct vfio_device_feature_mig_state),
> +  sizeof(uint64_t))] = {};
> +struct vfio_device_feature *feature = (void *)buf;
> +struct vfio_device_feature_mig_state *mig_state = (void *)feature->data;

We can cast to the actual types rather than void* here.

> +
> +feature->argsz = sizeof(buf);
> +feature->flags =
> +VFIO_DEVICE_FEATURE_SET | VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE;
> +mig_state->device_state = new_state;
> +if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> +/* Try to set the device in some good state */
> +error_report(
> +"%s: Failed setting device state to %s, err: %s. Setting device 
> in recover state %s",
> + vbasedev->name,

[PATCH v3 1/2] nvme: Move adjustment of data_units{read,written}

In order to return the units_{read/written} required by the SMART log we
need to shift the number of bytes value by BDRV_SECTORS_BITS and multiply
by 1000. This is a prep patch that moves this adjustment to where the SMART
log is calculated in order to use the stats struct for calculating OCP
extended smart log values.

Signed-off-by: Joel Granados 
---
 hw/nvme/ctrl.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 87aeba0564..bf291f7ffe 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4449,8 +4449,8 @@ static void nvme_set_blk_stats(NvmeNamespace *ns, struct 
nvme_stats *stats)
 {
 BlockAcctStats *s = blk_get_stats(ns->blkconf.blk);
 
-stats->units_read += s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
-stats->units_written += s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
+stats->units_read += s->nr_bytes[BLOCK_ACCT_READ];
+stats->units_written += s->nr_bytes[BLOCK_ACCT_WRITE];
 stats->read_commands += s->nr_ops[BLOCK_ACCT_READ];
 stats->write_commands += s->nr_ops[BLOCK_ACCT_WRITE];
 }
@@ -4464,6 +4464,7 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t rae, 
uint32_t buf_len,
 uint32_t trans_len;
 NvmeNamespace *ns;
 time_t current_ms;
+uint64_t u_read, u_written;
 
 if (off >= sizeof(smart)) {
 return NVME_INVALID_FIELD | NVME_DNR;
@@ -4490,10 +4491,11 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t 
rae, uint32_t buf_len,
 trans_len = MIN(sizeof(smart) - off, buf_len);
 smart.critical_warning = n->smart_critical_warning;
 
-smart.data_units_read[0] = cpu_to_le64(DIV_ROUND_UP(stats.units_read,
-1000));
-smart.data_units_written[0] = cpu_to_le64(DIV_ROUND_UP(stats.units_written,
-   1000));
+u_read = DIV_ROUND_UP(stats.units_read >> BDRV_SECTOR_BITS, 1000);
+u_written = DIV_ROUND_UP(stats.units_written >> BDRV_SECTOR_BITS, 1000);
+
+smart.data_units_read[0] = cpu_to_le64(u_read);
+smart.data_units_written[0] = cpu_to_le64(u_written);
 smart.host_read_commands[0] = cpu_to_le64(stats.read_commands);
 smart.host_write_commands[0] = cpu_to_le64(stats.write_commands);
 
-- 
2.30.2

[PATCH v3 0/2] Add OCP extended log to nvme QEMU

The motivation and description are contained in the last patch in this set.
Will copy paste it here for convenience:

In order to evaluate write amplification factor (WAF) within the storage
stack it is important to know the number of bytes written to the
controller. The existing SMART log value of Data Units Written is too
coarse (given in units of 500 Kb) and so we add the SMART health
information extended from the OCP specification (given in units of bytes).

To accommodate different vendor specific specifications like OCP, we add a
multiplexing function (nvme_vendor_specific_log) which will route to the
different log functions based on arguments and log ids. We only return the
OCP extended smart log when the command is 0xC0 and ocp has been turned on
in the args.

Though we add the whole nvme smart log extended structure, we only populate
the physical_media_units_{read,written}, log_page_version and
log_page_uuid.

V3 changes:
1. Corrected a bunch of checkpatch issues. Since I changed the first patch
   I did not include the reviewed-by.
2. Included some documentation in nvme.rst for the ocp argument
3. Squashed the ocp arg changes into the main patch.
4. Fixed several comments and an open parenthesis
5. Hex values are now in lower case.
6. Change the reserved format to rsvd
7. Made sure that NvmeCtrl is the first arg in all the functions.
8. Fixed comment on commit of main patch

V2 changes:
1. I moved the ocp parameter from the namespace to the subsystem as it is
   defined there in the OCP specification
2. I now accumulate statistics from all namespaces and report them back on
   the extended log as per the spec.
3. I removed the default case in the switch in nvme_vendor_specific_log as
   it does not have any special function.

Joel Granados (2):
  nvme: Move adjustment of data_units{read,written}
  nvme: Add physical writes/reads from OCP log

 docs/system/devices/nvme.rst |  7 
 hw/nvme/ctrl.c   | 69 
 hw/nvme/nvme.h   |  1 +
 include/block/nvme.h | 36 +++
 4 files changed, 107 insertions(+), 6 deletions(-)

-- 
2.30.2

[PATCH v3 2/2] nvme: Add physical writes/reads from OCP log

In order to evaluate write amplification factor (WAF) within the storage
stack it is important to know the number of bytes written to the
controller. The existing SMART log value of Data Units Written is too
coarse (given in units of 500 Kb) and so we add the SMART health
information extended from the OCP specification (given in units of bytes)

We add a controller argument (ocp) that toggles on/off the SMART log
extended structure.  To accommodate different vendor specific specifications
like OCP, we add a multiplexing function (nvme_vendor_specific_log) which
will route to the different log functions based on arguments and log ids.
We only return the OCP extended SMART log when the command is 0xC0 and ocp
has been turned on in the args.

Though we add the whole nvme SMART log extended structure, we only populate
the physical_media_units_{read,written}, log_page_version and
log_page_uuid.

Signed-off-by: Joel Granados 
---
 docs/system/devices/nvme.rst |  7 +
 hw/nvme/ctrl.c   | 55 
 hw/nvme/nvme.h   |  1 +
 include/block/nvme.h | 36 +++
 4 files changed, 99 insertions(+)

diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst
index 30f841ef62..1cc5e52c00 100644
--- a/docs/system/devices/nvme.rst
+++ b/docs/system/devices/nvme.rst
@@ -53,6 +53,13 @@ parameters.
   Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID
   previously used.
 
+``ocp`` (default: ``off``)
+  The Open Compute Project defines the Datacenter NVMe SSD Specification that
+  sits on top of NVMe. It describes additional commands and NVMe behaviors
+  specific for the Datacenter. When this option is ``on`` OCP features such as
+  the SMART / Health information extended log become available in the
+  controller.
+
 Additional Namespaces
 -
 
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index bf291f7ffe..c7215a4ed1 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4455,6 +4455,41 @@ static void nvme_set_blk_stats(NvmeNamespace *ns, struct 
nvme_stats *stats)
 stats->write_commands += s->nr_ops[BLOCK_ACCT_WRITE];
 }
 
+static uint16_t nvme_ocp_extended_smart_info(NvmeCtrl *n, uint8_t rae,
+ uint32_t buf_len, uint64_t off,
+ NvmeRequest *req)
+{
+NvmeNamespace *ns = NULL;
+NvmeSmartLogExtended smart_l = { 0 };
+struct nvme_stats stats = { 0 };
+uint32_t trans_len;
+
+if (off >= sizeof(smart_l)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+/* accumulate all stats from all namespaces */
+for (int i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+ns = nvme_ns(n, i);
+if (ns) {
+nvme_set_blk_stats(ns, );
+}
+}
+
+smart_l.physical_media_units_written[0] = cpu_to_le32(stats.units_written);
+smart_l.physical_media_units_read[0] = cpu_to_le32(stats.units_read);
+smart_l.log_page_version = 0x0003;
+smart_l.log_page_uuid[0] = 0xA4F2BFEA2810AFC5;
+smart_l.log_page_uuid[1] = 0xAFD514C97C6F4F9C;
+
+if (!rae) {
+nvme_clear_events(n, NVME_AER_TYPE_SMART);
+}
+
+trans_len = MIN(sizeof(smart_l) - off, buf_len);
+return nvme_c2h(n, (uint8_t *) _l + off, trans_len, req);
+}
+
 static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
 uint64_t off, NvmeRequest *req)
 {
@@ -4642,6 +4677,23 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint8_t 
csi, uint32_t buf_len,
 return nvme_c2h(n, ((uint8_t *)) + off, trans_len, req);
 }
 
+static uint16_t nvme_vendor_specific_log(NvmeCtrl *n, uint8_t rae,
+ uint32_t buf_len, uint64_t off,
+ NvmeRequest *req, uint8_t lid)
+{
+switch (lid) {
+case 0xc0:
+if (n->params.ocp) {
+return nvme_ocp_extended_smart_info(n, rae, buf_len, off, req);
+}
+break;
+/* add a case for each additional vendor specific log id */
+}
+
+trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
 static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
 {
 NvmeCmd *cmd = >cmd;
@@ -4683,6 +4735,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest 
*req)
 return nvme_error_info(n, rae, len, off, req);
 case NVME_LOG_SMART_INFO:
 return nvme_smart_info(n, rae, len, off, req);
+case NVME_LOG_VENDOR_START...NVME_LOG_VENDOR_END:
+return nvme_vendor_specific_log(n, rae, len, off, req, lid);
 case NVME_LOG_FW_SLOT_INFO:
 return nvme_fw_log_info(n, len, off, req);
 case NVME_LOG_CHANGED_NSLIST:
@@ -7685,6 +7739,7 @@ static Property nvme_props[] = {
   params.sriov_max_vi_per_vf, 0),
 DEFINE_PROP_UINT8("sriov_max_vq_per_vf", NvmeCtrl,

Re: [PATCH v3] block/rbd: Add support for layered encryption

2022-11-16 Thread Ilya Dryomov

On Wed, Nov 16, 2022 at 12:15 PM Daniel P. Berrangé  wrote:
>
> On Wed, Nov 16, 2022 at 10:23:52AM +, Daniel P. Berrangé wrote:
> > On Wed, Nov 16, 2022 at 09:03:31AM +, Or Ozeri wrote:
> > > > -Original Message-
> > > > From: Daniel P. Berrangé 
> > > > Sent: 15 November 2022 19:47
> > > > To: Or Ozeri 
> > > > Cc: qemu-de...@nongnu.org; qemu-block@nongnu.org; Danny Harnik
> > > > ; idryo...@gmail.com
> > > > Subject: [EXTERNAL] Re: [PATCH v3] block/rbd: Add support for layered
> > > > encryption
> > > >
> > > > AFAICT, supporting layered encryption shouldn't require anything other 
> > > > than
> > > > the 'parent' addition.
> > > >
> > >
> > > Since the layered encryption API is new in librbd, we don't have to
> > > support "luks" and "luks2" at all.
> > > In librbd we are actually deprecating the use of "luks" and "luks2",
> > > and instead ask users to use "luks-any".
> >
> > Deprecating that is a bad idea. The security characteristics and
> > feature set of LUKSv1 and LUKSv2 can be quite different. If a mgmt
> > app is expecting the volume to be protected with LUKSv2, it should
> > be stating that explicitly and not permit a silent downgrade if
> > the volume was unexpectedly using LUKSv1.
> >
> > > If we don't add "luks-any" here, we will need to implement
> > > explicit cases for "luks" and "luks2" in the qemu_rbd_encryption_load2.
> > > This looks like a kind of wasteful coding that won't be actually used
> > > by users of the rbd driver in qemu.
> >
> > It isn't wasteful - supporting the formats explicitly is desirable
> > to prevent format downgrades.
> >
> > > Anyhow, we need the "luks-any" option for our use-case, so if you
> > > insist, I will first submit a patch to add "luks-any", before this
> > > patch.
> >
> > I'm pretty wary of any kind of automatic encryption format detection
> > in QEMU. The automatic block driver format probing has been a long
> > standing source of CVEs in QEMU and every single mgmt app above QEMU.
>
> Having said that, normal linux LUKS tools like cryptsetup or systemd
> LUKS integration will auto-detect  luks1 vs luks2. All cryptsetup
> commands also have an option to explicitly specify the format version.
>
> So with that precedent I guess it is ok to add 'luks-any'.

Yeah, I think we may need to reconsider the intent to deprecate
LUKS1 and LUKS2 options for loading encryption in librbd in favor
of a generic LUKS(-ANY) option.  But, just on its own, LUKS(-ANY)
is definitely a thing and having it exposed in QEMU seems natural.

Thanks,

Ilya

Re: [PATCH v2 2/9] block-copy: add missing coroutine_fn annotations

2022-11-16 Thread Paolo Bonzini


On 11/15/22 16:41, Emanuele Giuseppe Esposito wrote:

To sum up on what was discussed in this serie, I don't really see any
strong objection against these patches, so I will soon send v3 which is
pretty much the same except for patch 1, which will be removed.

I think these patches are useful and will be even more meaningful to the
reviewer when in the next few days I send all the rwlock patches.


Yes, I agree.

FWIW I implemented path search in vrc and it found 133 candidates 
(functions that are only called by coroutine_fn are not coroutine_fns 
themselves).  I only list them after the signature because as expected, 
most of them are pointless; however there are some are obviously correct:


1) some have _co_ in their name :)

2) these five directly call a generated_co_wrapper so they're an easy catch:

vhdx_log_write_and_flush-> bdrv_flush
vhdx_log_write_and_flush-> bdrv_pread
vhdx_log_write_and_flush-> bdrv_pwrite
mirror_flush-> blk_flush
qcow2_check_refcounts   -> bdrv_pwrite
qcow2_check_refcounts   -> bdrv_pwrite_sync
qcow2_check_refcounts   -> bdrv_pread
qcow2_read_extensions   -> bdrv_pread
check_directory_consistency -> bdrv_pwrite

(vrc lets me query this with "paths [coroutine_fn_candidate] 
[no_coroutine_fn]")


3) I can also query (with "paths [coroutine_fn_candidate] ... 
[no_coroutine_fn]") those that end up calling a generated_co_wrapper. 
Among these, vrc catches block_copy_reset_unallocated from this patch:


block_copy_reset_unallocated
block_crypto_co_create_generic
calculate_l2_meta
check_directory_consistency
commit_direntries
commit_one_file
is_zero
mirror_flush
qcow2_alloc_bytes
qcow2_alloc_cluster_abort
qcow2_alloc_clusters_at
qcow2_check_refcounts
qcow2_get_last_cluster
qcow2_read_extensions
qcow2_read_snapshots
qcow2_truncate_bitmaps_check
qcow2_update_options
vhdx_log_write_and_flush
vmdk_is_cid_valid
zero_l2_subclusters

Another possibility is to identify common "entry points" in the paths to 
the no_coroutine_fn and make them generated_co_wrappers.  For example in 
qcow2 these include bitmap_list_load, update_refcount and 
get_cluster_table and the qcow2_snapshot_* functions.


Of course the analysis would have to be rerun after doing every change.

The most time consuming part is labeling coroutine_fn/no_coroutine_fn, 
which would be useful to do with clang (and at this point you might as 
well extract the CFG with it).  Doing the queries totally by hand 
doesn't quite scale (for example vrc's blind spot is inlining and I 
forgot to disable it, but I only noticed too late...), but it should be 
scriptable since after all VRC is just a Python package + a nice CLI.


Thanks,

Paolo



label coroutine_fn_candidate aio_get_thread_pool
label coroutine_fn_candidate aio_task_pool_free
label coroutine_fn_candidate aio_task_pool_status
label coroutine_fn_candidate bdrv_bsc_fill
label coroutine_fn_candidate bdrv_bsc_invalidate_range
label coroutine_fn_candidate bdrv_bsc_is_data
label coroutine_fn_candidate bdrv_can_write_zeroes_with_unmap
label coroutine_fn_candidate bdrv_check_request
label coroutine_fn_candidate bdrv_dirty_bitmap_get
label coroutine_fn_candidate bdrv_dirty_bitmap_get_locked
label coroutine_fn_candidate bdrv_dirty_bitmap_lock
label coroutine_fn_candidate bdrv_dirty_bitmap_next_dirty_area
label coroutine_fn_candidate bdrv_dirty_bitmap_next_zero
label coroutine_fn_candidate bdrv_dirty_bitmap_set_inconsistent
label coroutine_fn_candidate bdrv_dirty_bitmap_status
label coroutine_fn_candidate bdrv_dirty_bitmap_truncate
label coroutine_fn_candidate bdrv_dirty_bitmap_unlock
label coroutine_fn_candidate bdrv_dirty_iter_free
label coroutine_fn_candidate bdrv_dirty_iter_new
label coroutine_fn_candidate bdrv_dirty_iter_next
label coroutine_fn_candidate bdrv_has_readonly_bitmaps
label coroutine_fn_candidate bdrv_inc_in_flight
label coroutine_fn_candidate bdrv_min_mem_align
label coroutine_fn_candidate bdrv_pad_request
label coroutine_fn_candidate bdrv_probe_all
label coroutine_fn_candidate bdrv_reset_dirty_bitmap_locked
label coroutine_fn_candidate bdrv_round_to_clusters
label coroutine_fn_candidate bdrv_set_dirty
label coroutine_fn_candidate bdrv_set_dirty_iter
label coroutine_fn_candidate bdrv_write_threshold_check_write
label coroutine_fn_candidate blk_check_byte_request
label coroutine_fn_candidate blkverify_err
label coroutine_fn_candidate block_copy_async
label coroutine_fn_candidate block_copy_call_cancel
label coroutine_fn_candidate block_copy_call_cancelled
label coroutine_fn_candidate block_copy_call_failed
label coroutine_fn_candidate block_copy_call_finished
label coroutine_fn_candidate block_copy_call_free
label coroutine_fn_candidate block_copy_call_status
label coroutine_fn_candidate block_copy_call_succeeded
label coroutine_fn_candidate block_copy_reset_unallocated
label coroutine_fn_candidate block_copy_set_skip_unallocated
label coroutine_fn_candidate block_crypto_co_create_generic
label coroutine_fn_candidate

Re: [PATCH v2 3/3] nvme: Add physical writes/reads from OCP log

On Tue, Nov 15, 2022 at 12:26:17PM +0100, Klaus Jensen wrote:
> On Nov 14 14:50, Joel Granados wrote:
> > In order to evaluate write amplification factor (WAF) within the storage
> > stack it is important to know the number of bytes written to the
> > controller. The existing SMART log value of Data Units Written is too
> > coarse (given in units of 500 Kb) and so we add the SMART health
> > information extended from the OCP specification (given in units of bytes).
> > 
> > To accomodate different vendor specific specifications like OCP, we add a
> > multiplexing function (nvme_vendor_specific_log) which will route to the
> > different log functions based on arguments and log ids. We only return the
> > OCP extended smart log when the command is 0xC0 and ocp has been turned on
> > in the args.
> > 
> > Though we add the whole nvme smart log extended structure, we only populate
> > the physical_media_units_{read,written}, log_page_version and
> > log_page_uuid.
> > 
> > Signed-off-by: Joel Granados 
> > 
> > squash with main
> > 
> > Signed-off-by: Joel Granados 
> 
> Looks like you slightly messed up the squash ;)
oops. that is my bad

> 
> Also, squash the previous patch (adding the ocp parameter) into this.
Here I wanted to keep the introduction of the argument separate. In any
case, I'll squash it with the other one.

> Please add a note in the documentation (docs/system/devices/nvme.rst)
> about this parameter.
Of course. I always forget documentation. I'll add it under the
"Controller Emulation" section and I'll call it ``ocp``

> 
> > ---
> >  hw/nvme/ctrl.c   | 56 
> >  include/block/nvme.h | 36 
> >  2 files changed, 92 insertions(+)
> > 
> > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > index 220683201a..5e6a8150a2 100644
> > --- a/hw/nvme/ctrl.c
> > +++ b/hw/nvme/ctrl.c
> > @@ -4455,6 +4455,42 @@ static void nvme_set_blk_stats(NvmeNamespace *ns, 
> > struct nvme_stats *stats)
> >  stats->write_commands += s->nr_ops[BLOCK_ACCT_WRITE];
> >  }
> >  
> > +static uint16_t nvme_ocp_extended_smart_info(NvmeCtrl *n, uint8_t rae,
> > + uint32_t buf_len, uint64_t 
> > off,
> > + NvmeRequest *req)
> > +{
> > +NvmeNamespace *ns = NULL;
> > +NvmeSmartLogExtended smart_ext = { 0 };
> > +struct nvme_stats stats = { 0 };
> > +uint32_t trans_len;
> > +
> > +if (off >= sizeof(smart_ext)) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +// Accumulate all stats from all namespaces
> 
> Use /* lower-case and no period */ for one sentence, one line comments.
> 
> I think scripts/checkpatch.pl picks this up.
There is a checkpatch like in the kernel. Fantastic! I'll make a note to
use it from now on.


> 
> > +for (int i = 1; i <= NVME_MAX_NAMESPACES; i++) {
> > +ns = nvme_ns(n, i);
> > +if (ns)
> > +{
> 
> Paranthesis go on the same line as the `if`.
of course

> 
> > +nvme_set_blk_stats(ns, );
> > +}
> > +}
> > +
> > +smart_ext.physical_media_units_written[0] = 
> > cpu_to_le32(stats.units_written);
> > +smart_ext.physical_media_units_read[0] = cpu_to_le32(stats.units_read);
> > +smart_ext.log_page_version = 0x0003;
> > +smart_ext.log_page_uuid[0] = 0xA4F2BFEA2810AFC5;
> > +smart_ext.log_page_uuid[1] = 0xAFD514C97C6F4F9C;
> > +
> > +if (!rae) {
> > +nvme_clear_events(n, NVME_AER_TYPE_SMART);
> > +}
> > +
> > +trans_len = MIN(sizeof(smart_ext) - off, buf_len);
> > +return nvme_c2h(n, (uint8_t *) _ext + off, trans_len, req);
> > +}
> > +
> >  static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
> >  uint64_t off, NvmeRequest *req)
> >  {
> > @@ -4642,6 +4678,24 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, 
> > uint8_t csi, uint32_t buf_len,
> >  return nvme_c2h(n, ((uint8_t *)) + off, trans_len, req);
> >  }
> >  
> > +static uint16_t nvme_vendor_specific_log(uint8_t lid, NvmeCtrl *n, uint8_t 
> > rae,
> > + uint32_t buf_len, uint64_t off,
> > + NvmeRequest *req)
> 
> `NvmeCtrl *n` must be first parameter.
Any reason why this is the case? I'll change it in my code, but would be
nice to understand the reason.


> 
> > +{
> > +NvmeSubsystem *subsys = n->subsys;
> > +switch (lid) {
> > +case NVME_LOG_VENDOR_START:
> 
> In this particular case, I think it is more clear if you simply use the
> hex value directly. The "meaning" of the log page id depends on if or
> not this is an controller implementing the OCP spec.
Agreed

> 
> > +if (subsys->params.ocp) {
> > +return nvme_ocp_extended_smart_info(n, rae, buf_len, off, 
> > req);
> > +}
> > +break;
> > +/* Add a case for each additional vendor

Re: [PATCH maybe-7.2 1/3] hw/i2c: only schedule pending master when bus is idle

2022-11-16 Thread Cédric Le Goater


On 11/16/22 09:43, Klaus Jensen wrote:

From: Klaus Jensen 

It is not given that the current master will release the bus after a
transfer ends. Only schedule a pending master if the bus is idle.

Fixes: 37fa5ca42623 ("hw/i2c: support multiple masters")
Signed-off-by: Klaus Jensen 
---
  hw/i2c/aspeed_i2c.c  |  2 ++
  hw/i2c/core.c| 37 ++---
  include/hw/i2c/i2c.h |  2 ++
  3 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/hw/i2c/aspeed_i2c.c b/hw/i2c/aspeed_i2c.c
index c166fd20fa11..1f071a3811f7 100644
--- a/hw/i2c/aspeed_i2c.c
+++ b/hw/i2c/aspeed_i2c.c
@@ -550,6 +550,8 @@ static void aspeed_i2c_bus_handle_cmd(AspeedI2CBus *bus, 
uint64_t value)
  }
  SHARED_ARRAY_FIELD_DP32(bus->regs, reg_cmd, M_STOP_CMD, 0);
  aspeed_i2c_set_state(bus, I2CD_IDLE);
+
+i2c_schedule_pending_master(bus->bus);


Shouldn't it be i2c_bus_release() ?

Thanks,

C.



  }
  
  if (aspeed_i2c_bus_pkt_mode_en(bus)) {

diff --git a/hw/i2c/core.c b/hw/i2c/core.c
index d4ba8146bffb..bed594fe599b 100644
--- a/hw/i2c/core.c
+++ b/hw/i2c/core.c
@@ -185,22 +185,39 @@ int i2c_start_transfer(I2CBus *bus, uint8_t address, bool 
is_recv)
  
  void i2c_bus_master(I2CBus *bus, QEMUBH *bh)

  {
+I2CPendingMaster *node = g_new(struct I2CPendingMaster, 1);
+node->bh = bh;
+
+QSIMPLEQ_INSERT_TAIL(>pending_masters, node, entry);
+}
+
+void i2c_schedule_pending_master(I2CBus *bus)
+{
+I2CPendingMaster *node;
+
  if (i2c_bus_busy(bus)) {
-I2CPendingMaster *node = g_new(struct I2CPendingMaster, 1);
-node->bh = bh;
-
-QSIMPLEQ_INSERT_TAIL(>pending_masters, node, entry);
+/* someone is already controlling the bus; wait for it to release it */
+return;
+}
  
+if (QSIMPLEQ_EMPTY(>pending_masters)) {

  return;
  }
  
-bus->bh = bh;

+node = QSIMPLEQ_FIRST(>pending_masters);
+bus->bh = node->bh;
+
+QSIMPLEQ_REMOVE_HEAD(>pending_masters, entry);
+g_free(node);
+
  qemu_bh_schedule(bus->bh);
  }
  
  void i2c_bus_release(I2CBus *bus)

  {
  bus->bh = NULL;
+
+i2c_schedule_pending_master(bus);
  }
  
  int i2c_start_recv(I2CBus *bus, uint8_t address)

@@ -234,16 +251,6 @@ void i2c_end_transfer(I2CBus *bus)
  g_free(node);
  }
  bus->broadcast = false;
-
-if (!QSIMPLEQ_EMPTY(>pending_masters)) {
-I2CPendingMaster *node = QSIMPLEQ_FIRST(>pending_masters);
-bus->bh = node->bh;
-
-QSIMPLEQ_REMOVE_HEAD(>pending_masters, entry);
-g_free(node);
-
-qemu_bh_schedule(bus->bh);
-}
  }
  
  int i2c_send(I2CBus *bus, uint8_t data)

diff --git a/include/hw/i2c/i2c.h b/include/hw/i2c/i2c.h
index 9b9581d23097..2a3abacd1ba6 100644
--- a/include/hw/i2c/i2c.h
+++ b/include/hw/i2c/i2c.h
@@ -141,6 +141,8 @@ int i2c_start_send(I2CBus *bus, uint8_t address);
   */
  int i2c_start_send_async(I2CBus *bus, uint8_t address);
  
+void i2c_schedule_pending_master(I2CBus *bus);

+
  void i2c_end_transfer(I2CBus *bus);
  void i2c_nack(I2CBus *bus);
  void i2c_ack(I2CBus *bus);

Fwd: [FOSDEM] CfP Software Defined Storage devroom FOSDEM23

2022-11-16 Thread Niels de Vos

Hi!

In a few montsh time FOSDEM will host an in-person Software Defined
Storage devroom again. It would be a great oppertunity to show what
storage related things the QEMU project has been doing, and what is
planned for the future. Please consider proposing a talk!

Thanks,
Niels


- Forwarded message from Jan Fajerski  -

> From: Jan Fajerski 
> To: fos...@lists.fosdem.org
> Cc: devroom-manag...@lists.fosdem.org
> Date: Thu, 10 Nov 2022 10:49:51 +0100
> Subject: [FOSDEM] CfP Software Defined Storage devroom FOSDEM23
> 
> FOSDEM is a free software event that offers open source communities a place to
> meet, share ideas and collaborate.  It is well known for being highly
> developer-oriented and in the past brought together 8000+ participants from
> all over the world.  Its home is in the city of Brussels (Belgium).
> 
> FOSDEM 2023 will take place as an in-person event during the weekend of 
> February
> 4./5. 2023. More details about the event can be found at http://fosdem.org/
> 
> ** Call For Participation
> 
> The Software Defined Storage devroom will go into its seventh round for talks
> around Open Source Software Defined Storage projects, management tools
> and real world deployments.
> 
> Presentation topics could include but are not limited too:
> 
> - Your work on a SDS project like Ceph, Gluster, OpenEBS, CORTX or Longhorn
> 
> - Your work on or with SDS related projects like OpenStack SWIFT or Container
> Storage Interface
> 
> - Management tools for SDS deployments
> 
> - Monitoring tools for SDS clusters
> 
> ** Important dates:
> 
> - Dec 10th 2022:  submission deadline for talk proposals
> - Dec 15th 2022:  announcement of the final schedule
> - Feb  4th 2023:  Software Defined Storage dev room
> 
> Talk proposals will be reviewed by a steering committee:
> - Niels de Vos (Red Hat)
> - Jan Fajerski (Red Hat)
> - TBD
> 
> We also welcome additional volunteers to help with making this devroom a
> success.
> 
> Use the FOSDEM 'pentabarf' tool to submit your proposal:
> https://penta.fosdem.org/submission/FOSDEM23
> 
> - If necessary, create a Pentabarf account and activate it.
> Please reuse your account from previous years if you have
> already created it.
> https://penta.fosdem.org/user/new_account/FOSDEM23
> 
> - In the "Person" section, provide First name, Last name
> (in the "General" tab), Email (in the "Contact" tab)
> and Bio ("Abstract" field in the "Description" tab).
> 
> - Submit a proposal by clicking on "Create event".
> 
> - If you plan to register your proposal in several tracks to increase your
> chances, don't! Register your talk once, in the most accurate track.
> 
> - Presentations have to be pre-recorded before the event and will be streamed
> on   the event weekend.
> 
> - Important! Select the "Software Defined Storage devroom" track
> (on the "General" tab).
> 
> - Provide the title of your talk ("Event title" in the "General" tab).
> 
> - Provide a description of the subject of the talk and the
> intended audience (in the "Abstract" field of the "Description" tab)
> 
> - Provide a rough outline of the talk or goals of the session (a short
> list of bullet points covering topics that will be discussed) in the
> "Full description" field in the "Description" tab
> 
> - Provide an expected length of your talk in the "Duration" field.
>   We suggest a length between 15 and 45 minutes.
> 
> ** Recording of talks
> 
> The FOSDEM organizers plan to have live streaming and recording fully working,
> both for remote/later viewing of talks, and so that people can watch streams
> in the hallways when rooms are full. This requires speakers to consent to
> being recorded and streamed. If you plan to be a speaker, please understand
> that by doing so you implicitly give consent for your talk to be recorded and
> streamed. The recordings will be published under the same license as all
> FOSDEM content (CC-BY).
> 
> Hope to hear from you soon! And please forward this announcement.
> 
> If you have any further questions, please write to the mailing list at
> storage-devr...@lists.fosdem.org and we will try to answer as soon as
> possible.
> 
> Thanks!
> 
> ___
> FOSDEM mailing list
> fos...@lists.fosdem.org
> https://lists.fosdem.org/listinfo/fosdem

- End forwarded message -


signature.asc
Description: PGP signature

RE: [PULL 00/30] Next patches

2022-11-16 Thread Xu, Ling1

Hi, All,
  Very appreciated for your time on reviewing our patch.
  The second CI failure caused by our patch has been addressed. One simple 
way is moving "#endif" in qemu/tests/bench/xbzrle-bench.c from line 46 to line 
450.
We have submitted patch v7 to update this modification. Thanks for your time 
again.

Best Regards,
Ling
  

-Original Message-
From: Stefan Hajnoczi  
Sent: Wednesday, November 16, 2022 2:58 AM
To: Juan Quintela ; Xu, Ling1 ; Zhao, 
Zhou ; Jin, Jun I 
Cc: qemu-de...@nongnu.org; Michael Tokarev ; Marc-André Lureau 
; David Hildenbrand ; Laurent 
Vivier ; Paolo Bonzini ; Daniel P. 
Berrangé ; Peter Xu ; Stefan Hajnoczi 
; Dr. David Alan Gilbert ; Thomas 
Huth ; qemu-block@nongnu.org; qemu-triv...@nongnu.org; 
Philippe Mathieu-Daudé ; Fam Zheng 
Subject: Re: [PULL 00/30] Next patches

On Tue, 15 Nov 2022 at 10:40, Juan Quintela  wrote:
>
> The following changes since commit 98f10f0e2613ba1ac2ad3f57a5174014f6dcb03d:
>
>   Merge tag 'pull-target-arm-20221114' of 
> https://git.linaro.org/people/pmaydell/qemu-arm into staging 
> (2022-11-14 13:31:17 -0500)
>
> are available in the Git repository at:
>
>   https://gitlab.com/juan.quintela/qemu.git tags/next-pull-request
>
> for you to fetch changes up to d896a7a40db13fc2d05828c94ddda2747530089c:
>
>   migration: Block migration comment or code is wrong (2022-11-15 
> 10:31:06 +0100)
>
> 
> Migration PULL request (take 2)
>
> Hi
>
> This time properly signed.
>
> [take 1]
> It includes:
> - Leonardo fix for zero_copy flush
> - Fiona fix for return value of readv/writev
> - Peter Xu cleanups
> - Peter Xu preempt patches
> - Patches ready from zero page (me)
> - AVX2 support (ling)
> - fix for slow networking and reordering of first packets (manish)
>
> Please, apply.
>
> 
>
> Fiona Ebner (1):
>   migration/channel-block: fix return value for
> qio_channel_block_{readv,writev}
>
> Juan Quintela (5):
>   multifd: Create page_size fields into both MultiFD{Recv,Send}Params
>   multifd: Create page_count fields into both MultiFD{Recv,Send}Params
>   migration: Export ram_transferred_ram()
>   migration: Export ram_release_page()
>   migration: Block migration comment or code is wrong
>
> Leonardo Bras (1):
>   migration/multifd/zero-copy: Create helper function for flushing
>
> Peter Xu (20):
>   migration: Fix possible infinite loop of ram save process
>   migration: Fix race on qemu_file_shutdown()
>   migration: Disallow postcopy preempt to be used with compress
>   migration: Use non-atomic ops for clear log bitmap
>   migration: Disable multifd explicitly with compression
>   migration: Take bitmap mutex when completing ram migration
>   migration: Add postcopy_preempt_active()
>   migration: Cleanup xbzrle zero page cache update logic
>   migration: Trivial cleanup save_page_header() on same block check
>   migration: Remove RAMState.f references in compression code
>   migration: Yield bitmap_mutex properly when sending/sleeping
>   migration: Use atomic ops properly for page accountings
>   migration: Teach PSS about host page
>   migration: Introduce pss_channel
>   migration: Add pss_init()
>   migration: Make PageSearchStatus part of RAMState
>   migration: Move last_sent_block into PageSearchStatus
>   migration: Send requested page directly in rp-return thread
>   migration: Remove old preempt code around state maintainance
>   migration: Drop rs->f
>
> ling xu (2):
>   Update AVX512 support for xbzrle_encode_buffer
>   Unit test code and benchmark code

This commit causes the following CI failure:

cc -m64 -mcx16 -Ilibauthz.fa.p -I. -I.. -Iqapi -Itrace -Iui/shader
-I/usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include
-fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem 
/builds/qemu-project/qemu/linux-headers -isystem linux-headers -iquote . 
-iquote /builds/qemu-project/qemu -iquote /builds/qemu-project/qemu/include 
-iquote
/builds/qemu-project/qemu/tcg/i386 -pthread -U_FORTIFY_SOURCE
-D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE 
-Wstrict-prototypes -Wredundant-decls -Wundef -Wwrite-strings 
-Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv 
-Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security 
-Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs 
-Wendif-labels -Wexpansion-to-defined
-Wimplicit-fallthrough=2 -Wno-missing-include-dirs -Wno-shift-negative-value 
-Wno-psabi -fstack-protector-strong -fPIE -MD -MQ 
libauthz.fa.p/authz_simple.c.o -MF libauthz.fa.p/authz_simple.c.o.d -o 
libauthz.fa.p/authz_simple.c.o -c ../authz/simple.c In file included from 
../authz/simple.c:23:
../authz/trace.h:1:10: fatal error: trace/trace-authz.h: No such file or 
directory
1 | #include "trace/trace-authz.h"
| ^

Re: [PATCH v2 2/3] nvme: Add ocp to the subsys

On Tue, Nov 15, 2022 at 12:11:50PM +0100, Klaus Jensen wrote:
> On Nov 14 14:50, Joel Granados wrote:
> > The Open Compute Project defines a Datacenter NVMe SSD Spec that sits on
> > top of the NVMe spec. Additional commands and NVMe behaviors specific for
> > the Datacenter. This is a preparation patch that introduces an argument to
> > activate OCP in nvme.
> > 
> > Signed-off-by: Joel Granados 
> > ---
> >  hw/nvme/nvme.h   | 1 +
> >  hw/nvme/subsys.c | 4 ++--
> >  2 files changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> > index 79f5c281c2..aa99c0c57c 100644
> > --- a/hw/nvme/nvme.h
> > +++ b/hw/nvme/nvme.h
> > @@ -56,6 +56,7 @@ typedef struct NvmeSubsystem {
> >  
> >  struct {
> >  char *nqn;
> > +bool ocp;
> >  } params;
> >  } NvmeSubsystem;
> >  
> > diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
> > index 9d2643678b..ecca28449c 100644
> > --- a/hw/nvme/subsys.c
> > +++ b/hw/nvme/subsys.c
> > @@ -129,8 +129,8 @@ static void nvme_subsys_realize(DeviceState *dev, Error 
> > **errp)
> >  
> >  static Property nvme_subsystem_props[] = {
> >  DEFINE_PROP_STRING("nqn", NvmeSubsystem, params.nqn),
> > -DEFINE_PROP_END_OF_LIST(),
> > -};
> > +DEFINE_PROP_BOOL("ocp", NvmeSubsystem, params.ocp, false),
> 
> It is the controller that implements the OCP specification, not the
> namespace or the subsystem. The parameter should be on the controller
> device.
Makes sense. I'll put the option in hw/nvme/ctrl.c

> 
> We discussed that the Get Log Page was subsystem scoped and not
> namespace scoped, but that is unrelated to this.
Yep, this was the confusion. Thx for clarifying.

> 
> > +DEFINE_PROP_END_OF_LIST(), };
> >  
> >  static void nvme_subsys_class_init(ObjectClass *oc, void *data)
> >  {
> > -- 
> > 2.30.2
> > 
> > 




signature.asc
Description: PGP signature

Re: [PATCH RFC 2/3] hw/i2c: add mctp core

2022-11-16 Thread Corey Minyard

On Wed, Nov 16, 2022 at 09:43:11AM +0100, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Add an abstract MCTP over I2C endpoint model. This implements MCTP
> control message handling as well as handling the actual I2C transport
> (packetization).
> 
> Devices are intended to derive from this and implement the class
> methods.
> 
> Parts of this implementation is inspired by code[1] previously posted by
> Jonathan Cameron.

I have some comments inline, mostly about buffer handling.  Buffer
handling is scary to me, so you might see some paranoia here :-).

> 
>   [1]: 
> https://lore.kernel.org/qemu-devel/20220520170128.4436-1-jonathan.came...@huawei.com/
> 
> Signed-off-by: Klaus Jensen 
> ---
>  hw/arm/Kconfig |   1 +
>  hw/i2c/Kconfig |   4 +
>  hw/i2c/mctp.c  | 365 +
>  hw/i2c/meson.build |   1 +
>  hw/i2c/trace-events|  12 ++
>  include/hw/i2c/mctp.h  |  83 ++
>  include/hw/misc/mctp.h |  43 +
>  7 files changed, 509 insertions(+)
>  create mode 100644 hw/i2c/mctp.c
>  create mode 100644 include/hw/i2c/mctp.h
>  create mode 100644 include/hw/misc/mctp.h
> 
> diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
> index 17fcde8e1ccc..3233bdc193d7 100644
> --- a/hw/arm/Kconfig
> +++ b/hw/arm/Kconfig
> @@ -444,6 +444,7 @@ config ASPEED_SOC
>  select DS1338
>  select FTGMAC100
>  select I2C
> +select MCTP_I2C
>  select DPS310
>  select PCA9552
>  select SERIAL
> diff --git a/hw/i2c/Kconfig b/hw/i2c/Kconfig
> index 9bb8870517f8..5dd43d550c32 100644
> --- a/hw/i2c/Kconfig
> +++ b/hw/i2c/Kconfig
> @@ -41,3 +41,7 @@ config PCA954X
>  config PMBUS
>  bool
>  select SMBUS
> +
> +config MCTP_I2C
> +bool
> +select I2C
> diff --git a/hw/i2c/mctp.c b/hw/i2c/mctp.c
> new file mode 100644
> index ..46376de95a98
> --- /dev/null
> +++ b/hw/i2c/mctp.c
> @@ -0,0 +1,365 @@
> +/*
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + * SPDX-FileCopyrightText: Copyright (c) 2022 Samsung Electronics Co., Ltd.
> + * SPDX-FileContributor: Klaus Jensen 
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/main-loop.h"
> +
> +#include "hw/qdev-properties.h"
> +#include "hw/i2c/i2c.h"
> +#include "hw/i2c/mctp.h"
> +
> +#include "trace.h"
> +
> +static uint8_t crc8(uint16_t data)
> +{
> +#define POLY (0x1070U << 3)
> +int i;
> +
> +for (i = 0; i < 8; i++) {
> +if (data & 0x8000) {
> +data = data ^ POLY;
> +}
> +
> +data = data << 1;
> +}
> +
> +return (uint8_t)(data >> 8);
> +#undef POLY
> +}
> +
> +static uint8_t i2c_smbus_pec(uint8_t crc, uint8_t *buf, size_t len)
> +{
> +int i;
> +
> +for (i = 0; i < len; i++) {
> +crc = crc8((crc ^ buf[i]) << 8);
> +}
> +
> +return crc;
> +}

The PEC calculation probably belongs in it's own smbus.c file, since
it's generic, so someone looking will find it.

> +
> +void i2c_mctp_schedule_send(MCTPI2CEndpoint *mctp)
> +{
> +I2CBus *i2c = I2C_BUS(qdev_get_parent_bus(DEVICE(mctp)));
> +
> +mctp->tx.state = I2C_MCTP_STATE_TX_START_SEND;
> +
> +i2c_bus_master(i2c, mctp->tx.bh);
> +}
> +
> +static void i2c_mctp_tx(void *opaque)
> +{
> +DeviceState *dev = DEVICE(opaque);
> +I2CBus *i2c = I2C_BUS(qdev_get_parent_bus(dev));
> +I2CSlave *slave = I2C_SLAVE(dev);
> +MCTPI2CEndpoint *mctp = MCTP_I2C_ENDPOINT(dev);
> +MCTPI2CEndpointClass *mc = MCTP_I2C_ENDPOINT_GET_CLASS(mctp);
> +MCTPI2CPacket *pkt = (MCTPI2CPacket *)mctp->buffer;
> +uint8_t flags = 0;
> +
> +switch (mctp->tx.state) {
> +case I2C_MCTP_STATE_TX_SEND_BYTE:
> +if (mctp->pos < mctp->len) {
> +uint8_t byte = mctp->buffer[mctp->pos];
> +
> +trace_i2c_mctp_tx_send_byte(mctp->pos, byte);
> +
> +/* send next byte */
> +i2c_send_async(i2c, byte);
> +
> +mctp->pos++;
> +
> +break;
> +}
> +
> +/* packet sent */
> +i2c_end_transfer(i2c);
> +
> +/* fall through */
> +
> +case I2C_MCTP_STATE_TX_START_SEND:
> +if (mctp->tx.is_control) {
> +/* packet payload is already in buffer */
> +flags |= MCTP_H_FLAGS_SOM | MCTP_H_FLAGS_EOM;
> +} else {
> +/* get message bytes from derived device */
> +mctp->len = mc->get_message_bytes(mctp, pkt->mctp.payload,
> +  I2C_MCTP_MAXMTU, );
> +}
> +
> +if (!mctp->len) {
> +trace_i2c_mctp_tx_done();
> +
> +/* no more packets needed; release the bus */
> +i2c_bus_release(i2c);
> +
> +mctp->state = I2C_MCTP_STATE_IDLE;
> +mctp->tx.is_control = false;
> +
> +break;
> +}
> +
> +mctp->state = I2C_MCTP_STATE_TX;
> +
> +pkt->i2c = (MCTPI2CPacketHeader) {
> +.dest = mctp->tx.addr & ~0x1,
> +.prot = 0xf,
> +.byte_count

[PATCH 05/15] block: use bdrv_co_refresh_total_sectors when possible

In some places we are sure we are always running in a
coroutine, therefore it's useless to call the generated_co_wrapper,
instead call directly the _co_ function.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c | 6 +++---
 block/copy-on-read.c  | 2 +-
 block/io.c| 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index fc19cf423e..6a84a2a019 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1235,8 +1235,8 @@ void blk_set_disable_request_queuing(BlockBackend *blk, 
bool disable)
 blk->disable_request_queuing = disable;
 }
 
-static int blk_check_byte_request(BlockBackend *blk, int64_t offset,
-  int64_t bytes)
+static coroutine_fn int blk_check_byte_request(BlockBackend *blk,
+   int64_t offset, int64_t bytes)
 {
 int64_t len;
 
@@ -1253,7 +1253,7 @@ static int blk_check_byte_request(BlockBackend *blk, 
int64_t offset,
 }
 
 if (!blk->allow_write_beyond_eof) {
-len = bdrv_getlength(blk_bs(blk));
+len = bdrv_co_getlength(blk_bs(blk));
 if (len < 0) {
 return len;
 }
diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 815ac1d835..74f7727a02 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -122,7 +122,7 @@ static void cor_child_perm(BlockDriverState *bs, BdrvChild 
*c,
 
 static int64_t cor_getlength(BlockDriverState *bs)
 {
-return bdrv_getlength(bs->file->bs);
+return bdrv_co_getlength(bs->file->bs);
 }
 
 
diff --git a/block/io.c b/block/io.c
index 99867fe148..3f65c57f82 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3381,7 +3381,7 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, 
int64_t offset, bool exact,
 if (new_bytes && backing) {
 int64_t backing_len;
 
-backing_len = bdrv_getlength(backing->bs);
+backing_len = bdrv_co_getlength(backing->bs);
 if (backing_len < 0) {
 ret = backing_len;
 error_setg_errno(errp, -ret, "Could not get backing file size");
@@ -3411,7 +3411,7 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, 
int64_t offset, bool exact,
 goto out;
 }
 
-ret = bdrv_refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
+ret = bdrv_co_refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Could not refresh total sector count");
 } else {
-- 
2.31.1

[PATCH 03/15] block-backend: use bdrv_getlength instead of blk_getlength

The only difference is that blk_ checks if the block is available,
but this check is already performed above in blk_check_byte_request().

This is in preparation for the graph rdlock, which will be taken
by both the callers of blk_check_byte_request() and blk_getlength().

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 6f0dd15808..4af9a3179e 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1253,7 +1253,7 @@ static int blk_check_byte_request(BlockBackend *blk, 
int64_t offset,
 }
 
 if (!blk->allow_write_beyond_eof) {
-len = blk_getlength(blk);
+len = bdrv_getlength(blk_bs(blk));
 if (len < 0) {
 return len;
 }
-- 
2.31.1

[PATCH 09/15] block-coroutine-wrapper: support void functions

Just omit the various 'return' when the return type is
void.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 scripts/block-coroutine-wrapper.py | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/scripts/block-coroutine-wrapper.py 
b/scripts/block-coroutine-wrapper.py
index 05267761f0..8d7aa5d7f4 100644
--- a/scripts/block-coroutine-wrapper.py
+++ b/scripts/block-coroutine-wrapper.py
@@ -115,7 +115,7 @@ def create_g_c_w(func: FuncDecl) -> str:
 {func.return_type} {func.name}({ func.gen_list('{decl}') })
 {{
 if (qemu_in_coroutine()) {{
-return {name}({ func.gen_list('{name}') });
+{func.co_ret}{name}({ func.gen_list('{name}') });
 }} else {{
 {struct_name} s = {{
 .poll_state.bs = {func.bs},
@@ -127,7 +127,7 @@ def create_g_c_w(func: FuncDecl) -> str:
 s.poll_state.co = qemu_coroutine_create({name}_entry, );
 
 bdrv_poll_co(_state);
-return s.ret;
+{func.ret}
 }}
 }}"""
 
@@ -150,7 +150,7 @@ def create_coroutine_only(func: FuncDecl) -> str:
 s.poll_state.co = qemu_coroutine_create({name}_entry, );
 
 bdrv_poll_co(_state);
-return s.ret;
+{func.ret}
 }}"""
 
 
@@ -168,6 +168,15 @@ def gen_wrapper(func: FuncDecl) -> str:
 graph_lock='bdrv_graph_co_rdlock();'
 graph_unlock='bdrv_graph_co_rdunlock();'
 
+func.get_result = 's->ret = '
+func.ret = 'return s.ret;'
+func.co_ret = 'return '
+func.return_field = func.return_type + " ret;"
+if func.return_type == 'void':
+func.get_result = ''
+func.ret = ''
+func.co_ret = ''
+func.return_field = ''
 
 t = func.args[0].type
 if t == 'BlockDriverState *':
@@ -193,7 +202,7 @@ def gen_wrapper(func: FuncDecl) -> str:
 
 typedef struct {struct_name} {{
 BdrvPollCo poll_state;
-{func.return_type} ret;
+{func.return_field}
 { func.gen_block('{decl};') }
 }} {struct_name};
 
@@ -202,7 +211,7 @@ def gen_wrapper(func: FuncDecl) -> str:
 {struct_name} *s = opaque;
 
 {graph_lock}
-s->ret = {name}({ func.gen_list('s->{name}') });
+{func.get_result}{name}({ func.gen_list('s->{name}') });
 {graph_unlock}
 s->poll_state.in_progress = false;
 
-- 
2.31.1

[PATCH 08/15] block: convert bdrv_is_inserted in generated_co_wrapper_simple

BlockDriver->bdrv_is_inserted is categorized as IO callback, and
it currently doesn't run in a coroutine.
This makes very difficult to add the graph rdlock, since the
callback traverses the block nodes graph.

Therefore use generated_co_wrapper_simple to automatically
creates a wrapper with the same name that will take care of
always calling the function in a coroutine.
This is a generated_co_wrapper_simple callback, which means
it assumes that bdrv_is_inserted is never caled in a coroutine
context.

At the same time, add also blk_is_inserted as g_c_w, since it
is called in both coroutine and non contexts.

Because now this function creates a new coroutine and polls,
we need to take the AioContext lock where it is missing,
for the only reason that internally g_c_w calls AIO_WAIT_WHILE
and it expects to release the AioContext lock.
Once the rwlock is ultimated and placed in every place it needs
to be, we will poll using AIO_WAIT_WHILE_UNLOCKED and remove
the AioContext lock.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c   |  5 +++--
 block/block-backend.c |  5 +++--
 block/io.c| 12 ++--
 blockdev.c|  8 +++-
 include/block/block-io.h  |  5 -
 include/block/block_int-common.h  |  4 ++--
 include/sysemu/block-backend-io.h |  5 -
 7 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/block.c b/block.c
index 2cdede9c01..4205735308 100644
--- a/block.c
+++ b/block.c
@@ -6778,11 +6778,12 @@ out:
 /**
  * Return TRUE if the media is present
  */
-bool bdrv_is_inserted(BlockDriverState *bs)
+bool coroutine_fn bdrv_co_is_inserted(BlockDriverState *bs)
 {
 BlockDriver *drv = bs->drv;
 BdrvChild *child;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 if (!drv) {
 return false;
@@ -6791,7 +6792,7 @@ bool bdrv_is_inserted(BlockDriverState *bs)
 return drv->bdrv_is_inserted(bs);
 }
 QLIST_FOREACH(child, >children, next) {
-if (!bdrv_is_inserted(child->bs)) {
+if (!bdrv_co_is_inserted(child->bs)) {
 return false;
 }
 }
diff --git a/block/block-backend.c b/block/block-backend.c
index 6a84a2a019..9a500fdde3 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1994,12 +1994,13 @@ void blk_activate(BlockBackend *blk, Error **errp)
 bdrv_activate(bs, errp);
 }
 
-bool blk_is_inserted(BlockBackend *blk)
+bool coroutine_fn blk_co_is_inserted(BlockBackend *blk)
 {
 BlockDriverState *bs = blk_bs(blk);
 IO_CODE();
+assert_bdrv_graph_readable();
 
-return bs && bdrv_is_inserted(bs);
+return bs && bdrv_co_is_inserted(bs);
 }
 
 bool blk_is_available(BlockBackend *blk)
diff --git a/block/io.c b/block/io.c
index 99ef9a8cb9..88da9470c3 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1598,7 +1598,7 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
 
 trace_bdrv_co_preadv_part(bs, offset, bytes, flags);
 
-if (!bdrv_is_inserted(bs)) {
+if (!bdrv_co_is_inserted(bs)) {
 return -ENOMEDIUM;
 }
 
@@ -2044,7 +2044,7 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
 
 trace_bdrv_co_pwritev_part(child->bs, offset, bytes, flags);
 
-if (!bdrv_is_inserted(bs)) {
+if (!bdrv_co_is_inserted(bs)) {
 return -ENOMEDIUM;
 }
 
@@ -2764,7 +2764,7 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
 assert_bdrv_graph_readable();
 bdrv_inc_in_flight(bs);
 
-if (!bdrv_is_inserted(bs) || bdrv_is_read_only(bs) ||
+if (!bdrv_co_is_inserted(bs) || bdrv_is_read_only(bs) ||
 bdrv_is_sg(bs)) {
 goto early_exit;
 }
@@ -2889,7 +2889,7 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, 
int64_t offset,
 IO_CODE();
 assert_bdrv_graph_readable();
 
-if (!bs || !bs->drv || !bdrv_is_inserted(bs)) {
+if (!bs || !bs->drv || !bdrv_co_is_inserted(bs)) {
 return -ENOMEDIUM;
 }
 
@@ -3173,7 +3173,7 @@ static int coroutine_fn bdrv_co_copy_range_internal(
 assert(!(read_flags & BDRV_REQ_NO_WAIT));
 assert(!(write_flags & BDRV_REQ_NO_WAIT));
 
-if (!dst || !dst->bs || !bdrv_is_inserted(dst->bs)) {
+if (!dst || !dst->bs || !bdrv_co_is_inserted(dst->bs)) {
 return -ENOMEDIUM;
 }
 ret = bdrv_check_request32(dst_offset, bytes, NULL, 0);
@@ -3184,7 +3184,7 @@ static int coroutine_fn bdrv_co_copy_range_internal(
 return bdrv_co_pwrite_zeroes(dst, dst_offset, bytes, write_flags);
 }
 
-if (!src || !src->bs || !bdrv_is_inserted(src->bs)) {
+if (!src || !src->bs || !bdrv_co_is_inserted(src->bs)) {
 return -ENOMEDIUM;
 }
 ret = bdrv_check_request32(src_offset, bytes, NULL, 0);
diff --git a/blockdev.c b/blockdev.c
index 8ffb3d9537..bbe5207e8f 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1023,6 +1023,7 @@ fail:
 static BlockDriverState *qmp_get_root_bs(const char *name, Error **errp)
 {
 BlockDriverState *bs;
+AioContext *aio_context;

[PATCH 06/15] block: convert bdrv_get_allocated_file_size in generated_co_wrapper_simple

BlockDriver->bdrv_get_allocated_file_size  is categorized as IO callback, and
it currently doesn't run in a coroutine.
This makes very difficult to add the graph rdlock, since the
callback traverses the block nodes graph.

Therefore use generated_co_wrapper_simple to automatically
create a wrapper with the same name that will take care of
always calling the function in a coroutine.
This is a generated_co_wrapper_simple callback, which means
it assumes that bdrv_get_allocated_file_size  is never caled in a coroutine
context.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c  | 7 ---
 block/qcow2-refcount.c   | 2 +-
 include/block/block-io.h | 5 -
 include/block/block_int-common.h | 3 ++-
 4 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index c7b32ba17a..5650d6fe63 100644
--- a/block.c
+++ b/block.c
@@ -5708,7 +5708,7 @@ static int64_t 
bdrv_sum_allocated_file_size(BlockDriverState *bs)
 if (child->role & (BDRV_CHILD_DATA | BDRV_CHILD_METADATA |
BDRV_CHILD_FILTERED))
 {
-child_size = bdrv_get_allocated_file_size(child->bs);
+child_size = bdrv_co_get_allocated_file_size(child->bs);
 if (child_size < 0) {
 return child_size;
 }
@@ -5723,10 +5723,11 @@ static int64_t 
bdrv_sum_allocated_file_size(BlockDriverState *bs)
  * Length of a allocated file in bytes. Sparse files are counted by actual
  * allocated space. Return < 0 if error or unknown.
  */
-int64_t bdrv_get_allocated_file_size(BlockDriverState *bs)
+int64_t coroutine_fn bdrv_co_get_allocated_file_size(BlockDriverState *bs)
 {
 BlockDriver *drv = bs->drv;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 if (!drv) {
 return -ENOMEDIUM;
@@ -5744,7 +5745,7 @@ int64_t bdrv_get_allocated_file_size(BlockDriverState *bs)
 return -ENOTSUP;
 } else if (drv->is_filter) {
 /* Filter drivers default to the size of their filtered child */
-return bdrv_get_allocated_file_size(bdrv_filter_bs(bs));
+return bdrv_co_get_allocated_file_size(bdrv_filter_bs(bs));
 } else {
 /* Other drivers default to summing their children's sizes */
 return bdrv_sum_allocated_file_size(bs);
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 81264740f0..487681d85e 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -3719,7 +3719,7 @@ int coroutine_fn 
qcow2_detect_metadata_preallocation(BlockDriverState *bs)
 return file_length;
 }
 
-real_allocation = bdrv_get_allocated_file_size(bs->file->bs);
+real_allocation = bdrv_co_get_allocated_file_size(bs->file->bs);
 if (real_allocation < 0) {
 return real_allocation;
 }
diff --git a/include/block/block-io.h b/include/block/block-io.h
index 4fb95e9b7a..ac509c461f 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -74,7 +74,10 @@ int64_t generated_co_wrapper 
bdrv_nb_sectors(BlockDriverState *bs);
 int64_t coroutine_fn bdrv_co_getlength(BlockDriverState *bs);
 int64_t generated_co_wrapper bdrv_getlength(BlockDriverState *bs);
 
-int64_t bdrv_get_allocated_file_size(BlockDriverState *bs);
+int64_t generated_co_wrapper_simple bdrv_get_allocated_file_size(
+BlockDriverState *bs);
+int64_t coroutine_fn bdrv_co_get_allocated_file_size(BlockDriverState *bs);
+
 BlockMeasureInfo *bdrv_measure(BlockDriver *drv, QemuOpts *opts,
BlockDriverState *in_bs, Error **errp);
 void bdrv_get_geometry(BlockDriverState *bs, uint64_t *nb_sectors_ptr);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index d1cf52d4f7..42daf86c65 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -725,7 +725,8 @@ struct BlockDriver {
  BdrvRequestFlags flags, Error **errp);
 /* Called with graph rdlock held. */
 int64_t coroutine_fn (*bdrv_getlength)(BlockDriverState *bs);
-int64_t (*bdrv_get_allocated_file_size)(BlockDriverState *bs);
+/* Called with graph rdlock held. */
+int64_t coroutine_fn (*bdrv_get_allocated_file_size)(BlockDriverState *bs);
 
 /* Does not need graph rdlock, since it does not traverse the graph */
 BlockMeasureInfo *(*bdrv_measure)(QemuOpts *opts, BlockDriverState *in_bs,
-- 
2.31.1

[PATCH 14/15] block: convert bdrv_io_unplug in generated_co_wrapper_simple

BlockDriver->bdrv_io_unplug is categorized as IO callback, and
it currently doesn't run in a coroutine.
This makes very difficult to add the graph rdlock, since the
callback traverses the block nodes graph.

The only caller of this function is blk_unplug, therefore
make blk_unplug a generated_co_wrapper_simple, so that
it always creates a new coroutine, and then make bdrv_unplug
coroutine_fn.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c | 5 +++--
 block/io.c| 5 +++--
 include/block/block-io.h  | 3 +--
 include/block/block_int-common.h  | 2 +-
 include/sysemu/block-backend-io.h | 4 +++-
 5 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 826a936beb..3b10e35ea4 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2340,13 +2340,14 @@ void coroutine_fn blk_co_io_plug(BlockBackend *blk)
 }
 }
 
-void blk_io_unplug(BlockBackend *blk)
+void coroutine_fn blk_co_io_unplug(BlockBackend *blk)
 {
 BlockDriverState *bs = blk_bs(blk);
 IO_CODE();
+assert_bdrv_graph_readable();
 
 if (bs) {
-bdrv_io_unplug(bs);
+bdrv_co_io_unplug(bs);
 }
 }
 
diff --git a/block/io.c b/block/io.c
index d3b8c1e4b2..48a94dd384 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3086,10 +3086,11 @@ void coroutine_fn bdrv_co_io_plug(BlockDriverState *bs)
 }
 }
 
-void bdrv_io_unplug(BlockDriverState *bs)
+void coroutine_fn bdrv_co_io_unplug(BlockDriverState *bs)
 {
 BdrvChild *child;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 assert(bs->io_plugged);
 if (qatomic_fetch_dec(>io_plugged) == 1) {
@@ -3100,7 +3101,7 @@ void bdrv_io_unplug(BlockDriverState *bs)
 }
 
 QLIST_FOREACH(child, >children, next) {
-bdrv_io_unplug(child->bs);
+bdrv_co_io_unplug(child->bs);
 }
 }
 
diff --git a/include/block/block-io.h b/include/block/block-io.h
index a045643b26..f93357681a 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -216,8 +216,7 @@ void bdrv_coroutine_enter(BlockDriverState *bs, Coroutine 
*co);
 AioContext *child_of_bds_get_parent_aio_context(BdrvChild *c);
 
 void coroutine_fn bdrv_co_io_plug(BlockDriverState *bs);
-
-void bdrv_io_unplug(BlockDriverState *bs);
+void coroutine_fn bdrv_co_io_unplug(BlockDriverState *bs);
 
 bool coroutine_fn bdrv_co_can_store_new_dirty_bitmap(BlockDriverState *bs,
  const char *name,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index ed96bc3241..3ab3fa45a2 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -787,7 +787,7 @@ struct BlockDriver {
 
 /* io queue for linux-aio. Called with graph rdlock taken. */
 void coroutine_fn (*bdrv_io_plug)(BlockDriverState *bs);
-void (*bdrv_io_unplug)(BlockDriverState *bs);
+void coroutine_fn (*bdrv_io_unplug)(BlockDriverState *bs);
 
 /**
  * bdrv_drain_begin is called if implemented in the beginning of a
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index 703fcc3ac5..c8df7320f7 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -90,7 +90,9 @@ int blk_get_max_hw_iov(BlockBackend *blk);
 void coroutine_fn blk_co_io_plug(BlockBackend *blk);
 void generated_co_wrapper_simple blk_io_plug(BlockBackend *blk);
 
-void blk_io_unplug(BlockBackend *blk);
+void coroutine_fn blk_co_io_unplug(BlockBackend *blk);
+void generated_co_wrapper_simple blk_io_unplug(BlockBackend *blk);
+
 AioContext *blk_get_aio_context(BlockBackend *blk);
 BlockAcctStats *blk_get_stats(BlockBackend *blk);
 void *blk_aio_get(const AIOCBInfo *aiocb_info, BlockBackend *blk,
-- 
2.31.1

[PATCH 11/15] block: convert bdrv_lock_medium in generated_co_wrapper_simple

BlockDriver->bdrv_lock_medium is categorized as IO callback, and
it currently doesn't run in a coroutine.
This makes very difficult to add the graph rdlock, since the
callback traverses the block nodes graph.

The only caller of this function is blk_lock_medium, therefore
make blk_lock_medium a generated_co_wrapper_simple, so that
it always creates a new coroutine, and then make bdrv_lock_medium
coroutine_fn.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c   | 3 ++-
 block/block-backend.c | 5 +++--
 block/copy-on-read.c  | 2 +-
 block/filter-compress.c   | 2 +-
 block/raw-format.c| 2 +-
 include/block/block-io.h  | 2 +-
 include/block/block_int-common.h  | 2 +-
 include/sysemu/block-backend-io.h | 5 -
 8 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/block.c b/block.c
index ffbb8c602f..afc5735b82 100644
--- a/block.c
+++ b/block.c
@@ -6817,10 +6817,11 @@ void coroutine_fn bdrv_co_eject(BlockDriverState *bs, 
bool eject_flag)
  * Lock or unlock the media (if it is locked, the user won't be able
  * to eject it manually).
  */
-void bdrv_lock_medium(BlockDriverState *bs, bool locked)
+void coroutine_fn bdrv_co_lock_medium(BlockDriverState *bs, bool locked)
 {
 BlockDriver *drv = bs->drv;
 IO_CODE();
+assert_bdrv_graph_readable();
 trace_bdrv_lock_medium(bs, locked);
 
 if (drv && drv->bdrv_lock_medium) {
diff --git a/block/block-backend.c b/block/block-backend.c
index 308dd2070a..75e2f2124f 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2009,13 +2009,14 @@ bool blk_is_available(BlockBackend *blk)
 return blk_is_inserted(blk) && !blk_dev_is_tray_open(blk);
 }
 
-void blk_lock_medium(BlockBackend *blk, bool locked)
+void coroutine_fn blk_co_lock_medium(BlockBackend *blk, bool locked)
 {
 BlockDriverState *bs = blk_bs(blk);
 IO_CODE();
+assert_bdrv_graph_readable();
 
 if (bs) {
-bdrv_lock_medium(bs, locked);
+bdrv_co_lock_medium(bs, locked);
 }
 }
 
diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 76f884a6ae..ccc767f37b 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -224,7 +224,7 @@ static void cor_eject(BlockDriverState *bs, bool eject_flag)
 
 static void cor_lock_medium(BlockDriverState *bs, bool locked)
 {
-bdrv_lock_medium(bs->file->bs, locked);
+bdrv_co_lock_medium(bs->file->bs, locked);
 }
 
 
diff --git a/block/filter-compress.c b/block/filter-compress.c
index 571e4684dd..e10312c225 100644
--- a/block/filter-compress.c
+++ b/block/filter-compress.c
@@ -124,7 +124,7 @@ static void compress_eject(BlockDriverState *bs, bool 
eject_flag)
 
 static void compress_lock_medium(BlockDriverState *bs, bool locked)
 {
-bdrv_lock_medium(bs->file->bs, locked);
+bdrv_co_lock_medium(bs->file->bs, locked);
 }
 
 
diff --git a/block/raw-format.c b/block/raw-format.c
index 9b23cf17bb..96a9b33384 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -410,7 +410,7 @@ static void raw_eject(BlockDriverState *bs, bool eject_flag)
 
 static void raw_lock_medium(BlockDriverState *bs, bool locked)
 {
-bdrv_lock_medium(bs->file->bs, locked);
+bdrv_co_lock_medium(bs->file->bs, locked);
 }
 
 static int coroutine_fn raw_co_ioctl(BlockDriverState *bs,
diff --git a/include/block/block-io.h b/include/block/block-io.h
index 204adeb701..497580fc28 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -124,7 +124,7 @@ int bdrv_get_flags(BlockDriverState *bs);
 bool coroutine_fn bdrv_co_is_inserted(BlockDriverState *bs);
 bool generated_co_wrapper_simple bdrv_is_inserted(BlockDriverState *bs);
 
-void bdrv_lock_medium(BlockDriverState *bs, bool locked);
+void coroutine_fn bdrv_co_lock_medium(BlockDriverState *bs, bool locked);
 void coroutine_fn bdrv_co_eject(BlockDriverState *bs, bool eject_flag);
 
 const char *bdrv_get_format_name(BlockDriverState *bs);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index d01b3d44f5..3e1eba518c 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -762,7 +762,7 @@ struct BlockDriver {
 /* removable device specific. Called with graph rdlock held. */
 bool coroutine_fn (*bdrv_is_inserted)(BlockDriverState *bs);
 void coroutine_fn (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
-void (*bdrv_lock_medium)(BlockDriverState *bs, bool locked);
+void coroutine_fn (*bdrv_lock_medium)(BlockDriverState *bs, bool locked);
 
 /* to control generic scsi devices. Called with graph rdlock taken. */
 BlockAIOCB *coroutine_fn (*bdrv_aio_ioctl)(BlockDriverState *bs,
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index cc706c03d8..dd8566ee69 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -58,7 +58,10 @@ bool coroutine_fn blk_co_is_inserted(BlockBackend *blk);
 bool

[PATCH 13/15] block: convert bdrv_io_plug in generated_co_wrapper_simple

BlockDriver->bdrv_io_plug is categorized as IO callback, and
it currently doesn't run in a coroutine.
This makes very difficult to add the graph rdlock, since the
callback traverses the block nodes graph.

The only caller of this function is blk_plug, therefore
make blk_plug a generated_co_wrapper_simple, so that
it always creates a new coroutine, and then make bdrv_plug
coroutine_fn.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c | 5 +++--
 block/io.c| 5 +++--
 include/block/block-io.h  | 3 ++-
 include/block/block_int-common.h  | 4 ++--
 include/sysemu/block-backend-io.h | 4 +++-
 5 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 75e2f2124f..826a936beb 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2329,13 +2329,14 @@ void blk_add_insert_bs_notifier(BlockBackend *blk, 
Notifier *notify)
 notifier_list_add(>insert_bs_notifiers, notify);
 }
 
-void blk_io_plug(BlockBackend *blk)
+void coroutine_fn blk_co_io_plug(BlockBackend *blk)
 {
 BlockDriverState *bs = blk_bs(blk);
 IO_CODE();
+assert_bdrv_graph_readable();
 
 if (bs) {
-bdrv_io_plug(bs);
+bdrv_co_io_plug(bs);
 }
 }
 
diff --git a/block/io.c b/block/io.c
index 11d2c5dcde..d3b8c1e4b2 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3068,13 +3068,14 @@ void *qemu_try_blockalign0(BlockDriverState *bs, size_t 
size)
 return mem;
 }
 
-void bdrv_io_plug(BlockDriverState *bs)
+void coroutine_fn bdrv_co_io_plug(BlockDriverState *bs)
 {
 BdrvChild *child;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 QLIST_FOREACH(child, >children, next) {
-bdrv_io_plug(child->bs);
+bdrv_co_io_plug(child->bs);
 }
 
 if (qatomic_fetch_inc(>io_plugged) == 0) {
diff --git a/include/block/block-io.h b/include/block/block-io.h
index 176e3cc734..a045643b26 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -215,7 +215,8 @@ void bdrv_coroutine_enter(BlockDriverState *bs, Coroutine 
*co);
 
 AioContext *child_of_bds_get_parent_aio_context(BdrvChild *c);
 
-void bdrv_io_plug(BlockDriverState *bs);
+void coroutine_fn bdrv_co_io_plug(BlockDriverState *bs);
+
 void bdrv_io_unplug(BlockDriverState *bs);
 
 bool coroutine_fn bdrv_co_can_store_new_dirty_bitmap(BlockDriverState *bs,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index b509855c19..ed96bc3241 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -785,8 +785,8 @@ struct BlockDriver {
 void coroutine_fn (*bdrv_debug_event)(BlockDriverState *bs,
   BlkdebugEvent event);
 
-/* io queue for linux-aio */
-void (*bdrv_io_plug)(BlockDriverState *bs);
+/* io queue for linux-aio. Called with graph rdlock taken. */
+void coroutine_fn (*bdrv_io_plug)(BlockDriverState *bs);
 void (*bdrv_io_unplug)(BlockDriverState *bs);
 
 /**
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index dd8566ee69..703fcc3ac5 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -87,7 +87,9 @@ void blk_iostatus_set_err(BlockBackend *blk, int error);
 int blk_get_max_iov(BlockBackend *blk);
 int blk_get_max_hw_iov(BlockBackend *blk);
 
-void blk_io_plug(BlockBackend *blk);
+void coroutine_fn blk_co_io_plug(BlockBackend *blk);
+void generated_co_wrapper_simple blk_io_plug(BlockBackend *blk);
+
 void blk_io_unplug(BlockBackend *blk);
 AioContext *blk_get_aio_context(BlockBackend *blk);
 BlockAcctStats *blk_get_stats(BlockBackend *blk);
-- 
2.31.1

[PATCH 07/15] block: convert bdrv_get_info in generated_co_wrapper

BlockDriver->bdrv_get_info is categorized as IO callback, and
it currently doesn't run in a coroutine.
This makes very difficult to add the graph rdlock, since the
callback traverses the block nodes graph.

Therefore use generated_co_wrapper to automatically
create a wrapper with the same name.
Unfortunately we cannot use a generated_co_wrapper_simple,
because the function is called by mixed functions that run both
in coroutine and non-coroutine context (for example block_load
in migration/block.c).

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c  |  6 --
 block/crypto.c   |  2 +-
 block/io.c   |  8 
 block/mirror.c   | 10 +++---
 block/raw-format.c   |  2 +-
 block/stream.c   |  4 +++-
 include/block/block-io.h |  6 +-
 include/block/block_int-common.h |  4 +++-
 8 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/block.c b/block.c
index 5650d6fe63..2cdede9c01 100644
--- a/block.c
+++ b/block.c
@@ -6279,11 +6279,13 @@ void bdrv_get_backing_filename(BlockDriverState *bs,
 pstrcpy(filename, filename_size, bs->backing_file);
 }
 
-int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+int coroutine_fn bdrv_co_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 int ret;
 BlockDriver *drv = bs->drv;
 IO_CODE();
+assert_bdrv_graph_readable();
+
 /* if bs->drv == NULL, bs is closed, so there's nothing to do here */
 if (!drv) {
 return -ENOMEDIUM;
@@ -6291,7 +6293,7 @@ int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo 
*bdi)
 if (!drv->bdrv_get_info) {
 BlockDriverState *filtered = bdrv_filter_bs(bs);
 if (filtered) {
-return bdrv_get_info(filtered, bdi);
+return bdrv_co_get_info(filtered, bdi);
 }
 return -ENOTSUP;
 }
diff --git a/block/crypto.c b/block/crypto.c
index 2fb8add458..12a84dd1cd 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -743,7 +743,7 @@ static int block_crypto_get_info_luks(BlockDriverState *bs,
 BlockDriverInfo subbdi;
 int ret;
 
-ret = bdrv_get_info(bs->file->bs, );
+ret = bdrv_co_get_info(bs->file->bs, );
 if (ret != 0) {
 return ret;
 }
diff --git a/block/io.c b/block/io.c
index 3f65c57f82..99ef9a8cb9 100644
--- a/block/io.c
+++ b/block/io.c
@@ -694,14 +694,14 @@ BdrvTrackedRequest *coroutine_fn 
bdrv_co_get_self_request(BlockDriverState *bs)
 /**
  * Round a region to cluster boundaries
  */
-void bdrv_round_to_clusters(BlockDriverState *bs,
+void coroutine_fn bdrv_round_to_clusters(BlockDriverState *bs,
 int64_t offset, int64_t bytes,
 int64_t *cluster_offset,
 int64_t *cluster_bytes)
 {
 BlockDriverInfo bdi;
 IO_CODE();
-if (bdrv_get_info(bs, ) < 0 || bdi.cluster_size == 0) {
+if (bdrv_co_get_info(bs, ) < 0 || bdi.cluster_size == 0) {
 *cluster_offset = offset;
 *cluster_bytes = bytes;
 } else {
@@ -711,12 +711,12 @@ void bdrv_round_to_clusters(BlockDriverState *bs,
 }
 }
 
-static int bdrv_get_cluster_size(BlockDriverState *bs)
+static coroutine_fn int bdrv_get_cluster_size(BlockDriverState *bs)
 {
 BlockDriverInfo bdi;
 int ret;
 
-ret = bdrv_get_info(bs, );
+ret = bdrv_co_get_info(bs, );
 if (ret < 0 || bdi.cluster_size == 0) {
 return bs->bl.request_alignment;
 } else {
diff --git a/block/mirror.c b/block/mirror.c
index aecc895b73..8dc136ebbe 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -576,8 +576,10 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 } else if (ret >= 0 && !(ret & BDRV_BLOCK_DATA)) {
 int64_t target_offset;
 int64_t target_bytes;
-bdrv_round_to_clusters(blk_bs(s->target), offset, io_bytes,
-   _offset, _bytes);
+WITH_GRAPH_RDLOCK_GUARD() {
+bdrv_round_to_clusters(blk_bs(s->target), offset, io_bytes,
+   _offset, _bytes);
+}
 if (target_offset == offset &&
 target_bytes == io_bytes) {
 mirror_method = ret & BDRV_BLOCK_ZERO ?
@@ -965,11 +967,13 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
  */
 bdrv_get_backing_filename(target_bs, backing_filename,
   sizeof(backing_filename));
-if (!bdrv_get_info(target_bs, ) && bdi.cluster_size) {
+bdrv_graph_co_rdlock();
+if (!bdrv_co_get_info(target_bs, ) && bdi.cluster_size) {
 s->target_cluster_size = bdi.cluster_size;
 } else {
 s->target_cluster_size = BDRV_SECTOR_SIZE;
 }
+bdrv_graph_co_rdunlock();
 if (backing_filename[0] && !bdrv_backing_chain_next(target_bs) &&
 s->granularity < s->target_cluster_size) {
 s->buf_size = MAX(s->buf_size,

[PATCH 04/15] block: convert bdrv_refresh_total_sectors in generated_co_wrapper

BlockDriver->bdrv_getlength is categorized as IO callb
ack, and
it currently doesn't run in a coroutine.
This makes very difficult to add the graph rdlock, sin
ce the
callback traverses the block nodes graph.

Therefore use generated_co_wrapper to automatically
create a wrapper for bdrv_refresh_total_sectors.
Unfortunately we cannot use a generated_co_wrapper_sim
ple,
because the function is called by mixed functions that
run both
in coroutine and non-coroutine context (for example bd
rv_open_driver()).

Unfortunately this callback requires multiple bdrv_* a
nd blk_*
callbacks to be converted, because there are different
mixed
callers (coroutines and not) calling this callback fro
m different
function paths.

Because now this function creates a new coroutine and polls,
we need to take the AioContext lock where it is missing,
for the only reason that internally g_c_w calls AIO_WAIT_WHILE
and it expects to release the AioContext lock.
This is especially messy when a g_c_w creates a coroutine and
polls in bdrv_open_driver, because this function has so many
callers in so many context that it can easily lead to deadlocks.
Therefore the new rule for bdrv_open_driver is that the caller
must always hold the AioContext lock of the given bs (except
if it is a coroutine), because the function calls
bdrv_refresh_total_sectors() which is a g_c_w.

Once the rwlock is ultimated and placed in every place it needs
to be, we will poll using AIO_WAIT_WHILE_UNLOCKED and remove
the AioContext lock.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c   | 29 -
 block/block-backend.c | 12 
 block/commit.c|  4 ++--
 block/meson.build |  1 +
 block/mirror.c|  7 +--
 hw/scsi/scsi-disk.c   |  5 +
 include/block/block-io.h  |  8 ++--
 include/block/block_int-common.h  |  3 ++-
 include/block/block_int-io.h  |  5 -
 include/sysemu/block-backend-io.h | 10 --
 tests/unit/test-block-iothread.c  |  7 +++
 11 files changed, 72 insertions(+), 19 deletions(-)

diff --git a/block.c b/block.c
index ba4bbb42aa..c7b32ba17a 100644
--- a/block.c
+++ b/block.c
@@ -1044,10 +1044,12 @@ static int find_image_format(BlockBackend *file, const 
char *filename,
  * Set the current 'total_sectors' value
  * Return 0 on success, -errno on error.
  */
-int bdrv_refresh_total_sectors(BlockDriverState *bs, int64_t hint)
+int coroutine_fn bdrv_co_refresh_total_sectors(BlockDriverState *bs,
+   int64_t hint)
 {
 BlockDriver *drv = bs->drv;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 if (!drv) {
 return -ENOMEDIUM;
@@ -1611,6 +1613,11 @@ out:
 g_free(gen_node_name);
 }
 
+/*
+ * The caller must always hold @bs AioContext lock, because this function calls
+ * bdrv_refresh_total_sectors() which polls when called from non-coroutine
+ * context.
+ */
 static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv,
 const char *node_name, QDict *options,
 int open_flags, Error **errp)
@@ -3750,6 +3757,10 @@ out:
  * The reference parameter may be used to specify an existing block device 
which
  * should be opened. If specified, neither options nor a filename may be given,
  * nor can an existing BDS be reused (that is, *pbs has to be NULL).
+ *
+ * The caller must always hold @filename AioContext lock, because this
+ * function eventually calls bdrv_refresh_total_sectors() which polls
+ * when called from non-coroutine context.
  */
 static BlockDriverState *bdrv_open_inherit(const char *filename,
const char *reference,
@@ -4038,6 +4049,11 @@ close_and_fail:
 return NULL;
 }
 
+/*
+ * The caller must always hold @filename AioContext lock, because this
+ * function eventually calls bdrv_refresh_total_sectors() which polls
+ * when called from non-coroutine context.
+ */
 BlockDriverState *bdrv_open(const char *filename, const char *reference,
 QDict *options, int flags, Error **errp)
 {
@@ -5774,16 +5790,17 @@ BlockMeasureInfo *bdrv_measure(BlockDriver *drv, 
QemuOpts *opts,
 /**
  * Return number of sectors on success, -errno on error.
  */
-int64_t bdrv_nb_sectors(BlockDriverState *bs)
+int64_t coroutine_fn bdrv_co_nb_sectors(BlockDriverState *bs)
 {
 BlockDriver *drv = bs->drv;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 if (!drv)
 return -ENOMEDIUM;
 
 if (drv->has_variable_length) {
-int ret = bdrv_refresh_total_sectors(bs, bs->total_sectors);
+int ret = bdrv_co_refresh_total_sectors(bs, bs->total_sectors);
 if (ret < 0) {
 return ret;
 }
@@ -5795,11 +5812,13 @@ int64_t bdrv_nb_sectors(BlockDriverState *bs)
  * Return length in bytes on success, -errno on error.
  * The length is always a multiple

[PATCH 12/15] block: convert bdrv_debug_event in generated_co_wrapper

BlockDriver->bdrv_debug_event is categorized as IO callback, and
it currently doesn't run in a coroutine.
This makes very difficult to add the graph rdlock, since the
callback traverses the block nodes graph.

Therefore use generated_co_wrapper to automatically
create a wrapper with the same name.
Unfortunately we cannot use a generated_co_wrapper_simple,
because the function is called by mixed functions that run both
in coroutine and non-coroutine context (for example blkdebug_open).

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c  |  4 +++-
 block/io.c   | 22 +++---
 include/block/block-io.h |  7 +--
 include/block/block_int-common.h |  4 +++-
 4 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/block.c b/block.c
index afc5735b82..a2e3550d5c 100644
--- a/block.c
+++ b/block.c
@@ -6331,9 +6331,11 @@ BlockStatsSpecific 
*bdrv_get_specific_stats(BlockDriverState *bs)
 return drv->bdrv_get_specific_stats(bs);
 }
 
-void bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent event)
+void coroutine_fn bdrv_co_debug_event(BlockDriverState *bs, BlkdebugEvent 
event)
 {
 IO_CODE();
+assert_bdrv_graph_readable();
+
 if (!bs || !bs->drv || !bs->drv->bdrv_debug_event) {
 return;
 }
diff --git a/block/io.c b/block/io.c
index 88da9470c3..11d2c5dcde 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1227,7 +1227,7 @@ static int coroutine_fn 
bdrv_co_do_copy_on_readv(BdrvChild *child,
 goto err;
 }
 
-bdrv_debug_event(bs, BLKDBG_COR_WRITE);
+bdrv_co_debug_event(bs, BLKDBG_COR_WRITE);
 if (drv->bdrv_co_pwrite_zeroes &&
 buffer_is_zero(bounce_buffer, pnum)) {
 /* FIXME: Should we (perhaps conditionally) be setting
@@ -1472,10 +1472,10 @@ static coroutine_fn int bdrv_padding_rmw_read(BdrvChild 
*child,
 qemu_iovec_init_buf(_qiov, pad->buf, bytes);
 
 if (pad->head) {
-bdrv_debug_event(bs, BLKDBG_PWRITEV_RMW_HEAD);
+bdrv_co_debug_event(bs, BLKDBG_PWRITEV_RMW_HEAD);
 }
 if (pad->merge_reads && pad->tail) {
-bdrv_debug_event(bs, BLKDBG_PWRITEV_RMW_TAIL);
+bdrv_co_debug_event(bs, BLKDBG_PWRITEV_RMW_TAIL);
 }
 ret = bdrv_aligned_preadv(child, req, req->overlap_offset, bytes,
   align, _qiov, 0, 0);
@@ -1483,10 +1483,10 @@ static coroutine_fn int bdrv_padding_rmw_read(BdrvChild 
*child,
 return ret;
 }
 if (pad->head) {
-bdrv_debug_event(bs, BLKDBG_PWRITEV_RMW_AFTER_HEAD);
+bdrv_co_debug_event(bs, BLKDBG_PWRITEV_RMW_AFTER_HEAD);
 }
 if (pad->merge_reads && pad->tail) {
-bdrv_debug_event(bs, BLKDBG_PWRITEV_RMW_AFTER_TAIL);
+bdrv_co_debug_event(bs, BLKDBG_PWRITEV_RMW_AFTER_TAIL);
 }
 
 if (pad->merge_reads) {
@@ -1497,7 +1497,7 @@ static coroutine_fn int bdrv_padding_rmw_read(BdrvChild 
*child,
 if (pad->tail) {
 qemu_iovec_init_buf(_qiov, pad->tail_buf, align);
 
-bdrv_debug_event(bs, BLKDBG_PWRITEV_RMW_TAIL);
+bdrv_co_debug_event(bs, BLKDBG_PWRITEV_RMW_TAIL);
 ret = bdrv_aligned_preadv(
 child, req,
 req->overlap_offset + req->overlap_bytes - align,
@@ -1505,7 +1505,7 @@ static coroutine_fn int bdrv_padding_rmw_read(BdrvChild 
*child,
 if (ret < 0) {
 return ret;
 }
-bdrv_debug_event(bs, BLKDBG_PWRITEV_RMW_AFTER_TAIL);
+bdrv_co_debug_event(bs, BLKDBG_PWRITEV_RMW_AFTER_TAIL);
 }
 
 zero_mem:
@@ -1908,16 +1908,16 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild 
*child,
 if (ret < 0) {
 /* Do nothing, write notifier decided to fail this request */
 } else if (flags & BDRV_REQ_ZERO_WRITE) {
-bdrv_debug_event(bs, BLKDBG_PWRITEV_ZERO);
+bdrv_co_debug_event(bs, BLKDBG_PWRITEV_ZERO);
 ret = bdrv_co_do_pwrite_zeroes(bs, offset, bytes, flags);
 } else if (flags & BDRV_REQ_WRITE_COMPRESSED) {
 ret = bdrv_driver_pwritev_compressed(bs, offset, bytes,
  qiov, qiov_offset);
 } else if (bytes <= max_transfer) {
-bdrv_debug_event(bs, BLKDBG_PWRITEV);
+bdrv_co_debug_event(bs, BLKDBG_PWRITEV);
 ret = bdrv_driver_pwritev(bs, offset, bytes, qiov, qiov_offset, flags);
 } else {
-bdrv_debug_event(bs, BLKDBG_PWRITEV);
+bdrv_co_debug_event(bs, BLKDBG_PWRITEV);
 while (bytes_remaining) {
 int num = MIN(bytes_remaining, max_transfer);
 int local_flags = flags;
@@ -1940,7 +1940,7 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild 
*child,
 bytes_remaining -= num;
 }
 }
-bdrv_debug_event(bs, BLKDBG_PWRITEV_DONE);
+bdrv_co_debug_event(bs, BLKDBG_PWRITEV_DONE);
 
 if (ret >= 0) {

[PATCH 02/15] block: rename refresh_total_sectors in bdrv_refresh_total_sectors

Name is not right, since we are going to convert this in
a generated_co_wrapper.

No functional change intended.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c  | 8 
 block/io.c   | 8 +---
 include/block/block_int-io.h | 2 +-
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/block.c b/block.c
index 1a6ae08879..ba4bbb42aa 100644
--- a/block.c
+++ b/block.c
@@ -1044,7 +1044,7 @@ static int find_image_format(BlockBackend *file, const 
char *filename,
  * Set the current 'total_sectors' value
  * Return 0 on success, -errno on error.
  */
-int refresh_total_sectors(BlockDriverState *bs, int64_t hint)
+int bdrv_refresh_total_sectors(BlockDriverState *bs, int64_t hint)
 {
 BlockDriver *drv = bs->drv;
 IO_CODE();
@@ -1662,7 +1662,7 @@ static int bdrv_open_driver(BlockDriverState *bs, 
BlockDriver *drv,
 bs->supported_read_flags |= BDRV_REQ_REGISTERED_BUF;
 bs->supported_write_flags |= BDRV_REQ_REGISTERED_BUF;
 
-ret = refresh_total_sectors(bs, bs->total_sectors);
+ret = bdrv_refresh_total_sectors(bs, bs->total_sectors);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Could not refresh total sector count");
 return ret;
@@ -5783,7 +5783,7 @@ int64_t bdrv_nb_sectors(BlockDriverState *bs)
 return -ENOMEDIUM;
 
 if (drv->has_variable_length) {
-int ret = refresh_total_sectors(bs, bs->total_sectors);
+int ret = bdrv_refresh_total_sectors(bs, bs->total_sectors);
 if (ret < 0) {
 return ret;
 }
@@ -6565,7 +6565,7 @@ int bdrv_activate(BlockDriverState *bs, Error **errp)
 bdrv_dirty_bitmap_skip_store(bm, false);
 }
 
-ret = refresh_total_sectors(bs, bs->total_sectors);
+ret = bdrv_refresh_total_sectors(bs, bs->total_sectors);
 if (ret < 0) {
 bs->open_flags |= BDRV_O_INACTIVE;
 error_setg_errno(errp, -ret, "Could not refresh total sector 
count");
diff --git a/block/io.c b/block/io.c
index 7d1d0c48b0..99867fe148 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3411,15 +3411,17 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, 
int64_t offset, bool exact,
 goto out;
 }
 
-ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
+ret = bdrv_refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Could not refresh total sector count");
 } else {
 offset = bs->total_sectors * BDRV_SECTOR_SIZE;
 }
-/* It's possible that truncation succeeded but refresh_total_sectors
+/*
+ * It's possible that truncation succeeded but bdrv_refresh_total_sectors
  * failed, but the latter doesn't affect how we should finish the request.
- * Pass 0 as the last parameter so that dirty bitmaps etc. are handled. */
+ * Pass 0 as the last parameter so that dirty bitmaps etc. are handled.
+ */
 bdrv_co_write_req_finish(child, offset - new_bytes, new_bytes, , 0);
 
 out:
diff --git a/include/block/block_int-io.h b/include/block/block_int-io.h
index ac6ad3b3ff..453855e651 100644
--- a/include/block/block_int-io.h
+++ b/include/block/block_int-io.h
@@ -122,7 +122,7 @@ int coroutine_fn bdrv_co_copy_range_to(BdrvChild *src, 
int64_t src_offset,
BdrvRequestFlags read_flags,
BdrvRequestFlags write_flags);
 
-int refresh_total_sectors(BlockDriverState *bs, int64_t hint);
+int bdrv_refresh_total_sectors(BlockDriverState *bs, int64_t hint);
 
 BdrvChild *bdrv_cow_child(BlockDriverState *bs);
 BdrvChild *bdrv_filter_child(BlockDriverState *bs);
-- 
2.31.1

[PATCH 10/15] block: convert bdrv_eject in generated_co_wrapper_simple

BlockDriver->bdrv_eject is categorized as IO callback, and
it currently doesn't run in a coroutine.
This makes very difficult to add the graph rdlock, since the
callback traverses the block nodes graph.

The only caller of this function is blk_eject, therefore
make blk_eject a generated_co_wrapper_simple, so that
it always creates a new coroutine, and then make bdrv_eject
coroutine_fn.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c   | 3 ++-
 block/block-backend.c | 5 +++--
 block/copy-on-read.c  | 2 +-
 block/filter-compress.c   | 2 +-
 block/raw-format.c| 2 +-
 include/block/block-io.h  | 3 ++-
 include/block/block_int-common.h  | 2 +-
 include/sysemu/block-backend-io.h | 4 +++-
 8 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/block.c b/block.c
index 4205735308..ffbb8c602f 100644
--- a/block.c
+++ b/block.c
@@ -6802,10 +6802,11 @@ bool coroutine_fn bdrv_co_is_inserted(BlockDriverState 
*bs)
 /**
  * If eject_flag is TRUE, eject the media. Otherwise, close the tray
  */
-void bdrv_eject(BlockDriverState *bs, bool eject_flag)
+void coroutine_fn bdrv_co_eject(BlockDriverState *bs, bool eject_flag)
 {
 BlockDriver *drv = bs->drv;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 if (drv && drv->bdrv_eject) {
 drv->bdrv_eject(bs, eject_flag);
diff --git a/block/block-backend.c b/block/block-backend.c
index 9a500fdde3..308dd2070a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2019,14 +2019,15 @@ void blk_lock_medium(BlockBackend *blk, bool locked)
 }
 }
 
-void blk_eject(BlockBackend *blk, bool eject_flag)
+void coroutine_fn blk_co_eject(BlockBackend *blk, bool eject_flag)
 {
 BlockDriverState *bs = blk_bs(blk);
 char *id;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 if (bs) {
-bdrv_eject(bs, eject_flag);
+bdrv_co_eject(bs, eject_flag);
 }
 
 /* Whether or not we ejected on the backend,
diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 74f7727a02..76f884a6ae 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -218,7 +218,7 @@ static int coroutine_fn 
cor_co_pwritev_compressed(BlockDriverState *bs,
 
 static void cor_eject(BlockDriverState *bs, bool eject_flag)
 {
-bdrv_eject(bs->file->bs, eject_flag);
+bdrv_co_eject(bs->file->bs, eject_flag);
 }
 
 
diff --git a/block/filter-compress.c b/block/filter-compress.c
index 305716c86c..571e4684dd 100644
--- a/block/filter-compress.c
+++ b/block/filter-compress.c
@@ -118,7 +118,7 @@ static void compress_refresh_limits(BlockDriverState *bs, 
Error **errp)
 
 static void compress_eject(BlockDriverState *bs, bool eject_flag)
 {
-bdrv_eject(bs->file->bs, eject_flag);
+bdrv_co_eject(bs->file->bs, eject_flag);
 }
 
 
diff --git a/block/raw-format.c b/block/raw-format.c
index 4773bf9cda..9b23cf17bb 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -405,7 +405,7 @@ static int coroutine_fn raw_co_truncate(BlockDriverState 
*bs, int64_t offset,
 
 static void raw_eject(BlockDriverState *bs, bool eject_flag)
 {
-bdrv_eject(bs->file->bs, eject_flag);
+bdrv_co_eject(bs->file->bs, eject_flag);
 }
 
 static void raw_lock_medium(BlockDriverState *bs, bool locked)
diff --git a/include/block/block-io.h b/include/block/block-io.h
index 3432e6ad3e..204adeb701 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -125,7 +125,8 @@ bool coroutine_fn bdrv_co_is_inserted(BlockDriverState *bs);
 bool generated_co_wrapper_simple bdrv_is_inserted(BlockDriverState *bs);
 
 void bdrv_lock_medium(BlockDriverState *bs, bool locked);
-void bdrv_eject(BlockDriverState *bs, bool eject_flag);
+void coroutine_fn bdrv_co_eject(BlockDriverState *bs, bool eject_flag);
+
 const char *bdrv_get_format_name(BlockDriverState *bs);
 
 bool bdrv_supports_compressed_writes(BlockDriverState *bs);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 4cad48b2ad..d01b3d44f5 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -761,7 +761,7 @@ struct BlockDriver {
 
 /* removable device specific. Called with graph rdlock held. */
 bool coroutine_fn (*bdrv_is_inserted)(BlockDriverState *bs);
-void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
+void coroutine_fn (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
 void (*bdrv_lock_medium)(BlockDriverState *bs, bool locked);
 
 /* to control generic scsi devices. Called with graph rdlock taken. */
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index bf88f7699e..cc706c03d8 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -59,7 +59,9 @@ bool generated_co_wrapper blk_is_inserted(BlockBackend *blk);
 
 bool blk_is_available(BlockBackend *blk);
 void blk_lock_medium(BlockBackend *blk, bool locked);
-void blk_eject(BlockBackend *blk,

[PATCH 15/15] block: rename newly converted BlockDriver IO coroutine functions

Since these functions alwayas run in coroutine context, adjust
their name to include "_co_", just like all other BlockDriver callbacks.

No functional change intended.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c  | 32 ++---
 block/blkdebug.c |  4 +--
 block/blkio.c|  6 ++--
 block/blklogwrites.c |  2 +-
 block/blkreplay.c|  2 +-
 block/blkverify.c|  2 +-
 block/copy-on-read.c |  6 ++--
 block/crypto.c   |  4 +--
 block/curl.c |  8 +++---
 block/file-posix.c   | 48 
 block/file-win32.c   |  8 +++---
 block/filter-compress.c  |  6 ++--
 block/gluster.c  | 16 +--
 block/io.c   | 16 +--
 block/iscsi.c|  8 +++---
 block/nbd.c  |  6 ++--
 block/nfs.c  |  2 +-
 block/null.c |  8 +++---
 block/nvme.c |  6 ++--
 block/preallocate.c  |  2 +-
 block/qcow.c |  2 +-
 block/qcow2.c|  6 ++--
 block/qed.c  |  4 +--
 block/quorum.c   |  2 +-
 block/raw-format.c   |  8 +++---
 block/rbd.c  |  4 +--
 block/replication.c  |  2 +-
 block/ssh.c  |  2 +-
 block/throttle.c |  2 +-
 block/vdi.c  |  2 +-
 block/vhdx.c |  2 +-
 block/vmdk.c |  4 +--
 block/vpc.c  |  2 +-
 include/block/block_int-common.h | 27 +-
 34 files changed, 131 insertions(+), 130 deletions(-)

diff --git a/block.c b/block.c
index a2e3550d5c..78b3e7fcd6 100644
--- a/block.c
+++ b/block.c
@@ -1055,13 +1055,13 @@ int coroutine_fn 
bdrv_co_refresh_total_sectors(BlockDriverState *bs,
 return -ENOMEDIUM;
 }
 
-/* Do not attempt drv->bdrv_getlength() on scsi-generic devices */
+/* Do not attempt drv->bdrv_co_getlength() on scsi-generic devices */
 if (bdrv_is_sg(bs))
 return 0;
 
 /* query actual device if possible, otherwise just trust the hint */
-if (drv->bdrv_getlength) {
-int64_t length = drv->bdrv_getlength(bs);
+if (drv->bdrv_co_getlength) {
+int64_t length = drv->bdrv_co_getlength(bs);
 if (length < 0) {
 return length;
 }
@@ -5695,7 +5695,7 @@ exit:
 }
 
 /**
- * Implementation of BlockDriver.bdrv_get_allocated_file_size() that
+ * Implementation of BlockDriver.bdrv_co_get_allocated_file_size() that
  * sums the size of all data-bearing children.  (This excludes backing
  * children.)
  */
@@ -5732,8 +5732,8 @@ int64_t coroutine_fn 
bdrv_co_get_allocated_file_size(BlockDriverState *bs)
 if (!drv) {
 return -ENOMEDIUM;
 }
-if (drv->bdrv_get_allocated_file_size) {
-return drv->bdrv_get_allocated_file_size(bs);
+if (drv->bdrv_co_get_allocated_file_size) {
+return drv->bdrv_co_get_allocated_file_size(bs);
 }
 
 if (drv->bdrv_file_open) {
@@ -6290,7 +6290,7 @@ int coroutine_fn bdrv_co_get_info(BlockDriverState *bs, 
BlockDriverInfo *bdi)
 if (!drv) {
 return -ENOMEDIUM;
 }
-if (!drv->bdrv_get_info) {
+if (!drv->bdrv_co_get_info) {
 BlockDriverState *filtered = bdrv_filter_bs(bs);
 if (filtered) {
 return bdrv_co_get_info(filtered, bdi);
@@ -6298,7 +6298,7 @@ int coroutine_fn bdrv_co_get_info(BlockDriverState *bs, 
BlockDriverInfo *bdi)
 return -ENOTSUP;
 }
 memset(bdi, 0, sizeof(*bdi));
-ret = drv->bdrv_get_info(bs, bdi);
+ret = drv->bdrv_co_get_info(bs, bdi);
 if (ret < 0) {
 return ret;
 }
@@ -6336,11 +6336,11 @@ void coroutine_fn bdrv_co_debug_event(BlockDriverState 
*bs, BlkdebugEvent event)
 IO_CODE();
 assert_bdrv_graph_readable();
 
-if (!bs || !bs->drv || !bs->drv->bdrv_debug_event) {
+if (!bs || !bs->drv || !bs->drv->bdrv_co_debug_event) {
 return;
 }
 
-bs->drv->bdrv_debug_event(bs, event);
+bs->drv->bdrv_co_debug_event(bs, event);
 }
 
 static BlockDriverState *bdrv_find_debug_node(BlockDriverState *bs)
@@ -6790,8 +6790,8 @@ bool coroutine_fn bdrv_co_is_inserted(BlockDriverState 
*bs)
 if (!drv) {
 return false;
 }
-if (drv->bdrv_is_inserted) {
-return drv->bdrv_is_inserted(bs);
+if (drv->bdrv_co_is_inserted) {
+return drv->bdrv_co_is_inserted(bs);
 }
 QLIST_FOREACH(child, >children, next) {
 if (!bdrv_co_is_inserted(child->bs)) {
@@ -6810,8 +6810,8 @@ void coroutine_fn bdrv_co_eject(BlockDriverState *bs, 
bool eject_flag)
 IO_CODE();
 assert_bdrv_graph_readable();
 
-if (drv && drv->bdrv_eject) {
-drv->bdrv_eject(bs, eject_flag);
+if (drv && drv->bdrv_co_eject) {

[PATCH 01/15] block/qed: add missing graph rdlock in qed_need_check_timer_entry

This function is called in two different places:
- timer callback, which does not take the graph rdlock.
- bdrv_qed_drain_begin(), which is a .bdrv_drain_begin()
  callback that will soon take the lock.

Since it calls recursive functions that traverse the
graph, we need to protect them with the graph rdlock.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/qed.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/qed.c b/block/qed.c
index c2691a85b1..778b23d0f6 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -282,11 +282,13 @@ static void coroutine_fn 
qed_unplug_allocating_write_reqs(BDRVQEDState *s)
 qemu_co_mutex_unlock(>table_lock);
 }
 
+/* Called with graph rdlock taken */
 static void coroutine_fn qed_need_check_timer(BDRVQEDState *s)
 {
 int ret;
 
 trace_qed_need_check_timer_cb(s);
+assert_bdrv_graph_readable();
 
 if (!qed_plug_allocating_write_reqs(s)) {
 return;
@@ -312,6 +314,7 @@ static void coroutine_fn qed_need_check_timer(BDRVQEDState 
*s)
 static void coroutine_fn qed_need_check_timer_entry(void *opaque)
 {
 BDRVQEDState *s = opaque;
+GRAPH_RDLOCK_GUARD();
 
 qed_need_check_timer(opaque);
 bdrv_dec_in_flight(s->bs);
-- 
2.31.1

[PATCH 00/15] Protect the block layer with a rwlock: part 3

Please read "Protect the block layer with a rwlock: part 1" and
"Protect the block layer with a rwlock: part 2" for an
additional introduction and aim of this series.

In this serie, we cover the remaining BlockDriver IO callbacks that were not
running in coroutine, therefore not using the graph rdlock.
Therefore convert them to coroutines, using either g_c_w or a new
variant introduced in this serie (see below).

We need to convert these callbacks into coroutine because non-coroutine code
is tied to the main thread, even though it will still delegate I/O accesses to
the iothread (via the bdrv_coroutine_enter call in generated_co_wrappers).
Making callbacks run in coroutines provides more flexibility, because they run
entirely in iothreads and can use CoMutexes for mutual exclusion.

Here we introduce generated_co_wrapper_simple, a simplification of g_c_w that
only considers the case where the caller is not in a coroutine.
This simplifies and clarifies a lot when the caller is a coroutine or not, and
in the future will hopefully replace g_c_w.

While we are at it, try to directly call the _co_ counterpart of a g_c_w when
we know already that the function always run in a coroutine.

Based-on: <20221116135331.3052923-1-eespo...@redhat.com>

Thank you,
Emanuele

Emanuele Giuseppe Esposito (15):
  block/qed: add missing graph rdlock in qed_need_check_timer_entry
  block: rename refresh_total_sectors in bdrv_refresh_total_sectors
  block-backend: use bdrv_getlength instead of blk_getlength
  block: convert bdrv_refresh_total_sectors in generated_co_wrapper
  block: use bdrv_co_refresh_total_sectors when possible
  block: convert bdrv_get_allocated_file_size in
generated_co_wrapper_simple
  block: convert bdrv_get_info in generated_co_wrapper
  block: convert bdrv_is_inserted in generated_co_wrapper_simple
  block-coroutine-wrapper: support void functions
  block: convert bdrv_eject in generated_co_wrapper_simple
  block: convert bdrv_lock_medium in generated_co_wrapper_simple
  block: convert bdrv_debug_event in generated_co_wrapper
  block: convert bdrv_io_plug in generated_co_wrapper_simple
  block: convert bdrv_io_unplug in generated_co_wrapper_simple
  block: rename newly converted BlockDriver IO coroutine functions

 block.c| 93 +++---
 block/blkdebug.c   |  4 +-
 block/blkio.c  |  6 +-
 block/blklogwrites.c   |  2 +-
 block/blkreplay.c  |  2 +-
 block/blkverify.c  |  2 +-
 block/block-backend.c  | 43 --
 block/commit.c |  4 +-
 block/copy-on-read.c   | 12 ++--
 block/crypto.c |  6 +-
 block/curl.c   |  8 +--
 block/file-posix.c | 48 +++
 block/file-win32.c |  8 +--
 block/filter-compress.c| 10 ++--
 block/gluster.c| 16 ++---
 block/io.c | 78 +
 block/iscsi.c  |  8 +--
 block/meson.build  |  1 +
 block/mirror.c | 17 --
 block/nbd.c|  6 +-
 block/nfs.c|  2 +-
 block/null.c   |  8 +--
 block/nvme.c   |  6 +-
 block/preallocate.c|  2 +-
 block/qcow.c   |  2 +-
 block/qcow2-refcount.c |  2 +-
 block/qcow2.c  |  6 +-
 block/qed.c|  7 ++-
 block/quorum.c |  2 +-
 block/raw-format.c | 14 ++---
 block/rbd.c|  4 +-
 block/replication.c|  2 +-
 block/ssh.c|  2 +-
 block/stream.c |  4 +-
 block/throttle.c   |  2 +-
 block/vdi.c|  2 +-
 block/vhdx.c   |  2 +-
 block/vmdk.c   |  4 +-
 block/vpc.c|  2 +-
 blockdev.c |  8 ++-
 hw/scsi/scsi-disk.c|  5 ++
 include/block/block-io.h   | 40 +
 include/block/block_int-common.h   | 37 +++-
 include/block/block_int-io.h   |  5 +-
 include/sysemu/block-backend-io.h  | 32 +++---
 scripts/block-coroutine-wrapper.py | 19 --
 tests/unit/test-block-iothread.c   |  7 +++
 47 files changed, 364 insertions(+), 238 deletions(-)

-- 
2.31.1

Re: [PATCH 0/3] hw/{i2c,nvme}: mctp endpoint, nvme management interface model

2022-11-16 Thread Jeremy Kerr

Hi Klaus,

[+CC Matt]

> This adds a generic MCTP endpoint model that other devices may derive
> from. I'm not 100% happy with the design of the class methods, but
> it's a start.

Thanks for posting these! I'll have a more thorough look through soon,
but wanted to tackle some of the larger design-points first (and we've
already spoken a bit about these, but rehashing a little of that for
others CCed too).

For me, the big decision here is where we want to run the NVMe-MI
device model. Doing it in the qemu process certainly makes things
easier to set up, and we can just configure the machine+nvme-mi device
as the one operation.

The alternative would be to have the NVMe-MI model run as an external
process, and not part of the qemu tree; it looks like Peter D is going
for that approach with [1]. The advantage there is that we would be
able to test against closer-to-reality "MI firmware" (say, a device
vendor running their NVMe-MI firmware directly in another emulator? are
folks interested in doing that?)

The complexity around the latter approach will be where we split the
processes, and arrange for IPC. [1] suggests at the i2c layer, but that
does seem to have complexities with i2c controller model compatibility;
we could certainly extend that to a "generic" i2c-over-something
protocol (which would also be handy for other things), or go higher up
and use MCTP directly as the transport (say, the serial binding over a
chardev). The former would be more useful for direct firmware
emulation.

My interest is mainly in testing the software stack, so either approach
is fine; I assume your interest is from the device implementation side?

Cheers,

Jeremy

[1]:
https://github.com/facebook/openbmc/blob/helium/common/recipes-devtools/qemu/qemu/0007-hw-misc-Add-i2c-netdev-device.patch

[PATCH 3/6] block: assert that BlockDriver->bdrv_co_copy_range_{from/to} is always called with graph rdlock taken

The only non-protected caller is convert_co_copy_range(), all other
callers are BlockDriver callbacks that already take the rdlock.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c| 2 ++
 block/io.c   | 5 +
 include/block/block_int-common.h | 4 
 qemu-img.c   | 4 +++-
 4 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 9e1c689e84..6f0dd15808 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2631,6 +2631,8 @@ int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, 
int64_t off_in,
 if (r) {
 return r;
 }
+
+GRAPH_RDLOCK_GUARD();
 return bdrv_co_copy_range(blk_in->root, off_in,
   blk_out->root, off_out,
   bytes, read_flags, write_flags);
diff --git a/block/io.c b/block/io.c
index 831f277e85..62c0b3a390 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3165,6 +3165,7 @@ static int coroutine_fn bdrv_co_copy_range_internal(
 {
 BdrvTrackedRequest req;
 int ret;
+assert_bdrv_graph_readable();
 
 /* TODO We can support BDRV_REQ_NO_FALLBACK here */
 assert(!(read_flags & BDRV_REQ_NO_FALLBACK));
@@ -3246,6 +3247,7 @@ int coroutine_fn bdrv_co_copy_range_from(BdrvChild *src, 
int64_t src_offset,
  BdrvRequestFlags write_flags)
 {
 IO_CODE();
+assert_bdrv_graph_readable();
 trace_bdrv_co_copy_range_from(src, src_offset, dst, dst_offset, bytes,
   read_flags, write_flags);
 return bdrv_co_copy_range_internal(src, src_offset, dst, dst_offset,
@@ -3263,6 +3265,7 @@ int coroutine_fn bdrv_co_copy_range_to(BdrvChild *src, 
int64_t src_offset,
BdrvRequestFlags write_flags)
 {
 IO_CODE();
+assert_bdrv_graph_readable();
 trace_bdrv_co_copy_range_to(src, src_offset, dst, dst_offset, bytes,
 read_flags, write_flags);
 return bdrv_co_copy_range_internal(src, src_offset, dst, dst_offset,
@@ -3275,6 +3278,8 @@ int coroutine_fn bdrv_co_copy_range(BdrvChild *src, 
int64_t src_offset,
 BdrvRequestFlags write_flags)
 {
 IO_CODE();
+assert_bdrv_graph_readable();
+
 return bdrv_co_copy_range_from(src, src_offset,
dst, dst_offset,
bytes, read_flags, write_flags);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 1e9bb91c98..9e441cb93b 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -574,6 +574,8 @@ struct BlockDriver {
  *
  * See the comment of bdrv_co_copy_range for the parameter and return value
  * semantics.
+ *
+ * Called with graph rdlock taken.
  */
 int coroutine_fn (*bdrv_co_copy_range_from)(BlockDriverState *bs,
 BdrvChild *src,
@@ -592,6 +594,8 @@ struct BlockDriver {
  *
  * See the comment of bdrv_co_copy_range for the parameter and return value
  * semantics.
+ *
+ * Called with graph rdlock taken.
  */
 int coroutine_fn (*bdrv_co_copy_range_to)(BlockDriverState *bs,
   BdrvChild *src,
diff --git a/qemu-img.c b/qemu-img.c
index 33703a6d92..2086cf6eed 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2027,7 +2027,9 @@ retry:
 
 if (s->ret == -EINPROGRESS) {
 if (copy_range) {
-ret = convert_co_copy_range(s, sector_num, n);
+WITH_GRAPH_RDLOCK_GUARD() {
+ret = convert_co_copy_range(s, sector_num, n);
+}
 if (ret) {
 s->copy_range = false;
 goto retry;
-- 
2.31.1

[PATCH 4/6] block/dirty-bitmap: assert that BlockDriver->bdrv_co_*_dirty_bitmap are always called with graph rdlock taken

The only callers are the respective bdrv_*_dirty_bitmap() functions that
take care of creating a new coroutine (that already takes the graph
rdlock).

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/dirty-bitmap.c | 2 ++
 include/block/block_int-common.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 21cf592889..92c70a7282 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -392,6 +392,7 @@ int coroutine_fn
 bdrv_co_remove_persistent_dirty_bitmap(BlockDriverState *bs, const char *name,
Error **errp)
 {
+assert_bdrv_graph_readable();
 if (bs->drv && bs->drv->bdrv_co_remove_persistent_dirty_bitmap) {
 return bs->drv->bdrv_co_remove_persistent_dirty_bitmap(bs, name, errp);
 }
@@ -413,6 +414,7 @@ bdrv_co_can_store_new_dirty_bitmap(BlockDriverState *bs, 
const char *name,
uint32_t granularity, Error **errp)
 {
 BlockDriver *drv = bs->drv;
+assert_bdrv_graph_readable();
 
 if (!drv) {
 error_setg_errno(errp, ENOMEDIUM,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 9e441cb93b..3064822508 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -789,9 +789,11 @@ struct BlockDriver {
 void (*bdrv_drain_end)(BlockDriverState *bs);
 
 bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
+/* Called with graph rdlock held. */
 bool coroutine_fn (*bdrv_co_can_store_new_dirty_bitmap)(
 BlockDriverState *bs, const char *name, uint32_t granularity,
 Error **errp);
+/* Called with graph rdlock held. */
 int coroutine_fn (*bdrv_co_remove_persistent_dirty_bitmap)(
 BlockDriverState *bs, const char *name, Error **errp);
 };
-- 
2.31.1

[PATCH 5/6] block/io: assert that BlockDriver->bdrv_co__snapshot_ are always called with graph rdlock taken

The only callers are other callback functions that already run with the
graph rdlock taken.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/io.c   | 2 ++
 include/block/block_int-common.h | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/block/io.c b/block/io.c
index 62c0b3a390..7d1d0c48b0 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3449,6 +3449,7 @@ bdrv_co_preadv_snapshot(BdrvChild *child, int64_t offset, 
int64_t bytes,
 BlockDriver *drv = bs->drv;
 int ret;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 if (!drv) {
 return -ENOMEDIUM;
@@ -3474,6 +3475,7 @@ bdrv_co_snapshot_block_status(BlockDriverState *bs,
 BlockDriver *drv = bs->drv;
 int ret;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 if (!drv) {
 return -ENOMEDIUM;
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 3064822508..03bd28e3c9 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -652,9 +652,12 @@ struct BlockDriver {
  * - be able to select a specific snapshot
  * - receive the snapshot's actual length (which may differ from bs's
  *   length)
+ *
+ * Called with graph rdlock taken.
  */
 int coroutine_fn (*bdrv_co_preadv_snapshot)(BlockDriverState *bs,
 int64_t offset, int64_t bytes, QEMUIOVector *qiov, size_t qiov_offset);
+/* Called with graph rdlock taken. */
 int coroutine_fn (*bdrv_co_snapshot_block_status)(BlockDriverState *bs,
 bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum,
 int64_t *map, BlockDriverState **file);
-- 
2.31.1

[PATCH 11/20] block-gen: assert that bdrv_co_{check/invalidate_cache} are always called with graph rdlock taken

The only callers of these functions are the respective
generated_co_wrapper, and they already take the lock.

Protecting bdrv_co_{check/invalidate_cache}() implies that
BlockDriver->bdrv_co_{check/invalidate_cache}() is always called with
graph rdlock taken.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c  | 2 ++
 include/block/block_int-common.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/block.c b/block.c
index 1c870d85e6..c7611bed9e 100644
--- a/block.c
+++ b/block.c
@@ -5375,6 +5375,7 @@ int coroutine_fn bdrv_co_check(BlockDriverState *bs,
BdrvCheckResult *res, BdrvCheckMode fix)
 {
 IO_CODE();
+assert_bdrv_graph_readable();
 if (bs->drv == NULL) {
 return -ENOMEDIUM;
 }
@@ -6590,6 +6591,7 @@ int coroutine_fn 
bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp)
 IO_CODE();
 
 assert(!(bs->open_flags & BDRV_O_INACTIVE));
+assert_bdrv_graph_readable();
 
 if (bs->drv->bdrv_co_invalidate_cache) {
 bs->drv->bdrv_co_invalidate_cache(bs, _err);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index d666b0c441..f285a6b8f7 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -641,6 +641,7 @@ struct BlockDriver {
 
 /*
  * Invalidate any cached meta-data.
+ * Called with graph rdlock held.
  */
 void coroutine_fn (*bdrv_co_invalidate_cache)(BlockDriverState *bs,
   Error **errp);
@@ -726,6 +727,7 @@ struct BlockDriver {
 /*
  * Returns 0 for completed check, -errno for internal errors.
  * The check results are stored in result.
+ * Called with graph rdlock held.
  */
 int coroutine_fn (*bdrv_co_check)(BlockDriverState *bs,
   BdrvCheckResult *result,
-- 
2.31.1

[PATCH 12/20] block-gen: assert that bdrv_co_pwrite is always called with graph rdlock taken

This function, in addition to be called by a generated_co_wrapper,
is also called elsewhere else.
The strategy is to always take the lock at the function called
when the coroutine is created, to avoid recursive locking.

By protecting brdv_co_pwrite, we also automatically protect
the following other generated_co_wrappers:
blk_co_pwrite
blk_co_pwritev
blk_co_pwritev_part
blk_co_pwrite_compressed
blk_co_pwrite_zeroes

Protecting bdrv_driver_pwritev_compressed() and bdrv_driver_pwritev_compressed()
implies that the following BlockDriver callbacks always called with graph rdlock
taken:
- bdrv_aio_pwritev
- bdrv_co_writev
- bdrv_co_pwritev
- bdrv_co_pwritev_part
- bdrv_co_pwritev_compressed
- bdrv_co_pwritev_compressed_part

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c| 1 +
 block/block-copy.c   | 8 ++--
 block/io.c   | 2 ++
 include/block/block_int-common.h | 6 ++
 include/block/block_int-io.h | 1 +
 5 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 0686cd6942..d48ec3a2ac 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1363,6 +1363,7 @@ blk_co_do_pwritev_part(BlockBackend *blk, int64_t offset, 
int64_t bytes,
 IO_CODE();
 
 blk_wait_while_drained(blk);
+GRAPH_RDLOCK_GUARD();
 
 /* Call blk_bs() only after waiting, the graph may have changed */
 bs = blk_bs(blk);
diff --git a/block/block-copy.c b/block/block-copy.c
index f33ab1d0b6..dabf461112 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -464,6 +464,8 @@ static coroutine_fn int block_copy_task_run(AioTaskPool 
*pool,
  * a full-size buffer or disabled if the copy_range attempt fails.  The output
  * value of @method should be used for subsequent tasks.
  * Returns 0 on success.
+ *
+ * Called with graph rdlock taken.
  */
 static int coroutine_fn block_copy_do_copy(BlockCopyState *s,
int64_t offset, int64_t bytes,
@@ -554,8 +556,10 @@ static coroutine_fn int block_copy_task_entry(AioTask 
*task)
 BlockCopyMethod method = t->method;
 int ret;
 
-ret = block_copy_do_copy(s, t->req.offset, t->req.bytes, ,
- _is_read);
+WITH_GRAPH_RDLOCK_GUARD() {
+ret = block_copy_do_copy(s, t->req.offset, t->req.bytes, ,
+ _is_read);
+}
 
 WITH_QEMU_LOCK_GUARD(>lock) {
 if (s->method == t->method) {
diff --git a/block/io.c b/block/io.c
index ac12725fb2..9280fb9f38 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1012,6 +1012,7 @@ static int coroutine_fn 
bdrv_driver_pwritev(BlockDriverState *bs,
 unsigned int nb_sectors;
 QEMUIOVector local_qiov;
 int ret;
+assert_bdrv_graph_readable();
 
 bdrv_check_qiov_request(offset, bytes, qiov, qiov_offset, _abort);
 
@@ -1090,6 +1091,7 @@ bdrv_driver_pwritev_compressed(BlockDriverState *bs, 
int64_t offset,
 BlockDriver *drv = bs->drv;
 QEMUIOVector local_qiov;
 int ret;
+assert_bdrv_graph_readable();
 
 bdrv_check_qiov_request(offset, bytes, qiov, qiov_offset, _abort);
 
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index f285a6b8f7..d44f825d95 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -479,6 +479,7 @@ struct BlockDriver {
 BlockAIOCB *(*bdrv_aio_preadv)(BlockDriverState *bs,
 int64_t offset, int64_t bytes, QEMUIOVector *qiov,
 BdrvRequestFlags flags, BlockCompletionFunc *cb, void *opaque);
+/* Called with graph rdlock taken. */
 BlockAIOCB *(*bdrv_aio_pwritev)(BlockDriverState *bs,
 int64_t offset, int64_t bytes, QEMUIOVector *qiov,
 BdrvRequestFlags flags, BlockCompletionFunc *cb, void *opaque);
@@ -515,6 +516,7 @@ struct BlockDriver {
 QEMUIOVector *qiov, size_t qiov_offset,
 BdrvRequestFlags flags);
 
+/* Called with graph rdlock taken. */
 int coroutine_fn (*bdrv_co_writev)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
 int flags);
@@ -532,10 +534,12 @@ struct BlockDriver {
  * no larger than 'max_transfer'.
  *
  * The buffer in @qiov may point directly to guest memory.
+ * Called with graph rdlock taken.
  */
 int coroutine_fn (*bdrv_co_pwritev)(BlockDriverState *bs,
 int64_t offset, int64_t bytes, QEMUIOVector *qiov,
 BdrvRequestFlags flags);
+/* Called with graph rdlock taken. */
 int coroutine_fn (*bdrv_co_pwritev_part)(BlockDriverState *bs,
 int64_t offset, int64_t bytes, QEMUIOVector *qiov, size_t qiov_offset,
 BdrvRequestFlags flags);
@@ -693,8 +697,10 @@ struct BlockDriver {
 BlockMeasureInfo *(*bdrv_measure)(QemuOpts *opts, BlockDriverState *in_bs,
   Error **errp);
 
+/* Called with graph rdlock held. */
 int coroutine_fn

[PATCH 6/6] block: assert that BlockDriver->bdrv_co_delete_file is always called with graph rdlock taken

The only callers are other callback functions that already run with the graph
rdlock taken.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c  | 1 +
 include/block/block_int-common.h | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index e54ed300d7..1a6ae08879 100644
--- a/block.c
+++ b/block.c
@@ -747,6 +747,7 @@ int coroutine_fn bdrv_co_delete_file(BlockDriverState *bs, 
Error **errp)
 
 IO_CODE();
 assert(bs != NULL);
+assert_bdrv_graph_readable();
 
 if (!bs->drv) {
 error_setg(errp, "Block node '%s' is not opened", bs->filename);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 03bd28e3c9..20308376c6 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -681,7 +681,7 @@ struct BlockDriver {
  */
 int coroutine_fn (*bdrv_co_flush)(BlockDriverState *bs);
 
-/* Delete a created file. */
+/* Delete a created file. Called with graph rdlock taken. */
 int coroutine_fn (*bdrv_co_delete_file)(BlockDriverState *bs,
 Error **errp);
 
-- 
2.31.1

[PATCH 19/20] block-gen: assert that bdrv_co_ioctl is always called with graph rdlock taken

The only caller of this function is blk_ioctl, a generated_co_wrapper
functions that needs to take the graph read lock.

Protecting bdrv_co_ioctl() implies that
BlockDriver->bdrv_co_ioctl() is always called with
graph rdlock taken, and BlockDriver->bdrv_aio_ioctl is
a coroutine_fn callback (called too with rdlock taken).

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c| 1 +
 block/io.c   | 1 +
 include/block/block_int-common.h | 5 +++--
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 20b772a476..9e1c689e84 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1672,6 +1672,7 @@ blk_co_do_ioctl(BlockBackend *blk, unsigned long int req, 
void *buf)
 IO_CODE();
 
 blk_wait_while_drained(blk);
+GRAPH_RDLOCK_GUARD();
 
 if (!blk_is_available(blk)) {
 return -ENOMEDIUM;
diff --git a/block/io.c b/block/io.c
index c5b3bb0a6d..831f277e85 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3007,6 +3007,7 @@ int coroutine_fn bdrv_co_ioctl(BlockDriverState *bs, int 
req, void *buf)
 };
 BlockAIOCB *acb;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 bdrv_inc_in_flight(bs);
 if (!drv || (!drv->bdrv_aio_ioctl && !drv->bdrv_co_ioctl)) {
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 9d9cd59f1e..db97d61836 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -743,10 +743,11 @@ struct BlockDriver {
 void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
 void (*bdrv_lock_medium)(BlockDriverState *bs, bool locked);
 
-/* to control generic scsi devices */
-BlockAIOCB *(*bdrv_aio_ioctl)(BlockDriverState *bs,
+/* to control generic scsi devices. Called with graph rdlock taken. */
+BlockAIOCB *coroutine_fn (*bdrv_aio_ioctl)(BlockDriverState *bs,
 unsigned long int req, void *buf,
 BlockCompletionFunc *cb, void *opaque);
+/* Called with graph rdlock taken. */
 int coroutine_fn (*bdrv_co_ioctl)(BlockDriverState *bs,
   unsigned long int req, void *buf);
 
-- 
2.31.1

[PATCH 0/6] Protect the block layer with a rwlock: part 2

Please read "Protect the block layer with a rwlock: part 1" for an additional
introduction and aim of this series.

This second part aims to add the graph rdlock to the BlockDriver functions
that already run in coroutine context and are classified as IO.
Such functions will recursively traverse the BlockDriverState graph, therefore
they need to be protected with the rdlock.

Based-on: <20221116134850.3051419-1-eespo...@redhat.com>

Thank you,
Emanuele

Emanuele Giuseppe Esposito (6):
  block: assert that bdrv_co_create is always called with graph rdlock
taken
  block: assert that BlockDriver->bdrv_co_{amend/create} are called with
graph rdlock taken
  block: assert that BlockDriver->bdrv_co_copy_range_{from/to} is always
called with graph rdlock taken
  block/dirty-bitmap: assert that BlockDriver->bdrv_co_*_dirty_bitmap
are always called with graph rdlock taken
  block/io: assert that BlockDriver->bdrv_co_*_snapshot_* are always
called with graph rdlock taken
  block: assert that BlockDriver->bdrv_co_delete_file is always called
with graph rdlock taken

 block.c  |  2 ++
 block/amend.c|  1 +
 block/block-backend.c|  2 ++
 block/create.c   |  1 +
 block/dirty-bitmap.c |  2 ++
 block/io.c   |  7 +++
 include/block/block_int-common.h | 14 +-
 qemu-img.c   |  4 +++-
 8 files changed, 31 insertions(+), 2 deletions(-)

-- 
2.31.1

[PATCH 04/20] block.c: wrlock in bdrv_replace_child_noperm

Protect the main function where graph is modified.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c  | 6 --
 include/block/block_int-common.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index d3e168408a..4ef537a9f2 100644
--- a/block.c
+++ b/block.c
@@ -1416,6 +1416,7 @@ static void bdrv_child_cb_attach(BdrvChild *child)
 
 assert_bdrv_graph_writable(bs);
 QLIST_INSERT_HEAD(>children, child, next);
+
 if (bs->drv->is_filter || (child->role & BDRV_CHILD_FILTERED)) {
 /*
  * Here we handle filters and block/raw-format.c when it behave like
@@ -2829,24 +2830,25 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 assert(bdrv_get_aio_context(old_bs) == bdrv_get_aio_context(new_bs));
 }
 
+bdrv_graph_wrlock();
 if (old_bs) {
 if (child->klass->detach) {
 child->klass->detach(child);
 }
-assert_bdrv_graph_writable(old_bs);
+
 QLIST_REMOVE(child, next_parent);
 }
 
 child->bs = new_bs;
 
 if (new_bs) {
-assert_bdrv_graph_writable(new_bs);
 QLIST_INSERT_HEAD(_bs->parents, child, next_parent);
 
 if (child->klass->attach) {
 child->klass->attach(child);
 }
 }
+bdrv_graph_wrunlock();
 
 /*
  * If the old child node was drained but the new one is not, allow
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 791dddfd7d..fd9f40a815 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -71,6 +71,7 @@ enum BdrvTrackedRequestType {
 BDRV_TRACKED_TRUNCATE,
 };
 
+
 /*
  * That is not quite good that BdrvTrackedRequest structure is public,
  * as block/io.c is very careful about incoming offset/bytes being
-- 
2.31.1

[PATCH 07/20] graph-lock: implement WITH_GRAPH_RDLOCK_GUARD and GRAPH_RDLOCK_GUARD macros

Similar to the implementation in lockable.h, implement macros to
automatically take and release the rdlock.
Create the empty GraphLockable struct only to use it as a type for
G_DEFINE_AUTOPTR_CLEANUP_FUNC.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/block/graph-lock.h | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/include/block/graph-lock.h b/include/block/graph-lock.h
index 9430707dca..0d886a9ca3 100644
--- a/include/block/graph-lock.h
+++ b/include/block/graph-lock.h
@@ -141,5 +141,40 @@ void assert_bdrv_graph_readable(void);
  */
 void assert_bdrv_graph_writable(void);
 
+typedef struct GraphLockable { } GraphLockable;
+
+/*
+ * In C, compound literals have the lifetime of an automatic variable.
+ * In C++ it would be different, but then C++ wouldn't need QemuLockable
+ * either...
+ */
+#define GML_OBJ_() (&(GraphLockable) { })
+
+static inline GraphLockable *graph_lockable_auto_lock(GraphLockable *x)
+{
+bdrv_graph_co_rdlock();
+return x;
+}
+
+static inline void graph_lockable_auto_unlock(GraphLockable *x)
+{
+bdrv_graph_co_rdunlock();
+}
+
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(GraphLockable, graph_lockable_auto_unlock)
+
+#define WITH_GRAPH_RDLOCK_GUARD_(var) \
+for (g_autoptr(GraphLockable) var = graph_lockable_auto_lock(GML_OBJ_()); \
+ var; \
+ graph_lockable_auto_unlock(var), var = NULL)
+
+#define WITH_GRAPH_RDLOCK_GUARD() \
+WITH_GRAPH_RDLOCK_GUARD_(glue(graph_lockable_auto, __COUNTER__))
+
+#define GRAPH_RDLOCK_GUARD(x)   \
+g_autoptr(GraphLockable)\
+glue(graph_lockable_auto, __COUNTER__) G_GNUC_UNUSED =  \
+graph_lockable_auto_lock(GML_OBJ_())
+
 #endif /* GRAPH_LOCK_H */
 
-- 
2.31.1

[PATCH 2/6] block: assert that BlockDriver->bdrv_co_{amend/create} are called with graph rdlock taken

Both functions are only called by Job->run() callbacks, therefore
they must take the lock in the *_run() implementation.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/amend.c| 1 +
 block/create.c   | 1 +
 include/block/block_int-common.h | 2 ++
 3 files changed, 4 insertions(+)

diff --git a/block/amend.c b/block/amend.c
index f696a006e3..a155b6889b 100644
--- a/block/amend.c
+++ b/block/amend.c
@@ -45,6 +45,7 @@ static int coroutine_fn blockdev_amend_run(Job *job, Error 
**errp)
 {
 BlockdevAmendJob *s = container_of(job, BlockdevAmendJob, common);
 int ret;
+GRAPH_RDLOCK_GUARD();
 
 job_progress_set_remaining(>common, 1);
 ret = s->bs->drv->bdrv_co_amend(s->bs, s->opts, s->force, errp);
diff --git a/block/create.c b/block/create.c
index 4df43f11f4..4048d71265 100644
--- a/block/create.c
+++ b/block/create.c
@@ -43,6 +43,7 @@ static int coroutine_fn blockdev_create_run(Job *job, Error 
**errp)
 int ret;
 
 GLOBAL_STATE_CODE();
+GRAPH_RDLOCK_GUARD();
 
 job_progress_set_remaining(>common, 1);
 ret = s->drv->bdrv_co_create(s->opts, errp);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index d45961a1d1..1e9bb91c98 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -251,6 +251,7 @@ struct BlockDriver {
   Error **errp);
 void (*bdrv_close)(BlockDriverState *bs);
 
+/* Called with graph rdlock taken */
 int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts,
Error **errp);
 /* Called with graph rdlock taken */
@@ -471,6 +472,7 @@ struct BlockDriver {
 
 int (*bdrv_probe)(const uint8_t *buf, int buf_size, const char *filename);
 
+/* Called with graph rdlock taken. */
 int coroutine_fn (*bdrv_co_amend)(BlockDriverState *bs,
   BlockdevAmendOptions *opts,
   bool force,
-- 
2.31.1

[PATCH 01/20] block: introduce a lock to protect graph operations

From: Paolo Bonzini 

block layer graph operations are always run under BQL in the
main loop. This is proved by the assertion qemu_in_main_thread()
and its wrapper macro GLOBAL_STATE_CODE.
However, there are also concurrent coroutines running in other
iothreads that always try to traverse the graph.
Currently this is protected (among various other things) by
the AioContext lock, but once this is removed we need to make
sure that reads do not happen while modifying the graph.

We distinguish between writer (main loop, under BQL) that modifies the
graph, and readers (all other coroutines running in various AioContext),
that go through the graph edges, reading ->parents and->children.

The writer (main loop)  has an "exclusive" access, so it first waits for
current read to finish, and then prevents incoming ones from
entering while it has the exclusive access.

The readers (coroutines in multiple AioContext) are free to
access the graph as long the writer is not modifying the graph.
In case it is, they go in a CoQueue and sleep until the writer
is done.

If a coroutine changes AioContext, the counter in the original and new
AioContext are left intact, since the writer does not care where is the
reader, but only if there is one.
As a result, some AioContexts might have a negative reader count, to
balance the positive count of the AioContext that took the lock.
This also means that when an AioContext is deleted it may have a nonzero
reader count. In that case we transfer the count to a global shared counter
so that the writer is always aware of all readers.

Co-developed-with: Emanuele Giuseppe Esposito 
Signed-off-by: Kevin Wolf 
Signed-off-by: Paolo Bonzini 
---
 block/graph-lock.c | 221 +
 block/meson.build  |   1 +
 include/block/aio.h|   9 ++
 include/block/block_int.h  |   1 +
 include/block/graph-lock.h | 129 ++
 5 files changed, 361 insertions(+)
 create mode 100644 block/graph-lock.c
 create mode 100644 include/block/graph-lock.h

diff --git a/block/graph-lock.c b/block/graph-lock.c
new file mode 100644
index 00..b608a89d7c
--- /dev/null
+++ b/block/graph-lock.c
@@ -0,0 +1,221 @@
+/*
+ * Graph lock: rwlock to protect block layer graph manipulations (add/remove
+ * edges and nodes)
+ *
+ *  Copyright (c) 2022 Red Hat
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see .
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/main-loop.h"
+#include "block/graph-lock.h"
+#include "block/block.h"
+#include "block/block_int.h"
+
+/* Protects the list of aiocontext and orphaned_reader_count */
+static QemuMutex aio_context_list_lock;
+
+/* Written and read with atomic operations. */
+static int has_writer;
+
+/*
+ * A reader coroutine could move from an AioContext to another.
+ * If this happens, there is no problem from the point of view of
+ * counters. The problem is that the total count becomes
+ * unbalanced if one of the two AioContexts gets deleted.
+ * The count of readers must remain correct, so the AioContext's
+ * balance is transferred to this glboal variable.
+ * Protected by aio_context_list_lock.
+ */
+static uint32_t orphaned_reader_count;
+
+/* Queue of readers waiting for the writer to finish */
+static CoQueue reader_queue;
+
+/*
+ * List of AioContext. This list ensures that each AioContext
+ * can safely modify only its own counter, avoid reading/writing
+ * others and thus improving performances by avoiding cacheline bounces.
+ */
+static QTAILQ_HEAD(, AioContext) aio_context_list =
+QTAILQ_HEAD_INITIALIZER(aio_context_list);
+
+static void __attribute__((__constructor__)) bdrv_init_graph_lock(void)
+{
+qemu_mutex_init(_context_list_lock);
+qemu_co_queue_init(_queue);
+}
+
+void register_aiocontext(AioContext *ctx)
+{
+QEMU_LOCK_GUARD(_context_list_lock);
+assert(ctx->reader_count == 0);
+QTAILQ_INSERT_TAIL(_context_list, ctx, next_aio);
+}
+
+void unregister_aiocontext(AioContext *ctx)
+{
+QEMU_LOCK_GUARD(_context_list_lock);
+orphaned_reader_count += ctx->reader_count;
+QTAILQ_REMOVE(_context_list, ctx, next_aio);
+}
+
+static uint32_t reader_count(void)
+{
+AioContext *ctx;
+uint32_t rd;
+
+QEMU_LOCK_GUARD(_context_list_lock);
+
+/* rd can temporarly be negative, but the total will *always* be >= 0 */
+rd = orphaned_reader_count;
+

[PATCH 18/20] block-gen: assert that bdrv_co_common_block_status_above is always called with graph rdlock taken

This function, in addition to be called by a generated_co_wrapper,
is also called elsewhere else.
The strategy is to always take the lock at the function called
when the coroutine is created, to avoid recursive locking.

Protecting bdrv_co_block_status() called by
bdrv_co_common_block_status_above() implies that
BlockDriver->bdrv_co_block_status() is always called with
graph rdlock taken.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/backup.c   |  3 +++
 block/block-backend.c|  2 ++
 block/block-copy.c   |  2 ++
 block/io.c   |  2 ++
 block/mirror.c   | 14 +-
 block/stream.c   | 32 ++--
 include/block/block_int-common.h |  2 ++
 qemu-img.c   |  4 +++-
 8 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 6a9ad97a53..42b16d0136 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -269,7 +269,10 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 return -ECANCELED;
 }
 
+/* rdlock protects the subsequent call to bdrv_is_allocated() */
+bdrv_graph_co_rdlock();
 ret = block_copy_reset_unallocated(s->bcs, offset, );
+bdrv_graph_co_rdunlock();
 if (ret < 0) {
 return ret;
 }
diff --git a/block/block-backend.c b/block/block-backend.c
index 211a813523..20b772a476 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1433,6 +1433,7 @@ int coroutine_fn blk_block_status_above(BlockBackend *blk,
 BlockDriverState **file)
 {
 IO_CODE();
+GRAPH_RDLOCK_GUARD();
 return bdrv_block_status_above(blk_bs(blk), base, offset, bytes, pnum, map,
file);
 }
@@ -1443,6 +1444,7 @@ int coroutine_fn blk_is_allocated_above(BlockBackend *blk,
 int64_t bytes, int64_t *pnum)
 {
 IO_CODE();
+GRAPH_RDLOCK_GUARD();
 return bdrv_is_allocated_above(blk_bs(blk), base, include_base, offset,
bytes, pnum);
 }
diff --git a/block/block-copy.c b/block/block-copy.c
index dabf461112..e20d2b2f78 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -630,6 +630,7 @@ static int coroutine_fn 
block_copy_is_cluster_allocated(BlockCopyState *s,
 assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
 
 while (true) {
+/* protected in backup_run() */
 ret = bdrv_is_allocated(bs, offset, bytes, );
 if (ret < 0) {
 return ret;
@@ -892,6 +893,7 @@ static int coroutine_fn 
block_copy_common(BlockCopyCallState *call_state)
 
 static void coroutine_fn block_copy_async_co_entry(void *opaque)
 {
+GRAPH_RDLOCK_GUARD();
 block_copy_common(opaque);
 }
 
diff --git a/block/io.c b/block/io.c
index bc9f47538c..c5b3bb0a6d 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2215,6 +2215,7 @@ static int coroutine_fn 
bdrv_co_block_status(BlockDriverState *bs,
 bool has_filtered_child;
 
 assert(pnum);
+assert_bdrv_graph_readable();
 *pnum = 0;
 total_size = bdrv_getlength(bs);
 if (total_size < 0) {
@@ -2445,6 +2446,7 @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
 IO_CODE();
 
 assert(!include_base || base); /* Can't include NULL base */
+assert_bdrv_graph_readable();
 
 if (!depth) {
 depth = 
diff --git a/block/mirror.c b/block/mirror.c
index f509cc1cb1..02ee7bba08 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -559,9 +559,11 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 MirrorMethod mirror_method = MIRROR_METHOD_COPY;
 
 assert(!(offset % s->granularity));
-ret = bdrv_block_status_above(source, NULL, offset,
-  nb_chunks * s->granularity,
-  _bytes, NULL, NULL);
+WITH_GRAPH_RDLOCK_GUARD() {
+ret = bdrv_block_status_above(source, NULL, offset,
+nb_chunks * s->granularity,
+_bytes, NULL, NULL);
+}
 if (ret < 0) {
 io_bytes = MIN(nb_chunks * s->granularity, max_io_bytes);
 } else if (ret & BDRV_BLOCK_DATA) {
@@ -864,8 +866,10 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob 
*s)
 return 0;
 }
 
-ret = bdrv_is_allocated_above(bs, s->base_overlay, true, offset, bytes,
-  );
+WITH_GRAPH_RDLOCK_GUARD() {
+ret = bdrv_is_allocated_above(bs, s->base_overlay, true, offset,
+  bytes, );
+}
 if (ret < 0) {
 return ret;
 }
diff --git a/block/stream.c b/block/stream.c
index 8744ad103f..22368ce186 100644
--- a/block/stream.c
+++

[PATCH 10/20] block-gen: assert that {bdrv/blk}_co_truncate is always called with graph rdlock taken

This function, in addition to be called by a generated_co_wrapper,
is also called by the blk_* API.
The strategy is to always take the lock at the function called
when the coroutine is created, to avoid recursive locking.

Protecting bdrv_co_truncate() implies that
BlockDriver->bdrv_co_truncate() is always called with
graph rdlock taken.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c| 1 +
 block/io.c   | 1 +
 include/block/block_int-common.h | 2 ++
 3 files changed, 4 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index 333d50fb3f..0686cd6942 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2370,6 +2370,7 @@ int coroutine_fn blk_co_truncate(BlockBackend *blk, 
int64_t offset, bool exact,
  Error **errp)
 {
 IO_OR_GS_CODE();
+GRAPH_RDLOCK_GUARD();
 if (!blk_is_available(blk)) {
 error_setg(errp, "No medium inserted");
 return -ENOMEDIUM;
diff --git a/block/io.c b/block/io.c
index 9bcb19e5ee..ac12725fb2 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3295,6 +3295,7 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, 
int64_t offset, bool exact,
 int64_t old_size, new_bytes;
 int ret;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 /* if bs->drv == NULL, bs is closed, so there's nothing to do here */
 if (!drv) {
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index fd9f40a815..d666b0c441 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -681,6 +681,8 @@ struct BlockDriver {
  *
  * If @exact is true and this function fails but would succeed
  * with @exact = false, it should return -ENOTSUP.
+ *
+ * Called with graph rdlock held.
  */
 int coroutine_fn (*bdrv_co_truncate)(BlockDriverState *bs, int64_t offset,
  bool exact, PreallocMode prealloc,
-- 
2.31.1

[PATCH 08/20] block-coroutine-wrapper.py: take the graph rdlock in bdrv_* functions

All generated_co_wrapper functions create a coroutine when
called from non-coroutine context.

The format can be one of the two:

bdrv_something()
if(qemu_in_coroutine()):
bdrv_co_something();
else:
// create coroutine that calls bdrv_co_something();

blk_something()
if(qemu_in_coroutine()):
blk_co_something();
else:
// create coroutine that calls blk_co_something();
// blk_co_something() then eventually calls bdrv_co_something()

The bdrv_co_something functions are recursively traversing the graph,
therefore they all need to be protected with the graph rdlock.
Instead, blk_co_something() calls bdrv_co_something(), so given that and
being always called at the root of the graph (not in recursive
callbacks), they should take the graph rdlock.

The contract is simple, from now on, all bdrv_co_* functions called by g_c_w
callbacks assume that the graph rdlock is taken at the coroutine
creation, i.e. in g_c_w or in specific coroutines (right now we just
consider the g_c_w case).

All the blk_co_* are responsible of taking the rdlock (at this point is still a 
TBD).

Suggested-by: Kevin Wolf 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 scripts/block-coroutine-wrapper.py | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/scripts/block-coroutine-wrapper.py 
b/scripts/block-coroutine-wrapper.py
index 21ecb3e896..05267761f0 100644
--- a/scripts/block-coroutine-wrapper.py
+++ b/scripts/block-coroutine-wrapper.py
@@ -67,8 +67,11 @@ def __init__(self, return_type: str, name: str, args: str,
 self.return_type = return_type.strip()
 self.name = name.strip()
 self.args = [ParamDecl(arg.strip()) for arg in args.split(',')]
+self.lock = True
 self.create_only_co = False
 
+if variant == '_blk':
+self.lock = False
 if variant == '_simple':
 self.create_only_co = True
 
@@ -86,7 +89,6 @@ def gen_block(self, format: str) -> str:
   r'(?P[a-z][a-z0-9_]*)'
   r'\((?P[^)]*)\);$', re.MULTILINE)
 
-
 def func_decl_iter(text: str) -> Iterator:
 for m in func_decl_re.finditer(text):
 yield FuncDecl(return_type=m.group('return_type'),
@@ -160,6 +162,13 @@ def gen_wrapper(func: FuncDecl) -> str:
 func.co_name = f'{subsystem}_co_{subname}'
 name = func.co_name
 
+graph_lock=''
+graph_unlock=''
+if func.lock:
+graph_lock='bdrv_graph_co_rdlock();'
+graph_unlock='bdrv_graph_co_rdunlock();'
+
+
 t = func.args[0].type
 if t == 'BlockDriverState *':
 bs = 'bs'
@@ -192,7 +201,9 @@ def gen_wrapper(func: FuncDecl) -> str:
 {{
 {struct_name} *s = opaque;
 
+{graph_lock}
 s->ret = {name}({ func.gen_list('s->{name}') });
+{graph_unlock}
 s->poll_state.in_progress = false;
 
 aio_wait_kick();
-- 
2.31.1

[PATCH 1/6] block: assert that bdrv_co_create is always called with graph rdlock taken

This function is either called by bdrv_create(), which always takes
care of creating a new coroutine, or by bdrv_create_file(), which
is only called by BlockDriver->bdrv_co_create_opts callbacks,
invoked by bdrv_co_create().

Protecting bdrv_co_create() implies that BlockDriver->bdrv_co_create_opts
is always called with graph rdlock taken.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c  | 1 +
 include/block/block_int-common.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/block.c b/block.c
index c7611bed9e..e54ed300d7 100644
--- a/block.c
+++ b/block.c
@@ -537,6 +537,7 @@ int coroutine_fn bdrv_co_create(BlockDriver *drv, const 
char *filename,
 assert(qemu_in_coroutine());
 assert(*errp == NULL);
 assert(drv);
+assert_bdrv_graph_readable();
 
 if (!drv->bdrv_co_create_opts) {
 error_setg(errp, "Driver '%s' does not support image creation",
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index db97d61836..d45961a1d1 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -253,6 +253,7 @@ struct BlockDriver {
 
 int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts,
Error **errp);
+/* Called with graph rdlock taken */
 int coroutine_fn (*bdrv_co_create_opts)(BlockDriver *drv,
 const char *filename,
 QemuOpts *opts,
-- 
2.31.1

[PATCH 02/20] graph-lock: introduce BdrvGraphRWlock structure

Just a wrapper to simplify what is available to the struct AioContext.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/graph-lock.c | 59 ++
 include/block/aio.h| 12 
 include/block/graph-lock.h |  1 +
 3 files changed, 48 insertions(+), 24 deletions(-)

diff --git a/block/graph-lock.c b/block/graph-lock.c
index b608a89d7c..c3c6eeedad 100644
--- a/block/graph-lock.c
+++ b/block/graph-lock.c
@@ -44,12 +44,23 @@ static uint32_t orphaned_reader_count;
 /* Queue of readers waiting for the writer to finish */
 static CoQueue reader_queue;
 
+struct BdrvGraphRWlock {
+/* How many readers are currently reading the graph. */
+uint32_t reader_count;
+
+/*
+ * List of BdrvGraphRWlock kept in graph-lock.c
+ * Protected by aio_context_list_lock
+ */
+QTAILQ_ENTRY(BdrvGraphRWlock) next_aio;
+};
+
 /*
- * List of AioContext. This list ensures that each AioContext
+ * List of BdrvGraphRWlock. This list ensures that each BdrvGraphRWlock
  * can safely modify only its own counter, avoid reading/writing
  * others and thus improving performances by avoiding cacheline bounces.
  */
-static QTAILQ_HEAD(, AioContext) aio_context_list =
+static QTAILQ_HEAD(, BdrvGraphRWlock) aio_context_list =
 QTAILQ_HEAD_INITIALIZER(aio_context_list);
 
 static void __attribute__((__constructor__)) bdrv_init_graph_lock(void)
@@ -60,29 +71,31 @@ static void __attribute__((__constructor__)) 
bdrv_init_graph_lock(void)
 
 void register_aiocontext(AioContext *ctx)
 {
+ctx->bdrv_graph = g_new0(BdrvGraphRWlock, 1);
 QEMU_LOCK_GUARD(_context_list_lock);
-assert(ctx->reader_count == 0);
-QTAILQ_INSERT_TAIL(_context_list, ctx, next_aio);
+assert(ctx->bdrv_graph->reader_count == 0);
+QTAILQ_INSERT_TAIL(_context_list, ctx->bdrv_graph, next_aio);
 }
 
 void unregister_aiocontext(AioContext *ctx)
 {
 QEMU_LOCK_GUARD(_context_list_lock);
-orphaned_reader_count += ctx->reader_count;
-QTAILQ_REMOVE(_context_list, ctx, next_aio);
+orphaned_reader_count += ctx->bdrv_graph->reader_count;
+QTAILQ_REMOVE(_context_list, ctx->bdrv_graph, next_aio);
+g_free(ctx->bdrv_graph);
 }
 
 static uint32_t reader_count(void)
 {
-AioContext *ctx;
+BdrvGraphRWlock *brdv_graph;
 uint32_t rd;
 
 QEMU_LOCK_GUARD(_context_list_lock);
 
 /* rd can temporarly be negative, but the total will *always* be >= 0 */
 rd = orphaned_reader_count;
-QTAILQ_FOREACH(ctx, _context_list, next_aio) {
-rd += qatomic_read(>reader_count);
+QTAILQ_FOREACH(brdv_graph, _context_list, next_aio) {
+rd += qatomic_read(_graph->reader_count);
 }
 
 /* shouldn't overflow unless there are 2^31 readers */
@@ -138,12 +151,17 @@ void bdrv_graph_wrunlock(void)
 
 void coroutine_fn bdrv_graph_co_rdlock(void)
 {
-AioContext *aiocontext;
-aiocontext = qemu_get_current_aio_context();
+BdrvGraphRWlock *bdrv_graph;
+bdrv_graph = qemu_get_current_aio_context()->bdrv_graph;
+
+/* Do not lock if in main thread */
+if (qemu_in_main_thread()) {
+return;
+}
 
 for (;;) {
-qatomic_set(>reader_count,
-aiocontext->reader_count + 1);
+qatomic_set(_graph->reader_count,
+bdrv_graph->reader_count + 1);
 /* make sure writer sees reader_count before we check has_writer */
 smp_mb();
 
@@ -192,7 +210,7 @@ void coroutine_fn bdrv_graph_co_rdlock(void)
 }
 
 /* slow path where reader sleeps */
-aiocontext->reader_count--;
+bdrv_graph->reader_count--;
 aio_wait_kick();
 qemu_co_queue_wait(_queue, _context_list_lock);
 }
@@ -201,11 +219,16 @@ void coroutine_fn bdrv_graph_co_rdlock(void)
 
 void coroutine_fn bdrv_graph_co_rdunlock(void)
 {
-AioContext *aiocontext;
-aiocontext = qemu_get_current_aio_context();
+BdrvGraphRWlock *bdrv_graph;
+bdrv_graph = qemu_get_current_aio_context()->bdrv_graph;
+
+/* Do not lock if in main thread */
+if (qemu_in_main_thread()) {
+return;
+}
 
-qatomic_store_release(>reader_count,
-  aiocontext->reader_count - 1);
+qatomic_store_release(_graph->reader_count,
+  bdrv_graph->reader_count - 1);
 /* make sure writer sees reader_count before we check has_writer */
 smp_mb();
 
diff --git a/include/block/aio.h b/include/block/aio.h
index 8e64f81d01..0f65a3cc9e 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -22,6 +22,7 @@
 #include "qemu/event_notifier.h"
 #include "qemu/thread.h"
 #include "qemu/timer.h"
+#include "block/graph-lock.h"
 
 typedef struct BlockAIOCB BlockAIOCB;
 typedef void BlockCompletionFunc(void *opaque, int ret);
@@ -127,14 +128,13 @@ struct AioContext {
 /* Used by AioContext users to protect from multi-threaded access.  */
 QemuRecMutex lock;
 
-/* How many readers in this

[PATCH 14/20] block-gen: assert that bdrv_co_pread is always called with graph rdlock taken

This function, in addition to be called by a generated_co_wrapper,
is also called elsewhere else.
The strategy is to always take the lock at the function called
when the coroutine is created, to avoid recursive locking.

By protecting brdv_co_pread, we also automatically protect
the following other generated_co_wrappers:
blk_co_pread
blk_co_preadv
blk_co_preadv_part

Protecting bdrv_driver_preadv() implies that the following BlockDriver
callbacks always called with graph rdlock taken:
- bdrv_co_preadv_part
- bdrv_co_preadv
- bdrv_aio_preadv
- bdrv_co_readv

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c| 1 +
 block/io.c   | 1 +
 block/mirror.c   | 6 --
 include/block/block_int-common.h | 5 +
 include/block/block_int-io.h | 1 +
 tests/unit/test-bdrv-drain.c | 2 ++
 6 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index d48ec3a2ac..083ed6009e 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1289,6 +1289,7 @@ blk_co_do_preadv_part(BlockBackend *blk, int64_t offset, 
int64_t bytes,
 IO_CODE();
 
 blk_wait_while_drained(blk);
+GRAPH_RDLOCK_GUARD();
 
 /* Call blk_bs() only after waiting, the graph may have changed */
 bs = blk_bs(blk);
diff --git a/block/io.c b/block/io.c
index 92c74648fb..cfc201ef91 100644
--- a/block/io.c
+++ b/block/io.c
@@ -942,6 +942,7 @@ static int coroutine_fn bdrv_driver_preadv(BlockDriverState 
*bs,
 unsigned int nb_sectors;
 QEMUIOVector local_qiov;
 int ret;
+assert_bdrv_graph_readable();
 
 bdrv_check_qiov_request(offset, bytes, qiov, qiov_offset, _abort);
 assert(!(flags & ~bs->supported_read_flags));
diff --git a/block/mirror.c b/block/mirror.c
index 251adc5ae0..f509cc1cb1 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -389,8 +389,10 @@ static void coroutine_fn mirror_co_read(void *opaque)
 op->is_in_flight = true;
 trace_mirror_one_iteration(s, op->offset, op->bytes);
 
-ret = bdrv_co_preadv(s->mirror_top_bs->backing, op->offset, op->bytes,
- >qiov, 0);
+WITH_GRAPH_RDLOCK_GUARD() {
+ret = bdrv_co_preadv(s->mirror_top_bs->backing, op->offset, op->bytes,
+ >qiov, 0);
+}
 mirror_read_complete(op, ret);
 }
 
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index e8d2e4b6c7..64c5bb64de 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -476,6 +476,7 @@ struct BlockDriver {
   Error **errp);
 
 /* aio */
+/* Called with graph rdlock held. */
 BlockAIOCB *(*bdrv_aio_preadv)(BlockDriverState *bs,
 int64_t offset, int64_t bytes, QEMUIOVector *qiov,
 BdrvRequestFlags flags, BlockCompletionFunc *cb, void *opaque);
@@ -489,6 +490,7 @@ struct BlockDriver {
 int64_t offset, int bytes,
 BlockCompletionFunc *cb, void *opaque);
 
+/* Called with graph rdlock held. */
 int coroutine_fn (*bdrv_co_readv)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
 
@@ -506,11 +508,14 @@ struct BlockDriver {
  * no larger than 'max_transfer'.
  *
  * The buffer in @qiov may point directly to guest memory.
+ *
+ * Called with graph rdlock held.
  */
 int coroutine_fn (*bdrv_co_preadv)(BlockDriverState *bs,
 int64_t offset, int64_t bytes, QEMUIOVector *qiov,
 BdrvRequestFlags flags);
 
+/* Called with graph rdlock held. */
 int coroutine_fn (*bdrv_co_preadv_part)(BlockDriverState *bs,
 int64_t offset, int64_t bytes,
 QEMUIOVector *qiov, size_t qiov_offset,
diff --git a/include/block/block_int-io.h b/include/block/block_int-io.h
index ae88507d6a..ac6ad3b3ff 100644
--- a/include/block/block_int-io.h
+++ b/include/block/block_int-io.h
@@ -60,6 +60,7 @@ static inline int coroutine_fn bdrv_co_pread(BdrvChild *child,
 {
 QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
 IO_CODE();
+assert_bdrv_graph_readable();
 
 return bdrv_co_preadv(child, offset, bytes, , flags);
 }
diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index 2686a8acee..90edc2f5bf 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -967,6 +967,8 @@ static void coroutine_fn test_co_delete_by_drain(void 
*opaque)
 void *buffer = g_malloc(65536);
 QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buffer, 65536);
 
+GRAPH_RDLOCK_GUARD();
+
 /* Pretend some internal write operation from parent to child.
  * Important: We have to read from the child, not from the parent!
  * Draining works by first propagating it all up the tree to the
-- 
2.31.1

[PATCH 06/20] block: assert that graph read and writes are performed correctly

Remove the old assert_bdrv_graph_writable, and replace it with
the new version using graph-lock API.
See the function documentation for more information.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c|  4 ++--
 block/graph-lock.c | 11 +++
 include/block/block_int-global-state.h | 17 -
 include/block/graph-lock.h | 15 +++
 4 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/block.c b/block.c
index afab74d4da..1c870d85e6 100644
--- a/block.c
+++ b/block.c
@@ -1414,7 +1414,7 @@ static void bdrv_child_cb_attach(BdrvChild *child)
 {
 BlockDriverState *bs = child->opaque;
 
-assert_bdrv_graph_writable(bs);
+assert_bdrv_graph_writable();
 QLIST_INSERT_HEAD(>children, child, next);
 
 if (bs->drv->is_filter || (child->role & BDRV_CHILD_FILTERED)) {
@@ -1461,7 +1461,7 @@ static void bdrv_child_cb_detach(BdrvChild *child)
 bdrv_backing_detach(child);
 }
 
-assert_bdrv_graph_writable(bs);
+assert_bdrv_graph_writable();
 QLIST_REMOVE(child, next);
 if (child == bs->backing) {
 assert(child != bs->file);
diff --git a/block/graph-lock.c b/block/graph-lock.c
index c3c6eeedad..07476fd7c8 100644
--- a/block/graph-lock.c
+++ b/block/graph-lock.c
@@ -242,3 +242,14 @@ void coroutine_fn bdrv_graph_co_rdunlock(void)
 aio_wait_kick();
 }
 }
+
+void assert_bdrv_graph_readable(void)
+{
+assert(qemu_in_main_thread() || reader_count());
+}
+
+void assert_bdrv_graph_writable(void)
+{
+assert(qemu_in_main_thread());
+assert(qatomic_read(_writer));
+}
diff --git a/include/block/block_int-global-state.h 
b/include/block/block_int-global-state.h
index b49f4eb35b..2f0993f6e9 100644
--- a/include/block/block_int-global-state.h
+++ b/include/block/block_int-global-state.h
@@ -310,21 +310,4 @@ void bdrv_remove_aio_context_notifier(BlockDriverState *bs,
  */
 void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
 
-/**
- * Make sure that the function is running under both drain and BQL.
- * The latter protects from concurrent writings
- * from the GS API, while the former prevents concurrent reads
- * from I/O.
- */
-static inline void assert_bdrv_graph_writable(BlockDriverState *bs)
-{
-/*
- * TODO: this function is incomplete. Because the users of this
- * assert lack the necessary drains, check only for BQL.
- * Once the necessary drains are added,
- * assert also for qatomic_read(>quiesce_counter) > 0
- */
-assert(qemu_in_main_thread());
-}
-
 #endif /* BLOCK_INT_GLOBAL_STATE_H */
diff --git a/include/block/graph-lock.h b/include/block/graph-lock.h
index fc806aefa3..9430707dca 100644
--- a/include/block/graph-lock.h
+++ b/include/block/graph-lock.h
@@ -126,5 +126,20 @@ void coroutine_fn bdrv_graph_co_rdlock(void);
  */
 void coroutine_fn bdrv_graph_co_rdunlock(void);
 
+/*
+ * assert_bdrv_graph_readable:
+ * Make sure that the reader is either the main loop,
+ * or there is at least a reader helding the rdlock.
+ * In this way an incoming writer is aware of the read and waits.
+ */
+void assert_bdrv_graph_readable(void);
+
+/*
+ * assert_bdrv_graph_writable:
+ * Make sure that the writer is the main loop and has set @has_writer,
+ * so that incoming readers will pause.
+ */
+void assert_bdrv_graph_writable(void);
+
 #endif /* GRAPH_LOCK_H */
 
-- 
2.31.1

[PATCH 05/20] block: remove unnecessary assert_bdrv_graph_writable()

We don't protect bdrv->aio_context with the graph rwlock,
so these assertions are not needed

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block.c b/block.c
index 4ef537a9f2..afab74d4da 100644
--- a/block.c
+++ b/block.c
@@ -7183,7 +7183,6 @@ static void bdrv_detach_aio_context(BlockDriverState *bs)
 if (bs->quiesce_counter) {
 aio_enable_external(bs->aio_context);
 }
-assert_bdrv_graph_writable(bs);
 bs->aio_context = NULL;
 }
 
@@ -7197,7 +7196,6 @@ static void bdrv_attach_aio_context(BlockDriverState *bs,
 aio_disable_external(new_context);
 }
 
-assert_bdrv_graph_writable(bs);
 bs->aio_context = new_context;
 
 if (bs->drv && bs->drv->bdrv_attach_aio_context) {
@@ -7278,7 +7276,6 @@ static void bdrv_set_aio_context_commit(void *opaque)
 BlockDriverState *bs = (BlockDriverState *) state->bs;
 AioContext *new_context = state->new_ctx;
 AioContext *old_context = bdrv_get_aio_context(bs);
-assert_bdrv_graph_writable(bs);
 
 /*
  * Take the old AioContex when detaching it from bs.
-- 
2.31.1

[PATCH 16/20] block-gen: assert that bdrv_co_{read/write}v_vmstate are always called with graph rdlock taken

The only caller of these functions is bdrv_{read/write}v_vmstate, a
generated_co_wrapper function that already takes the
graph read lock.

Protecting bdrv_co_{read/write}v_vmstate() implies that
BlockDriver->bdrv_{load/save}_vmstate() is always called with
graph rdlock taken.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/io.c   | 2 ++
 include/block/block_int-common.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/block/io.c b/block/io.c
index 0bf3919939..c9b451fecd 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2633,6 +2633,7 @@ bdrv_co_readv_vmstate(BlockDriverState *bs, QEMUIOVector 
*qiov, int64_t pos)
 BlockDriverState *child_bs = bdrv_primary_bs(bs);
 int ret;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 ret = bdrv_check_qiov_request(pos, qiov->size, qiov, 0, NULL);
 if (ret < 0) {
@@ -2665,6 +2666,7 @@ bdrv_co_writev_vmstate(BlockDriverState *bs, QEMUIOVector 
*qiov, int64_t pos)
 BlockDriverState *child_bs = bdrv_primary_bs(bs);
 int ret;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 ret = bdrv_check_qiov_request(pos, qiov->size, qiov, 0, NULL);
 if (ret < 0) {
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index bab0521943..568c2d3092 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -724,9 +724,11 @@ struct BlockDriver {
  Error **errp);
 BlockStatsSpecific *(*bdrv_get_specific_stats)(BlockDriverState *bs);
 
+/* Called with graph rdlock held. */
 int coroutine_fn (*bdrv_save_vmstate)(BlockDriverState *bs,
   QEMUIOVector *qiov,
   int64_t pos);
+/* Called with graph rdlock held. */
 int coroutine_fn (*bdrv_load_vmstate)(BlockDriverState *bs,
   QEMUIOVector *qiov,
   int64_t pos);
-- 
2.31.1

[PATCH 09/20] block-backend: introduce new generated_co_wrapper_blk annotation

This annotation will be used to distinguish the blk_* API from the
bdrv_* API in block-gen.c. The reason for this distinction is that
blk_* API eventually result in always calling bdrv_*, which has
implications when we introduce the read graph lock.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/block/block-common.h  |  1 +
 include/sysemu/block-backend-io.h | 69 ---
 2 files changed, 36 insertions(+), 34 deletions(-)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index 683e3d1c51..f70f1560c5 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -42,6 +42,7 @@
  */
 #define generated_co_wrapper
 #define generated_co_wrapper_simple
+#define generated_co_wrapper_blk
 
 #include "block/dirty-bitmap.h"
 #include "block/blockjob.h"
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index a47cb825e5..887a29dc59 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -110,77 +110,78 @@ int coroutine_fn blk_is_allocated_above(BlockBackend *blk,
  * the "I/O or GS" API.
  */
 
-int generated_co_wrapper blk_pread(BlockBackend *blk, int64_t offset,
-   int64_t bytes, void *buf,
-   BdrvRequestFlags flags);
+int generated_co_wrapper_blk blk_pread(BlockBackend *blk, int64_t offset,
+   int64_t bytes, void *buf,
+   BdrvRequestFlags flags);
 int coroutine_fn blk_co_pread(BlockBackend *blk, int64_t offset, int64_t bytes,
   void *buf, BdrvRequestFlags flags);
 
-int generated_co_wrapper blk_preadv(BlockBackend *blk, int64_t offset,
-int64_t bytes, QEMUIOVector *qiov,
-BdrvRequestFlags flags);
+int generated_co_wrapper_blk blk_preadv(BlockBackend *blk, int64_t offset,
+int64_t bytes, QEMUIOVector *qiov,
+BdrvRequestFlags flags);
 int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
int64_t bytes, QEMUIOVector *qiov,
BdrvRequestFlags flags);
 
-int generated_co_wrapper blk_preadv_part(BlockBackend *blk, int64_t offset,
- int64_t bytes, QEMUIOVector *qiov,
- size_t qiov_offset,
- BdrvRequestFlags flags);
+int generated_co_wrapper_blk blk_preadv_part(BlockBackend *blk, int64_t offset,
+ int64_t bytes, QEMUIOVector *qiov,
+ size_t qiov_offset,
+ BdrvRequestFlags flags);
 int coroutine_fn blk_co_preadv_part(BlockBackend *blk, int64_t offset,
 int64_t bytes, QEMUIOVector *qiov,
 size_t qiov_offset, BdrvRequestFlags 
flags);
 
-int generated_co_wrapper blk_pwrite(BlockBackend *blk, int64_t offset,
-int64_t bytes, const void *buf,
-BdrvRequestFlags flags);
+int generated_co_wrapper_blk blk_pwrite(BlockBackend *blk, int64_t offset,
+int64_t bytes, const void *buf,
+BdrvRequestFlags flags);
 int coroutine_fn blk_co_pwrite(BlockBackend *blk, int64_t offset, int64_t 
bytes,
const void *buf, BdrvRequestFlags flags);
 
-int generated_co_wrapper blk_pwritev(BlockBackend *blk, int64_t offset,
- int64_t bytes, QEMUIOVector *qiov,
- BdrvRequestFlags flags);
+int generated_co_wrapper_blk blk_pwritev(BlockBackend *blk, int64_t offset,
+ int64_t bytes, QEMUIOVector *qiov,
+ BdrvRequestFlags flags);
 int coroutine_fn blk_co_pwritev(BlockBackend *blk, int64_t offset,
 int64_t bytes, QEMUIOVector *qiov,
 BdrvRequestFlags flags);
 
-int generated_co_wrapper blk_pwritev_part(BlockBackend *blk, int64_t offset,
-  int64_t bytes, QEMUIOVector *qiov,
-  size_t qiov_offset,
-  BdrvRequestFlags flags);
+int generated_co_wrapper_blk blk_pwritev_part(BlockBackend *blk, int64_t 
offset,
+  int64_t bytes, QEMUIOVector 
*qiov,
+  size_t qiov_offset,
+  BdrvRequestFlags flags);
 int coroutine_fn blk_co_pwritev_part(BlockBackend *blk, int64_t offset,

[PATCH 03/20] async: register/unregister aiocontext in graph lock list

Add/remove the AioContext in aio_context_list in graph-lock.c only when
it is being effectively created/destroyed.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 util/async.c | 4 
 util/meson.build | 1 +
 2 files changed, 5 insertions(+)

diff --git a/util/async.c b/util/async.c
index 63434ddae4..14d63b3091 100644
--- a/util/async.c
+++ b/util/async.c
@@ -27,6 +27,7 @@
 #include "qapi/error.h"
 #include "block/aio.h"
 #include "block/thread-pool.h"
+#include "block/graph-lock.h"
 #include "qemu/main-loop.h"
 #include "qemu/atomic.h"
 #include "qemu/rcu_queue.h"
@@ -376,6 +377,7 @@ aio_ctx_finalize(GSource *source)
 qemu_rec_mutex_destroy(>lock);
 qemu_lockcnt_destroy(>list_lock);
 timerlistgroup_deinit(>tlg);
+unregister_aiocontext(ctx);
 aio_context_destroy(ctx);
 }
 
@@ -574,6 +576,8 @@ AioContext *aio_context_new(Error **errp)
 ctx->thread_pool_min = 0;
 ctx->thread_pool_max = THREAD_POOL_MAX_THREADS_DEFAULT;
 
+register_aiocontext(ctx);
+
 return ctx;
 fail:
 g_source_destroy(>source);
diff --git a/util/meson.build b/util/meson.build
index 59c1f467bb..ecee2ba899 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -70,6 +70,7 @@ endif
 
 if have_block
   util_ss.add(files('aiocb.c', 'async.c', 'aio-wait.c'))
+  util_ss.add(files('../block/graph-lock.c'))
   util_ss.add(files('base64.c'))
   util_ss.add(files('buffer.c'))
   util_ss.add(files('bufferiszero.c'))
-- 
2.31.1

[PATCH 17/20] block-gen: assert that bdrv_co_pdiscard is always called with graph rdlock taken

This function, in addition to be called by a generated_co_wrapper,
is also called by the blk_* API.
The strategy is to always take the lock at the function called
when the coroutine is created, to avoid recursive locking.

Protecting bdrv_co_pdiscard{_snapshot}() implies that the following BlockDriver
callbacks always called with graph rdlock taken:
- bdrv_co_pdiscard
- bdrv_aio_pdiscard
- bdrv_co_pdiscard_snapshot

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c| 1 +
 block/io.c   | 2 ++
 include/block/block_int-common.h | 3 +++
 3 files changed, 6 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index d660772375..211a813523 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1716,6 +1716,7 @@ blk_co_do_pdiscard(BlockBackend *blk, int64_t offset, 
int64_t bytes)
 IO_CODE();
 
 blk_wait_while_drained(blk);
+GRAPH_RDLOCK_GUARD();
 
 ret = blk_check_byte_request(blk, offset, bytes);
 if (ret < 0) {
diff --git a/block/io.c b/block/io.c
index c9b451fecd..bc9f47538c 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2885,6 +2885,7 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, 
int64_t offset,
 int head, tail, align;
 BlockDriverState *bs = child->bs;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 if (!bs || !bs->drv || !bdrv_is_inserted(bs)) {
 return -ENOMEDIUM;
@@ -3488,6 +3489,7 @@ bdrv_co_pdiscard_snapshot(BlockDriverState *bs, int64_t 
offset, int64_t bytes)
 BlockDriver *drv = bs->drv;
 int ret;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 if (!drv) {
 return -ENOMEDIUM;
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 568c2d3092..7c34a8e40f 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -486,6 +486,7 @@ struct BlockDriver {
 BdrvRequestFlags flags, BlockCompletionFunc *cb, void *opaque);
 BlockAIOCB *(*bdrv_aio_flush)(BlockDriverState *bs,
 BlockCompletionFunc *cb, void *opaque);
+/* Called with graph rdlock taken. */
 BlockAIOCB *(*bdrv_aio_pdiscard)(BlockDriverState *bs,
 int64_t offset, int bytes,
 BlockCompletionFunc *cb, void *opaque);
@@ -559,6 +560,7 @@ struct BlockDriver {
  */
 int coroutine_fn (*bdrv_co_pwrite_zeroes)(BlockDriverState *bs,
 int64_t offset, int64_t bytes, BdrvRequestFlags flags);
+/* Called with graph rdlock taken. */
 int coroutine_fn (*bdrv_co_pdiscard)(BlockDriverState *bs,
 int64_t offset, int64_t bytes);
 
@@ -647,6 +649,7 @@ struct BlockDriver {
 int coroutine_fn (*bdrv_co_snapshot_block_status)(BlockDriverState *bs,
 bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum,
 int64_t *map, BlockDriverState **file);
+/* Called with graph rdlock taken. */
 int coroutine_fn (*bdrv_co_pdiscard_snapshot)(BlockDriverState *bs,
 int64_t offset, int64_t bytes);
 
-- 
2.31.1

[PATCH 20/20] block-gen: assert that nbd_co_do_establish_connection is always called with graph rdlock taken

The only caller of this function is nbd_do_establish_connection, a
generated_co_wrapper that already take the graph read lock.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/nbd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/nbd.c b/block/nbd.c
index 7d485c86d2..5cad58aaf6 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -322,6 +322,7 @@ int coroutine_fn 
nbd_co_do_establish_connection(BlockDriverState *bs,
 int ret;
 IO_CODE();
 
+assert_bdrv_graph_readable();
 assert(!s->ioc);
 
 s->ioc = nbd_co_establish_connection(s->conn, >info, blocking, errp);
-- 
2.31.1

[PATCH 15/20] block-gen: assert that {bdrv/blk}_co_flush is always called with graph rdlock taken

This function, in addition to be called by a generated_co_wrapper,
is also called by the blk_* API.
The strategy is to always take the lock at the function called
when the coroutine is created, to avoid recursive locking.

Protecting bdrv_co_flush() implies that the following BlockDriver
callbacks always called with graph rdlock taken:
- bdrv_co_flush
- bdrv_co_flush_to_os
- bdrv_co_flush_to_disk

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c| 3 ++-
 block/io.c   | 1 +
 include/block/block_int-common.h | 6 ++
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 083ed6009e..d660772375 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1759,8 +1759,9 @@ int coroutine_fn blk_co_pdiscard(BlockBackend *blk, 
int64_t offset,
 /* To be called between exactly one pair of blk_inc/dec_in_flight() */
 static int coroutine_fn blk_co_do_flush(BlockBackend *blk)
 {
-blk_wait_while_drained(blk);
 IO_CODE();
+blk_wait_while_drained(blk);
+GRAPH_RDLOCK_GUARD();
 
 if (!blk_is_available(blk)) {
 return -ENOMEDIUM;
diff --git a/block/io.c b/block/io.c
index cfc201ef91..0bf3919939 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2757,6 +2757,7 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
 int ret = 0;
 IO_CODE();
 
+assert_bdrv_graph_readable();
 bdrv_inc_in_flight(bs);
 
 if (!bdrv_is_inserted(bs) || bdrv_is_read_only(bs) ||
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 64c5bb64de..bab0521943 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -661,6 +661,8 @@ struct BlockDriver {
  * Flushes all data for all layers by calling bdrv_co_flush for underlying
  * layers, if needed. This function is needed for deterministic
  * synchronization of the flush finishing callback.
+ *
+ * Called with graph rdlock taken.
  */
 int coroutine_fn (*bdrv_co_flush)(BlockDriverState *bs);
 
@@ -671,6 +673,8 @@ struct BlockDriver {
 /*
  * Flushes all data that was already written to the OS all the way down to
  * the disk (for example file-posix.c calls fsync()).
+ *
+ * Called with graph rdlock taken.
  */
 int coroutine_fn (*bdrv_co_flush_to_disk)(BlockDriverState *bs);
 
@@ -678,6 +682,8 @@ struct BlockDriver {
  * Flushes all internal caches to the OS. The data may still sit in a
  * writeback cache of the host OS, but it will survive a crash of the qemu
  * process.
+ *
+ * Called with graph rdlock held.
  */
 int coroutine_fn (*bdrv_co_flush_to_os)(BlockDriverState *bs);
 
-- 
2.31.1

[PATCH 00/20] Protect the block layer with a rwlock: part 1

This serie is the first of four series that aim to introduce and use a new
graph rwlock in the QEMU block layer.
The aim is to replace the current AioContext lock with much fine-grained locks,
aimed to protect only specific data.
Currently the AioContext lock is used pretty much everywhere, and it's not
even clear what it is protecting exactly.

The aim of the rwlock is to cover graph modifications: more precisely,
when a BlockDriverState parent or child list is modified or read, since it can
be concurrently accessed by the main loop and iothreads.

The main assumption is that the main loop is the only one allowed to perform
graph modifications, and so far this has always been held by the current code.

The rwlock is inspired from cpus-common.c implementation, and aims to
reduce cacheline bouncing by having per-aiocontext counter of readers.
All details and implementation of the lock are in patch 1.

We distinguish between writer (main loop, under BQL) that modifies the
graph, and readers (all other coroutines running in various AioContext),
that go through the graph edges, reading ->parents and->children.
The writer (main loop)  has an "exclusive" access, so it first waits for
current read to finish, and then prevents incoming ones from
entering while it has the exclusive access.
The readers (coroutines in multiple AioContext) are free to
access the graph as long the writer is not modifying the graph.
In case it is, they go in a CoQueue and sleep until the writer
is done.

In this first serie, my aim is to introduce the lock (patches 1-3,6), cover the
main graph writer (patch 4), define assertions (patch 5) and start using the
read lock in the generated_co_wrapper functions (7-20).
Such functions recursively traverse the BlockDriverState graph, so they must
take the graph rdlock.

We distinguish two cases related to the generated_co_wrapper (often shortened
to g_c_w):
- qemu_in_coroutine(), which means the function is already running in a
  coroutine. This means we don't take the lock, because the coroutine must
  have taken it when it started
- !qemu_in_coroutine(), which means we need to create a new coroutine that
  performs the operation requested. In this case we take the rdlock as soon as
  the coroutine starts, and release only before finishing.

In this and following series, we try to follow the following locking pattern:
- bdrv_co_* functions that call BlockDriver callbacks always expect the lock
  to be taken, therefore they assert.
- blk_co_* functions usually call blk_wait_while_drained(), therefore they must
  take the lock internally. Therefore we introduce generated_co_wrapper_blk,
  which does not take the rdlock when starting the coroutine.

The long term goal of this series is to eventually replace the AioContext lock,
so that we can get rid of it once and for all.

This serie is based on v4 of "Still more coroutine and various fixes in block 
layer".

Based-on: <20221116122241.2856527-1-eespo...@redhat.com>

Thank you,
Emanuele

Emanuele Giuseppe Esposito (19):
  graph-lock: introduce BdrvGraphRWlock structure
  async: register/unregister aiocontext in graph lock list
  block.c: wrlock in bdrv_replace_child_noperm
  block: remove unnecessary assert_bdrv_graph_writable()
  block: assert that graph read and writes are performed correctly
  graph-lock: implement WITH_GRAPH_RDLOCK_GUARD and GRAPH_RDLOCK_GUARD
macros
  block-coroutine-wrapper.py: take the graph rdlock in bdrv_* functions
  block-backend: introduce new generated_co_wrapper_blk annotation
  block-gen: assert that {bdrv/blk}_co_truncate is always called with
graph rdlock taken
  block-gen: assert that bdrv_co_{check/invalidate_cache} are always
called with graph rdlock taken
  block-gen: assert that bdrv_co_pwrite is always called with graph
rdlock taken
  block-gen: assert that bdrv_co_pwrite_{zeros/sync} is always called
with graph rdlock taken
  block-gen: assert that bdrv_co_pread is always called with graph
rdlock taken
  block-gen: assert that {bdrv/blk}_co_flush is always called with graph
rdlock taken
  block-gen: assert that bdrv_co_{read/write}v_vmstate are always called
with graph rdlock taken
  block-gen: assert that bdrv_co_pdiscard is always called with graph
rdlock taken
  block-gen: assert that bdrv_co_common_block_status_above is always
called with graph rdlock taken
  block-gen: assert that bdrv_co_ioctl is always called with graph
rdlock taken
  block-gen: assert that nbd_co_do_establish_connection is always called
with graph rdlock taken

Paolo Bonzini (1):
  block: introduce a lock to protect graph operations

 block.c|  15 +-
 block/backup.c |   3 +
 block/block-backend.c  |  10 +-
 block/block-copy.c |  10 +-
 block/graph-lock.c | 255 +
 block/io.c |  15 ++
 block/meson.build

[PATCH 13/20] block-gen: assert that bdrv_co_pwrite_{zeros/sync} is always called with graph rdlock taken

Already protected by bdrv_co_pwrite callers.

Protecting bdrv_co_do_pwrite_zeroes() implies that
BlockDriver->bdrv_co_pwrite_zeroes() is always called with
graph rdlock taken.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/io.c   | 3 +++
 include/block/block_int-common.h | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/block/io.c b/block/io.c
index 9280fb9f38..92c74648fb 100644
--- a/block/io.c
+++ b/block/io.c
@@ -904,6 +904,7 @@ int coroutine_fn bdrv_co_pwrite_sync(BdrvChild *child, 
int64_t offset,
 {
 int ret;
 IO_CODE();
+assert_bdrv_graph_readable();
 
 ret = bdrv_co_pwrite(child, offset, bytes, buf, flags);
 if (ret < 0) {
@@ -1660,6 +1661,7 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 bs->bl.request_alignment);
 int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer, MAX_BOUNCE_BUFFER);
 
+assert_bdrv_graph_readable();
 bdrv_check_request(offset, bytes, _abort);
 
 if (!drv) {
@@ -2124,6 +2126,7 @@ int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, 
int64_t offset,
 {
 IO_CODE();
 trace_bdrv_co_pwrite_zeroes(child->bs, offset, bytes, flags);
+assert_bdrv_graph_readable();
 
 if (!(child->bs->open_flags & BDRV_O_UNMAP)) {
 flags &= ~BDRV_REQ_MAY_UNMAP;
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index d44f825d95..e8d2e4b6c7 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -549,6 +549,8 @@ struct BlockDriver {
  * would use a compact metadata representation to implement this.  This
  * function pointer may be NULL or return -ENOSUP and .bdrv_co_writev()
  * will be called instead.
+ *
+ * Called with graph rdlock taken.
  */
 int coroutine_fn (*bdrv_co_pwrite_zeroes)(BlockDriverState *bs,
 int64_t offset, int64_t bytes, BdrvRequestFlags flags);
-- 
2.31.1

Re: [PATCH maybe-7.2 1/3] hw/i2c: only schedule pending master when bus is idle

2022-11-16 Thread Corey Minyard

On Wed, Nov 16, 2022 at 09:43:10AM +0100, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> It is not given that the current master will release the bus after a
> transfer ends. Only schedule a pending master if the bus is idle.
> 

Yes, I think this is correct.

Acked-by: Corey Minyard 

Is there a reason you are thinking this is needed for 7.2?  There's no
code in qemu proper that uses this yet.  I had assumed that was coming
soon after the patch.

-corey

> Fixes: 37fa5ca42623 ("hw/i2c: support multiple masters")
> Signed-off-by: Klaus Jensen 
> ---
>  hw/i2c/aspeed_i2c.c  |  2 ++
>  hw/i2c/core.c| 37 ++---
>  include/hw/i2c/i2c.h |  2 ++
>  3 files changed, 26 insertions(+), 15 deletions(-)
> 
> diff --git a/hw/i2c/aspeed_i2c.c b/hw/i2c/aspeed_i2c.c
> index c166fd20fa11..1f071a3811f7 100644
> --- a/hw/i2c/aspeed_i2c.c
> +++ b/hw/i2c/aspeed_i2c.c
> @@ -550,6 +550,8 @@ static void aspeed_i2c_bus_handle_cmd(AspeedI2CBus *bus, 
> uint64_t value)
>  }
>  SHARED_ARRAY_FIELD_DP32(bus->regs, reg_cmd, M_STOP_CMD, 0);
>  aspeed_i2c_set_state(bus, I2CD_IDLE);
> +
> +i2c_schedule_pending_master(bus->bus);
>  }
>  
>  if (aspeed_i2c_bus_pkt_mode_en(bus)) {
> diff --git a/hw/i2c/core.c b/hw/i2c/core.c
> index d4ba8146bffb..bed594fe599b 100644
> --- a/hw/i2c/core.c
> +++ b/hw/i2c/core.c
> @@ -185,22 +185,39 @@ int i2c_start_transfer(I2CBus *bus, uint8_t address, 
> bool is_recv)
>  
>  void i2c_bus_master(I2CBus *bus, QEMUBH *bh)
>  {
> +I2CPendingMaster *node = g_new(struct I2CPendingMaster, 1);
> +node->bh = bh;
> +
> +QSIMPLEQ_INSERT_TAIL(>pending_masters, node, entry);
> +}
> +
> +void i2c_schedule_pending_master(I2CBus *bus)
> +{
> +I2CPendingMaster *node;
> +
>  if (i2c_bus_busy(bus)) {
> -I2CPendingMaster *node = g_new(struct I2CPendingMaster, 1);
> -node->bh = bh;
> -
> -QSIMPLEQ_INSERT_TAIL(>pending_masters, node, entry);
> +/* someone is already controlling the bus; wait for it to release it 
> */
> +return;
> +}
>  
> +if (QSIMPLEQ_EMPTY(>pending_masters)) {
>  return;
>  }
>  
> -bus->bh = bh;
> +node = QSIMPLEQ_FIRST(>pending_masters);
> +bus->bh = node->bh;
> +
> +QSIMPLEQ_REMOVE_HEAD(>pending_masters, entry);
> +g_free(node);
> +
>  qemu_bh_schedule(bus->bh);
>  }
>  
>  void i2c_bus_release(I2CBus *bus)
>  {
>  bus->bh = NULL;
> +
> +i2c_schedule_pending_master(bus);
>  }
>  
>  int i2c_start_recv(I2CBus *bus, uint8_t address)
> @@ -234,16 +251,6 @@ void i2c_end_transfer(I2CBus *bus)
>  g_free(node);
>  }
>  bus->broadcast = false;
> -
> -if (!QSIMPLEQ_EMPTY(>pending_masters)) {
> -I2CPendingMaster *node = QSIMPLEQ_FIRST(>pending_masters);
> -bus->bh = node->bh;
> -
> -QSIMPLEQ_REMOVE_HEAD(>pending_masters, entry);
> -g_free(node);
> -
> -qemu_bh_schedule(bus->bh);
> -}
>  }
>  
>  int i2c_send(I2CBus *bus, uint8_t data)
> diff --git a/include/hw/i2c/i2c.h b/include/hw/i2c/i2c.h
> index 9b9581d23097..2a3abacd1ba6 100644
> --- a/include/hw/i2c/i2c.h
> +++ b/include/hw/i2c/i2c.h
> @@ -141,6 +141,8 @@ int i2c_start_send(I2CBus *bus, uint8_t address);
>   */
>  int i2c_start_send_async(I2CBus *bus, uint8_t address);
>  
> +void i2c_schedule_pending_master(I2CBus *bus);
> +
>  void i2c_end_transfer(I2CBus *bus);
>  void i2c_nack(I2CBus *bus);
>  void i2c_ack(I2CBus *bus);
> -- 
> 2.38.1
> 
>

[PATCH 00/20] Protect the block layer with a rwlock: part 1

This serie is the first of four series that aim to introduce and use a new
graph rwlock in the QEMU block layer.
The aim is to replace the current AioContext lock with much fine-grained locks,
aimed to protect only specific data.
Currently the AioContext lock is used pretty much everywhere, and it's not
even clear what it is protecting exactly.

The aim of the rwlock is to cover graph modifications: more precisely,
when a BlockDriverState parent or child list is modified or read, since it can
be concurrently accessed by the main loop and iothreads.

The main assumption is that the main loop is the only one allowed to perform
graph modifications, and so far this has always been held by the current code.

The rwlock is inspired from cpus-common.c implementation, and aims to
reduce cacheline bouncing by having per-aiocontext counter of readers.
All details and implementation of the lock are in patch 1.

We distinguish between writer (main loop, under BQL) that modifies the
graph, and readers (all other coroutines running in various AioContext),
that go through the graph edges, reading ->parents and->children.
The writer (main loop)  has an "exclusive" access, so it first waits for
current read to finish, and then prevents incoming ones from
entering while it has the exclusive access.
The readers (coroutines in multiple AioContext) are free to
access the graph as long the writer is not modifying the graph.
In case it is, they go in a CoQueue and sleep until the writer
is done.

In this first serie, my aim is to introduce the lock (patches 1-3,6), cover the
main graph writer (patch 4), define assertions (patch 5) and start using the
read lock in the generated_co_wrapper functions (7-20).
Such functions recursively traverse the BlockDriverState graph, so they must
take the graph rdlock.

We distinguish two cases related to the generated_co_wrapper (often shortened
to g_c_w):
- qemu_in_coroutine(), which means the function is already running in a
  coroutine. This means we don't take the lock, because the coroutine must
  have taken it when it started
- !qemu_in_coroutine(), which means we need to create a new coroutine that
  performs the operation requested. In this case we take the rdlock as soon as
  the coroutine starts, and release only before finishing.

In this and following series, we try to follow the following locking pattern:
- bdrv_co_* functions that call BlockDriver callbacks always expect the lock
  to be taken, therefore they assert.
- blk_co_* functions usually call blk_wait_while_drained(), therefore they must
  take the lock internally. Therefore we introduce generated_co_wrapper_blk,
  which does not take the rdlock when starting the coroutine.

The long term goal of this series is to eventually replace the AioContext lock,
so that we can get rid of it once and for all.

This serie is based on v4 of "Still more coroutine and various fixes in block 
layer".

Based-on: <20221116122241.2856527-1-eespo...@redhat.com>

Thank you,
Emanuele

Emanuele Giuseppe Esposito (19):
  graph-lock: introduce BdrvGraphRWlock structure
  async: register/unregister aiocontext in graph lock list
  block.c: wrlock in bdrv_replace_child_noperm
  block: remove unnecessary assert_bdrv_graph_writable()
  block: assert that graph read and writes are performed correctly
  graph-lock: implement WITH_GRAPH_RDLOCK_GUARD and GRAPH_RDLOCK_GUARD
macros
  block-coroutine-wrapper.py: take the graph rdlock in bdrv_* functions
  block-backend: introduce new generated_co_wrapper_blk annotation
  block-gen: assert that {bdrv/blk}_co_truncate is always called with
graph rdlock taken
  block-gen: assert that bdrv_co_{check/invalidate_cache} are always
called with graph rdlock taken
  block-gen: assert that bdrv_co_pwrite is always called with graph
rdlock taken
  block-gen: assert that bdrv_co_pwrite_{zeros/sync} is always called
with graph rdlock taken
  block-gen: assert that bdrv_co_pread is always called with graph
rdlock taken
  block-gen: assert that {bdrv/blk}_co_flush is always called with graph
rdlock taken
  block-gen: assert that bdrv_co_{read/write}v_vmstate are always called
with graph rdlock taken
  block-gen: assert that bdrv_co_pdiscard is always called with graph
rdlock taken
  block-gen: assert that bdrv_co_common_block_status_above is always
called with graph rdlock taken
  block-gen: assert that bdrv_co_ioctl is always called with graph
rdlock taken
  block-gen: assert that nbd_co_do_establish_connection is always called
with graph rdlock taken

Paolo Bonzini (1):
  block: introduce a lock to protect graph operations

 block.c|  15 +-
 block/backup.c |   3 +
 block/block-backend.c  |  10 +-
 block/block-copy.c |  10 +-
 block/graph-lock.c | 255 +
 block/io.c |  15 ++
 block/meson.build

[PATCH 00/20] Protect the block layer with a rwlock: part 1

This serie is the first of four series that aim to introduce and use a new
graph rwlock in the QEMU block layer.
The aim is to replace the current AioContext lock with much fine-grained locks,
aimed to protect only specific data.
Currently the AioContext lock is used pretty much everywhere, and it's not
even clear what it is protecting exactly.

The aim of the rwlock is to cover graph modifications: more precisely,
when a BlockDriverState parent or child list is modified or read, since it can
be concurrently accessed by the main loop and iothreads.

The main assumption is that the main loop is the only one allowed to perform
graph modifications, and so far this has always been held by the current code.

The rwlock is inspired from cpus-common.c implementation, and aims to
reduce cacheline bouncing by having per-aiocontext counter of readers.
All details and implementation of the lock are in patch 1.

We distinguish between writer (main loop, under BQL) that modifies the
graph, and readers (all other coroutines running in various AioContext),
that go through the graph edges, reading ->parents and->children.
The writer (main loop)  has an "exclusive" access, so it first waits for
current read to finish, and then prevents incoming ones from
entering while it has the exclusive access.
The readers (coroutines in multiple AioContext) are free to
access the graph as long the writer is not modifying the graph.
In case it is, they go in a CoQueue and sleep until the writer
is done.

In this first serie, my aim is to introduce the lock (patches 1-3,6), cover the
main graph writer (patch 4), define assertions (patch 5) and start using the
read lock in the generated_co_wrapper functions (7-20).
Such functions recursively traverse the BlockDriverState graph, so they must
take the graph rdlock.

We distinguish two cases related to the generated_co_wrapper (often shortened
to g_c_w):
- qemu_in_coroutine(), which means the function is already running in a
  coroutine. This means we don't take the lock, because the coroutine must
  have taken it when it started
- !qemu_in_coroutine(), which means we need to create a new coroutine that
  performs the operation requested. In this case we take the rdlock as soon as
  the coroutine starts, and release only before finishing.

In this and following series, we try to follow the following locking pattern:
- bdrv_co_* functions that call BlockDriver callbacks always expect the lock
  to be taken, therefore they assert.
- blk_co_* functions usually call blk_wait_while_drained(), therefore they must
  take the lock internally. Therefore we introduce generated_co_wrapper_blk,
  which does not take the rdlock when starting the coroutine.

The long term goal of this series is to eventually replace the AioContext lock,
so that we can get rid of it once and for all.

This serie is based on v4 of "Still more coroutine and various fixes in block 
layer".

Based-on: <20221116122241.2856527-1-eespo...@redhat.com>

Thank you,
Emanuele

Emanuele Giuseppe Esposito (19):
  graph-lock: introduce BdrvGraphRWlock structure
  async: register/unregister aiocontext in graph lock list
  block.c: wrlock in bdrv_replace_child_noperm
  block: remove unnecessary assert_bdrv_graph_writable()
  block: assert that graph read and writes are performed correctly
  graph-lock: implement WITH_GRAPH_RDLOCK_GUARD and GRAPH_RDLOCK_GUARD
macros
  block-coroutine-wrapper.py: take the graph rdlock in bdrv_* functions
  block-backend: introduce new generated_co_wrapper_blk annotation
  block-gen: assert that {bdrv/blk}_co_truncate is always called with
graph rdlock taken
  block-gen: assert that bdrv_co_{check/invalidate_cache} are always
called with graph rdlock taken
  block-gen: assert that bdrv_co_pwrite is always called with graph
rdlock taken
  block-gen: assert that bdrv_co_pwrite_{zeros/sync} is always called
with graph rdlock taken
  block-gen: assert that bdrv_co_pread is always called with graph
rdlock taken
  block-gen: assert that {bdrv/blk}_co_flush is always called with graph
rdlock taken
  block-gen: assert that bdrv_co_{read/write}v_vmstate are always called
with graph rdlock taken
  block-gen: assert that bdrv_co_pdiscard is always called with graph
rdlock taken
  block-gen: assert that bdrv_co_common_block_status_above is always
called with graph rdlock taken
  block-gen: assert that bdrv_co_ioctl is always called with graph
rdlock taken
  block-gen: assert that nbd_co_do_establish_connection is always called
with graph rdlock taken

Paolo Bonzini (1):
  block: introduce a lock to protect graph operations

 block.c|  15 +-
 block/backup.c |   3 +
 block/block-backend.c  |  10 +-
 block/block-copy.c |  10 +-
 block/graph-lock.c | 255 +
 block/io.c |  15 ++
 block/meson.build

Re: [PATCH v3 10/17] vfio/migration: Move migration v1 logic to vfio_migration_init()

2022-11-16 Thread Avihai Horon




On 16/11/2022 1:56, Alex Williamson wrote:

External email: Use caution opening links or attachments


On Thu, 3 Nov 2022 18:16:13 +0200
Avihai Horon  wrote:


Move vfio_dev_get_region_info() logic from vfio_migration_probe() to
vfio_migration_init(). This logic is specific to v1 protocol and moving
it will make it easier to add the v2 protocol implementation later.
No functional changes intended.

Signed-off-by: Avihai Horon 
---
  hw/vfio/migration.c  | 30 +++---
  hw/vfio/trace-events |  2 +-
  2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 99ffb75782..0e3a950746 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -785,14 +785,14 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
  vbasedev->migration = NULL;
  }

-static int vfio_migration_init(VFIODevice *vbasedev,
-   struct vfio_region_info *info)
+static int vfio_migration_init(VFIODevice *vbasedev)
  {
  int ret;
  Object *obj;
  VFIOMigration *migration;
  char id[256] = "";
  g_autofree char *path = NULL, *oid = NULL;
+struct vfio_region_info *info = NULL;

Nit, I'm not spotting any cases where we need this initialization.  The
same is not true in the code the info handling was extracted from.
Thanks,


You are right. I will drop the initialization in v4.
Thanks!


Alex


  if (!vbasedev->ops->vfio_get_object) {
  return -EINVAL;
@@ -803,6 +803,14 @@ static int vfio_migration_init(VFIODevice *vbasedev,
  return -EINVAL;
  }

+ret = vfio_get_dev_region_info(vbasedev,
+   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
+   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
+   );
+if (ret) {
+return ret;
+}
+
  vbasedev->migration = g_new0(VFIOMigration, 1);
  vbasedev->migration->device_state = VFIO_DEVICE_STATE_V1_RUNNING;
  vbasedev->migration->vm_running = runstate_is_running();
@@ -822,6 +830,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
  goto err;
  }

+g_free(info);
+
  migration = vbasedev->migration;
  migration->vbasedev = vbasedev;

@@ -844,6 +854,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
  return 0;

  err:
+g_free(info);
  vfio_migration_exit(vbasedev);
  return ret;
  }
@@ -857,34 +868,23 @@ int64_t vfio_mig_bytes_transferred(void)

  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
  {
-struct vfio_region_info *info = NULL;
  int ret = -ENOTSUP;

  if (!vbasedev->enable_migration) {
  goto add_blocker;
  }

-ret = vfio_get_dev_region_info(vbasedev,
-   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
-   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
-   );
+ret = vfio_migration_init(vbasedev);
  if (ret) {
  goto add_blocker;
  }

-ret = vfio_migration_init(vbasedev, info);
-if (ret) {
-goto add_blocker;
-}
-
-trace_vfio_migration_probe(vbasedev->name, info->index);
-g_free(info);
+trace_vfio_migration_probe(vbasedev->name);
  return 0;

  add_blocker:
  error_setg(>migration_blocker,
 "VFIO device doesn't support migration");
-g_free(info);

  ret = migrate_add_blocker(vbasedev->migration_blocker, errp);
  if (ret < 0) {
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index a21cbd2a56..27c059f96e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -148,7 +148,7 @@ vfio_display_edid_update(uint32_t prefx, uint32_t prefy) 
"%ux%u"
  vfio_display_edid_write_error(void) ""

  # migration.c
-vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
+vfio_migration_probe(const char *name) " (%s)"
  vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
  vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t 
dev_state) " (%s) running %d reason %s device state %d"
  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state 
%s"

Re: [PATCH v3 07/17] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support

2022-11-16 Thread Avihai Horon




On 16/11/2022 1:36, Alex Williamson wrote:

External email: Use caution opening links or attachments


On Thu, 3 Nov 2022 18:16:10 +0200
Avihai Horon  wrote:


Currently, if IOMMU of a VFIO container doesn't support dirty page
tracking, migration is blocked. This is because a DMA-able VFIO device
can dirty RAM pages without updating QEMU about it, thus breaking the
migration.

However, this doesn't mean that migration can't be done at all.
In such case, allow migration and let QEMU VFIO code mark the entire
bitmap dirty.

This guarantees that all pages that might have gotten dirty are reported
back, and thus guarantees a valid migration even without VFIO IOMMU
dirty tracking support.

The motivation for this patch is the future introduction of iommufd [1].
iommufd will directly implement the /dev/vfio/vfio container IOCTLs by
mapping them into its internal ops, allowing the usage of these IOCTLs
over iommufd. However, VFIO IOMMU dirty tracking will not be supported
by this VFIO compatibility API.

This patch will allow migration by hosts that use the VFIO compatibility
API and prevent migration regressions caused by the lack of VFIO IOMMU
dirty tracking support.

[1] https://lore.kernel.org/kvm/0-v2-f9436d0bde78+4bb-iommufd_...@nvidia.com/

Signed-off-by: Avihai Horon 
---
  hw/vfio/common.c| 84 +
  hw/vfio/migration.c |  3 +-
  2 files changed, 70 insertions(+), 17 deletions(-)

This duplicates quite a bit of code, I think we can integrate this into
a common flow quite a bit more.  See below, only compile tested. Thanks,


Oh, great, thanks!
I will test it and add it as part of v4.


Alex

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6b5d8c0bf694..4117b40fd9b0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -397,17 +397,33 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
   IOMMUTLBEntry *iotlb)
  {
  struct vfio_iommu_type1_dma_unmap *unmap;
-struct vfio_bitmap *bitmap;
+struct vfio_bitmap *vbitmap;
+unsigned long *bitmap;
+uint64_t bitmap_size;
  uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
  int ret;

-unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
+unmap = g_malloc0(sizeof(*unmap) + sizeof(*vbitmap));

-unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
+unmap->argsz = sizeof(*unmap);
  unmap->iova = iova;
  unmap->size = size;
-unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
-bitmap = (struct vfio_bitmap *)>data;
+
+bitmap_size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+  BITS_PER_BYTE;
+bitmap = g_try_malloc0(bitmap_size);
+if (!bitmap) {
+ret = -ENOMEM;
+goto unmap_exit;
+}
+
+if (!container->dirty_pages_supported) {
+bitmap_set(bitmap, 0, pages);
+goto do_unmap;
+}
+
+unmap->argsz += sizeof(*vbitmap);
+unmap->flags = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;

  /*
   * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
@@ -415,33 +431,28 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
   * to qemu_real_host_page_size.
   */

-bitmap->pgsize = qemu_real_host_page_size();
-bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
-   BITS_PER_BYTE;
+vbitmap = (struct vfio_bitmap *)>data;
+vbitmap->data = (__u64 *)bitmap;
+vbitmap->pgsize = qemu_real_host_page_size();
+vbitmap->size = bitmap_size;

-if (bitmap->size > container->max_dirty_bitmap_size) {
-error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
- (uint64_t)bitmap->size);
+if (bitmap_size > container->max_dirty_bitmap_size) {
+error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, bitmap_size);
  ret = -E2BIG;
  goto unmap_exit;
  }

-bitmap->data = g_try_malloc0(bitmap->size);
-if (!bitmap->data) {
-ret = -ENOMEM;
-goto unmap_exit;
-}
-
+do_unmap:
  ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
  if (!ret) {
-cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
-iotlb->translated_addr, pages);
+cpu_physical_memory_set_dirty_lebitmap(bitmap, iotlb->translated_addr,
+   pages);
  } else {
  error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
  }

-g_free(bitmap->data);
  unmap_exit:
+g_free(bitmap);
  g_free(unmap);
  return ret;
  }
@@ -460,8 +471,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
  .size = size,
  };

-if (iotlb && container->dirty_pages_supported &&
-vfio_devices_all_running_and_saving(container)) {
+if (iotlb && vfio_devices_all_running_and_saving(container)) {
  return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
  }

@@ -1257,6 +1267,10 @@

Re: [PATCH maybe-7.2 1/3] hw/i2c: only schedule pending master when bus is idle

2022-11-16 Thread Peter Maydell

On Wed, 16 Nov 2022 at 08:43, Klaus Jensen  wrote:
>
> From: Klaus Jensen 
>
> It is not given that the current master will release the bus after a
> transfer ends. Only schedule a pending master if the bus is idle.
>
> Fixes: 37fa5ca42623 ("hw/i2c: support multiple masters")
> Signed-off-by: Klaus Jensen 

If this is a bug fix we should consider for 7.2, you should
send it as a separate patch with a commit message that
describes the consequences of the bug (e.g. whether it
affects common workloads or if it's just an odd corner case).
As one patch in an otherwise RFC series it's going to get lost
otherwise.

thanks
-- PMM

[PATCH v4 04/11] block-coroutine-wrapper.py: introduce generated_co_wrapper_simple

This new annotation creates just a function wrapper that creates
a new coroutine. It assumes the caller is not a coroutine.

This is much better as g_c_w, because it is clear if the caller
is a coroutine or not, and provides the advantage of automating
the code creation. In the future all g_c_w functions will be
substituted on g_c_w_simple.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/block/block-common.h   |  1 +
 scripts/block-coroutine-wrapper.py | 87 ++
 2 files changed, 66 insertions(+), 22 deletions(-)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index 297704c1e9..8ae750c7cf 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -43,6 +43,7 @@
  * Read more in docs/devel/block-coroutine-wrapper.rst
  */
 #define generated_co_wrapper
+#define generated_co_wrapper_simple
 
 /* block.c */
 typedef struct BlockDriver BlockDriver;
diff --git a/scripts/block-coroutine-wrapper.py 
b/scripts/block-coroutine-wrapper.py
index 08be813407..f88ef53964 100644
--- a/scripts/block-coroutine-wrapper.py
+++ b/scripts/block-coroutine-wrapper.py
@@ -62,10 +62,15 @@ def __init__(self, param_decl: str) -> None:
 
 
 class FuncDecl:
-def __init__(self, return_type: str, name: str, args: str) -> None:
+def __init__(self, return_type: str, name: str, args: str,
+ variant: str) -> None:
 self.return_type = return_type.strip()
 self.name = name.strip()
 self.args = [ParamDecl(arg.strip()) for arg in args.split(',')]
+self.create_only_co = False
+
+if variant == '_simple':
+self.create_only_co = True
 
 def gen_list(self, format: str) -> str:
 return ', '.join(format.format_map(arg.__dict__) for arg in self.args)
@@ -75,7 +80,8 @@ def gen_block(self, format: str) -> str:
 
 
 # Match wrappers declared with a generated_co_wrapper mark
-func_decl_re = re.compile(r'^int\s*generated_co_wrapper\s*'
+func_decl_re = re.compile(r'^int\s*generated_co_wrapper'
+  r'(?P(_[a-z][a-z0-9_]*)?)\s*'
   r'(?P[a-z][a-z0-9_]*)'
   r'\((?P[^)]*)\);$', re.MULTILINE)
 
@@ -84,7 +90,8 @@ def func_decl_iter(text: str) -> Iterator:
 for m in func_decl_re.finditer(text):
 yield FuncDecl(return_type='int',
name=m.group('wrapper_name'),
-   args=m.group('args'))
+   args=m.group('args'),
+   variant=m.group('variant'))
 
 
 def snake_to_camel(func_name: str) -> str:
@@ -97,6 +104,51 @@ def snake_to_camel(func_name: str) -> str:
 return ''.join(words)
 
 
+# Checks if we are already in coroutine
+def create_g_c_w(func: FuncDecl) -> str:
+name = func.co_name
+struct_name = func.struct_name
+return f"""\
+int {func.name}({ func.gen_list('{decl}') })
+{{
+if (qemu_in_coroutine()) {{
+return {name}({ func.gen_list('{name}') });
+}} else {{
+{struct_name} s = {{
+.poll_state.bs = {func.bs},
+.poll_state.in_progress = true,
+
+{ func.gen_block('.{name} = {name},') }
+}};
+
+s.poll_state.co = qemu_coroutine_create({name}_entry, );
+
+return bdrv_poll_co(_state);
+}}
+}}"""
+
+
+# Assumes we are not in coroutine, and creates one
+def create_coroutine_only(func: FuncDecl) -> str:
+name = func.co_name
+struct_name = func.struct_name
+return f"""\
+int {func.name}({ func.gen_list('{decl}') })
+{{
+assert(!qemu_in_coroutine());
+{struct_name} s = {{
+.poll_state.bs = {func.bs},
+.poll_state.in_progress = true,
+
+{ func.gen_block('.{name} = {name},') }
+}};
+
+s.poll_state.co = qemu_coroutine_create({name}_entry, );
+
+return bdrv_poll_co(_state);
+}}"""
+
+
 def gen_wrapper(func: FuncDecl) -> str:
 assert not '_co_' in func.name
 assert func.return_type == 'int'
@@ -105,7 +157,8 @@ def gen_wrapper(func: FuncDecl) -> str:
 
 subsystem, subname = func.name.split('_', 1)
 
-name = f'{subsystem}_co_{subname}'
+func.co_name = f'{subsystem}_co_{subname}'
+name = func.co_name
 
 t = func.args[0].type
 if t == 'BlockDriverState *':
@@ -114,7 +167,13 @@ def gen_wrapper(func: FuncDecl) -> str:
 bs = 'child->bs'
 else:
 bs = 'blk_bs(blk)'
-struct_name = snake_to_camel(name)
+func.bs = bs
+func.struct_name = snake_to_camel(func.name)
+struct_name = func.struct_name
+
+creation_function = create_g_c_w
+if func.create_only_co:
+creation_function = create_coroutine_only
 
 return f"""\
 /*
@@ -136,23 +195,7 @@ def gen_wrapper(func: FuncDecl) -> str:
 aio_wait_kick();
 }}
 
-int {func.name}({ func.gen_list('{decl}') })
-{{
-if (qemu_in_coroutine()) {{
-return {name}({ func.gen_list('{name}') });
-}} else {{
-{struct_name} s = {{
-.poll_state.bs = {bs},

[PATCH v4 05/11] block-coroutine-wrapper.py: default to main loop aiocontext if function does not have a BlockDriverState parameter

Basically BdrvPollCo->bs is only used by bdrv_poll_co(), and the
functions that it uses are both using bdrv_get_aio_context, that
defaults to qemu_get_aio_context() if bs is NULL.

Therefore pass NULL to BdrvPollCo to automatically generate a function
that create and runs a coroutine in the main loop.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 scripts/block-coroutine-wrapper.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/scripts/block-coroutine-wrapper.py 
b/scripts/block-coroutine-wrapper.py
index f88ef53964..0f842386d4 100644
--- a/scripts/block-coroutine-wrapper.py
+++ b/scripts/block-coroutine-wrapper.py
@@ -152,8 +152,6 @@ def create_coroutine_only(func: FuncDecl) -> str:
 def gen_wrapper(func: FuncDecl) -> str:
 assert not '_co_' in func.name
 assert func.return_type == 'int'
-assert func.args[0].type in ['BlockDriverState *', 'BdrvChild *',
- 'BlockBackend *']
 
 subsystem, subname = func.name.split('_', 1)
 
@@ -165,8 +163,10 @@ def gen_wrapper(func: FuncDecl) -> str:
 bs = 'bs'
 elif t == 'BdrvChild *':
 bs = 'child->bs'
-else:
+elif t == 'BlockBackend *':
 bs = 'blk_bs(blk)'
+else:
+bs = 'NULL'
 func.bs = bs
 func.struct_name = snake_to_camel(func.name)
 struct_name = func.struct_name
-- 
2.31.1

[PATCH v4 09/11] block: bdrv_create_file is a coroutine_fn

It is always called in coroutine_fn callbacks, therefore
it can directly call bdrv_co_create().

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c| 6 --
 include/block/block-global-state.h | 3 ++-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index c610a32e77..7a4c3eb540 100644
--- a/block.c
+++ b/block.c
@@ -534,6 +534,7 @@ static int coroutine_fn bdrv_co_create(BlockDriver *drv, 
const char *filename,
 int ret;
 char *filename_copy;
 GLOBAL_STATE_CODE();
+assert(qemu_in_coroutine());
 assert(*errp == NULL);
 assert(drv);
 
@@ -725,7 +726,8 @@ out:
 return ret;
 }
 
-int bdrv_create_file(const char *filename, QemuOpts *opts, Error **errp)
+int coroutine_fn bdrv_create_file(const char *filename, QemuOpts *opts,
+  Error **errp)
 {
 QemuOpts *protocol_opts;
 BlockDriver *drv;
@@ -766,7 +768,7 @@ int bdrv_create_file(const char *filename, QemuOpts *opts, 
Error **errp)
 goto out;
 }
 
-ret = bdrv_create(drv, filename, protocol_opts, errp);
+ret = bdrv_co_create(drv, filename, protocol_opts, errp);
 out:
 qemu_opts_del(protocol_opts);
 qobject_unref(qdict);
diff --git a/include/block/block-global-state.h 
b/include/block/block-global-state.h
index 00e0cf8aea..6f35ed99e3 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -57,7 +57,8 @@ BlockDriver *bdrv_find_protocol(const char *filename,
 BlockDriver *bdrv_find_format(const char *format_name);
 int bdrv_create(BlockDriver *drv, const char* filename,
 QemuOpts *opts, Error **errp);
-int bdrv_create_file(const char *filename, QemuOpts *opts, Error **errp);
+int coroutine_fn bdrv_create_file(const char *filename, QemuOpts *opts,
+  Error **errp);
 
 BlockDriverState *bdrv_new(void);
 int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
-- 
2.31.1

[PATCH v4 01/11] block-copy: add missing coroutine_fn annotations

These functions end up calling bdrv_common_block_status_above(), a
generated_co_wrapper function.
In addition, they also happen to be always called in coroutine context,
meaning all callers are coroutine_fn.
This means that the g_c_w function will enter the qemu_in_coroutine()
case and eventually suspend (or in other words call qemu_coroutine_yield()).
Therefore we need to mark such functions coroutine_fn too.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-copy.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index bb947afdda..f33ab1d0b6 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -577,8 +577,9 @@ static coroutine_fn int block_copy_task_entry(AioTask *task)
 return ret;
 }
 
-static int block_copy_block_status(BlockCopyState *s, int64_t offset,
-   int64_t bytes, int64_t *pnum)
+static coroutine_fn int block_copy_block_status(BlockCopyState *s,
+int64_t offset,
+int64_t bytes, int64_t *pnum)
 {
 int64_t num;
 BlockDriverState *base;
@@ -613,8 +614,9 @@ static int block_copy_block_status(BlockCopyState *s, 
int64_t offset,
  * Check if the cluster starting at offset is allocated or not.
  * return via pnum the number of contiguous clusters sharing this allocation.
  */
-static int block_copy_is_cluster_allocated(BlockCopyState *s, int64_t offset,
-   int64_t *pnum)
+static int coroutine_fn block_copy_is_cluster_allocated(BlockCopyState *s,
+int64_t offset,
+int64_t *pnum)
 {
 BlockDriverState *bs = s->source->bs;
 int64_t count, total_count = 0;
@@ -669,8 +671,9 @@ void block_copy_reset(BlockCopyState *s, int64_t offset, 
int64_t bytes)
  * @return 0 when the cluster at @offset was unallocated,
  * 1 otherwise, and -ret on error.
  */
-int64_t block_copy_reset_unallocated(BlockCopyState *s,
- int64_t offset, int64_t *count)
+int64_t coroutine_fn block_copy_reset_unallocated(BlockCopyState *s,
+  int64_t offset,
+  int64_t *count)
 {
 int ret;
 int64_t clusters, bytes;
-- 
2.31.1

[PATCH v4 07/11] block/vmdk: add missing coroutine_fn annotations

These functions end up calling bdrv_create() implemented as generated_co_wrapper
functions.
In addition, they also happen to be always called in coroutine context,
meaning all callers are coroutine_fn.
This means that the g_c_w function will enter the qemu_in_coroutine()
case and eventually suspend (or in other words call qemu_coroutine_yield()).
Therefore we need to mark such functions coroutine_fn too.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/vmdk.c | 36 +++-
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/vmdk.c b/block/vmdk.c
index 26376352b9..0c32bf2e83 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -2285,10 +2285,11 @@ exit:
 return ret;
 }
 
-static int vmdk_create_extent(const char *filename, int64_t filesize,
-  bool flat, bool compress, bool zeroed_grain,
-  BlockBackend **pbb,
-  QemuOpts *opts, Error **errp)
+static int coroutine_fn vmdk_create_extent(const char *filename,
+   int64_t filesize, bool flat,
+   bool compress, bool zeroed_grain,
+   BlockBackend **pbb,
+   QemuOpts *opts, Error **errp)
 {
 int ret;
 BlockBackend *blk = NULL;
@@ -2366,14 +2367,14 @@ static int filename_decompose(const char *filename, 
char *path, char *prefix,
  *   non-split format.
  * idx >= 1: get the n-th extent if in a split subformat
  */
-typedef BlockBackend *(*vmdk_create_extent_fn)(int64_t size,
-   int idx,
-   bool flat,
-   bool split,
-   bool compress,
-   bool zeroed_grain,
-   void *opaque,
-   Error **errp);
+typedef BlockBackend * coroutine_fn (*vmdk_create_extent_fn)(int64_t size,
+ int idx,
+ bool flat,
+ bool split,
+ bool compress,
+ bool zeroed_grain,
+ void *opaque,
+ Error **errp);
 
 static void vmdk_desc_add_extent(GString *desc,
  const char *extent_line_fmt,
@@ -2616,7 +2617,7 @@ typedef struct {
 QemuOpts *opts;
 } VMDKCreateOptsData;
 
-static BlockBackend *vmdk_co_create_opts_cb(int64_t size, int idx,
+static BlockBackend * coroutine_fn vmdk_co_create_opts_cb(int64_t size, int 
idx,
 bool flat, bool split, bool 
compress,
 bool zeroed_grain, void *opaque,
 Error **errp)
@@ -2768,10 +2769,11 @@ exit:
 return ret;
 }
 
-static BlockBackend *vmdk_co_create_cb(int64_t size, int idx,
-   bool flat, bool split, bool compress,
-   bool zeroed_grain, void *opaque,
-   Error **errp)
+static BlockBackend * coroutine_fn vmdk_co_create_cb(int64_t size, int idx,
+ bool flat, bool split,
+ bool compress,
+ bool zeroed_grain,
+ void *opaque, Error 
**errp)
 {
 int ret;
 BlockDriverState *bs;
-- 
2.31.1

[PATCH v4 02/11] nbd/server.c: add missing coroutine_fn annotations

These functions end up calling bdrv_*() implemented as generated_co_wrapper
functions.
In addition, they also happen to be always called in coroutine context,
meaning all callers are coroutine_fn.
This means that the g_c_w function will enter the qemu_in_coroutine()
case and eventually suspend (or in other words call qemu_coroutine_yield()).
Therefore we need to mark such functions coroutine_fn too.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 nbd/server.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index ada16089f3..e2eec0cf46 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2141,8 +2141,9 @@ static int nbd_extent_array_add(NBDExtentArray *ea,
 return 0;
 }
 
-static int blockstatus_to_extents(BlockDriverState *bs, uint64_t offset,
-  uint64_t bytes, NBDExtentArray *ea)
+static int coroutine_fn blockstatus_to_extents(BlockDriverState *bs,
+   uint64_t offset, uint64_t bytes,
+   NBDExtentArray *ea)
 {
 while (bytes) {
 uint32_t flags;
@@ -2168,8 +2169,9 @@ static int blockstatus_to_extents(BlockDriverState *bs, 
uint64_t offset,
 return 0;
 }
 
-static int blockalloc_to_extents(BlockDriverState *bs, uint64_t offset,
- uint64_t bytes, NBDExtentArray *ea)
+static int coroutine_fn blockalloc_to_extents(BlockDriverState *bs,
+  uint64_t offset, uint64_t bytes,
+  NBDExtentArray *ea)
 {
 while (bytes) {
 int64_t num;
@@ -2220,11 +,12 @@ static int nbd_co_send_extents(NBDClient *client, 
uint64_t handle,
 }
 
 /* Get block status from the exported device and send it to the client */
-static int nbd_co_send_block_status(NBDClient *client, uint64_t handle,
-BlockDriverState *bs, uint64_t offset,
-uint32_t length, bool dont_fragment,
-bool last, uint32_t context_id,
-Error **errp)
+static int
+coroutine_fn nbd_co_send_block_status(NBDClient *client, uint64_t handle,
+  BlockDriverState *bs, uint64_t offset,
+  uint32_t length, bool dont_fragment,
+  bool last, uint32_t context_id,
+  Error **errp)
 {
 int ret;
 unsigned int nb_extents = dont_fragment ? 1 : NBD_MAX_BLOCK_STATUS_EXTENTS;
-- 
2.31.1

[PATCH v4 03/11] block-backend: replace bdrv__above with blk__above

Avoid mixing bdrv_* functions with blk_*, so create blk_* counterparts
for:
- bdrv_block_status_above
- bdrv_is_allocated_above

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c | 21 +
 block/commit.c|  4 ++--
 include/sysemu/block-backend-io.h |  9 +
 nbd/server.c  | 28 ++--
 qemu-img.c|  4 ++--
 5 files changed, 48 insertions(+), 18 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 742efa7955..333d50fb3f 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1424,6 +1424,27 @@ int coroutine_fn blk_co_pwritev(BlockBackend *blk, 
int64_t offset,
 return blk_co_pwritev_part(blk, offset, bytes, qiov, 0, flags);
 }
 
+int coroutine_fn blk_block_status_above(BlockBackend *blk,
+BlockDriverState *base,
+int64_t offset, int64_t bytes,
+int64_t *pnum, int64_t *map,
+BlockDriverState **file)
+{
+IO_CODE();
+return bdrv_block_status_above(blk_bs(blk), base, offset, bytes, pnum, map,
+   file);
+}
+
+int coroutine_fn blk_is_allocated_above(BlockBackend *blk,
+BlockDriverState *base,
+bool include_base, int64_t offset,
+int64_t bytes, int64_t *pnum)
+{
+IO_CODE();
+return bdrv_is_allocated_above(blk_bs(blk), base, include_base, offset,
+   bytes, pnum);
+}
+
 typedef struct BlkRwCo {
 BlockBackend *blk;
 int64_t offset;
diff --git a/block/commit.c b/block/commit.c
index 0029b31944..9d4d908344 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -155,8 +155,8 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
 break;
 }
 /* Copy if allocated above the base */
-ret = bdrv_is_allocated_above(blk_bs(s->top), s->base_overlay, true,
-  offset, COMMIT_BUFFER_SIZE, );
+ret = blk_is_allocated_above(s->top, s->base_overlay, true,
+ offset, COMMIT_BUFFER_SIZE, );
 copy = (ret > 0);
 trace_commit_one_iteration(s, offset, n, ret);
 if (copy) {
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index 50f5aa2e07..a47cb825e5 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -92,6 +92,15 @@ int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, 
int64_t off_in,
int64_t bytes, BdrvRequestFlags read_flags,
BdrvRequestFlags write_flags);
 
+int coroutine_fn blk_block_status_above(BlockBackend *blk,
+BlockDriverState *base,
+int64_t offset, int64_t bytes,
+int64_t *pnum, int64_t *map,
+BlockDriverState **file);
+int coroutine_fn blk_is_allocated_above(BlockBackend *blk,
+BlockDriverState *base,
+bool include_base, int64_t offset,
+int64_t bytes, int64_t *pnum);
 
 /*
  * "I/O or GS" API functions. These functions can run without
diff --git a/nbd/server.c b/nbd/server.c
index e2eec0cf46..6389b46396 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1991,7 +1991,7 @@ static int coroutine_fn 
nbd_co_send_structured_error(NBDClient *client,
 }
 
 /* Do a sparse read and send the structured reply to the client.
- * Returns -errno if sending fails. bdrv_block_status_above() failure is
+ * Returns -errno if sending fails. blk_block_status_above() failure is
  * reported to the client, at which point this function succeeds.
  */
 static int coroutine_fn nbd_co_send_sparse_read(NBDClient *client,
@@ -2007,10 +2007,10 @@ static int coroutine_fn 
nbd_co_send_sparse_read(NBDClient *client,
 
 while (progress < size) {
 int64_t pnum;
-int status = bdrv_block_status_above(blk_bs(exp->common.blk), NULL,
- offset + progress,
- size - progress, , NULL,
- NULL);
+int status = blk_block_status_above(exp->common.blk, NULL,
+offset + progress,
+size - progress, , NULL,
+NULL);
 bool final;
 
 if (status < 0) {
@@ -2141,14 +2141,14 @@ static int nbd_extent_array_add(NBDExtentArray *ea,
 return 0;
 }
 
-static int coroutine_fn

[PATCH v4 11/11] block/dirty-bitmap: convert coroutine-only functions to generated_co_wrapper_simple

bdrv_can_store_new_dirty_bitmap and bdrv_remove_persistent_dirty_bitmap
check if they are running in a coroutine, directly calling the
coroutine callback if it's the case.
Except that no coroutine calls such functions, therefore that check
can be removed, and function creation can be offloaded to
g_c_w_simple.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/dirty-bitmap.c | 88 +---
 block/meson.build|  1 +
 include/block/block-common.h |  5 +-
 include/block/block-io.h |  9 +++-
 include/block/dirty-bitmap.h | 10 +++-
 5 files changed, 21 insertions(+), 92 deletions(-)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index bf3dc0512a..21cf592889 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -388,7 +388,7 @@ void bdrv_release_named_dirty_bitmaps(BlockDriverState *bs)
  * not fail.
  * This function doesn't release corresponding BdrvDirtyBitmap.
  */
-static int coroutine_fn
+int coroutine_fn
 bdrv_co_remove_persistent_dirty_bitmap(BlockDriverState *bs, const char *name,
Error **errp)
 {
@@ -399,45 +399,6 @@ bdrv_co_remove_persistent_dirty_bitmap(BlockDriverState 
*bs, const char *name,
 return 0;
 }
 
-typedef struct BdrvRemovePersistentDirtyBitmapCo {
-BlockDriverState *bs;
-const char *name;
-Error **errp;
-int ret;
-} BdrvRemovePersistentDirtyBitmapCo;
-
-static void coroutine_fn
-bdrv_co_remove_persistent_dirty_bitmap_entry(void *opaque)
-{
-BdrvRemovePersistentDirtyBitmapCo *s = opaque;
-
-s->ret = bdrv_co_remove_persistent_dirty_bitmap(s->bs, s->name, s->errp);
-aio_wait_kick();
-}
-
-int bdrv_remove_persistent_dirty_bitmap(BlockDriverState *bs, const char *name,
-Error **errp)
-{
-if (qemu_in_coroutine()) {
-return bdrv_co_remove_persistent_dirty_bitmap(bs, name, errp);
-} else {
-Coroutine *co;
-BdrvRemovePersistentDirtyBitmapCo s = {
-.bs = bs,
-.name = name,
-.errp = errp,
-.ret = -EINPROGRESS,
-};
-
-co = 
qemu_coroutine_create(bdrv_co_remove_persistent_dirty_bitmap_entry,
-   );
-bdrv_coroutine_enter(bs, co);
-BDRV_POLL_WHILE(bs, s.ret == -EINPROGRESS);
-
-return s.ret;
-}
-}
-
 bool
 bdrv_supports_persistent_dirty_bitmap(BlockDriverState *bs)
 {
@@ -447,7 +408,7 @@ bdrv_supports_persistent_dirty_bitmap(BlockDriverState *bs)
 return false;
 }
 
-static bool coroutine_fn
+bool coroutine_fn
 bdrv_co_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
uint32_t granularity, Error **errp)
 {
@@ -470,51 +431,6 @@ bdrv_co_can_store_new_dirty_bitmap(BlockDriverState *bs, 
const char *name,
 return drv->bdrv_co_can_store_new_dirty_bitmap(bs, name, granularity, 
errp);
 }
 
-typedef struct BdrvCanStoreNewDirtyBitmapCo {
-BlockDriverState *bs;
-const char *name;
-uint32_t granularity;
-Error **errp;
-bool ret;
-
-bool in_progress;
-} BdrvCanStoreNewDirtyBitmapCo;
-
-static void coroutine_fn bdrv_co_can_store_new_dirty_bitmap_entry(void *opaque)
-{
-BdrvCanStoreNewDirtyBitmapCo *s = opaque;
-
-s->ret = bdrv_co_can_store_new_dirty_bitmap(s->bs, s->name, s->granularity,
-s->errp);
-s->in_progress = false;
-aio_wait_kick();
-}
-
-bool bdrv_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
- uint32_t granularity, Error **errp)
-{
-IO_CODE();
-if (qemu_in_coroutine()) {
-return bdrv_co_can_store_new_dirty_bitmap(bs, name, granularity, errp);
-} else {
-Coroutine *co;
-BdrvCanStoreNewDirtyBitmapCo s = {
-.bs = bs,
-.name = name,
-.granularity = granularity,
-.errp = errp,
-.in_progress = true,
-};
-
-co = qemu_coroutine_create(bdrv_co_can_store_new_dirty_bitmap_entry,
-   );
-bdrv_coroutine_enter(bs, co);
-BDRV_POLL_WHILE(bs, s.in_progress);
-
-return s.ret;
-}
-}
-
 void bdrv_disable_dirty_bitmap(BdrvDirtyBitmap *bitmap)
 {
 bdrv_dirty_bitmaps_lock(bitmap->bs);
diff --git a/block/meson.build b/block/meson.build
index b7c68b83a3..c48a59571e 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -137,6 +137,7 @@ block_gen_c = custom_target('block-gen.c',
 output: 'block-gen.c',
 input: files(
   '../include/block/block-io.h',
+  '../include/block/dirty-bitmap.h',
   '../include/block/block-global-state.h',
   '../include/sysemu/block-backend-io.h',

[PATCH v4 00/11] Still more coroutine and various fixes in block layer

This is a dump of all minor coroutine-related fixes found while looking
around and testing various things in the QEMU block layer.

Patches aim to:
- add missing coroutine_fn annotation to the functions
- simplify to avoid the typical "if in coroutine: fn()
  // else create_coroutine(fn)" already present in generated_co_wraper
  functions.
- make sure that if a BlockDriver callback is defined as coroutine_fn, then
  it is always running in a coroutine.

This serie is based on Kevin Wolf's series "block: Simplify drain".

Based-on: <20221108123738.530873-1-kw...@redhat.com>

Emanuele
---
v4:
* use v2 commit messages
* introduce generated_co_wrapper_simple to simplify patches

v3:
* Remove patch 1, base on kevin "drain semplification serie"

v2:
* clarified commit message in patches 2/3/6 on why we add coroutine_fn

Emanuele Giuseppe Esposito (11):
  block-copy: add missing coroutine_fn annotations
  nbd/server.c: add missing coroutine_fn annotations
  block-backend: replace bdrv_*_above with blk_*_above
  block-coroutine-wrapper.py: introduce generated_co_wrapper_simple
  block-coroutine-wrapper.py: default to main loop aiocontext if
function does not have a BlockDriverState parameter
  block-coroutine-wrapper.py: support also basic return types
  block/vmdk: add missing coroutine_fn annotations
  block: distinguish between bdrv_create running in coroutine and not
  block: bdrv_create_file is a coroutine_fn
  block: convert bdrv_create to generated_co_wrapper_simple
  block/dirty-bitmap: convert coroutine-only functions to
generated_co_wrapper_simple

 block.c|  68 +--
 block/block-backend.c  |  21 ++
 block/block-copy.c |  15 +++--
 block/block-gen.h  |   5 +-
 block/commit.c |   4 +-
 block/dirty-bitmap.c   |  88 +
 block/meson.build  |   1 +
 block/vmdk.c   |  36 +-
 include/block/block-common.h   |   6 +-
 include/block/block-global-state.h |  13 +++-
 include/block/block-io.h   |   9 ++-
 include/block/dirty-bitmap.h   |  10 ++-
 include/sysemu/block-backend-io.h  |   9 +++
 nbd/server.c   |  43 ++--
 qemu-img.c |   4 +-
 scripts/block-coroutine-wrapper.py | 102 +
 16 files changed, 209 insertions(+), 225 deletions(-)

-- 
2.31.1

[PATCH v4 06/11] block-coroutine-wrapper.py: support also basic return types

Extend the regex to cover also return type, pointers included.
This implies that the value returned by the function cannot be
a simple "int" anymore, but the custom return type.
Therefore remove poll_state->ret and instead use a per-function
custom "ret" field.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-gen.h  |  5 +
 scripts/block-coroutine-wrapper.py | 19 +++
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/block/block-gen.h b/block/block-gen.h
index f80cf4897d..8ac9d5bd4f 100644
--- a/block/block-gen.h
+++ b/block/block-gen.h
@@ -32,18 +32,15 @@
 typedef struct BdrvPollCo {
 BlockDriverState *bs;
 bool in_progress;
-int ret;
 Coroutine *co; /* Keep pointer here for debugging */
 } BdrvPollCo;
 
-static inline int bdrv_poll_co(BdrvPollCo *s)
+static inline void bdrv_poll_co(BdrvPollCo *s)
 {
 assert(!qemu_in_coroutine());
 
 bdrv_coroutine_enter(s->bs, s->co);
 BDRV_POLL_WHILE(s->bs, s->in_progress);
-
-return s->ret;
 }
 
 #endif /* BLOCK_BLOCK_GEN_H */
diff --git a/scripts/block-coroutine-wrapper.py 
b/scripts/block-coroutine-wrapper.py
index 0f842386d4..21ecb3e896 100644
--- a/scripts/block-coroutine-wrapper.py
+++ b/scripts/block-coroutine-wrapper.py
@@ -80,7 +80,8 @@ def gen_block(self, format: str) -> str:
 
 
 # Match wrappers declared with a generated_co_wrapper mark
-func_decl_re = re.compile(r'^int\s*generated_co_wrapper'
+func_decl_re = re.compile(r'^(?P[a-zA-Z][a-zA-Z0-9_]* [*]?)'
+  r'\s*generated_co_wrapper'
   r'(?P(_[a-z][a-z0-9_]*)?)\s*'
   r'(?P[a-z][a-z0-9_]*)'
   r'\((?P[^)]*)\);$', re.MULTILINE)
@@ -88,7 +89,7 @@ def gen_block(self, format: str) -> str:
 
 def func_decl_iter(text: str) -> Iterator:
 for m in func_decl_re.finditer(text):
-yield FuncDecl(return_type='int',
+yield FuncDecl(return_type=m.group('return_type'),
name=m.group('wrapper_name'),
args=m.group('args'),
variant=m.group('variant'))
@@ -109,7 +110,7 @@ def create_g_c_w(func: FuncDecl) -> str:
 name = func.co_name
 struct_name = func.struct_name
 return f"""\
-int {func.name}({ func.gen_list('{decl}') })
+{func.return_type} {func.name}({ func.gen_list('{decl}') })
 {{
 if (qemu_in_coroutine()) {{
 return {name}({ func.gen_list('{name}') });
@@ -123,7 +124,8 @@ def create_g_c_w(func: FuncDecl) -> str:
 
 s.poll_state.co = qemu_coroutine_create({name}_entry, );
 
-return bdrv_poll_co(_state);
+bdrv_poll_co(_state);
+return s.ret;
 }}
 }}"""
 
@@ -133,7 +135,7 @@ def create_coroutine_only(func: FuncDecl) -> str:
 name = func.co_name
 struct_name = func.struct_name
 return f"""\
-int {func.name}({ func.gen_list('{decl}') })
+{func.return_type} {func.name}({ func.gen_list('{decl}') })
 {{
 assert(!qemu_in_coroutine());
 {struct_name} s = {{
@@ -145,13 +147,13 @@ def create_coroutine_only(func: FuncDecl) -> str:
 
 s.poll_state.co = qemu_coroutine_create({name}_entry, );
 
-return bdrv_poll_co(_state);
+bdrv_poll_co(_state);
+return s.ret;
 }}"""
 
 
 def gen_wrapper(func: FuncDecl) -> str:
 assert not '_co_' in func.name
-assert func.return_type == 'int'
 
 subsystem, subname = func.name.split('_', 1)
 
@@ -182,6 +184,7 @@ def gen_wrapper(func: FuncDecl) -> str:
 
 typedef struct {struct_name} {{
 BdrvPollCo poll_state;
+{func.return_type} ret;
 { func.gen_block('{decl};') }
 }} {struct_name};
 
@@ -189,7 +192,7 @@ def gen_wrapper(func: FuncDecl) -> str:
 {{
 {struct_name} *s = opaque;
 
-s->poll_state.ret = {name}({ func.gen_list('s->{name}') });
+s->ret = {name}({ func.gen_list('s->{name}') });
 s->poll_state.in_progress = false;
 
 aio_wait_kick();
-- 
2.31.1

[PATCH v4 08/11] block: distinguish between bdrv_create running in coroutine and not

Call two different functions depending on whether bdrv_create
is in coroutine or not, following the same pattern as
generated_co_wrapper functions.

This allows to also call the coroutine function directly,
without using CreateCo or relying in bdrv_create().

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c | 76 -
 1 file changed, 38 insertions(+), 38 deletions(-)

diff --git a/block.c b/block.c
index 577639c7e0..c610a32e77 100644
--- a/block.c
+++ b/block.c
@@ -528,66 +528,66 @@ typedef struct CreateCo {
 Error *err;
 } CreateCo;
 
-static void coroutine_fn bdrv_create_co_entry(void *opaque)
+static int coroutine_fn bdrv_co_create(BlockDriver *drv, const char *filename,
+   QemuOpts *opts, Error **errp)
 {
-Error *local_err = NULL;
 int ret;
+char *filename_copy;
+GLOBAL_STATE_CODE();
+assert(*errp == NULL);
+assert(drv);
+
+if (!drv->bdrv_co_create_opts) {
+error_setg(errp, "Driver '%s' does not support image creation",
+   drv->format_name);
+return -ENOTSUP;
+}
 
+filename_copy = g_strdup(filename);
+ret = drv->bdrv_co_create_opts(drv, filename_copy, opts, errp);
+
+if (ret < 0 && !*errp) {
+error_setg_errno(errp, -ret, "Could not create image");
+}
+
+g_free(filename_copy);
+return ret;
+}
+
+static void coroutine_fn bdrv_create_co_entry(void *opaque)
+{
 CreateCo *cco = opaque;
-assert(cco->drv);
 GLOBAL_STATE_CODE();
 
-ret = cco->drv->bdrv_co_create_opts(cco->drv,
-cco->filename, cco->opts, _err);
-error_propagate(>err, local_err);
-cco->ret = ret;
+cco->ret = bdrv_co_create(cco->drv, cco->filename, cco->opts, >err);
 }
 
 int bdrv_create(BlockDriver *drv, const char* filename,
 QemuOpts *opts, Error **errp)
 {
-int ret;
-
 GLOBAL_STATE_CODE();
 
-Coroutine *co;
-CreateCo cco = {
-.drv = drv,
-.filename = g_strdup(filename),
-.opts = opts,
-.ret = NOT_DONE,
-.err = NULL,
-};
-
-if (!drv->bdrv_co_create_opts) {
-error_setg(errp, "Driver '%s' does not support image creation", 
drv->format_name);
-ret = -ENOTSUP;
-goto out;
-}
-
 if (qemu_in_coroutine()) {
 /* Fast-path if already in coroutine context */
-bdrv_create_co_entry();
+return bdrv_co_create(drv, filename, opts, errp);
 } else {
+Coroutine *co;
+CreateCo cco = {
+.drv = drv,
+.filename = filename,
+.opts = opts,
+.ret = NOT_DONE,
+.err = NULL,
+};
+
 co = qemu_coroutine_create(bdrv_create_co_entry, );
 qemu_coroutine_enter(co);
 while (cco.ret == NOT_DONE) {
 aio_poll(qemu_get_aio_context(), true);
 }
+error_propagate(errp, cco.err);
+return cco.ret;
 }
-
-ret = cco.ret;
-if (ret < 0) {
-if (cco.err) {
-error_propagate(errp, cco.err);
-} else {
-error_setg_errno(errp, -ret, "Could not create image");
-}
-}
-
-out:
-g_free(cco.filename);
-return ret;
 }
 
 /**
-- 
2.31.1

[PATCH v4 10/11] block: convert bdrv_create to generated_co_wrapper_simple

This function is never called in coroutine context, therefore
instead of manually creating a new coroutine, delegate it to the
block-coroutine-wrapper script, defining it as g_c_w_simple.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c| 38 +-
 include/block/block-global-state.h | 10 ++--
 2 files changed, 9 insertions(+), 39 deletions(-)

diff --git a/block.c b/block.c
index 7a4c3eb540..d3e168408a 100644
--- a/block.c
+++ b/block.c
@@ -528,7 +528,7 @@ typedef struct CreateCo {
 Error *err;
 } CreateCo;
 
-static int coroutine_fn bdrv_co_create(BlockDriver *drv, const char *filename,
+int coroutine_fn bdrv_co_create(BlockDriver *drv, const char *filename,
QemuOpts *opts, Error **errp)
 {
 int ret;
@@ -555,42 +555,6 @@ static int coroutine_fn bdrv_co_create(BlockDriver *drv, 
const char *filename,
 return ret;
 }
 
-static void coroutine_fn bdrv_create_co_entry(void *opaque)
-{
-CreateCo *cco = opaque;
-GLOBAL_STATE_CODE();
-
-cco->ret = bdrv_co_create(cco->drv, cco->filename, cco->opts, >err);
-}
-
-int bdrv_create(BlockDriver *drv, const char* filename,
-QemuOpts *opts, Error **errp)
-{
-GLOBAL_STATE_CODE();
-
-if (qemu_in_coroutine()) {
-/* Fast-path if already in coroutine context */
-return bdrv_co_create(drv, filename, opts, errp);
-} else {
-Coroutine *co;
-CreateCo cco = {
-.drv = drv,
-.filename = filename,
-.opts = opts,
-.ret = NOT_DONE,
-.err = NULL,
-};
-
-co = qemu_coroutine_create(bdrv_create_co_entry, );
-qemu_coroutine_enter(co);
-while (cco.ret == NOT_DONE) {
-aio_poll(qemu_get_aio_context(), true);
-}
-error_propagate(errp, cco.err);
-return cco.ret;
-}
-}
-
 /**
  * Helper function for bdrv_create_file_fallback(): Resize @blk to at
  * least the given @minimum_size.
diff --git a/include/block/block-global-state.h 
b/include/block/block-global-state.h
index 6f35ed99e3..305336bdb9 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -55,8 +55,14 @@ BlockDriver *bdrv_find_protocol(const char *filename,
 bool allow_protocol_prefix,
 Error **errp);
 BlockDriver *bdrv_find_format(const char *format_name);
-int bdrv_create(BlockDriver *drv, const char* filename,
-QemuOpts *opts, Error **errp);
+
+int coroutine_fn bdrv_co_create(BlockDriver *drv, const char* filename,
+QemuOpts *opts, Error **errp);
+int generated_co_wrapper_simple bdrv_create(BlockDriver *drv,
+const char* filename,
+QemuOpts *opts,
+Error **errp);
+
 int coroutine_fn bdrv_create_file(const char *filename, QemuOpts *opts,
   Error **errp);
 
-- 
2.31.1

Re: [PATCH v2] m25p80: Improve error when the backend file size does not match the device

2022-11-16 Thread Francisco Iglesias

On [2022 Nov 15] Tue 16:10:00, Cédric Le Goater wrote:
> Currently, when a block backend is attached to a m25p80 device and the
> associated file size does not match the flash model, QEMU complains
> with the error message "failed to read the initial flash content".
> This is confusing for the user.
> 
> Use blk_check_size_and_read_all() instead of blk_pread() to improve
> the reported error.
> 
> Signed-off-by: Cédric Le Goater 

Reviewed-by: Francisco Iglesias 

> ---
>  hw/block/m25p80.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
> index 02adc87527..68a757abf3 100644
> --- a/hw/block/m25p80.c
> +++ b/hw/block/m25p80.c
> @@ -24,6 +24,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/units.h"
>  #include "sysemu/block-backend.h"
> +#include "hw/block/block.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/qdev-properties-system.h"
>  #include "hw/ssi/ssi.h"
> @@ -1614,8 +1615,7 @@ static void m25p80_realize(SSIPeripheral *ss, Error 
> **errp)
>  trace_m25p80_binding(s);
>  s->storage = blk_blockalign(s->blk, s->size);
>  
> -if (blk_pread(s->blk, 0, s->size, s->storage, 0) < 0) {
> -error_setg(errp, "failed to read the initial flash content");
> +if (!blk_check_size_and_read_all(s->blk, s->storage, s->size, errp)) 
> {
>  return;
>  }
>  } else {
> -- 
> 2.38.1
>

Re: [PATCH v3] block/rbd: Add support for layered encryption

2022-11-16 Thread Daniel P . Berrangé

On Wed, Nov 16, 2022 at 10:23:52AM +, Daniel P. Berrangé wrote:
> On Wed, Nov 16, 2022 at 09:03:31AM +, Or Ozeri wrote:
> > > -Original Message-
> > > From: Daniel P. Berrangé 
> > > Sent: 15 November 2022 19:47
> > > To: Or Ozeri 
> > > Cc: qemu-de...@nongnu.org; qemu-block@nongnu.org; Danny Harnik
> > > ; idryo...@gmail.com
> > > Subject: [EXTERNAL] Re: [PATCH v3] block/rbd: Add support for layered
> > > encryption
> > > 
> > > AFAICT, supporting layered encryption shouldn't require anything other 
> > > than
> > > the 'parent' addition.
> > > 
> > 
> > Since the layered encryption API is new in librbd, we don't have to
> > support "luks" and "luks2" at all.
> > In librbd we are actually deprecating the use of "luks" and "luks2",
> > and instead ask users to use "luks-any".
> 
> Deprecating that is a bad idea. The security characteristics and
> feature set of LUKSv1 and LUKSv2 can be quite different. If a mgmt
> app is expecting the volume to be protected with LUKSv2, it should
> be stating that explicitly and not permit a silent downgrade if
> the volume was unexpectedly using LUKSv1.
> 
> > If we don't add "luks-any" here, we will need to implement
> > explicit cases for "luks" and "luks2" in the qemu_rbd_encryption_load2.
> > This looks like a kind of wasteful coding that won't be actually used
> > by users of the rbd driver in qemu.
> 
> It isn't wasteful - supporting the formats explicitly is desirable
> to prevent format downgrades.
> 
> > Anyhow, we need the "luks-any" option for our use-case, so if you
> > insist, I will first submit a patch to add "luks-any", before this
> > patch.
> 
> I'm pretty wary of any kind of automatic encryption format detection
> in QEMU. The automatic block driver format probing has been a long
> standing source of CVEs in QEMU and every single mgmt app above QEMU.

Having said that, normal linux LUKS tools like cryptsetup or systemd
LUKS integration will auto-detect  luks1 vs luks2. All cryptsetup
commands also have an option to explicitly specify the format version.

So with that precedent I guess it is ok to add 'luks-any'.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH v3 0/8] Still more coroutine and various fixes in block layer

I apologize, as discussed also in v2 I just realized I could introduce
generated_co_wrapper_simple already here and simplify patches 6 and 8.

Also I think commit messages are the old ones from v1.

I'll resend. Please ignore this serie.

Emanuele

Am 16/11/2022 um 09:50 schrieb Emanuele Giuseppe Esposito:
> This is a dump of all minor coroutine-related fixes found while looking
> around and testing various things in the QEMU block layer.
> 
> Patches aim to:
> - add missing coroutine_fn annotation to the functions
> - simplify to avoid the typical "if in coroutine: fn()
>   // else create_coroutine(fn)" already present in generated_co_wraper
>   functions.
> - make sure that if a BlockDriver callback is defined as coroutine_fn, then
>   it is always running in a coroutine.
> 
> This serie is based on Kevin Wolf's series "block: Simplify drain".
> 
> Based-on: <20221108123738.530873-1-kw...@redhat.com>
> 
> Emanuele
> ---
> v3:
> * Remove patch 1, base on kevin "drain semplification serie"
> 
> v2:
> * clarified commit message in patches 2/3/6 on why we add coroutine_fn
> 
> Emanuele Giuseppe Esposito (8):
>   block-copy: add missing coroutine_fn annotations
>   nbd/server.c: add missing coroutine_fn annotations
>   block-backend: replace bdrv_*_above with blk_*_above
>   block: distinguish between bdrv_create running in coroutine and not
>   block/vmdk: add missing coroutine_fn annotations
>   block: bdrv_create_file is a coroutine_fn
>   block: bdrv_create is never called in coroutine context
>   block/dirty-bitmap: remove unnecessary qemu_in_coroutine() case
> 
>  block.c| 75 ++
>  block/block-backend.c  | 21 +
>  block/block-copy.c | 15 +++---
>  block/commit.c |  4 +-
>  block/dirty-bitmap.c   | 66 --
>  block/vmdk.c   | 36 +++---
>  include/block/block-global-state.h |  3 +-
>  include/sysemu/block-backend-io.h  |  9 
>  nbd/server.c   | 43 +
>  qemu-img.c |  4 +-
>  10 files changed, 151 insertions(+), 125 deletions(-)
>

Re: [PATCH v3] block/rbd: Add support for layered encryption

2022-11-16 Thread Daniel P . Berrangé

On Wed, Nov 16, 2022 at 09:03:31AM +, Or Ozeri wrote:
> > -Original Message-
> > From: Daniel P. Berrangé 
> > Sent: 15 November 2022 19:47
> > To: Or Ozeri 
> > Cc: qemu-de...@nongnu.org; qemu-block@nongnu.org; Danny Harnik
> > ; idryo...@gmail.com
> > Subject: [EXTERNAL] Re: [PATCH v3] block/rbd: Add support for layered
> > encryption
> > 
> > AFAICT, supporting layered encryption shouldn't require anything other than
> > the 'parent' addition.
> > 
> 
> Since the layered encryption API is new in librbd, we don't have to
> support "luks" and "luks2" at all.
> In librbd we are actually deprecating the use of "luks" and "luks2",
> and instead ask users to use "luks-any".

Deprecating that is a bad idea. The security characteristics and
feature set of LUKSv1 and LUKSv2 can be quite different. If a mgmt
app is expecting the volume to be protected with LUKSv2, it should
be stating that explicitly and not permit a silent downgrade if
the volume was unexpectedly using LUKSv1.

> If we don't add "luks-any" here, we will need to implement
> explicit cases for "luks" and "luks2" in the qemu_rbd_encryption_load2.
> This looks like a kind of wasteful coding that won't be actually used
> by users of the rbd driver in qemu.

It isn't wasteful - supporting the formats explicitly is desirable
to prevent format downgrades.

> Anyhow, we need the "luks-any" option for our use-case, so if you
> insist, I will first submit a patch to add "luks-any", before this
> patch.

I'm pretty wary of any kind of automatic encryption format detection
in QEMU. The automatic block driver format probing has been a long
standing source of CVEs in QEMU and every single mgmt app above QEMU.

What is the problem with specifying the desired LUKS format explicitly.
The mgmt app should know what formats it wants to be using. It should
be possible to query format for existing volumes too.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

RE: [PATCH v3] block/rbd: Add support for layered encryption

2022-11-16 Thread Or Ozeri

> -Original Message-
> From: Daniel P. Berrangé 
> Sent: 15 November 2022 19:47
> To: Or Ozeri 
> Cc: qemu-de...@nongnu.org; qemu-block@nongnu.org; Danny Harnik
> ; idryo...@gmail.com
> Subject: [EXTERNAL] Re: [PATCH v3] block/rbd: Add support for layered
> encryption
> 
> AFAICT, supporting layered encryption shouldn't require anything other than
> the 'parent' addition.
> 

Since the layered encryption API is new in librbd, we don't have to support 
"luks" and "luks2" at all.
In librbd we are actually deprecating the use of "luks" and "luks2", and 
instead ask users to use "luks-any".
If we don't add "luks-any" here, we will need to implement explicit cases for 
"luks" and "luks2" in the qemu_rbd_encryption_load2. This looks like a kind of 
wasteful coding that won't be actually used by users of the rbd driver in qemu.
Anyhow, we need the "luks-any" option for our use-case, so if you insist, I 
will first submit a patch to add "luks-any", before this patch.

[PATCH v3 2/8] nbd/server.c: add missing coroutine_fn annotations

There are probably more missing, but right now it is necessary that
we extend coroutine_fn to block{allock/status}_to_extents, because
they use bdrv_* functions calling the generated_co_wrapper API, which
checks for the qemu_in_coroutine() case.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 nbd/server.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index ada16089f3..e2eec0cf46 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2141,8 +2141,9 @@ static int nbd_extent_array_add(NBDExtentArray *ea,
 return 0;
 }
 
-static int blockstatus_to_extents(BlockDriverState *bs, uint64_t offset,
-  uint64_t bytes, NBDExtentArray *ea)
+static int coroutine_fn blockstatus_to_extents(BlockDriverState *bs,
+   uint64_t offset, uint64_t bytes,
+   NBDExtentArray *ea)
 {
 while (bytes) {
 uint32_t flags;
@@ -2168,8 +2169,9 @@ static int blockstatus_to_extents(BlockDriverState *bs, 
uint64_t offset,
 return 0;
 }
 
-static int blockalloc_to_extents(BlockDriverState *bs, uint64_t offset,
- uint64_t bytes, NBDExtentArray *ea)
+static int coroutine_fn blockalloc_to_extents(BlockDriverState *bs,
+  uint64_t offset, uint64_t bytes,
+  NBDExtentArray *ea)
 {
 while (bytes) {
 int64_t num;
@@ -2220,11 +,12 @@ static int nbd_co_send_extents(NBDClient *client, 
uint64_t handle,
 }
 
 /* Get block status from the exported device and send it to the client */
-static int nbd_co_send_block_status(NBDClient *client, uint64_t handle,
-BlockDriverState *bs, uint64_t offset,
-uint32_t length, bool dont_fragment,
-bool last, uint32_t context_id,
-Error **errp)
+static int
+coroutine_fn nbd_co_send_block_status(NBDClient *client, uint64_t handle,
+  BlockDriverState *bs, uint64_t offset,
+  uint32_t length, bool dont_fragment,
+  bool last, uint32_t context_id,
+  Error **errp)
 {
 int ret;
 unsigned int nb_extents = dont_fragment ? 1 : NBD_MAX_BLOCK_STATUS_EXTENTS;
-- 
2.31.1

[PATCH v3 4/8] block: distinguish between bdrv_create running in coroutine and not

Call two different functions depending on whether bdrv_create
is in coroutine or not, following the same pattern as
generated_co_wrapper functions.

This allows to also call the coroutine function directly,
without using CreateCo or relying in bdrv_create().
It will also be useful when we add the graph rdlock to the
coroutine case.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c | 74 -
 1 file changed, 36 insertions(+), 38 deletions(-)

diff --git a/block.c b/block.c
index 577639c7e0..375c8056a3 100644
--- a/block.c
+++ b/block.c
@@ -528,66 +528,64 @@ typedef struct CreateCo {
 Error *err;
 } CreateCo;
 
-static void coroutine_fn bdrv_create_co_entry(void *opaque)
+static int coroutine_fn bdrv_co_create(BlockDriver *drv, const char *filename,
+   QemuOpts *opts, Error **errp)
 {
-Error *local_err = NULL;
 int ret;
+GLOBAL_STATE_CODE();
+assert(*errp == NULL);
+
+if (!drv->bdrv_co_create_opts) {
+error_setg(errp, "Driver '%s' does not support image creation",
+   drv->format_name);
+return -ENOTSUP;
+}
+
+ret = drv->bdrv_co_create_opts(drv, filename, opts, errp);
 
+if (ret < 0 && !*errp) {
+error_setg_errno(errp, -ret, "Could not create image");
+}
+
+return ret;
+}
+
+static void coroutine_fn bdrv_create_co_entry(void *opaque)
+{
 CreateCo *cco = opaque;
-assert(cco->drv);
 GLOBAL_STATE_CODE();
+assert(cco->drv);
 
-ret = cco->drv->bdrv_co_create_opts(cco->drv,
-cco->filename, cco->opts, _err);
-error_propagate(>err, local_err);
-cco->ret = ret;
+cco->ret = bdrv_co_create(cco->drv, cco->filename, cco->opts, >err);
 }
 
 int bdrv_create(BlockDriver *drv, const char* filename,
 QemuOpts *opts, Error **errp)
 {
-int ret;
-
 GLOBAL_STATE_CODE();
 
-Coroutine *co;
-CreateCo cco = {
-.drv = drv,
-.filename = g_strdup(filename),
-.opts = opts,
-.ret = NOT_DONE,
-.err = NULL,
-};
-
-if (!drv->bdrv_co_create_opts) {
-error_setg(errp, "Driver '%s' does not support image creation", 
drv->format_name);
-ret = -ENOTSUP;
-goto out;
-}
-
 if (qemu_in_coroutine()) {
 /* Fast-path if already in coroutine context */
-bdrv_create_co_entry();
+return bdrv_co_create(drv, filename, opts, errp);
 } else {
+Coroutine *co;
+CreateCo cco = {
+.drv = drv,
+.filename = g_strdup(filename),
+.opts = opts,
+.ret = NOT_DONE,
+.err = NULL,
+};
+
 co = qemu_coroutine_create(bdrv_create_co_entry, );
 qemu_coroutine_enter(co);
 while (cco.ret == NOT_DONE) {
 aio_poll(qemu_get_aio_context(), true);
 }
+error_propagate(errp, cco.err);
+g_free(cco.filename);
+return cco.ret;
 }
-
-ret = cco.ret;
-if (ret < 0) {
-if (cco.err) {
-error_propagate(errp, cco.err);
-} else {
-error_setg_errno(errp, -ret, "Could not create image");
-}
-}
-
-out:
-g_free(cco.filename);
-return ret;
 }
 
 /**
-- 
2.31.1

[PATCH v3 8/8] block/dirty-bitmap: remove unnecessary qemu_in_coroutine() case

Some functions check if they are running in a coroutine, calling
the coroutine callback directly if it's the case.
Except that no coroutine calls such functions, therefore that case
can be removed.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/dirty-bitmap.c | 66 +++-
 1 file changed, 29 insertions(+), 37 deletions(-)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index bf3dc0512a..8092d08261 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -418,24 +418,20 @@ bdrv_co_remove_persistent_dirty_bitmap_entry(void *opaque)
 int bdrv_remove_persistent_dirty_bitmap(BlockDriverState *bs, const char *name,
 Error **errp)
 {
-if (qemu_in_coroutine()) {
-return bdrv_co_remove_persistent_dirty_bitmap(bs, name, errp);
-} else {
-Coroutine *co;
-BdrvRemovePersistentDirtyBitmapCo s = {
-.bs = bs,
-.name = name,
-.errp = errp,
-.ret = -EINPROGRESS,
-};
-
-co = 
qemu_coroutine_create(bdrv_co_remove_persistent_dirty_bitmap_entry,
-   );
-bdrv_coroutine_enter(bs, co);
-BDRV_POLL_WHILE(bs, s.ret == -EINPROGRESS);
-
-return s.ret;
-}
+Coroutine *co;
+BdrvRemovePersistentDirtyBitmapCo s = {
+.bs = bs,
+.name = name,
+.errp = errp,
+.ret = -EINPROGRESS,
+};
+assert(!qemu_in_coroutine());
+co = qemu_coroutine_create(bdrv_co_remove_persistent_dirty_bitmap_entry,
+);
+bdrv_coroutine_enter(bs, co);
+BDRV_POLL_WHILE(bs, s.ret == -EINPROGRESS);
+
+return s.ret;
 }
 
 bool
@@ -494,25 +490,21 @@ bool bdrv_can_store_new_dirty_bitmap(BlockDriverState 
*bs, const char *name,
  uint32_t granularity, Error **errp)
 {
 IO_CODE();
-if (qemu_in_coroutine()) {
-return bdrv_co_can_store_new_dirty_bitmap(bs, name, granularity, errp);
-} else {
-Coroutine *co;
-BdrvCanStoreNewDirtyBitmapCo s = {
-.bs = bs,
-.name = name,
-.granularity = granularity,
-.errp = errp,
-.in_progress = true,
-};
-
-co = qemu_coroutine_create(bdrv_co_can_store_new_dirty_bitmap_entry,
-   );
-bdrv_coroutine_enter(bs, co);
-BDRV_POLL_WHILE(bs, s.in_progress);
-
-return s.ret;
-}
+Coroutine *co;
+BdrvCanStoreNewDirtyBitmapCo s = {
+.bs = bs,
+.name = name,
+.granularity = granularity,
+.errp = errp,
+.in_progress = true,
+};
+assert(!qemu_in_coroutine());
+co = qemu_coroutine_create(bdrv_co_can_store_new_dirty_bitmap_entry,
+);
+bdrv_coroutine_enter(bs, co);
+BDRV_POLL_WHILE(bs, s.in_progress);
+
+return s.ret;
 }
 
 void bdrv_disable_dirty_bitmap(BdrvDirtyBitmap *bitmap)
-- 
2.31.1

[PATCH v3 1/8] block-copy: add missing coroutine_fn annotations

block_copy_reset_unallocated and block_copy_is_cluster_allocated are
only called by backup_run, a corotuine_fn itself.

Same applies to block_copy_block_status, called by
block_copy_dirty_clusters.

Therefore mark them as coroutine too.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-copy.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index bb947afdda..f33ab1d0b6 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -577,8 +577,9 @@ static coroutine_fn int block_copy_task_entry(AioTask *task)
 return ret;
 }
 
-static int block_copy_block_status(BlockCopyState *s, int64_t offset,
-   int64_t bytes, int64_t *pnum)
+static coroutine_fn int block_copy_block_status(BlockCopyState *s,
+int64_t offset,
+int64_t bytes, int64_t *pnum)
 {
 int64_t num;
 BlockDriverState *base;
@@ -613,8 +614,9 @@ static int block_copy_block_status(BlockCopyState *s, 
int64_t offset,
  * Check if the cluster starting at offset is allocated or not.
  * return via pnum the number of contiguous clusters sharing this allocation.
  */
-static int block_copy_is_cluster_allocated(BlockCopyState *s, int64_t offset,
-   int64_t *pnum)
+static int coroutine_fn block_copy_is_cluster_allocated(BlockCopyState *s,
+int64_t offset,
+int64_t *pnum)
 {
 BlockDriverState *bs = s->source->bs;
 int64_t count, total_count = 0;
@@ -669,8 +671,9 @@ void block_copy_reset(BlockCopyState *s, int64_t offset, 
int64_t bytes)
  * @return 0 when the cluster at @offset was unallocated,
  * 1 otherwise, and -ret on error.
  */
-int64_t block_copy_reset_unallocated(BlockCopyState *s,
- int64_t offset, int64_t *count)
+int64_t coroutine_fn block_copy_reset_unallocated(BlockCopyState *s,
+  int64_t offset,
+  int64_t *count)
 {
 int ret;
 int64_t clusters, bytes;
-- 
2.31.1

[PATCH v3 5/8] block/vmdk: add missing coroutine_fn annotations

vmdk_co_create_opts() is a coroutine_fn, and calls vmdk_co_do_create()
which in turn can call two callbacks: vmdk_co_create_opts_cb and
vmdk_co_create_cb.

Mark all these functions as coroutine_fn, since vmdk_co_create_opts()
is the only caller.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/vmdk.c | 36 +++-
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/vmdk.c b/block/vmdk.c
index 26376352b9..0c32bf2e83 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -2285,10 +2285,11 @@ exit:
 return ret;
 }
 
-static int vmdk_create_extent(const char *filename, int64_t filesize,
-  bool flat, bool compress, bool zeroed_grain,
-  BlockBackend **pbb,
-  QemuOpts *opts, Error **errp)
+static int coroutine_fn vmdk_create_extent(const char *filename,
+   int64_t filesize, bool flat,
+   bool compress, bool zeroed_grain,
+   BlockBackend **pbb,
+   QemuOpts *opts, Error **errp)
 {
 int ret;
 BlockBackend *blk = NULL;
@@ -2366,14 +2367,14 @@ static int filename_decompose(const char *filename, 
char *path, char *prefix,
  *   non-split format.
  * idx >= 1: get the n-th extent if in a split subformat
  */
-typedef BlockBackend *(*vmdk_create_extent_fn)(int64_t size,
-   int idx,
-   bool flat,
-   bool split,
-   bool compress,
-   bool zeroed_grain,
-   void *opaque,
-   Error **errp);
+typedef BlockBackend * coroutine_fn (*vmdk_create_extent_fn)(int64_t size,
+ int idx,
+ bool flat,
+ bool split,
+ bool compress,
+ bool zeroed_grain,
+ void *opaque,
+ Error **errp);
 
 static void vmdk_desc_add_extent(GString *desc,
  const char *extent_line_fmt,
@@ -2616,7 +2617,7 @@ typedef struct {
 QemuOpts *opts;
 } VMDKCreateOptsData;
 
-static BlockBackend *vmdk_co_create_opts_cb(int64_t size, int idx,
+static BlockBackend * coroutine_fn vmdk_co_create_opts_cb(int64_t size, int 
idx,
 bool flat, bool split, bool 
compress,
 bool zeroed_grain, void *opaque,
 Error **errp)
@@ -2768,10 +2769,11 @@ exit:
 return ret;
 }
 
-static BlockBackend *vmdk_co_create_cb(int64_t size, int idx,
-   bool flat, bool split, bool compress,
-   bool zeroed_grain, void *opaque,
-   Error **errp)
+static BlockBackend * coroutine_fn vmdk_co_create_cb(int64_t size, int idx,
+ bool flat, bool split,
+ bool compress,
+ bool zeroed_grain,
+ void *opaque, Error 
**errp)
 {
 int ret;
 BlockDriverState *bs;
-- 
2.31.1

[PATCH v3 6/8] block: bdrv_create_file is a coroutine_fn

It is always called in coroutine_fn callbacks, therefore
it can directly call bdrv_co_create().

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c| 6 --
 include/block/block-global-state.h | 3 ++-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index 375c8056a3..dcac28756c 100644
--- a/block.c
+++ b/block.c
@@ -533,6 +533,7 @@ static int coroutine_fn bdrv_co_create(BlockDriver *drv, 
const char *filename,
 {
 int ret;
 GLOBAL_STATE_CODE();
+assert(qemu_in_coroutine());
 assert(*errp == NULL);
 
 if (!drv->bdrv_co_create_opts) {
@@ -723,7 +724,8 @@ out:
 return ret;
 }
 
-int bdrv_create_file(const char *filename, QemuOpts *opts, Error **errp)
+int coroutine_fn bdrv_create_file(const char *filename, QemuOpts *opts,
+  Error **errp)
 {
 QemuOpts *protocol_opts;
 BlockDriver *drv;
@@ -764,7 +766,7 @@ int bdrv_create_file(const char *filename, QemuOpts *opts, 
Error **errp)
 goto out;
 }
 
-ret = bdrv_create(drv, filename, protocol_opts, errp);
+ret = bdrv_co_create(drv, filename, protocol_opts, errp);
 out:
 qemu_opts_del(protocol_opts);
 qobject_unref(qdict);
diff --git a/include/block/block-global-state.h 
b/include/block/block-global-state.h
index 00e0cf8aea..6f35ed99e3 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -57,7 +57,8 @@ BlockDriver *bdrv_find_protocol(const char *filename,
 BlockDriver *bdrv_find_format(const char *format_name);
 int bdrv_create(BlockDriver *drv, const char* filename,
 QemuOpts *opts, Error **errp);
-int bdrv_create_file(const char *filename, QemuOpts *opts, Error **errp);
+int coroutine_fn bdrv_create_file(const char *filename, QemuOpts *opts,
+  Error **errp);
 
 BlockDriverState *bdrv_new(void);
 int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
-- 
2.31.1

[PATCH v3 3/8] block-backend: replace bdrv__above with blk__above

Avoid mixing bdrv_* functions with blk_*, so create blk_* counterparts
for:
- bdrv_block_status_above
- bdrv_is_allocated_above

Note that these functions will take the rdlock, so they must always run
in a coroutine.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c | 21 +
 block/commit.c|  4 ++--
 include/sysemu/block-backend-io.h |  9 +
 nbd/server.c  | 28 ++--
 qemu-img.c|  4 ++--
 5 files changed, 48 insertions(+), 18 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 742efa7955..333d50fb3f 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1424,6 +1424,27 @@ int coroutine_fn blk_co_pwritev(BlockBackend *blk, 
int64_t offset,
 return blk_co_pwritev_part(blk, offset, bytes, qiov, 0, flags);
 }
 
+int coroutine_fn blk_block_status_above(BlockBackend *blk,
+BlockDriverState *base,
+int64_t offset, int64_t bytes,
+int64_t *pnum, int64_t *map,
+BlockDriverState **file)
+{
+IO_CODE();
+return bdrv_block_status_above(blk_bs(blk), base, offset, bytes, pnum, map,
+   file);
+}
+
+int coroutine_fn blk_is_allocated_above(BlockBackend *blk,
+BlockDriverState *base,
+bool include_base, int64_t offset,
+int64_t bytes, int64_t *pnum)
+{
+IO_CODE();
+return bdrv_is_allocated_above(blk_bs(blk), base, include_base, offset,
+   bytes, pnum);
+}
+
 typedef struct BlkRwCo {
 BlockBackend *blk;
 int64_t offset;
diff --git a/block/commit.c b/block/commit.c
index 0029b31944..9d4d908344 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -155,8 +155,8 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
 break;
 }
 /* Copy if allocated above the base */
-ret = bdrv_is_allocated_above(blk_bs(s->top), s->base_overlay, true,
-  offset, COMMIT_BUFFER_SIZE, );
+ret = blk_is_allocated_above(s->top, s->base_overlay, true,
+ offset, COMMIT_BUFFER_SIZE, );
 copy = (ret > 0);
 trace_commit_one_iteration(s, offset, n, ret);
 if (copy) {
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index 50f5aa2e07..a47cb825e5 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -92,6 +92,15 @@ int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, 
int64_t off_in,
int64_t bytes, BdrvRequestFlags read_flags,
BdrvRequestFlags write_flags);
 
+int coroutine_fn blk_block_status_above(BlockBackend *blk,
+BlockDriverState *base,
+int64_t offset, int64_t bytes,
+int64_t *pnum, int64_t *map,
+BlockDriverState **file);
+int coroutine_fn blk_is_allocated_above(BlockBackend *blk,
+BlockDriverState *base,
+bool include_base, int64_t offset,
+int64_t bytes, int64_t *pnum);
 
 /*
  * "I/O or GS" API functions. These functions can run without
diff --git a/nbd/server.c b/nbd/server.c
index e2eec0cf46..6389b46396 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1991,7 +1991,7 @@ static int coroutine_fn 
nbd_co_send_structured_error(NBDClient *client,
 }
 
 /* Do a sparse read and send the structured reply to the client.
- * Returns -errno if sending fails. bdrv_block_status_above() failure is
+ * Returns -errno if sending fails. blk_block_status_above() failure is
  * reported to the client, at which point this function succeeds.
  */
 static int coroutine_fn nbd_co_send_sparse_read(NBDClient *client,
@@ -2007,10 +2007,10 @@ static int coroutine_fn 
nbd_co_send_sparse_read(NBDClient *client,
 
 while (progress < size) {
 int64_t pnum;
-int status = bdrv_block_status_above(blk_bs(exp->common.blk), NULL,
- offset + progress,
- size - progress, , NULL,
- NULL);
+int status = blk_block_status_above(exp->common.blk, NULL,
+offset + progress,
+size - progress, , NULL,
+NULL);
 bool final;
 
 if (status < 0) {
@@ -2141,14 +2141,14 @@ static int

[PATCH v3 0/8] Still more coroutine and various fixes in block layer

This is a dump of all minor coroutine-related fixes found while looking
around and testing various things in the QEMU block layer.

Patches aim to:
- add missing coroutine_fn annotation to the functions
- simplify to avoid the typical "if in coroutine: fn()
  // else create_coroutine(fn)" already present in generated_co_wraper
  functions.
- make sure that if a BlockDriver callback is defined as coroutine_fn, then
  it is always running in a coroutine.

This serie is based on Kevin Wolf's series "block: Simplify drain".

Based-on: <20221108123738.530873-1-kw...@redhat.com>

Emanuele
---
v3:
* Remove patch 1, base on kevin "drain semplification serie"

v2:
* clarified commit message in patches 2/3/6 on why we add coroutine_fn

Emanuele Giuseppe Esposito (8):
  block-copy: add missing coroutine_fn annotations
  nbd/server.c: add missing coroutine_fn annotations
  block-backend: replace bdrv_*_above with blk_*_above
  block: distinguish between bdrv_create running in coroutine and not
  block/vmdk: add missing coroutine_fn annotations
  block: bdrv_create_file is a coroutine_fn
  block: bdrv_create is never called in coroutine context
  block/dirty-bitmap: remove unnecessary qemu_in_coroutine() case

 block.c| 75 ++
 block/block-backend.c  | 21 +
 block/block-copy.c | 15 +++---
 block/commit.c |  4 +-
 block/dirty-bitmap.c   | 66 --
 block/vmdk.c   | 36 +++---
 include/block/block-global-state.h |  3 +-
 include/sysemu/block-backend-io.h  |  9 
 nbd/server.c   | 43 +
 qemu-img.c |  4 +-
 10 files changed, 151 insertions(+), 125 deletions(-)

-- 
2.31.1

[PATCH v3 7/8] block: bdrv_create is never called in coroutine context

Delete the if case and make sure it won't be called again
in coroutines.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c | 37 -
 1 file changed, 16 insertions(+), 21 deletions(-)

diff --git a/block.c b/block.c
index dcac28756c..7a4ce7948c 100644
--- a/block.c
+++ b/block.c
@@ -563,30 +563,25 @@ static void coroutine_fn bdrv_create_co_entry(void 
*opaque)
 int bdrv_create(BlockDriver *drv, const char* filename,
 QemuOpts *opts, Error **errp)
 {
+Coroutine *co;
+CreateCo cco = {
+.drv = drv,
+.filename = g_strdup(filename),
+.opts = opts,
+.ret = NOT_DONE,
+.err = NULL,
+};
 GLOBAL_STATE_CODE();
+assert(!qemu_in_coroutine());
 
-if (qemu_in_coroutine()) {
-/* Fast-path if already in coroutine context */
-return bdrv_co_create(drv, filename, opts, errp);
-} else {
-Coroutine *co;
-CreateCo cco = {
-.drv = drv,
-.filename = g_strdup(filename),
-.opts = opts,
-.ret = NOT_DONE,
-.err = NULL,
-};
-
-co = qemu_coroutine_create(bdrv_create_co_entry, );
-qemu_coroutine_enter(co);
-while (cco.ret == NOT_DONE) {
-aio_poll(qemu_get_aio_context(), true);
-}
-error_propagate(errp, cco.err);
-g_free(cco.filename);
-return cco.ret;
+co = qemu_coroutine_create(bdrv_create_co_entry, );
+qemu_coroutine_enter(co);
+while (cco.ret == NOT_DONE) {
+aio_poll(qemu_get_aio_context(), true);
 }
+error_propagate(errp, cco.err);
+g_free(cco.filename);
+return cco.ret;
 }
 
 /**
-- 
2.31.1

[PATCH RFC 2/3] hw/i2c: add mctp core

From: Klaus Jensen 

Add an abstract MCTP over I2C endpoint model. This implements MCTP
control message handling as well as handling the actual I2C transport
(packetization).

Devices are intended to derive from this and implement the class
methods.

Parts of this implementation is inspired by code[1] previously posted by
Jonathan Cameron.

  [1]: 
https://lore.kernel.org/qemu-devel/20220520170128.4436-1-jonathan.came...@huawei.com/

Signed-off-by: Klaus Jensen 
---
 hw/arm/Kconfig |   1 +
 hw/i2c/Kconfig |   4 +
 hw/i2c/mctp.c  | 365 +
 hw/i2c/meson.build |   1 +
 hw/i2c/trace-events|  12 ++
 include/hw/i2c/mctp.h  |  83 ++
 include/hw/misc/mctp.h |  43 +
 7 files changed, 509 insertions(+)
 create mode 100644 hw/i2c/mctp.c
 create mode 100644 include/hw/i2c/mctp.h
 create mode 100644 include/hw/misc/mctp.h

diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
index 17fcde8e1ccc..3233bdc193d7 100644
--- a/hw/arm/Kconfig
+++ b/hw/arm/Kconfig
@@ -444,6 +444,7 @@ config ASPEED_SOC
 select DS1338
 select FTGMAC100
 select I2C
+select MCTP_I2C
 select DPS310
 select PCA9552
 select SERIAL
diff --git a/hw/i2c/Kconfig b/hw/i2c/Kconfig
index 9bb8870517f8..5dd43d550c32 100644
--- a/hw/i2c/Kconfig
+++ b/hw/i2c/Kconfig
@@ -41,3 +41,7 @@ config PCA954X
 config PMBUS
 bool
 select SMBUS
+
+config MCTP_I2C
+bool
+select I2C
diff --git a/hw/i2c/mctp.c b/hw/i2c/mctp.c
new file mode 100644
index ..46376de95a98
--- /dev/null
+++ b/hw/i2c/mctp.c
@@ -0,0 +1,365 @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * SPDX-FileCopyrightText: Copyright (c) 2022 Samsung Electronics Co., Ltd.
+ * SPDX-FileContributor: Klaus Jensen 
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/main-loop.h"
+
+#include "hw/qdev-properties.h"
+#include "hw/i2c/i2c.h"
+#include "hw/i2c/mctp.h"
+
+#include "trace.h"
+
+static uint8_t crc8(uint16_t data)
+{
+#define POLY (0x1070U << 3)
+int i;
+
+for (i = 0; i < 8; i++) {
+if (data & 0x8000) {
+data = data ^ POLY;
+}
+
+data = data << 1;
+}
+
+return (uint8_t)(data >> 8);
+#undef POLY
+}
+
+static uint8_t i2c_smbus_pec(uint8_t crc, uint8_t *buf, size_t len)
+{
+int i;
+
+for (i = 0; i < len; i++) {
+crc = crc8((crc ^ buf[i]) << 8);
+}
+
+return crc;
+}
+
+void i2c_mctp_schedule_send(MCTPI2CEndpoint *mctp)
+{
+I2CBus *i2c = I2C_BUS(qdev_get_parent_bus(DEVICE(mctp)));
+
+mctp->tx.state = I2C_MCTP_STATE_TX_START_SEND;
+
+i2c_bus_master(i2c, mctp->tx.bh);
+}
+
+static void i2c_mctp_tx(void *opaque)
+{
+DeviceState *dev = DEVICE(opaque);
+I2CBus *i2c = I2C_BUS(qdev_get_parent_bus(dev));
+I2CSlave *slave = I2C_SLAVE(dev);
+MCTPI2CEndpoint *mctp = MCTP_I2C_ENDPOINT(dev);
+MCTPI2CEndpointClass *mc = MCTP_I2C_ENDPOINT_GET_CLASS(mctp);
+MCTPI2CPacket *pkt = (MCTPI2CPacket *)mctp->buffer;
+uint8_t flags = 0;
+
+switch (mctp->tx.state) {
+case I2C_MCTP_STATE_TX_SEND_BYTE:
+if (mctp->pos < mctp->len) {
+uint8_t byte = mctp->buffer[mctp->pos];
+
+trace_i2c_mctp_tx_send_byte(mctp->pos, byte);
+
+/* send next byte */
+i2c_send_async(i2c, byte);
+
+mctp->pos++;
+
+break;
+}
+
+/* packet sent */
+i2c_end_transfer(i2c);
+
+/* fall through */
+
+case I2C_MCTP_STATE_TX_START_SEND:
+if (mctp->tx.is_control) {
+/* packet payload is already in buffer */
+flags |= MCTP_H_FLAGS_SOM | MCTP_H_FLAGS_EOM;
+} else {
+/* get message bytes from derived device */
+mctp->len = mc->get_message_bytes(mctp, pkt->mctp.payload,
+  I2C_MCTP_MAXMTU, );
+}
+
+if (!mctp->len) {
+trace_i2c_mctp_tx_done();
+
+/* no more packets needed; release the bus */
+i2c_bus_release(i2c);
+
+mctp->state = I2C_MCTP_STATE_IDLE;
+mctp->tx.is_control = false;
+
+break;
+}
+
+mctp->state = I2C_MCTP_STATE_TX;
+
+pkt->i2c = (MCTPI2CPacketHeader) {
+.dest = mctp->tx.addr & ~0x1,
+.prot = 0xf,
+.byte_count = 5 + mctp->len,
+.source = slave->address << 1 | 0x1,
+};
+
+pkt->mctp.hdr = (MCTPPacketHeader) {
+.version = 0x1,
+.eid.dest = mctp->tx.eid,
+.eid.source = mctp->my_eid,
+.flags = flags | (mctp->tx.pktseq++ & 0x3) << 4 | mctp->tx.flags,
+};
+
+mctp->len += sizeof(MCTPI2CPacket);
+mctp->buffer[mctp->len] = i2c_smbus_pec(0, mctp->buffer, mctp->len);
+mctp->len++;
+
+trace_i2c_mctp_tx_start_send(mctp->len);
+
+i2c_start_send_async(i2c, pkt->i2c.dest >> 1);
+
+/* already "sent" the destination slave address */
+

[PATCH maybe-7.2 1/3] hw/i2c: only schedule pending master when bus is idle

From: Klaus Jensen 

It is not given that the current master will release the bus after a
transfer ends. Only schedule a pending master if the bus is idle.

Fixes: 37fa5ca42623 ("hw/i2c: support multiple masters")
Signed-off-by: Klaus Jensen 
---
 hw/i2c/aspeed_i2c.c  |  2 ++
 hw/i2c/core.c| 37 ++---
 include/hw/i2c/i2c.h |  2 ++
 3 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/hw/i2c/aspeed_i2c.c b/hw/i2c/aspeed_i2c.c
index c166fd20fa11..1f071a3811f7 100644
--- a/hw/i2c/aspeed_i2c.c
+++ b/hw/i2c/aspeed_i2c.c
@@ -550,6 +550,8 @@ static void aspeed_i2c_bus_handle_cmd(AspeedI2CBus *bus, 
uint64_t value)
 }
 SHARED_ARRAY_FIELD_DP32(bus->regs, reg_cmd, M_STOP_CMD, 0);
 aspeed_i2c_set_state(bus, I2CD_IDLE);
+
+i2c_schedule_pending_master(bus->bus);
 }
 
 if (aspeed_i2c_bus_pkt_mode_en(bus)) {
diff --git a/hw/i2c/core.c b/hw/i2c/core.c
index d4ba8146bffb..bed594fe599b 100644
--- a/hw/i2c/core.c
+++ b/hw/i2c/core.c
@@ -185,22 +185,39 @@ int i2c_start_transfer(I2CBus *bus, uint8_t address, bool 
is_recv)
 
 void i2c_bus_master(I2CBus *bus, QEMUBH *bh)
 {
+I2CPendingMaster *node = g_new(struct I2CPendingMaster, 1);
+node->bh = bh;
+
+QSIMPLEQ_INSERT_TAIL(>pending_masters, node, entry);
+}
+
+void i2c_schedule_pending_master(I2CBus *bus)
+{
+I2CPendingMaster *node;
+
 if (i2c_bus_busy(bus)) {
-I2CPendingMaster *node = g_new(struct I2CPendingMaster, 1);
-node->bh = bh;
-
-QSIMPLEQ_INSERT_TAIL(>pending_masters, node, entry);
+/* someone is already controlling the bus; wait for it to release it */
+return;
+}
 
+if (QSIMPLEQ_EMPTY(>pending_masters)) {
 return;
 }
 
-bus->bh = bh;
+node = QSIMPLEQ_FIRST(>pending_masters);
+bus->bh = node->bh;
+
+QSIMPLEQ_REMOVE_HEAD(>pending_masters, entry);
+g_free(node);
+
 qemu_bh_schedule(bus->bh);
 }
 
 void i2c_bus_release(I2CBus *bus)
 {
 bus->bh = NULL;
+
+i2c_schedule_pending_master(bus);
 }
 
 int i2c_start_recv(I2CBus *bus, uint8_t address)
@@ -234,16 +251,6 @@ void i2c_end_transfer(I2CBus *bus)
 g_free(node);
 }
 bus->broadcast = false;
-
-if (!QSIMPLEQ_EMPTY(>pending_masters)) {
-I2CPendingMaster *node = QSIMPLEQ_FIRST(>pending_masters);
-bus->bh = node->bh;
-
-QSIMPLEQ_REMOVE_HEAD(>pending_masters, entry);
-g_free(node);
-
-qemu_bh_schedule(bus->bh);
-}
 }
 
 int i2c_send(I2CBus *bus, uint8_t data)
diff --git a/include/hw/i2c/i2c.h b/include/hw/i2c/i2c.h
index 9b9581d23097..2a3abacd1ba6 100644
--- a/include/hw/i2c/i2c.h
+++ b/include/hw/i2c/i2c.h
@@ -141,6 +141,8 @@ int i2c_start_send(I2CBus *bus, uint8_t address);
  */
 int i2c_start_send_async(I2CBus *bus, uint8_t address);
 
+void i2c_schedule_pending_master(I2CBus *bus);
+
 void i2c_end_transfer(I2CBus *bus);
 void i2c_nack(I2CBus *bus);
 void i2c_ack(I2CBus *bus);
-- 
2.38.1

[PATCH RFC 3/3] hw/nvme: add nvme management interface model

From: Klaus Jensen 

Add the 'nmi-i2c' device that emulates an NVMe Management Interface
controller.

Initial support is very basic (Read NMI DS, Configuration Get).

This is based on previously posted code by Padmakar Kalghatgi, Arun
Kumar Agasar and Saurav Kumar.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/meson.build  |   1 +
 hw/nvme/nmi-i2c.c| 381 +++
 hw/nvme/trace-events |   6 +
 3 files changed, 388 insertions(+)
 create mode 100644 hw/nvme/nmi-i2c.c

diff --git a/hw/nvme/meson.build b/hw/nvme/meson.build
index 3cf40046eea9..b231e3fa12c2 100644
--- a/hw/nvme/meson.build
+++ b/hw/nvme/meson.build
@@ -1 +1,2 @@
 softmmu_ss.add(when: 'CONFIG_NVME_PCI', if_true: files('ctrl.c', 'dif.c', 
'ns.c', 'subsys.c'))
+softmmu_ss.add(when: 'CONFIG_MCTP_I2C', if_true: files('nmi-i2c.c'))
diff --git a/hw/nvme/nmi-i2c.c b/hw/nvme/nmi-i2c.c
new file mode 100644
index ..79fd18cdc5cf
--- /dev/null
+++ b/hw/nvme/nmi-i2c.c
@@ -0,0 +1,381 @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-only
+ *
+ * SPDX-FileCopyrightText: Copyright (c) 2022 Samsung Electronics Co., Ltd.
+ *
+ * SPDX-FileContributor: Padmakar Kalghatgi 
+ * SPDX-FileContributor: Arun Kumar Agasar 
+ * SPDX-FileContributor: Saurav Kumar 
+ * SPDX-FileContributor: Klaus Jensen 
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/crc32c.h"
+#include "hw/i2c/i2c.h"
+#include "hw/registerfields.h"
+#include "hw/i2c/mctp.h"
+#include "trace.h"
+
+#define NMI_MAX_MESSAGE_LENGTH 4224
+
+#define TYPE_NMI_I2C_DEVICE "nmi-i2c"
+OBJECT_DECLARE_SIMPLE_TYPE(NMIDevice, NMI_I2C_DEVICE)
+
+typedef struct NMIDevice {
+MCTPI2CEndpoint mctp;
+
+uint8_t buffer[NMI_MAX_MESSAGE_LENGTH];
+uint8_t scratch[NMI_MAX_MESSAGE_LENGTH];
+
+size_t  len;
+int64_t pos;
+} NMIDevice;
+
+FIELD(NMI_NMP, ROR, 7, 1)
+FIELD(NMI_NMP, NMIMT, 3, 4)
+
+#define NMI_NMP_NMIMT_NMI_CMD 0x1
+#define NMI_NMP_NMIMT_NM_ADMIN 0x2
+
+typedef struct NMIMessage {
+uint8_t mctpd;
+uint8_t nmp;
+uint8_t rsvd2[2];
+uint8_t payload[]; /* includes the Message Integrity Check */
+} NMIMessage;
+
+typedef struct NMIRequest {
+   uint8_t opc;
+   uint8_t rsvd1[3];
+   uint32_t dw0;
+   uint32_t dw1;
+   uint32_t mic;
+} NMIRequest;
+
+typedef struct NMIResponse {
+uint8_t status;
+uint8_t response[3];
+uint8_t payload[]; /* includes the Message Integrity Check */
+} NMIResponse;
+
+typedef enum NMIReadDSType {
+NMI_CMD_READ_NMI_DS_SUBSYSTEM   = 0x0,
+NMI_CMD_READ_NMI_DS_PORTS   = 0x1,
+NMI_CMD_READ_NMI_DS_CTRL_LIST   = 0x2,
+NMI_CMD_READ_NMI_DS_CTRL_INFO   = 0x3,
+NMI_CMD_READ_NMI_DS_CMD_SUPPORT = 0x4,
+NMI_CMD_READ_NMI_DS_MEB_CMD_SUPPORT = 0x5,
+} NMIReadDSType;
+
+static void nmi_handle_mi_read_nmi_ds(NMIDevice *nmi, NMIRequest *request)
+{
+I2CSlave *i2c = I2C_SLAVE(nmi);
+
+uint32_t dw0 = le32_to_cpu(request->dw0);
+uint8_t dtyp = (dw0 >> 24) & 0xf;
+uint8_t *buf;
+size_t len;
+
+trace_nmi_handle_mi_read_nmi_ds(dtyp);
+
+static uint8_t nmi_ds_subsystem[36] = {
+0x00,   /* success */
+0x20,   /* response data length */
+0x00, 0x00, /* reserved */
+0x00,   /* number of ports */
+0x01,   /* major version */
+0x01,   /* minor version */
+};
+
+static uint8_t nmi_ds_ports[36] = {
+0x00,   /* success */
+0x20,   /* response data length */
+0x00, 0x00, /* reserved */
+0x02,   /* port type (smbus) */
+0x00,   /* reserved */
+0x40, 0x00, /* maximum mctp transission unit size (64 bytes) */
+0x00, 0x00, 0x00, 0x00, /* management endpoint buffer size */
+0x00, 0x00, /* vpd i2c address/freq */
+0x00, 0x01, /* management endpoint i2c address/freq */
+};
+
+static uint8_t nmi_ds_error[4] = {
+0x04,   /* invalid parameter */
+0x00,   /* first invalid bit position */
+0x00, 0x00, /* first invalid byte position */
+};
+
+static uint8_t nmi_ds_empty[8] = {
+0x00,   /* success */
+0x02,   /* response data length */
+0x00, 0x00, /* reserved */
+0x00, 0x00, /* number of controllers */
+0x00, 0x00, /* padding */
+};
+
+switch (dtyp) {
+case NMI_CMD_READ_NMI_DS_SUBSYSTEM:
+len = 36;
+buf = nmi_ds_subsystem;
+
+break;
+
+case NMI_CMD_READ_NMI_DS_PORTS:
+len = 36;
+buf = nmi_ds_ports;
+
+/* patch in the i2c address of the endpoint */
+buf[14] = i2c->address;
+
+break;
+
+case NMI_CMD_READ_NMI_DS_CTRL_INFO:
+len = 4;
+buf = nmi_ds_error;
+
+break;
+
+case NMI_CMD_READ_NMI_DS_CTRL_LIST:
+case NMI_CMD_READ_NMI_DS_CMD_SUPPORT:
+case NMI_CMD_READ_NMI_DS_MEB_CMD_SUPPORT:
+len = 8;
+buf = nmi_ds_empty;
+
+break;
+
+default:
+len = 4;
+buf = nmi_ds_error;
+
+

[PATCH 0/3] hw/{i2c, nvme}: mctp endpoint, nvme management interface model