date:20211005

Re: [PATCH] ide: Cap LBA28 capacity announcement to 2^28-1

2021-10-05 Thread Samuel Thibault

Ping?

Samuel Thibault, le mar. 24 août 2021 12:43:44 +0200, a ecrit:
> The LBA28 capacity (at offsets 60/61 of identification) is supposed to
> express the maximum size supported by LBA28 commands. If the device is
> larger than this, we have to cap it to 2^28-1.
> 
> At least NetBSD happens to be using this value to determine whether to use
> LBA28 or LBA48 for its commands, using LBA28 for sectors that don't need
> LBA48. This commit thus fixes NetBSD access to disks larger than 128GiB.
> 
> Signed-off-by: Samuel Thibault 
> ---
>  hw/ide/core.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/ide/core.c b/hw/ide/core.c
> index fd69ca3167..e28f8aad61 100644
> --- a/hw/ide/core.c
> +++ b/hw/ide/core.c
> @@ -98,8 +98,12 @@ static void put_le16(uint16_t *p, unsigned int v)
>  static void ide_identify_size(IDEState *s)
>  {
>  uint16_t *p = (uint16_t *)s->identify_data;
> -put_le16(p + 60, s->nb_sectors);
> -put_le16(p + 61, s->nb_sectors >> 16);
> +int64_t nb_sectors_lba28 = s->nb_sectors;
> +if (nb_sectors_lba28 >= 1 << 28) {
> +nb_sectors_lba28 = (1 << 28) - 1;
> +}
> +put_le16(p + 60, nb_sectors_lba28);
> +put_le16(p + 61, nb_sectors_lba28 >> 16);
>  put_le16(p + 100, s->nb_sectors);
>  put_le16(p + 101, s->nb_sectors >> 16);
>  put_le16(p + 102, s->nb_sectors >> 32);
> -- 
> 2.32.0
>

Re: [PATCH v0 0/2] virtio-blk and vhost-user-blk cross-device migration

2021-10-05 Thread Michael S. Tsirkin

On Tue, Oct 05, 2021 at 12:10:08PM -0400, Eduardo Habkost wrote:
> On Tue, Oct 05, 2021 at 03:01:05PM +0100, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > On Tue, Oct 05, 2021 at 02:18:40AM +0300, Roman Kagan wrote:
> > > > On Mon, Oct 04, 2021 at 11:11:00AM -0400, Michael S. Tsirkin wrote:
> > > > > On Mon, Oct 04, 2021 at 06:07:29PM +0300, Denis Plotnikov wrote:
> > > > > > It might be useful for the cases when a slow block layer should be 
> > > > > > replaced
> > > > > > with a more performant one on running VM without stopping, i.e. 
> > > > > > with very low
> > > > > > downtime comparable with the one on migration.
> > > > > > 
> > > > > > It's possible to achive that for two reasons:
> > > > > > 
> > > > > > 1.The VMStates of "virtio-blk" and "vhost-user-blk" are almost the 
> > > > > > same.
> > > > > >   They consist of the identical VMSTATE_VIRTIO_DEVICE and differs 
> > > > > > from
> > > > > >   each other in the values of migration service fields only.
> > > > > > 2.The device driver used in the guest is the same: virtio-blk
> > > > > > 
> > > > > > In the series cross-migration is achieved by adding a new type.
> > > > > > The new type uses virtio-blk VMState instead of vhost-user-blk 
> > > > > > specific
> > > > > > VMstate, also it implements migration save/load callbacks to be 
> > > > > > compatible
> > > > > > with migration stream produced by "virtio-blk" device.
> > > > > > 
> > > > > > Adding the new type instead of modifying the existing one is 
> > > > > > convenent.
> > > > > > It ease to differ the new virtio-blk-compatible vhost-user-blk
> > > > > > device from the existing non-compatible one using qemu machinery 
> > > > > > without any
> > > > > > other modifiactions. That gives all the variety of qemu device 
> > > > > > related
> > > > > > constraints out of box.
> > > > > 
> > > > > Hmm I'm not sure I understand. What is the advantage for the user?
> > > > > What if vhost-user-blk became an alias for vhost-user-virtio-blk?
> > > > > We could add some hacks to make it compatible for old machine types.
> > > > 
> > > > The point is that virtio-blk and vhost-user-blk are not
> > > > migration-compatible ATM.  OTOH they are the same device from the guest
> > > > POV so there's nothing fundamentally preventing the migration between
> > > > the two.  In particular, we see it as a means to switch between the
> > > > storage backend transports via live migration without disrupting the
> > > > guest.
> > > > 
> > > > Migration-wise virtio-blk and vhost-user-blk have in common
> > > > 
> > > > - the content of the VMState -- VMSTATE_VIRTIO_DEVICE
> > > > 
> > > > The two differ in
> > > > 
> > > > - the name and the version of the VMStateDescription
> > > > 
> > > > - virtio-blk has an extra migration section (via .save/.load callbacks
> > > >   on VirtioDeviceClass) containing requests in flight
> > > > 
> > > > It looks like to become migration-compatible with virtio-blk,
> > > > vhost-user-blk has to start using VMStateDescription of virtio-blk and
> > > > provide compatible .save/.load callbacks.  It isn't entirely obvious how
> > > > to make this machine-type-dependent, so we came up with a simpler idea
> > > > of defining a new device that shares most of the implementation with the
> > > > original vhost-user-blk except for the migration stuff.  We're certainly
> > > > open to suggestions on how to reconcile this under a single
> > > > vhost-user-blk device, as this would be more user-friendly indeed.
> > > > 
> > > > We considered using a class property for this and defining the
> > > > respective compat clause, but IIUC the class constructors (where .vmsd
> > > > and .save/.load are defined) are not supposed to depend on class
> > > > properties.
> > > > 
> > > > Thanks,
> > > > Roman.
> > > 
> > > So the question is how to make vmsd depend on machine type.
> > > CC Eduardo who poked at this kind of compat stuff recently,
> > > paolo who looked at qom things most recently and dgilbert
> > > for advice on migration.
> > 
> > I don't think I've seen anyone change vmsd name dependent on machine
> > type; making fields appear/disappear is easy - that just ends up as a
> > property on the device that's checked;  I guess if that property is
> > global (rather than per instance) then you can check it in
> > vhost_user_blk_class_init and swing the dc->vmsd pointer?
> 
> class_init can be called very early during QEMU initialization,
> so it's too early to make decisions based on machine type.
> 
> Making a specific vmsd appear/disappear based on machine
> configuration or state is "easy", by implementing
> VMStateDescription.needed.  But this would require registering
> both vmsds (one of them would need to be registered manually
> instead of using DeviceClass.vmsd).
> 
> I don't remember what are the consequences of not using
> DeviceClass.vmsd to register a vmsd, I only remember it was
> subtle.  See commit b170fce3dd06 ("cpu: Register
>

Re: [PATCH v7 3/8] qmp: add QMP command x-debug-query-virtio

2021-10-05 Thread Eric Blake

On Tue, Oct 05, 2021 at 12:45:48PM -0400, Jonah Palmer wrote:
> From: Laurent Vivier 
> 
> This new command lists all the instances of VirtIODevice with
> their QOM paths and virtio type/name.
> 
> Signed-off-by: Jonah Palmer 
> ---
>  hw/virtio/meson.build  |  2 ++
>  hw/virtio/virtio-stub.c| 14 ++
>  hw/virtio/virtio.c | 27 +++
>  include/hw/virtio/virtio.h |  1 +
>  qapi/meson.build   |  1 +
>  qapi/qapi-schema.json  |  1 +
>  qapi/virtio.json   | 66 
> ++
>  tests/qtest/qmp-cmd-test.c |  1 +
>  8 files changed, 113 insertions(+)
>  create mode 100644 hw/virtio/virtio-stub.c
>  create mode 100644 qapi/virtio.json
> 
>  [Jonah: VirtioInfo member 'type' is now of type string and no longer
>   relies on defining a QAPI list of virtio device type enumerations
>   to match the VirtIODevice name with qapi_enum_parse().]

Hmm; depending on how much information you want to cram in strings, we
may want to rebase this series on top of Dan's work to add the
HumanReadableText QAPI type:
https://lists.gnu.org/archive/html/qemu-devel/2021-09/msg07717.html

> +++ b/qapi/virtio.json
> @@ -0,0 +1,66 @@
> +# -*- Mode: Python -*-
> +# vim: filetype=python
> +#
> +
> +##
> +# = Virtio devices
> +##
> +
> +##
> +# @VirtioInfo:
> +#
> +# Information about a given VirtIODevice
> +#
> +# @path: VirtIO device canonical QOM path.
> +#
> +# @type: VirtIO device name.
> +#
> +# Since: 6.2
> +#
> +##
> +{ 'struct': 'VirtioInfo',
> +'data': {
> +'path': 'str',
> +'type': 'str'
> +}
> +}
> +
> +##
> +# @x-debug-query-virtio:
> +#
> +# Return a list of all initalized VirtIO devices
> +#
> +# Returns: list of gathered @VirtioInfo devices
> +#
> +# Since: 6.2
> +#
> +# Example:
> +#
> +# -> { "execute": "x-debug-query-virtio" }
> +# <- { "return": [
> +#{
> +#"path": "/machine/peripheral-anon/device[4]/virtio-backend",
> +#"type": "virtio-input"
> +#},
> +#{
> +#"path": "/machine/peripheral/crypto0/virtio-backend",
> +#"type": "virtio-crypto"
> +#},
> +#{
> +#"path": "/machine/peripheral-anon/device[2]/virtio-backend",
> +#"type": "virtio-scsi"
> +#},
> +#{
> +#"path": "/machine/peripheral-anon/device[1]/virtio-backend",
> +#"type": "virtio-net"
> +#},
> +#{
> +#"path": "/machine/peripheral-anon/device[0]/virtio-backend",
> +#"type": "virtio-serial"
> +#}
> +#  ]
> +#}
> +#
> +##
> +
> +{ 'command': 'x-debug-query-virtio', 'returns': ['VirtioInfo'] }

But for now, it looks like 'str' is the correct type.


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v7 1/8] virtio: drop name parameter for virtio_init()

2021-10-05 Thread Eric Blake

On Tue, Oct 05, 2021 at 12:45:46PM -0400, Jonah Palmer wrote:
> This patch drops the name parameter for the virtio_init function.
> 
> The pair between the numeric device ID and the string device ID
> (name) of a virtio device already exists, but not in a way that
> let's us map between them.

s/let's/lets/

> 
> This patch will let us do this and removes the need for the name
> parameter in virtio_init().
> 
> Signed-off-by: Jonah Palmer 
> ---

> +++ b/hw/virtio/virtio.c
> @@ -133,6 +133,43 @@ struct VirtQueue
>  QLIST_ENTRY(VirtQueue) node;
>  };
>  
> +const char *virtio_device_names[] = {
> +[VIRTIO_ID_NET] = "virtio-net",
> +[VIRTIO_ID_BLOCK] = "virtio-blk",
> +[VIRTIO_ID_CONSOLE] = "virtio-serial",
> +[VIRTIO_ID_RNG] = "virtio-rng",
> +[VIRTIO_ID_BALLOON] = "virtio-balloon",
> +[VIRTIO_ID_IOMEM] = "virtio-iomem",
> +[VIRTIO_ID_RPMSG] = "virtio-rpmsg",
> +[VIRTIO_ID_SCSI] = "virtio-scsi",
> +[VIRTIO_ID_9P] = "virtio-9p",
> +[VIRTIO_ID_MAC80211_WLAN] = "virtio-mac-wlan",
> +[VIRTIO_ID_RPROC_SERIAL] = "virtio-rproc-serial",
> +[VIRTIO_ID_CAIF] = "virtio-caif",
> +[VIRTIO_ID_MEMORY_BALLOON] = "virtio-mem-balloon",
> +[VIRTIO_ID_GPU] = "virtio-gpu",
> +[VIRTIO_ID_CLOCK] = "virtio-clk",
> +[VIRTIO_ID_INPUT] = "virtio-input",
> +[VIRTIO_ID_VSOCK] = "vhost-vsock",
> +[VIRTIO_ID_CRYPTO] = "virtio-crypto",
> +[VIRTIO_ID_SIGNAL_DIST] = "virtio-signal",
> +[VIRTIO_ID_PSTORE] = "virtio-pstore",
> +[VIRTIO_ID_IOMMU] = "virtio-iommu",
> +[VIRTIO_ID_MEM] = "virtio-mem",
> +[VIRTIO_ID_SOUND] = "virtio-sound",
> +[VIRTIO_ID_FS] = "vhost-user-fs",
> +[VIRTIO_ID_PMEM] = "virtio-pmem",
> +[VIRTIO_ID_MAC80211_HWSIM] = "virtio-mac-hwsim",
> +[VIRTIO_ID_I2C_ADAPTER] = "vhost-user-i2c",
> +[VIRTIO_ID_BT] = "virtio-bluetooth"
> +};

Are these IDs consecutive, or can the array have gaps?

> +
> +static const char *virtio_id_to_name(uint16_t device_id)
> +{
> +assert(device_id < G_N_ELEMENTS(virtio_device_names));
> +return virtio_device_names[device_id];

If the latter, you may also want to assert that you aren't returning NULL.

> +++ b/include/standard-headers/linux/virtio_ids.h
> @@ -55,6 +55,7 @@
>  #define VIRTIO_ID_FS 26 /* virtio filesystem */
>  #define VIRTIO_ID_PMEM   27 /* virtio pmem */
>  #define VIRTIO_ID_MAC80211_HWSIM 29 /* virtio mac80211-hwsim */
> +#define VIRTIO_ID_I2C_ADAPTER   34 /* virtio I2C adapater */
>  #define VIRTIO_ID_BT 40 /* virtio bluetooth */

And it looks like the array has gaps.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[RFC PATCH 3/4] hw/scsi/scsi-generic: Use automatic AIO context lock

2021-10-05 Thread Philippe Mathieu-Daudé

Use the automatic AIO context acquire/release in
scsi_command_complete().

Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/scsi/scsi-generic.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 665baf900e4..08ef623c030 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -114,9 +114,9 @@ static void scsi_command_complete(void *opaque, int ret)
 assert(r->req.aiocb != NULL);
 r->req.aiocb = NULL;
 
-aio_context_acquire(blk_get_aio_context(s->conf.blk));
-scsi_command_complete_noio(r, ret);
-aio_context_release(blk_get_aio_context(s->conf.blk));
+WITH_AIO_CONTEXT_ACQUIRE_GUARD(blk_get_aio_context(s->conf.blk)) {
+scsi_command_complete_noio(r, ret);
+}
 }
 
 static int execute_command(BlockBackend *blk,
-- 
2.31.1

[RFC PATCH 1/4] block/aio: Add automatically released aio_context variants

2021-10-05 Thread Philippe Mathieu-Daudé

Similarly to commit 5626f8c6d46 ("rcu: Add automatically
released rcu_read_lock variants"):

AIO_CONTEXT_ACQUIRE_GUARD() acquires the aio context and then uses
glib's g_auto infrastructure (and thus whatever the compiler's hooks
are) to release it on all exits of the block.

WITH_AIO_CONTEXT_ACQUIRE_GUARD() is similar but is used as a wrapper
for the lock, i.e.:

   WITH_AIO_CONTEXT_ACQUIRE_GUARD() {
   stuff with context acquired
   }

Inspired-by: Dr. David Alan Gilbert 
Signed-off-by: Philippe Mathieu-Daudé 
---
 include/block/aio.h | 24 
 1 file changed, 24 insertions(+)

diff --git a/include/block/aio.h b/include/block/aio.h
index 47fbe9d81f2..4fa5a5c2720 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -294,6 +294,30 @@ void aio_context_acquire(AioContext *ctx);
 /* Relinquish ownership of the AioContext. */
 void aio_context_release(AioContext *ctx);
 
+static inline AioContext *aio_context_auto_acquire(AioContext *ctx)
+{
+aio_context_acquire(ctx);
+return ctx;
+}
+
+static inline void aio_context_auto_release(AioContext *ctx)
+{
+aio_context_release(ctx);
+}
+
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(AioContext, aio_context_auto_release)
+
+#define WITH_AIO_CONTEXT_ACQUIRE_GUARD(ctx) \
+WITH_AIO_CONTEXT_ACQUIRE_GUARD_(glue(_aio_context_auto, __COUNTER__), ctx)
+
+#define WITH_AIO_CONTEXT_ACQUIRE_GUARD_(var, ctx) \
+for (g_autoptr(AioContext) var = aio_context_auto_acquire(ctx); \
+(var); aio_context_auto_release(var), (var) = NULL)
+
+#define AIO_CONTEXT_ACQUIRE_GUARD(ctx) \
+g_autoptr(AioContext) _aio_context_auto __attribute__((unused)) \
+= aio_context_auto_acquire(ctx)
+
 /**
  * aio_bh_schedule_oneshot_full: Allocate a new bottom half structure that will
  * run only once and as soon as possible.
-- 
2.31.1

[RFC PATCH 4/4] hw/block/virtio-blk: Use automatic AIO context lock

2021-10-05 Thread Philippe Mathieu-Daudé

Use the automatic AIO context acquire/release in virtio_blk_reset().

Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/block/virtio-blk.c | 26 --
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index f139cd7cc9c..2dd6428e7b3 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -896,24 +896,22 @@ static void virtio_blk_dma_restart_cb(void *opaque, bool 
running,
 static void virtio_blk_reset(VirtIODevice *vdev)
 {
 VirtIOBlock *s = VIRTIO_BLK(vdev);
-AioContext *ctx;
 VirtIOBlockReq *req;
 
-ctx = blk_get_aio_context(s->blk);
-aio_context_acquire(ctx);
-blk_drain(s->blk);
-
-/* We drop queued requests after blk_drain() because blk_drain() itself can
- * produce them. */
-while (s->rq) {
-req = s->rq;
-s->rq = req->next;
-virtqueue_detach_element(req->vq, >elem, 0);
-virtio_blk_free_request(req);
+WITH_AIO_CONTEXT_ACQUIRE_GUARD(blk_get_aio_context(s->blk)) {
+blk_drain(s->blk);
+/*
+ * We drop queued requests after blk_drain() because
+ * blk_drain() itself can produce them.
+ */
+while (s->rq) {
+req = s->rq;
+s->rq = req->next;
+virtqueue_detach_element(req->vq, >elem, 0);
+virtio_blk_free_request(req);
+}
 }
 
-aio_context_release(ctx);
-
 assert(!s->dataplane_started);
 blk_set_enable_write_cache(s->blk, s->original_wce);
 }
-- 
2.31.1

[RFC PATCH 2/4] hw/scsi/scsi-disk: Use automatic AIO context lock

2021-10-05 Thread Philippe Mathieu-Daudé

Use the automatic AIO context acquire/release in scsi_block_realize().

Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/scsi/scsi-disk.c | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index e8a547dbb7d..fa2d8543718 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2605,7 +2605,6 @@ static int get_device_type(SCSIDiskState *s)
 static void scsi_block_realize(SCSIDevice *dev, Error **errp)
 {
 SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, dev);
-AioContext *ctx;
 int sg_version;
 int rc;
 
@@ -2620,8 +2619,7 @@ static void scsi_block_realize(SCSIDevice *dev, Error 
**errp)
   "be removed in a future version");
 }
 
-ctx = blk_get_aio_context(s->qdev.conf.blk);
-aio_context_acquire(ctx);
+AIO_CONTEXT_ACQUIRE_GUARD(blk_get_aio_context(dev->conf.blk));
 
 /* check we are using a driver managing SG_IO (version 3 and after) */
 rc = blk_ioctl(s->qdev.conf.blk, SG_GET_VERSION_NUM, _version);
@@ -2630,18 +2628,18 @@ static void scsi_block_realize(SCSIDevice *dev, Error 
**errp)
 if (rc != -EPERM) {
 error_append_hint(errp, "Is this a SCSI device?\n");
 }
-goto out;
+return;
 }
 if (sg_version < 3) {
 error_setg(errp, "scsi generic interface too old");
-goto out;
+return;
 }
 
 /* get device type from INQUIRY data */
 rc = get_device_type(s);
 if (rc < 0) {
 error_setg(errp, "INQUIRY failed");
-goto out;
+return;
 }
 
 /* Make a guess for the block size, we'll fix it when the guest sends.
@@ -2661,9 +2659,6 @@ static void scsi_block_realize(SCSIDevice *dev, Error 
**errp)
 
 scsi_realize(>qdev, errp);
 scsi_generic_read_device_inquiry(>qdev);
-
-out:
-aio_context_release(ctx);
 }
 
 typedef struct SCSIBlockReq {
-- 
2.31.1

[RFC PATCH 0/4] aio: AIO_CONTEXT_ACQUIRE_GUARD() macro experiment

2021-10-05 Thread Philippe Mathieu-Daudé

Experiment to use glib g_autoptr/autofree features with
AIO context.
Since this is a RFC, only few examples are provided.

TODO: Document the macros in docs/devel/multiple-iothreads.txt

Philippe Mathieu-Daudé (4):
  block/aio: Add automatically released aio_context variants
  hw/scsi/scsi-disk: Use automatic AIO context lock
  hw/scsi/scsi-generic: Use automatic AIO context lock
  hw/block/virtio-blk: Use automatic AIO context lock

 include/block/aio.h| 24 
 hw/block/virtio-blk.c  | 26 --
 hw/scsi/scsi-disk.c| 13 -
 hw/scsi/scsi-generic.c |  6 +++---
 4 files changed, 43 insertions(+), 26 deletions(-)

-- 
2.31.1

Re: [PATCH 09/11] qdev: Avoid QemuOpts in QMP device_add

2021-10-05 Thread Kevin Wolf

Am 05.10.2021 um 17:52 hat Damien Hedde geschrieben:
> 
> 
> On 10/5/21 16:37, Kevin Wolf wrote:
> > Am 27.09.2021 um 13:39 hat Kevin Wolf geschrieben:
> > > Am 27.09.2021 um 13:06 hat Damien Hedde geschrieben:
> > > > On 9/24/21 11:04, Kevin Wolf wrote:
> > > > > Directly call qdev_device_add_from_qdict() for QMP device_add instead 
> > > > > of
> > > > > first going through QemuOpts and converting back to QDict.
> > > > > 
> > > > > Note that this changes the behaviour of device_add, though in ways 
> > > > > that
> > > > > should be considered bug fixes:
> > > > > 
> > > > > QemuOpts ignores differences between data types, so you could
> > > > > successfully pass a string "123" for an integer property, or a string
> > > > > "on" for a boolean property (and vice versa).  After this change, the
> > > > > correct data type for the property must be used in the JSON input.
> > > > > 
> > > > > qemu_opts_from_qdict() also silently ignores any options whose value 
> > > > > is
> > > > > a QDict, QList or QNull.
> > > > > 
> > > > > To illustrate, the following QMP command was accepted before and is 
> > > > > now
> > > > > rejected for both reasons:
> > > > > 
> > > > > { "execute": "device_add",
> > > > > "arguments": { "driver": "scsi-cd",
> > > > >"drive": { "completely": "invalid" },
> > > > >"physical_block_size": "4096" } }
> > > > > 
> > > > > Signed-off-by: Kevin Wolf 
> > > > > ---
> > > > >softmmu/qdev-monitor.c | 18 +++---
> > > > >1 file changed, 11 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
> > > > > index c09b7430eb..8622ccade6 100644
> > > > > --- a/softmmu/qdev-monitor.c
> > > > > +++ b/softmmu/qdev-monitor.c
> > > > > @@ -812,7 +812,8 @@ void hmp_info_qdm(Monitor *mon, const QDict 
> > > > > *qdict)
> > > > >qdev_print_devinfos(true);
> > > > >}
> > > > > -void qmp_device_add(QDict *qdict, QObject **ret_data, Error **errp)
> > > > > +static void monitor_device_add(QDict *qdict, QObject **ret_data,
> > > > > +   bool from_json, Error **errp)
> > > > >{
> > > > >QemuOpts *opts;
> > > > >DeviceState *dev;
> > > > > @@ -825,7 +826,9 @@ void qmp_device_add(QDict *qdict, QObject 
> > > > > **ret_data, Error **errp)
> > > > >qemu_opts_del(opts);
> > > > >return;
> > > > >}
> > > > > -dev = qdev_device_add(opts, errp);
> > > > > +qemu_opts_del(opts);
> > > > > +
> > > > > +dev = qdev_device_add_from_qdict(qdict, from_json, errp);
> > > > 
> > > > Hi Kevin,
> > > > 
> > > > I'm wandering if deleting the opts (which remove it from the "device" 
> > > > opts
> > > > list) is really a no-op ?
> > > 
> > > It's not exactly a no-op. Previously, the QemuOpts would only be freed
> > > when the device is destroying, now we delete it immediately after
> > > creating the device. This could matter in some cases.
> > > 
> > > The one case I was aware of is that QemuOpts used to be responsible for
> > > checking for duplicate IDs. Obviously, it can't do this job any more
> > > when we call qemu_opts_del() right after creating the device. This is
> > > the reason for patch 6.
> > > 
> > > > The opts list is, eg, traversed in hw/net/virtio-net.c in the function
> > > > failover_find_primary_device_id() which may be called during the
> > > > virtio_net_set_features() (a TYPE_VIRTIO_NET method).
> > > > I do not have the knowledge to tell when this method is called. But If 
> > > > this
> > > > is after we create the devices. Then the list will be empty at this 
> > > > point
> > > > now.
> > > > 
> > > > It seems, there are 2 other calling sites of
> > > > "qemu_opts_foreach(qemu_find_opts("device"), [...]" in net/vhost-user.c 
> > > > and
> > > > net/vhost-vdpa.c
> > > 
> > > Yes, you are right. These callers probably need to be changed. Going
> > > through the command line options rather than looking at the actual
> > > device objects that exist doesn't feel entirely clean anyway.
> > 
> > So I tried to have a look at the virtio-net case, and ended up very
> > confused.
> > 
> > Obviously looking at command line options (even of a differrent device)
> > from within a device is very unclean. With a non-broken, i.e. type safe,
> > device-add (as well as with the JSON CLI option introduced by this
> > series), we can't have a QemuOpts any more that is by definition unsafe.
> > So this code needs a replacement.
> > 
> > My naive idea was that we just need to look at runtime state instead.
> > Don't search the options for a device with a matching 'failover_pair_id'
> > (which, by the way, would fail as soon as any other device introduces a
> > property with the same name), but search for actual PCIDevices in qdev
> > that have pci_dev->failover_pair_id set accordingly.
> > 
> > However, the logic in failover_add_primary() suggests that we can have a
> > state where QemuOpts for a device

[PATCH v7 3/8] qmp: add QMP command x-debug-query-virtio

2021-10-05 Thread Jonah Palmer

From: Laurent Vivier 

This new command lists all the instances of VirtIODevice with
their QOM paths and virtio type/name.

Signed-off-by: Jonah Palmer 
---
 hw/virtio/meson.build  |  2 ++
 hw/virtio/virtio-stub.c| 14 ++
 hw/virtio/virtio.c | 27 +++
 include/hw/virtio/virtio.h |  1 +
 qapi/meson.build   |  1 +
 qapi/qapi-schema.json  |  1 +
 qapi/virtio.json   | 66 ++
 tests/qtest/qmp-cmd-test.c |  1 +
 8 files changed, 113 insertions(+)
 create mode 100644 hw/virtio/virtio-stub.c
 create mode 100644 qapi/virtio.json

 [Jonah: VirtioInfo member 'type' is now of type string and no longer
  relies on defining a QAPI list of virtio device type enumerations
  to match the VirtIODevice name with qapi_enum_parse().] 

diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
index bc352a6..d409735 100644
--- a/hw/virtio/meson.build
+++ b/hw/virtio/meson.build
@@ -6,8 +6,10 @@ softmmu_virtio_ss.add(when: 'CONFIG_VHOST', if_false: 
files('vhost-stub.c'))
 
 softmmu_ss.add_all(when: 'CONFIG_VIRTIO', if_true: softmmu_virtio_ss)
 softmmu_ss.add(when: 'CONFIG_VIRTIO', if_false: files('vhost-stub.c'))
+softmmu_ss.add(when: 'CONFIG_VIRTIO', if_false: files('virtio-stub.c'))
 
 softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
+softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('virtio-stub.c'))
 
 virtio_ss = ss.source_set()
 virtio_ss.add(files('virtio.c'))
diff --git a/hw/virtio/virtio-stub.c b/hw/virtio/virtio-stub.c
new file mode 100644
index 000..d4a88f5
--- /dev/null
+++ b/hw/virtio/virtio-stub.c
@@ -0,0 +1,14 @@
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qapi/qapi-commands-virtio.h"
+
+static void *qmp_virtio_unsupported(Error **errp)
+{
+error_setg(errp, "Virtio is disabled");
+return NULL;
+}
+
+VirtioInfoList *qmp_x_debug_query_virtio(Error **errp)
+{
+return qmp_virtio_unsupported(errp);
+}
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 4af20c0..a454e2f 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -13,6 +13,8 @@
 
 #include "qemu/osdep.h"
 #include "qapi/error.h"
+#include "qapi/qapi-commands-virtio.h"
+#include "qapi/qapi-visit-virtio.h"
 #include "cpu.h"
 #include "trace.h"
 #include "qemu/error-report.h"
@@ -29,6 +31,9 @@
 #include "sysemu/runstate.h"
 #include "standard-headers/linux/virtio_ids.h"
 
+/* QAPI list of VirtIODevices */
+static QTAILQ_HEAD(, VirtIODevice) virtio_list;
+
 /*
  * The alignment to use between consumer and producer parts of vring.
  * x86 pagesize again. This is the default, used by transports like PCI
@@ -3709,6 +3714,7 @@ static void virtio_device_realize(DeviceState *dev, Error 
**errp)
 
 vdev->listener.commit = virtio_memory_listener_commit;
 memory_listener_register(>listener, vdev->dma_as);
+QTAILQ_INSERT_TAIL(_list, vdev, next);
 }
 
 static void virtio_device_unrealize(DeviceState *dev)
@@ -3723,6 +3729,7 @@ static void virtio_device_unrealize(DeviceState *dev)
 vdc->unrealize(dev);
 }
 
+QTAILQ_REMOVE(_list, vdev, next);
 g_free(vdev->bus_name);
 vdev->bus_name = NULL;
 }
@@ -3896,6 +3903,8 @@ static void virtio_device_class_init(ObjectClass *klass, 
void *data)
 vdc->stop_ioeventfd = virtio_device_stop_ioeventfd_impl;
 
 vdc->legacy_features |= VIRTIO_LEGACY_FEATURES;
+
+QTAILQ_INIT(_list);
 }
 
 bool virtio_device_ioeventfd_enabled(VirtIODevice *vdev)
@@ -3906,6 +3915,24 @@ bool virtio_device_ioeventfd_enabled(VirtIODevice *vdev)
 return virtio_bus_ioeventfd_enabled(vbus);
 }
 
+VirtioInfoList *qmp_x_debug_query_virtio(Error **errp)
+{
+VirtioInfoList *list = NULL;
+VirtioInfoList *node;
+VirtIODevice *vdev;
+
+QTAILQ_FOREACH(vdev, _list, next) {
+DeviceState *dev = DEVICE(vdev);
+node = g_new0(VirtioInfoList, 1);
+node->value = g_new(VirtioInfo, 1);
+node->value->path = g_strdup(dev->canonical_path);
+node->value->type = g_strdup(vdev->name);
+QAPI_LIST_PREPEND(list, node->value);
+}
+
+return list;
+}
+
 static const TypeInfo virtio_device_info = {
 .name = TYPE_VIRTIO_DEVICE,
 .parent = TYPE_DEVICE,
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 105b98c..eceaafc 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -110,6 +110,7 @@ struct VirtIODevice
 bool use_guest_notifier_mask;
 AddressSpace *dma_as;
 QLIST_HEAD(, VirtQueue) *vector_queues;
+QTAILQ_ENTRY(VirtIODevice) next;
 };
 
 struct VirtioDeviceClass {
diff --git a/qapi/meson.build b/qapi/meson.build
index c356a38..df5662e 100644
--- a/qapi/meson.build
+++ b/qapi/meson.build
@@ -45,6 +45,7 @@ qapi_all_modules = [
   'sockets',
   'trace',
   'transaction',
+  'virtio',
   'yank',
 ]
 if have_system
diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
index 4912b97..1512ada 100644
--- a/qapi/qapi-schema.json
+++

[PATCH v7 5/8] qmp: decode feature & status bits in virtio-status

2021-10-05 Thread Jonah Palmer

From: Laurent Vivier 

Display feature names instead of bitmaps for host, guest, and
backend for VirtIODevice.

Display status names instead of bitmaps for VirtIODevice.

Display feature names instead of bitmaps for backend, protocol,
acked, and features (hdev->features) for vhost devices.

Decode features according to device type. Decode status
according to configuration status bitmap (config_status_map).
Decode vhost user protocol features according to vhost user
protocol bitmap (vhost_user_protocol_map).

Transport features are on the first line. Undecoded bits
(if any) are stored in a separate field. Vhost device field
wont show if there's no vhost active for a given VirtIODevice.

Signed-off-by: Jonah Palmer 
---
 hw/block/virtio-blk.c  |  28 ++
 hw/char/virtio-serial-bus.c|  11 +
 hw/display/virtio-gpu-base.c   |  18 +-
 hw/input/virtio-input.c|  11 +-
 hw/net/virtio-net.c|  47 
 hw/scsi/virtio-scsi.c  |  17 ++
 hw/virtio/vhost-user-fs.c  |  10 +
 hw/virtio/vhost-vsock-common.c |  10 +
 hw/virtio/virtio-balloon.c |  14 +
 hw/virtio/virtio-crypto.c  |  10 +
 hw/virtio/virtio-iommu.c   |  14 +
 hw/virtio/virtio.c | 273 +++-
 include/hw/virtio/vhost.h  |   3 +
 include/hw/virtio/virtio.h |  17 ++
 qapi/virtio.json   | 574 ++---
 15 files changed, 1012 insertions(+), 45 deletions(-)

 [Jonah: Added vhost feature 'LOG_ALL' to all virtio feature maps for
  virtio devices who can use vhost. This includes virtio/vhost devices
  that previously did not have a feature map defined, such as
  virtio-input, vhost-user-fs, vhost-vsock, and virtio-crypto.

  Defined an enumeration and mapping of vhost user protocol features
  in virtio.c for decoding vhost user protocol features. Support to
  decode vhost user protocol features added.

  Added support to also decode VirtIODevice status bits via. a mapping
  of virtio config statuses.

  Needed to define a list of QAPI enumerations of virtio device types
  here because we can't discriminate the virtio device type in
  VirtioDeviceFeatures QAPI union based on a 'str' QAPI type.] 

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 505e574..c2e901f 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -13,6 +13,7 @@
 
 #include "qemu/osdep.h"
 #include "qapi/error.h"
+#include "qapi/qapi-visit-virtio.h"
 #include "qemu/iov.h"
 #include "qemu/module.h"
 #include "qemu/error-report.h"
@@ -32,6 +33,7 @@
 #include "hw/virtio/virtio-bus.h"
 #include "migration/qemu-file-types.h"
 #include "hw/virtio/virtio-access.h"
+#include "standard-headers/linux/vhost_types.h"
 
 /* Config size before the discard support (hide associated config fields) */
 #define VIRTIO_BLK_CFG_SIZE offsetof(struct virtio_blk_config, \
@@ -48,6 +50,32 @@ static const VirtIOFeature feature_sizes[] = {
 {}
 };
 
+qmp_virtio_feature_map_t blk_map[] = {
+#define FEATURE_ENTRY(name) \
+{ VIRTIO_BLK_F_##name, VIRTIO_BLK_FEATURE_##name }
+FEATURE_ENTRY(SIZE_MAX),
+FEATURE_ENTRY(SEG_MAX),
+FEATURE_ENTRY(GEOMETRY),
+FEATURE_ENTRY(RO),
+FEATURE_ENTRY(BLK_SIZE),
+FEATURE_ENTRY(TOPOLOGY),
+FEATURE_ENTRY(MQ),
+FEATURE_ENTRY(DISCARD),
+FEATURE_ENTRY(WRITE_ZEROES),
+#ifndef VIRTIO_BLK_NO_LEGACY
+FEATURE_ENTRY(BARRIER),
+FEATURE_ENTRY(SCSI),
+FEATURE_ENTRY(FLUSH),
+FEATURE_ENTRY(CONFIG_WCE),
+#endif /* !VIRTIO_BLK_NO_LEGACY */
+#undef FEATURE_ENTRY
+#define FEATURE_ENTRY(name) \
+{ VHOST_F_##name, VIRTIO_BLK_FEATURE_##name }
+FEATURE_ENTRY(LOG_ALL),
+#undef FEATURE_ENTRY
+{ -1, -1 }
+};
+
 static void virtio_blk_set_config_size(VirtIOBlock *s, uint64_t host_features)
 {
 s->config_size = MAX(VIRTIO_BLK_CFG_SIZE,
diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
index 746c92b..f91418b 100644
--- a/hw/char/virtio-serial-bus.c
+++ b/hw/char/virtio-serial-bus.c
@@ -20,6 +20,7 @@
 
 #include "qemu/osdep.h"
 #include "qapi/error.h"
+#include "qapi/qapi-visit-virtio.h"
 #include "qemu/iov.h"
 #include "qemu/main-loop.h"
 #include "qemu/module.h"
@@ -32,6 +33,16 @@
 #include "hw/virtio/virtio-serial.h"
 #include "hw/virtio/virtio-access.h"
 
+qmp_virtio_feature_map_t serial_map[] = {
+#define FEATURE_ENTRY(name) \
+{ VIRTIO_CONSOLE_F_##name, VIRTIO_SERIAL_FEATURE_##name }
+FEATURE_ENTRY(SIZE),
+FEATURE_ENTRY(MULTIPORT),
+FEATURE_ENTRY(EMERG_WRITE),
+#undef FEATURE_ENTRY
+{ -1, -1 }
+};
+
 static struct VirtIOSerialDevices {
 QLIST_HEAD(, VirtIOSerial) devices;
 } vserdevices;
diff --git a/hw/display/virtio-gpu-base.c b/hw/display/virtio-gpu-base.c
index 5411a7b..a322349 100644
--- a/hw/display/virtio-gpu-base.c
+++ b/hw/display/virtio-gpu-base.c
@@ -12,13 +12,29 @@
  */
 
 #include "qemu/osdep.h"
-
+#include "standard-headers/linux/vhost_types.h"
 #include "hw/virtio/virtio-gpu.h"
 #include "migration/blocker.h"
 #include "qapi/error.h"

Re: [RFC PATCH v2 03/25] block/block-backend.c: assertions for block-backend

2021-10-05 Thread Eric Blake

On Tue, Oct 05, 2021 at 10:31:53AM -0400, Emanuele Giuseppe Esposito wrote:
> All the global state (GS) API functions will check that
> qemu_in_main_thread() returns true. If not, it means
> that the safety of BQL cannot be guaranteed, and
> they need to be moved to I/O.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  block/block-backend.c  | 89 +-
>  softmmu/qdev-monitor.c |  2 +
>  2 files changed, 90 insertions(+), 1 deletion(-)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index d31ae16b99..9cd3b27b53 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -227,6 +227,7 @@ static void blk_root_activate(BdrvChild *child, Error 
> **errp)
>  
>  void blk_set_force_allow_inactivate(BlockBackend *blk)
>  {
> +g_assert(qemu_in_main_thread());

Why g_assert()?

> @@ -661,6 +676,7 @@ bool monitor_add_blk(BlockBackend *blk, const char *name, 
> Error **errp)
>  {
>  assert(!blk->name);
>  assert(name && name[0]);
> +g_assert(qemu_in_main_thread());

especially why mixed spellings?

Per osdep.h, we don't support builds with NDEBUG or G_DISABLE_ASSERT
defined to their non-default values, so behavior isn't really
different, but consistency says we use 'assert' more frequently than
'g_assert'.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[PATCH v7 8/8] hmp: add virtio commands

2021-10-05 Thread Jonah Palmer

From: Laurent Vivier 

This patch implements the HMP versions of the virtio QMP commands.

Signed-off-by: Jonah Palmer 
---
 docs/system/monitor.rst |   2 +
 hmp-commands-virtio.hx  | 250 ++
 hmp-commands.hx |  10 ++
 hw/virtio/virtio.c  | 355 
 include/monitor/hmp.h   |   5 +
 meson.build |   1 +
 monitor/misc.c  |  17 +++
 7 files changed, 640 insertions(+)
 create mode 100644 hmp-commands-virtio.hx

 [Jonah: Added new HMP command for vhost queue status
  (virtio vhost-queue-status) as well as HMP helper functions to dump
  decoded vhost protocol features and virtio device configuration
  statuses.]

diff --git a/docs/system/monitor.rst b/docs/system/monitor.rst
index ff5c434..10418fc 100644
--- a/docs/system/monitor.rst
+++ b/docs/system/monitor.rst
@@ -21,6 +21,8 @@ The following commands are available:
 
 .. hxtool-doc:: hmp-commands.hx
 
+.. hxtool-doc:: hmp-commands-virtio.hx
+
 .. hxtool-doc:: hmp-commands-info.hx
 
 Integer expressions
diff --git a/hmp-commands-virtio.hx b/hmp-commands-virtio.hx
new file mode 100644
index 000..36aab94
--- /dev/null
+++ b/hmp-commands-virtio.hx
@@ -0,0 +1,250 @@
+HXCOMM Use DEFHEADING() to define headings in both help text and rST.
+HXCOMM Text between SRST and ERST is copied to the rST version and
+HXCOMM discarded from C version.
+HXCOMM
+HXCOMM DEF(command, args, callback, arg_string, help) is used to construct
+HXCOMM monitor info commands.
+HXCOMM
+HXCOMM HXCOMM can be used for comments, discarded from both rST and C.
+HXCOMM
+HXCOMM In this file, generally SRST fragments should have two extra
+HXCOMM spaces of indent, so that the documentation list item for "virtio cmd"
+HXCOMM appears inside the documentation list item for the top level
+HXCOMM "virtio" documentation entry. The exception is the first SRST
+HXCOMM fragment that defines that top level entry.
+
+SRST
+  ``virtio`` *subcommand*
+  Show various information about virtio
+
+  Example:
+
+  List all sub-commands::
+
+  (qemu) virtio
+  virtio query  -- List all available virtio devices
+  virtio status path -- Display status of a given virtio device
+  virtio queue-status path queue -- Display status of a given virtio queue
+  virtio vhost-queue-status path queue -- Display status of a given vhost queue
+  virtio queue-element path queue [index] -- Display element of a given virtio 
queue
+
+ERST
+
+  {
+.name   = "query",
+.args_type  = "",
+.params = "",
+.help   = "List all available virtio devices",
+.cmd= hmp_virtio_query,
+.flags  = "p",
+  },
+
+SRST
+  ``virtio query``
+  List all available virtio devices
+
+  Example:
+
+  List all available virtio devices in the machine::
+
+  (qemu) virtio query
+  /machine/peripheral/vsock0/virtio-backend [vhost-vsock]
+  /machine/peripheral/crypto0/virtio-backend [virtio-crypto]
+  /machine/peripheral-anon/device[2]/virtio-backend [virtio-scsi]
+  /machine/peripheral-anon/device[1]/virtio-backend [virtio-net]
+  /machine/peripheral-anon/device[0]/virtio-backend [virtio-serial]
+
+ERST
+
+  {
+.name   = "status",
+.args_type  = "path:s",
+.params = "path",
+.help   = "Display status of a given virtio device",
+.cmd= hmp_virtio_status,
+.flags  = "p",
+  },
+
+SRST
+  ``virtio status`` *path*
+  Display status of a given virtio device
+
+  Example:
+
+  Dump the status of virtio-net (vhost on)::
+
+  (qemu) virtio status /machine/peripheral-anon/device[1]/virtio-backend
+  /machine/peripheral-anon/device[1]/virtio-backend:
+device_name: virtio-net (vhost)
+device_id:   1
+vhost_started:   true
+bus_name:(null)
+broken:  false
+disabled:false
+disable_legacy_check:false
+started: true
+use_started: true
+start_on_kick:   false
+use_guest_notifier_mask: true
+vm_running:  true
+num_vqs: 3
+queue_sel:   2
+isr: 1
+endianness:  little
+status: acknowledge, driver, features-ok, driver-ok
+Guest features:   event-idx, indirect-desc, version-1
+  ctrl-mac-addr, guest-announce, ctrl-vlan, ctrl-rx, 
ctrl-vq, status, mrg-rxbuf,
+  host-ufo, host-ecn, host-tso6, host-tso4, guest-ufo, 
guest-ecn, guest-tso6,
+  guest-tso4, mac, ctrl-guest-offloads, guest-csum, csum
+Host features:protocol-features, event-idx, indirect-desc, version-1, 
any-layout, notify-on-empty
+  gso, ctrl-mac-addr, guest-announce, ctrl-rx-extra, 
ctrl-vlan, ctrl-rx, ctrl-vq,
+  status, mrg-rxbuf, host-ufo, host-ecn, host-tso6, 
host-tso4, guest-ufo, guest-ecn,
+  guest-tso6, guest-tso4, mac,

[PATCH v7 2/8] virtio: add vhost support for virtio devices

2021-10-05 Thread Jonah Palmer

This patch adds a get_vhost() callback function for VirtIODevices that
returns the device's corresponding vhost_dev structure if the vhost
device is running. This patch also adds a vhost_started flag for VirtIODevices.

Previously, a VirtIODevice wouldn't be able to tell if its corresponding
vhost device was active or not.

Signed-off-by: Jonah Palmer 
---
 hw/block/vhost-user-blk.c  |  7 +++
 hw/display/vhost-user-gpu.c|  7 +++
 hw/input/vhost-user-input.c|  7 +++
 hw/net/virtio-net.c|  9 +
 hw/scsi/vhost-scsi.c   |  8 
 hw/virtio/vhost-user-fs.c  |  7 +++
 hw/virtio/vhost-vsock-common.c |  7 +++
 hw/virtio/vhost.c  |  3 +++
 hw/virtio/virtio-crypto.c  | 10 ++
 hw/virtio/virtio.c |  1 +
 include/hw/virtio/virtio.h |  3 +++
 11 files changed, 69 insertions(+)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index f61f8c1..b059da1 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -568,6 +568,12 @@ static void vhost_user_blk_instance_init(Object *obj)
   "/disk@0,0", DEVICE(obj));
 }
 
+static struct vhost_dev *vhost_user_blk_get_vhost(VirtIODevice *vdev)
+{
+VHostUserBlk *s = VHOST_USER_BLK(vdev);
+return >dev;
+}
+
 static const VMStateDescription vmstate_vhost_user_blk = {
 .name = "vhost-user-blk",
 .minimum_version_id = 1,
@@ -602,6 +608,7 @@ static void vhost_user_blk_class_init(ObjectClass *klass, 
void *data)
 vdc->get_features = vhost_user_blk_get_features;
 vdc->set_status = vhost_user_blk_set_status;
 vdc->reset = vhost_user_blk_reset;
+vdc->get_vhost = vhost_user_blk_get_vhost;
 }
 
 static const TypeInfo vhost_user_blk_info = {
diff --git a/hw/display/vhost-user-gpu.c b/hw/display/vhost-user-gpu.c
index 49df56c..6e93b46 100644
--- a/hw/display/vhost-user-gpu.c
+++ b/hw/display/vhost-user-gpu.c
@@ -565,6 +565,12 @@ vhost_user_gpu_device_realize(DeviceState *qdev, Error 
**errp)
 g->vhost_gpu_fd = -1;
 }
 
+static struct vhost_dev *vhost_user_gpu_get_vhost(VirtIODevice *vdev)
+{
+VhostUserGPU *g = VHOST_USER_GPU(vdev);
+return >vhost->dev;
+}
+
 static Property vhost_user_gpu_properties[] = {
 VIRTIO_GPU_BASE_PROPERTIES(VhostUserGPU, parent_obj.conf),
 DEFINE_PROP_END_OF_LIST(),
@@ -586,6 +592,7 @@ vhost_user_gpu_class_init(ObjectClass *klass, void *data)
 vdc->guest_notifier_pending = vhost_user_gpu_guest_notifier_pending;
 vdc->get_config = vhost_user_gpu_get_config;
 vdc->set_config = vhost_user_gpu_set_config;
+vdc->get_vhost = vhost_user_gpu_get_vhost;
 
 device_class_set_props(dc, vhost_user_gpu_properties);
 }
diff --git a/hw/input/vhost-user-input.c b/hw/input/vhost-user-input.c
index 273e96a..43d2ff3 100644
--- a/hw/input/vhost-user-input.c
+++ b/hw/input/vhost-user-input.c
@@ -79,6 +79,12 @@ static void vhost_input_set_config(VirtIODevice *vdev,
 virtio_notify_config(vdev);
 }
 
+static struct vhost_dev *vhost_input_get_vhost(VirtIODevice *vdev)
+{
+VHostUserInput *vhi = VHOST_USER_INPUT(vdev);
+return >vhost->dev;
+}
+
 static const VMStateDescription vmstate_vhost_input = {
 .name = "vhost-user-input",
 .unmigratable = 1,
@@ -93,6 +99,7 @@ static void vhost_input_class_init(ObjectClass *klass, void 
*data)
 dc->vmsd = _vhost_input;
 vdc->get_config = vhost_input_get_config;
 vdc->set_config = vhost_input_set_config;
+vdc->get_vhost = vhost_input_get_vhost;
 vic->realize = vhost_input_realize;
 vic->change_active = vhost_input_change_active;
 }
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index bf59f8b..6e54436 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3549,6 +3549,14 @@ static bool dev_unplug_pending(void *opaque)
 return vdc->primary_unplug_pending(dev);
 }
 
+static struct vhost_dev *virtio_net_get_vhost(VirtIODevice *vdev)
+{
+VirtIONet *n = VIRTIO_NET(vdev);
+NetClientState *nc = qemu_get_queue(n->nic);
+struct vhost_net *net = get_vhost_net(nc->peer);
+return >dev;
+}
+
 static const VMStateDescription vmstate_virtio_net = {
 .name = "virtio-net",
 .minimum_version_id = VIRTIO_NET_VM_VERSION,
@@ -3651,6 +3659,7 @@ static void virtio_net_class_init(ObjectClass *klass, 
void *data)
 vdc->post_load = virtio_net_post_load_virtio;
 vdc->vmsd = _virtio_net_device;
 vdc->primary_unplug_pending = primary_unplug_pending;
+vdc->get_vhost = virtio_net_get_vhost;
 }
 
 static const TypeInfo virtio_net_info = {
diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
index 039caf2..b0a9c45 100644
--- a/hw/scsi/vhost-scsi.c
+++ b/hw/scsi/vhost-scsi.c
@@ -264,6 +264,13 @@ static void vhost_scsi_unrealize(DeviceState *dev)
 virtio_scsi_common_unrealize(dev);
 }
 
+static struct vhost_dev *vhost_scsi_get_vhost(VirtIODevice *vdev)
+{
+VHostSCSI *s = VHOST_SCSI(vdev);
+VHostSCSICommon *vsc = VHOST_SCSI_COMMON(s);
+

[PATCH 2/2] block/aio_task: assert `max_busy_tasks` is greater than 0

2021-10-05 Thread Stefano Garzarella

All code in block/aio_task.c expects `max_busy_tasks` to always
be greater than 0.

Assert this condition during the AioTaskPool creation where
`max_busy_tasks` is set.

Signed-off-by: Stefano Garzarella 
---
 block/aio_task.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/aio_task.c b/block/aio_task.c
index 88989fa248..9bd17ea2c1 100644
--- a/block/aio_task.c
+++ b/block/aio_task.c
@@ -98,6 +98,8 @@ AioTaskPool *coroutine_fn aio_task_pool_new(int 
max_busy_tasks)
 {
 AioTaskPool *pool = g_new0(AioTaskPool, 1);
 
+assert(max_busy_tasks > 0);
+
 pool->main_co = qemu_coroutine_self();
 pool->max_busy_tasks = max_busy_tasks;
 
-- 
2.31.1

[PATCH v7 4/8] qmp: add QMP command x-debug-virtio-status

2021-10-05 Thread Jonah Palmer

From: Laurent Vivier 

This new command shows the status of a VirtIODevice, including
its corresponding vhost device status (if active).

Next patch will improve output by decoding feature bits, including
vhost device's feature bits (backend, protocol, acked, and features).
Also will decode status bits of a VirtIODevice.

Next patch will also suppress the vhost device field from displaying
if no vhost device is active for a given VirtIODevice.

Signed-off-by: Jonah Palmer 
---
 hw/virtio/virtio-stub.c |   5 +
 hw/virtio/virtio.c  |  96 +++
 qapi/virtio.json| 245 
 3 files changed, 346 insertions(+)

 [Jonah: Added more fields of VirtIODevice to display including name,
  status, isr, queue_sel, vm_running, broken, disabled, used_started,
  started, start_on_kick, disable_legacy_check, bus_name, and
  use_guest_notifier_mask.

  Also added vhost support that displays the status of the
  VirtIODevice's corresponding vhost device if it's active.
  Vhost device fields include n_mem_sections, n_tmp_sections, nvqs,
  vq_index, features, acked_features, backend_features,
  protocol_features, max_queues, backend_cap, log_enabled and
  log_size.] 

diff --git a/hw/virtio/virtio-stub.c b/hw/virtio/virtio-stub.c
index d4a88f5..ddb592f 100644
--- a/hw/virtio/virtio-stub.c
+++ b/hw/virtio/virtio-stub.c
@@ -12,3 +12,8 @@ VirtioInfoList *qmp_x_debug_query_virtio(Error **errp)
 {
 return qmp_virtio_unsupported(errp);
 }
+
+VirtioStatus *qmp_x_debug_virtio_status(const char* path, Error **errp)
+{
+return qmp_virtio_unsupported(errp);
+}
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index a454e2f..04a44e8 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3933,6 +3933,102 @@ VirtioInfoList *qmp_x_debug_query_virtio(Error **errp)
 return list;
 }
 
+static VirtIODevice *virtio_device_find(const char *path)
+{
+VirtIODevice *vdev;
+
+QTAILQ_FOREACH(vdev, _list, next) {
+DeviceState *dev = DEVICE(vdev);
+
+if (strcmp(dev->canonical_path, path) != 0) {
+continue;
+}
+return vdev;
+}
+
+return NULL;
+}
+
+VirtioStatus *qmp_x_debug_virtio_status(const char *path, Error **errp)
+{
+VirtIODevice *vdev;
+VirtioStatus *status;
+
+vdev = virtio_device_find(path);
+if (vdev == NULL) {
+error_setg(errp, "Path %s is not a VirtIO device", path);
+return NULL;
+}
+
+status = g_new0(VirtioStatus, 1);
+status->vhost_dev = g_new0(VhostStatus, 1);
+status->name = g_strdup(vdev->name);
+status->device_id = vdev->device_id;
+status->vhost_started = vdev->vhost_started;
+status->guest_features = vdev->guest_features;
+status->host_features = vdev->host_features;
+status->backend_features = vdev->backend_features;
+
+switch (vdev->device_endian) {
+case VIRTIO_DEVICE_ENDIAN_LITTLE:
+status->device_endian = VIRTIO_STATUS_ENDIANNESS_LITTLE;
+break;
+case VIRTIO_DEVICE_ENDIAN_BIG:
+status->device_endian = VIRTIO_STATUS_ENDIANNESS_BIG;
+break;
+default:
+status->device_endian = VIRTIO_STATUS_ENDIANNESS_UNKNOWN;
+break;
+}
+
+status->num_vqs = virtio_get_num_queues(vdev);
+status->status = vdev->status;
+status->isr = vdev->isr;
+status->queue_sel = vdev->queue_sel;
+status->vm_running = vdev->vm_running;
+status->broken = vdev->broken;
+status->disabled = vdev->disabled;
+status->use_started = vdev->use_started;
+status->started = vdev->started;
+status->start_on_kick = vdev->start_on_kick;
+status->disable_legacy_check = vdev->disable_legacy_check;
+status->bus_name = g_strdup(vdev->bus_name);
+status->use_guest_notifier_mask = vdev->use_guest_notifier_mask;
+
+if (vdev->vhost_started) {
+VirtioDeviceClass *vdc = VIRTIO_DEVICE_GET_CLASS(vdev);
+struct vhost_dev *hdev = vdc->get_vhost(vdev);
+
+status->vhost_dev->n_mem_sections = hdev->n_mem_sections;
+status->vhost_dev->n_tmp_sections = hdev->n_tmp_sections;
+status->vhost_dev->nvqs = hdev->nvqs;
+status->vhost_dev->vq_index = hdev->vq_index;
+status->vhost_dev->features = hdev->features;
+status->vhost_dev->acked_features = hdev->acked_features;
+status->vhost_dev->backend_features = hdev->backend_features;
+status->vhost_dev->protocol_features = hdev->protocol_features;
+status->vhost_dev->max_queues = hdev->max_queues;
+status->vhost_dev->backend_cap = hdev->backend_cap;
+status->vhost_dev->log_enabled = hdev->log_enabled;
+status->vhost_dev->log_size = hdev->log_size;
+} else {
+status->vhost_dev->n_mem_sections = 0;
+status->vhost_dev->n_tmp_sections = 0;
+status->vhost_dev->nvqs = 0;
+status->vhost_dev->vq_index = 0;
+status->vhost_dev->features = 0;
+

[PATCH v7 6/8] qmp: add QMP commands for virtio/vhost queue-status

2021-10-05 Thread Jonah Palmer

From: Laurent Vivier 

These new commands show the internal status of a VirtIODevice's
VirtQueue and a vhost device's vhost_virtqueue (if active).

Signed-off-by: Jonah Palmer 
---
 hw/virtio/virtio-stub.c |  14 +++
 hw/virtio/virtio.c  | 103 +++
 qapi/virtio.json| 262 
 3 files changed, 379 insertions(+)

 [Jonah: Added vhost support for qmp_x_debug_virtio_queue_status such
  that if a VirtIODevice's vhost device is active, shadow_avail_idx
  is hidden and last_avail_idx is retrieved from the vhost op
  vhost_get_vring_base().

  Also added a new QMP command qmp_x_debug_virtio_vhost_queue_status
  that shows the interal status of a VirtIODevice's vhost device if
  it's active.]

diff --git a/hw/virtio/virtio-stub.c b/hw/virtio/virtio-stub.c
index ddb592f..387803d 100644
--- a/hw/virtio/virtio-stub.c
+++ b/hw/virtio/virtio-stub.c
@@ -17,3 +17,17 @@ VirtioStatus *qmp_x_debug_virtio_status(const char* path, 
Error **errp)
 {
 return qmp_virtio_unsupported(errp);
 }
+
+VirtVhostQueueStatus *qmp_x_debug_virtio_vhost_queue_status(const char *path,
+uint16_t queue,
+Error **errp)
+{
+return qmp_virtio_unsupported(errp);
+}
+
+VirtQueueStatus *qmp_x_debug_virtio_queue_status(const char *path,
+ uint16_t queue,
+ Error **errp)
+{
+return qmp_virtio_unsupported(errp);
+}
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index f0e2b40..8d74dbf 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -4284,6 +4284,109 @@ VirtioStatus *qmp_x_debug_virtio_status(const char 
*path, Error **errp)
 return status;
 }
 
+VirtVhostQueueStatus *qmp_x_debug_virtio_vhost_queue_status(const char *path,
+uint16_t queue,
+Error **errp)
+{
+VirtIODevice *vdev;
+VirtVhostQueueStatus *status;
+
+vdev = virtio_device_find(path);
+if (vdev == NULL) {
+error_setg(errp, "Path %s is not a VirtIODevice", path);
+return NULL;
+}
+
+if (!vdev->vhost_started) {
+error_setg(errp, "Error: vhost device has not started yet");
+return NULL;
+}
+
+VirtioDeviceClass *vdc = VIRTIO_DEVICE_GET_CLASS(vdev);
+struct vhost_dev *hdev = vdc->get_vhost(vdev);
+
+if (queue < hdev->vq_index || queue >= hdev->vq_index + hdev->nvqs) {
+error_setg(errp, "Invalid vhost virtqueue number %d", queue);
+return NULL;
+}
+
+status = g_new0(VirtVhostQueueStatus, 1);
+status->device_name = g_strdup(vdev->name);
+status->kick = hdev->vqs[queue].kick;
+status->call = hdev->vqs[queue].call;
+status->desc = (uint64_t)(unsigned long)hdev->vqs[queue].desc;
+status->avail = (uint64_t)(unsigned long)hdev->vqs[queue].avail;
+status->used = (uint64_t)(unsigned long)hdev->vqs[queue].used;
+status->num = hdev->vqs[queue].num;
+status->desc_phys = hdev->vqs[queue].desc_phys;
+status->desc_size = hdev->vqs[queue].desc_size;
+status->avail_phys = hdev->vqs[queue].avail_phys;
+status->avail_size = hdev->vqs[queue].avail_size;
+status->used_phys = hdev->vqs[queue].used_phys;
+status->used_size = hdev->vqs[queue].used_size;
+
+return status;
+}
+
+VirtQueueStatus *qmp_x_debug_virtio_queue_status(const char *path,
+ uint16_t queue,
+ Error **errp)
+{
+VirtIODevice *vdev;
+VirtQueueStatus *status;
+
+vdev = virtio_device_find(path);
+if (vdev == NULL) {
+error_setg(errp, "Path %s is not a VirtIODevice", path);
+return NULL;
+}
+
+if (queue >= VIRTIO_QUEUE_MAX || !virtio_queue_get_num(vdev, queue)) {
+error_setg(errp, "Invalid virtqueue number %d", queue);
+return NULL;
+}
+
+status = g_new0(VirtQueueStatus, 1);
+status->device_name = g_strdup(vdev->name);
+status->queue_index = vdev->vq[queue].queue_index;
+status->inuse = vdev->vq[queue].inuse;
+status->vring_num = vdev->vq[queue].vring.num;
+status->vring_num_default = vdev->vq[queue].vring.num_default;
+status->vring_align = vdev->vq[queue].vring.align;
+status->vring_desc = vdev->vq[queue].vring.desc;
+status->vring_avail = vdev->vq[queue].vring.avail;
+status->vring_used = vdev->vq[queue].vring.used;
+status->used_idx = vdev->vq[queue].used_idx;
+status->signalled_used = vdev->vq[queue].signalled_used;
+status->signalled_used_valid = vdev->vq[queue].signalled_used_valid;
+
+if (vdev->vhost_started) {
+VirtioDeviceClass *vdc = VIRTIO_DEVICE_GET_CLASS(vdev);
+struct vhost_dev *hdev = vdc->get_vhost(vdev);
+
+/* check if vq index

[PATCH v7 7/8] qmp: add QMP command x-debug-virtio-queue-element

2021-10-05 Thread Jonah Palmer

From: Laurent Vivier 

This new command shows the information of a VirtQueue element.

Signed-off-by: Jonah Palmer 
---
 hw/virtio/virtio-stub.c |   9 +++
 hw/virtio/virtio.c  | 154 ++
 qapi/virtio.json| 191 
 3 files changed, 354 insertions(+)

 [Jonah: Added support to display driver (used vring) and device
  (avail vring) areas, including a new function vring_used_flags()
  to retrieve the used vring flags of a given element.]

diff --git a/hw/virtio/virtio-stub.c b/hw/virtio/virtio-stub.c
index 387803d..6c282b3 100644
--- a/hw/virtio/virtio-stub.c
+++ b/hw/virtio/virtio-stub.c
@@ -31,3 +31,12 @@ VirtQueueStatus *qmp_x_debug_virtio_queue_status(const char 
*path,
 {
 return qmp_virtio_unsupported(errp);
 }
+
+VirtioQueueElement *qmp_x_debug_virtio_queue_element(const char *path,
+ uint16_t queue,
+ bool has_index,
+ uint16_t index,
+ Error **errp)
+{
+return qmp_virtio_unsupported(errp);
+}
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 8d74dbf..0d67a36 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -478,6 +478,19 @@ static inline void vring_used_write(VirtQueue *vq, 
VRingUsedElem *uelem,
 address_space_cache_invalidate(>used, pa, sizeof(VRingUsedElem));
 }
 
+/* Called within rcu_read_lock(). */
+static inline uint16_t vring_used_flags(VirtQueue *vq)
+{
+VRingMemoryRegionCaches *caches = vring_get_region_caches(vq);
+hwaddr pa = offsetof(VRingUsed, flags);
+
+if (!caches) {
+return 0;
+}
+
+return virtio_lduw_phys_cached(vq->vdev, >used, pa);
+}
+
 /* Called within rcu_read_lock().  */
 static uint16_t vring_used_idx(VirtQueue *vq)
 {
@@ -4387,6 +4400,147 @@ VirtQueueStatus *qmp_x_debug_virtio_queue_status(const 
char *path,
 return status;
 }
 
+static VirtioRingDescFlagsList *qmp_decode_vring_desc_flags(uint16_t flags)
+{
+VirtioRingDescFlagsList *list = NULL;
+VirtioRingDescFlagsList *node;
+int i;
+
+struct {
+uint16_t flag;
+VirtioRingDescFlags value;
+} map[] = {
+{ VRING_DESC_F_NEXT, VIRTIO_RING_DESC_FLAGS_NEXT },
+{ VRING_DESC_F_WRITE, VIRTIO_RING_DESC_FLAGS_WRITE },
+{ VRING_DESC_F_INDIRECT, VIRTIO_RING_DESC_FLAGS_INDIRECT },
+{ 1 << VRING_PACKED_DESC_F_AVAIL, VIRTIO_RING_DESC_FLAGS_AVAIL },
+{ 1 << VRING_PACKED_DESC_F_USED, VIRTIO_RING_DESC_FLAGS_USED },
+{ 0, -1 }
+};
+
+for (i = 0; map[i].flag; i++) {
+if ((map[i].flag & flags) == 0) {
+continue;
+}
+node = g_malloc0(sizeof(VirtioRingDescFlagsList));
+node->value = map[i].value;
+node->next = list;
+list = node;
+}
+
+return list;
+}
+
+VirtioQueueElement *qmp_x_debug_virtio_queue_element(const char *path,
+ uint16_t queue,
+ bool has_index,
+ uint16_t index,
+ Error **errp)
+{
+VirtIODevice *vdev;
+VirtQueue *vq;
+VirtioQueueElement *element = NULL;
+
+vdev = virtio_device_find(path);
+if (vdev == NULL) {
+error_setg(errp, "Path %s is not a VirtIO device", path);
+return NULL;
+}
+
+if (queue >= VIRTIO_QUEUE_MAX || !virtio_queue_get_num(vdev, queue)) {
+error_setg(errp, "Invalid virtqueue number %d", queue);
+return NULL;
+}
+vq = >vq[queue];
+
+if (virtio_vdev_has_feature(vdev, VIRTIO_F_RING_PACKED)) {
+error_setg(errp, "Packed ring not supported");
+return NULL;
+} else {
+unsigned int head, i, max;
+VRingMemoryRegionCaches *caches;
+MemoryRegionCache indirect_desc_cache = MEMORY_REGION_CACHE_INVALID;
+MemoryRegionCache *desc_cache;
+VRingDesc desc;
+VirtioRingDescList *list = NULL;
+VirtioRingDescList *node;
+int rc;
+
+RCU_READ_LOCK_GUARD();
+
+max = vq->vring.num;
+
+if (!has_index) {
+head = vring_avail_ring(vq, vq->last_avail_idx % vq->vring.num);
+} else {
+head = vring_avail_ring(vq, index % vq->vring.num);
+}
+i = head;
+
+caches = vring_get_region_caches(vq);
+if (!caches) {
+error_setg(errp, "Region caches not initialized");
+return NULL;
+}
+if (caches->desc.len < max * sizeof(VRingDesc)) {
+error_setg(errp, "Cannot map descriptor ring");
+return NULL;
+}
+
+desc_cache = >desc;
+vring_split_desc_read(vdev, , desc_cache, i);
+if (desc.flags & VRING_DESC_F_INDIRECT)

[PATCH v7 0/8] hmp,qmp: Add commands to introspect virtio devices

2021-10-05 Thread Jonah Palmer

This series introduces new QMP/HMP commands to dump the status of a
virtio device at different levels.

[Jonah: Rebasing previous patchset from July (v6). Original patches
 are from Laurent Vivier from May 2020.

 Rebase from v6 to v7 includes adding ability to map between the
 numeric device ID and the string device ID (virtio device name), a
 get_vhost() callback function for VirtIODevices, display more fields
 of a VirtIODevice (including it's corresponding vhost device),
 support to decode vhost user protocol features, support to decode
 virtio configuration statuses, vhost support for displaying virtio
 queue statuses including a new command to introspect a vhost device's
 queue status, and lastly support to display driver and device areas
 when introspecting a VirtIODevice's virtqueue element.]

1. Main command

HMP Only:

virtio [subcommand]

Example:

List all sub-commands:

(qemu) virtio
virtio query  -- List all available virtio devices
virtio status path -- Display status of a given virtio device
virtio queue-status path queue -- Display status of a given virtio queue
virtio vhost-queue-status path queue -- Display status of a given vhost 
queue
virtio queue-element path queue [index] -- Display element of a given 
virtio queue

2. List available virtio devices in the machine

HMP Form:

virtio query

Example:

(qemu) virtio query
/machine/peripheral/vsock0/virtio-backend [vhost-vsock]
/machine/peripheral/crypto0/virtio-backend [virtio-crypto]
/machine/peripheral-anon/device[2]/virtio-backend [virtio-scsi]
/machine/peripheral-anon/device[1]/virtio-backend [virtio-net]
/machine/peripheral-anon/device[0]/virtio-backend [virtio-serial]

QMP Form:

{ 'command': 'x-debug-query-virtio', 'returns': ['VirtioInfo'] }

Example:

-> { "execute": "x-debug-query-virtio" }
<- { "return": [
{
"path": "/machine/peripheral/vsock0/virtio-backend",
"type": "vhost-vsock"
},
{
"path": "/machine/peripheral/crypto0/virtio-backend",
"type": "virtio-crypto"
},
{
"path": "/machine/peripheral-anon/device[2]/virtio-backend",
"type": "virtio-scsi"
},
{
"path": "/machine/peripheral-anon/device[1]/virtio-backend",
"type": "virtio-net"
},
{
"path": "/machine/peripheral-anon/device[0]/virtio-backend",
"type": "virtio-serial"
}
 ]
   }

3. Display status of a given virtio device

HMP Form:

virtio status 

Example:

(qemu) virtio status /machine/peripheral/vsock0/virtio-backend
/machine/peripheral/vsock0/virtio-backend:
device_name: vhost-vsock (vhost)
device_id:   19
vhost_started:   true
bus_name:(null)
broken:  false
disabled:false
disable_legacy_check:false
started: true
use_started: true
start_on_kick:   false
use_guest_notifier_mask: true
vm_running:  true
num_vqs: 3
queue_sel:   2
isr: 0
endianness:  little
status: acknowledge, driver, features-ok, driver-ok
Guest features:   event-idx, indirect-desc, version-1
Host features:protocol-features, event-idx, indirect-desc, 
version-1, any-layout,
  notify-on-empty
Backend features: 
VHost:
nvqs:   2
vq_index:   0
max_queues: 0
n_mem_sections: 4
n_tmp_sections: 4
backend_cap:0
log_enabled:false
log_size:   0
Features:  event-idx, indirect-desc, version-1, 
any-layout, notify-on-empty,
   log-all
Acked features:event-idx, indirect-desc, version-1
Backend features:  
Protocol features:

QMP Form:

{ 'command': 'x-debug-virtio-status',
  'data': { 'path': 'str' },
  'returns': 'VirtioStatus'
}

Example:

-> { "execute": "x-debug-virtio-status",
 "arguments": {
"path": "/machine/peripheral/vsock0/virtio-backend"
 }
   }
<- { "return": {
"device-endian": "little",
"bus-name": "",

[PATCH v7 1/8] virtio: drop name parameter for virtio_init()

2021-10-05 Thread Jonah Palmer

This patch drops the name parameter for the virtio_init function.

The pair between the numeric device ID and the string device ID
(name) of a virtio device already exists, but not in a way that
let's us map between them.

This patch will let us do this and removes the need for the name
parameter in virtio_init().

Signed-off-by: Jonah Palmer 
---
 hw/9pfs/virtio-9p-device.c  |  2 +-
 hw/block/vhost-user-blk.c   |  2 +-
 hw/block/virtio-blk.c   |  2 +-
 hw/char/virtio-serial-bus.c |  4 +--
 hw/display/virtio-gpu-base.c|  2 +-
 hw/input/virtio-input.c |  3 +-
 hw/net/virtio-net.c |  2 +-
 hw/scsi/virtio-scsi.c   |  3 +-
 hw/virtio/vhost-user-fs.c   |  3 +-
 hw/virtio/vhost-user-i2c.c  |  6 +---
 hw/virtio/vhost-user-vsock.c|  2 +-
 hw/virtio/vhost-vsock-common.c  |  4 +--
 hw/virtio/vhost-vsock.c |  2 +-
 hw/virtio/virtio-balloon.c  |  3 +-
 hw/virtio/virtio-crypto.c   |  2 +-
 hw/virtio/virtio-iommu.c|  3 +-
 hw/virtio/virtio-mem.c  |  3 +-
 hw/virtio/virtio-pmem.c |  3 +-
 hw/virtio/virtio-rng.c  |  2 +-
 hw/virtio/virtio.c  | 43 +++--
 include/hw/virtio/vhost-vsock-common.h  |  2 +-
 include/hw/virtio/virtio-gpu.h  |  3 +-
 include/hw/virtio/virtio.h  |  3 +-
 include/standard-headers/linux/virtio_ids.h |  1 +
 24 files changed, 65 insertions(+), 40 deletions(-)

diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index 54ee93b..5f522e6 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -216,7 +216,7 @@ static void virtio_9p_device_realize(DeviceState *dev, 
Error **errp)
 }
 
 v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
-virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size);
+virtio_init(vdev, VIRTIO_ID_9P, v->config_size);
 v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
 }
 
diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index ba13cb8..f61f8c1 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -490,7 +490,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, 
Error **errp)
 return;
 }
 
-virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
+virtio_init(vdev, VIRTIO_ID_BLOCK,
 sizeof(struct virtio_blk_config));
 
 s->virtqs = g_new(VirtQueue *, s->num_queues);
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index f139cd7..505e574 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1213,7 +1213,7 @@ static void virtio_blk_device_realize(DeviceState *dev, 
Error **errp)
 
 virtio_blk_set_config_size(s, s->host_features);
 
-virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size);
+virtio_init(vdev, VIRTIO_ID_BLOCK, s->config_size);
 
 s->blk = conf->conf.blk;
 s->rq = NULL;
diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
index dd6bc27..746c92b 100644
--- a/hw/char/virtio-serial-bus.c
+++ b/hw/char/virtio-serial-bus.c
@@ -1044,8 +1044,8 @@ static void virtio_serial_device_realize(DeviceState 
*dev, Error **errp)
 VIRTIO_CONSOLE_F_EMERG_WRITE)) {
 config_size = offsetof(struct virtio_console_config, emerg_wr);
 }
-virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
-config_size);
+
+virtio_init(vdev, VIRTIO_ID_CONSOLE, config_size);
 
 /* Spawn a new virtio-serial bus on which the ports will ride as devices */
 qbus_create_inplace(>bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
diff --git a/hw/display/virtio-gpu-base.c b/hw/display/virtio-gpu-base.c
index c8da480..5411a7b 100644
--- a/hw/display/virtio-gpu-base.c
+++ b/hw/display/virtio-gpu-base.c
@@ -170,7 +170,7 @@ virtio_gpu_base_device_realize(DeviceState *qdev,
 }
 
 g->virtio_config.num_scanouts = cpu_to_le32(g->conf.max_outputs);
-virtio_init(VIRTIO_DEVICE(g), "virtio-gpu", VIRTIO_ID_GPU,
+virtio_init(VIRTIO_DEVICE(g), VIRTIO_ID_GPU,
 sizeof(struct virtio_gpu_config));
 
 if (virtio_gpu_virgl_enabled(g->conf)) {
diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
index 54bcb46..5b5398b 100644
--- a/hw/input/virtio-input.c
+++ b/hw/input/virtio-input.c
@@ -257,8 +257,7 @@ static void virtio_input_device_realize(DeviceState *dev, 
Error **errp)
 vinput->cfg_size += 8;
 assert(vinput->cfg_size <= sizeof(virtio_input_config));
 
-virtio_init(vdev, "virtio-input", VIRTIO_ID_INPUT,
-vinput->cfg_size);
+virtio_init(vdev, VIRTIO_ID_INPUT, vinput->cfg_size);
 vinput->evt = virtio_add_queue(vdev, 64,

Re: [PATCH 0/2] block: avoid integer overflow of `max-workers` and assert `max_busy_tasks`

2021-10-05 Thread Vladimir Sementsov-Ogievskiy


10/5/21 19:11, Stefano Garzarella wrote:

This series contains a patch that avoids an integer overflow of
`max-workers` (struct BackupPerf) by adding a check and a patch
that asserts this condition where the problem occurs.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2009310
Signed-off-by: Stefano Garzarella 

Stefano Garzarella (2):
   block/backup: avoid integer overflow of `max-workers`
   block/aio_task: assert `max_busy_tasks` is greater than 0

  block/aio_task.c | 2 ++
  block/backup.c   | 4 ++--
  2 files changed, 4 insertions(+), 2 deletions(-)



Thanks for fixing, I'm applying it to my jobs branch.

--
Best regards,
Vladimir

Re: [PATCH 2/2] block/aio_task: assert `max_busy_tasks` is greater than 0

2021-10-05 Thread Vladimir Sementsov-Ogievskiy


10/5/21 19:11, Stefano Garzarella wrote:

All code in block/aio_task.c expects `max_busy_tasks` to always
be greater than 0.

Assert this condition during the AioTaskPool creation where
`max_busy_tasks` is set.

Signed-off-by: Stefano Garzarella



Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir

Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable

2021-10-05 Thread Christian Schoenebeck

On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck wrote:
> > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > > variable per virtio user.
> > > 
> > > virtio user == virtio device model?
> > 
> > Yes
> > 
> > > > Reasons:
> > > > 
> > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > > 
> > > > maximum queue size possible. Which is actually the maximum
> > > > queue size allowed by the virtio protocol. The appropriate
> > > > value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > 
> > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs
> > > > 01.h
> > > > tml#x1-240006
> > > > 
> > > > Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > > more or less arbitrary value of 1024 in the past, which
> > > > limits the maximum transfer size with virtio to 4M
> > > > (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > > being 4k).
> > > 
> > > Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
> > > that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> > > etc).
> > 
> > Yes, that's use case dependent. Hence the solution to opt-in if it is
> > desired and feasible.
> > 
> > > > (2) Additionally the current value of 1024 poses a hidden limit,
> > > > 
> > > > invisible to guest, which causes a system hang with the
> > > > following QEMU error if guest tries to exceed it:
> > > > 
> > > > virtio: too many write descriptors in indirect table
> > > 
> > > I don't understand this point. 2.6.5 The Virtqueue Descriptor Table 
says:
> > >   The number of descriptors in the table is defined by the queue size
> > >   for
> > > 
> > > this virtqueue: this is the maximum possible descriptor chain length.
> > > 
> > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > >   A driver MUST NOT create a descriptor chain longer than the Queue Size
> > >   of
> > > 
> > > the device.
> > > 
> > > Do you mean a broken/malicious guest driver that is violating the spec?
> > > That's not a hidden limit, it's defined by the spec.
> > 
> > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
> > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html
> > 
> > You can already go beyond that queue size at runtime with the indirection
> > table. The only actual limit is the currently hard coded value of 1k
> > pages.
> > Hence the suggestion to turn that into a variable.
> 
> Exceeding Queue Size is a VIRTIO spec violation. Drivers that operate
> outsided the spec do so at their own risk. They may not be compatible
> with all device implementations.

Yes, I am ware about that. And still, this practice is already done, which 
apparently is not limited to 9pfs.

> The limit is not hidden, it's Queue Size as defined by the spec :).
> 
> If you have a driver that is exceeding the limit, then please fix the
> driver.

I absolutely understand your position, but I hope you also understand that 
this violation of the specs is a theoretical issue, it is not a real-life 
problem right now, and due to lack of man power unfortunately I have to 
prioritize real-life problems over theoretical ones ATM. Keep in mind that 
right now I am the only person working on 9pfs actively, I do this voluntarily 
whenever I find a free time slice, and I am not paid for it either.

I don't see any reasonable way with reasonable effort to do what you are 
asking for here in 9pfs, and Greg may correct me here if I am saying anything 
wrong. If you are seeing any specific real-life issue here, then please tell 
me which one, otherwise I have to postpone that "specs violation" issue.

There is still a long list of real problems that I need to hunt down in 9pfs, 
afterwards I can continue with theoretical ones if you want, but right now I 
simply can't, sorry.

> > > > (3) Unfortunately not all virtio users in QEMU would currently
> > > > 
> > > > work correctly with the new value of 32768.
> > > > 
> > > > So let's turn this hard coded global value into a runtime
> > > > variable as a first step in this commit, configurable for each
> > > > virtio user by passing a corresponding value with virtio_init()
> > > > call.
> > > 
> > > virtio_add_queue() already has an int queue_size argument, why isn't
> > > that enough to deal with the maximum queue size? There's probably a good
> > > reason for it, but please include it in the commit description.
> > 
> > [...]
> > 
> > > Can you make this value per-vq instead of per-vdev since virtqueues can
> > > have different queue sizes?
> > > 
> > > The same applies to the rest of this patch. Anything using
> > > vdev->queue_max_size should probably use vq->vring.num

Re: [PATCH 1/2] block/backup: avoid integer overflow of `max-workers`

2021-10-05 Thread Vladimir Sementsov-Ogievskiy


10/5/21 19:11, Stefano Garzarella wrote:

QAPI generates `struct BackupPerf` where `max-workers` value is stored
in an `int64_t` variable.
But block_copy_async(), and the underlying code, uses an `int` parameter.

At the end that variable is used to initialize `max_busy_tasks` in
block/aio_task.c causing the following assertion failure if a value
greater than INT_MAX(2147483647) is used:

   ../block/aio_task.c:63: aio_task_pool_wait_one: Assertion `pool->busy_tasks 
> 0' failed.

Let's check that `max-workers` doesn't exceed INT_MAX and print an
error in that case.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2009310


I glad to see that someone experiments with my experimental API :)


Signed-off-by: Stefano Garzarella 


Reviewed-by: Vladimir Sementsov-Ogievskiy 


---
  block/backup.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 687d2882bc..8b072db5d9 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -407,8 +407,8 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
  return NULL;
  }
  
-if (perf->max_workers < 1) {

-error_setg(errp, "max-workers must be greater than zero");
+if (perf->max_workers < 1 || perf->max_workers > INT_MAX) {
+error_setg(errp, "max-workers must be between 1 and %d", INT_MAX);
  return NULL;
  }
  




--
Best regards,
Vladimir

[PATCH 1/2] block/backup: avoid integer overflow of `max-workers`

2021-10-05 Thread Stefano Garzarella

QAPI generates `struct BackupPerf` where `max-workers` value is stored
in an `int64_t` variable.
But block_copy_async(), and the underlying code, uses an `int` parameter.

At the end that variable is used to initialize `max_busy_tasks` in
block/aio_task.c causing the following assertion failure if a value
greater than INT_MAX(2147483647) is used:

  ../block/aio_task.c:63: aio_task_pool_wait_one: Assertion `pool->busy_tasks > 
0' failed.

Let's check that `max-workers` doesn't exceed INT_MAX and print an
error in that case.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2009310
Signed-off-by: Stefano Garzarella 
---
 block/backup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 687d2882bc..8b072db5d9 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -407,8 +407,8 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 return NULL;
 }
 
-if (perf->max_workers < 1) {
-error_setg(errp, "max-workers must be greater than zero");
+if (perf->max_workers < 1 || perf->max_workers > INT_MAX) {
+error_setg(errp, "max-workers must be between 1 and %d", INT_MAX);
 return NULL;
 }
 
-- 
2.31.1

[PATCH 0/2] block: avoid integer overflow of `max-workers` and assert `max_busy_tasks`

2021-10-05 Thread Stefano Garzarella

This series contains a patch that avoids an integer overflow of
`max-workers` (struct BackupPerf) by adding a check and a patch
that asserts this condition where the problem occurs.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2009310
Signed-off-by: Stefano Garzarella 

Stefano Garzarella (2):
  block/backup: avoid integer overflow of `max-workers`
  block/aio_task: assert `max_busy_tasks` is greater than 0

 block/aio_task.c | 2 ++
 block/backup.c   | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

-- 
2.31.1

Re: [PATCH v0 0/2] virtio-blk and vhost-user-blk cross-device migration

2021-10-05 Thread Eduardo Habkost

On Tue, Oct 05, 2021 at 03:01:05PM +0100, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (m...@redhat.com) wrote:
> > On Tue, Oct 05, 2021 at 02:18:40AM +0300, Roman Kagan wrote:
> > > On Mon, Oct 04, 2021 at 11:11:00AM -0400, Michael S. Tsirkin wrote:
> > > > On Mon, Oct 04, 2021 at 06:07:29PM +0300, Denis Plotnikov wrote:
> > > > > It might be useful for the cases when a slow block layer should be 
> > > > > replaced
> > > > > with a more performant one on running VM without stopping, i.e. with 
> > > > > very low
> > > > > downtime comparable with the one on migration.
> > > > > 
> > > > > It's possible to achive that for two reasons:
> > > > > 
> > > > > 1.The VMStates of "virtio-blk" and "vhost-user-blk" are almost the 
> > > > > same.
> > > > >   They consist of the identical VMSTATE_VIRTIO_DEVICE and differs from
> > > > >   each other in the values of migration service fields only.
> > > > > 2.The device driver used in the guest is the same: virtio-blk
> > > > > 
> > > > > In the series cross-migration is achieved by adding a new type.
> > > > > The new type uses virtio-blk VMState instead of vhost-user-blk 
> > > > > specific
> > > > > VMstate, also it implements migration save/load callbacks to be 
> > > > > compatible
> > > > > with migration stream produced by "virtio-blk" device.
> > > > > 
> > > > > Adding the new type instead of modifying the existing one is 
> > > > > convenent.
> > > > > It ease to differ the new virtio-blk-compatible vhost-user-blk
> > > > > device from the existing non-compatible one using qemu machinery 
> > > > > without any
> > > > > other modifiactions. That gives all the variety of qemu device related
> > > > > constraints out of box.
> > > > 
> > > > Hmm I'm not sure I understand. What is the advantage for the user?
> > > > What if vhost-user-blk became an alias for vhost-user-virtio-blk?
> > > > We could add some hacks to make it compatible for old machine types.
> > > 
> > > The point is that virtio-blk and vhost-user-blk are not
> > > migration-compatible ATM.  OTOH they are the same device from the guest
> > > POV so there's nothing fundamentally preventing the migration between
> > > the two.  In particular, we see it as a means to switch between the
> > > storage backend transports via live migration without disrupting the
> > > guest.
> > > 
> > > Migration-wise virtio-blk and vhost-user-blk have in common
> > > 
> > > - the content of the VMState -- VMSTATE_VIRTIO_DEVICE
> > > 
> > > The two differ in
> > > 
> > > - the name and the version of the VMStateDescription
> > > 
> > > - virtio-blk has an extra migration section (via .save/.load callbacks
> > >   on VirtioDeviceClass) containing requests in flight
> > > 
> > > It looks like to become migration-compatible with virtio-blk,
> > > vhost-user-blk has to start using VMStateDescription of virtio-blk and
> > > provide compatible .save/.load callbacks.  It isn't entirely obvious how
> > > to make this machine-type-dependent, so we came up with a simpler idea
> > > of defining a new device that shares most of the implementation with the
> > > original vhost-user-blk except for the migration stuff.  We're certainly
> > > open to suggestions on how to reconcile this under a single
> > > vhost-user-blk device, as this would be more user-friendly indeed.
> > > 
> > > We considered using a class property for this and defining the
> > > respective compat clause, but IIUC the class constructors (where .vmsd
> > > and .save/.load are defined) are not supposed to depend on class
> > > properties.
> > > 
> > > Thanks,
> > > Roman.
> > 
> > So the question is how to make vmsd depend on machine type.
> > CC Eduardo who poked at this kind of compat stuff recently,
> > paolo who looked at qom things most recently and dgilbert
> > for advice on migration.
> 
> I don't think I've seen anyone change vmsd name dependent on machine
> type; making fields appear/disappear is easy - that just ends up as a
> property on the device that's checked;  I guess if that property is
> global (rather than per instance) then you can check it in
> vhost_user_blk_class_init and swing the dc->vmsd pointer?

class_init can be called very early during QEMU initialization,
so it's too early to make decisions based on machine type.

Making a specific vmsd appear/disappear based on machine
configuration or state is "easy", by implementing
VMStateDescription.needed.  But this would require registering
both vmsds (one of them would need to be registered manually
instead of using DeviceClass.vmsd).

I don't remember what are the consequences of not using
DeviceClass.vmsd to register a vmsd, I only remember it was
subtle.  See commit b170fce3dd06 ("cpu: Register
VMStateDescription through CPUState") and related threads.  CCing
Philippe, who might remember the details here.

If that's an important use case, I would suggest allowing devices
to implement a DeviceClass.get_vmsd method, which would override
DeviceClass.vmsd if necessary.  Is the

Re: [PATCH 09/11] qdev: Avoid QemuOpts in QMP device_add

2021-10-05 Thread Damien Hedde

On 10/5/21 16:37, Kevin Wolf wrote:

Am 27.09.2021 um 13:39 hat Kevin Wolf geschrieben:

Am 27.09.2021 um 13:06 hat Damien Hedde geschrieben:

On 9/24/21 11:04, Kevin Wolf wrote:

Directly call qdev_device_add_from_qdict() for QMP device_add instead of
first going through QemuOpts and converting back to QDict.

Note that this changes the behaviour of device_add, though in ways that
should be considered bug fixes:

QemuOpts ignores differences between data types, so you could
successfully pass a string "123" for an integer property, or a string
"on" for a boolean property (and vice versa).  After this change, the
correct data type for the property must be used in the JSON input.

qemu_opts_from_qdict() also silently ignores any options whose value is
a QDict, QList or QNull.

To illustrate, the following QMP command was accepted before and is now
rejected for both reasons:

{ "execute": "device_add",
"arguments": { "driver": "scsi-cd",
   "drive": { "completely": "invalid" },
   "physical_block_size": "4096" } }

Signed-off-by: Kevin Wolf 
---
   softmmu/qdev-monitor.c | 18 +++---
   1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index c09b7430eb..8622ccade6 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -812,7 +812,8 @@ void hmp_info_qdm(Monitor *mon, const QDict *qdict)
   qdev_print_devinfos(true);
   }
-void qmp_device_add(QDict *qdict, QObject **ret_data, Error **errp)
+static void monitor_device_add(QDict *qdict, QObject **ret_data,
+   bool from_json, Error **errp)
   {
   QemuOpts *opts;
   DeviceState *dev;
@@ -825,7 +826,9 @@ void qmp_device_add(QDict *qdict, QObject **ret_data, Error 
**errp)
   qemu_opts_del(opts);
   return;
   }
-dev = qdev_device_add(opts, errp);
+qemu_opts_del(opts);
+
+dev = qdev_device_add_from_qdict(qdict, from_json, errp);

Hi Kevin,

I'm wandering if deleting the opts (which remove it from the "device" opts
list) is really a no-op ?

It's not exactly a no-op. Previously, the QemuOpts would only be freed
when the device is destroying, now we delete it immediately after
creating the device. This could matter in some cases.

The one case I was aware of is that QemuOpts used to be responsible for
checking for duplicate IDs. Obviously, it can't do this job any more
when we call qemu_opts_del() right after creating the device. This is
the reason for patch 6.

The opts list is, eg, traversed in hw/net/virtio-net.c in the function
failover_find_primary_device_id() which may be called during the
virtio_net_set_features() (a TYPE_VIRTIO_NET method).
I do not have the knowledge to tell when this method is called. But If this
is after we create the devices. Then the list will be empty at this point
now.

It seems, there are 2 other calling sites of
"qemu_opts_foreach(qemu_find_opts("device"), [...]" in net/vhost-user.c and
net/vhost-vdpa.c

Yes, you are right. These callers probably need to be changed. Going
through the command line options rather than looking at the actual
device objects that exist doesn't feel entirely clean anyway.

So I tried to have a look at the virtio-net case, and ended up very
confused.

Obviously looking at command line options (even of a differrent device)
from within a device is very unclean. With a non-broken, i.e. type safe,
device-add (as well as with the JSON CLI option introduced by this
series), we can't have a QemuOpts any more that is by definition unsafe.
So this code needs a replacement.

My naive idea was that we just need to look at runtime state instead.
Don't search the options for a device with a matching 'failover_pair_id'
(which, by the way, would fail as soon as any other device introduces a
property with the same name), but search for actual PCIDevices in qdev
that have pci_dev->failover_pair_id set accordingly.

However, the logic in failover_add_primary() suggests that we can have a
state where QemuOpts for a device exist, but the device doesn't, and
then it hotplugs the device from the command line options. How would we
ever get into such an inconsistent state where QemuOpts contains a
device that doesn't exist? Normally devices get their QemuOpts when they
are created and device_finalize() deletes the QemuOpts again. >

Just read the following from docs/system/virtio-net-failover.rst

> Usage
> -
>
> The primary device can be hotplugged or be part of the startup
> configuration
>
>   -device virtio-net-pci,netdev=hostnet1,id=net1,
>   mac=52:54:00:6f:55:cc,bus=root2,failover=on
>
> With the parameter failover=on the VIRTIO_NET_F_STANDBY feature
> will be enabled.
>
> -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,
> failover_pair_id=net1
>
> failover_pair_id references the id of the virtio-net standby device.
> This is only for pairing the devices within QEMU. The guest kernel
> module

Re: [PATCH 12/15] iotests: Disable AQMP logging under non-debug modes

2021-10-05 Thread Hanna Reitz


On 04.10.21 20:32, John Snow wrote:



On Mon, Oct 4, 2021 at 6:12 AM Hanna Reitz > wrote:


On 18.09.21 04:14, John Snow wrote:
>
>
> On Fri, Sep 17, 2021 at 8:58 PM John Snow mailto:js...@redhat.com>
> >> wrote:
>
>
>
>     On Fri, Sep 17, 2021 at 10:30 AM Hanna Reitz
mailto:hre...@redhat.com>
>     >> wrote:
>
>         On 17.09.21 07:40, John Snow wrote:
>         > Disable the aqmp logger, which likes to (at the moment)
>         print out
>         > intermediate warnings and errors that cause session
>         termination; disable
>         > them so they don't interfere with the job output.
>         >
>         > Leave any "CRITICAL" warnings enabled though, those
are ones
>         that we
>         > should never see, no matter what.
>
>         I mean, looks OK to me, but from what I understand (i.e.
little),
>         qmp_client doesn’t log CRITICAL messages, at least I
can’t see
>         any. Only
>         ERRORs.
>
>
>     There's *one* critical message in protocol.py, used for a
>     circumstance that I *think* should be impossible. I do not
think I
>     currently use any WARNING level statements.
>
>         I guess I’m missing some CRITICAL messages in external
>         functions called
>         from qmp_client.py, but shouldn’t we still keep ERRORs?
>
>
>     ...Mayybe?
>
>     The errors logged by AQMP are *almost always* raised as
Exceptions
>     somewhere else, eventually. Sometimes when we encounter them in
>     one context, we need to save them and then re-raise them in a
>     different execution context. There's one good exception to this:
>     My pal, EOFError.
>
>     If the reader context encounters EOF, it raises EOFError and
this
>     causes a disconnect to be scheduled asynchronously. *Any*
>     Exception that causes a disconnect to be scheduled
asynchronously
>     is dutifully logged as an ERROR. At this point in the code, we
>     don't really know if the user of the library considers this an
>     "error" yet or not. I've waffled a lot on how exactly to treat
>     this circumstance. ...Hm, I guess that's really the only case
>     where I have an error that really ought to be suppressed. I
>     suppose what I will do here is: if the exception happens to
be an
>     EOFError I will drop the severity of the log message down to
INFO.
>     I don't know why it takes being challenged on this stuff to
start
>     thinking clearly about it, but here we are. Thank you for your
>     feedback :~)
>
>     --js
>
>
> Oh, CI testing reminds me of why I am a liar here.
>
> the mirror-top-perms test intentionally expects not to be able to
> connect, but we're treated to these two additional lines of output:
>
> +ERROR:qemu.aqmp.qmp_client.qemub-2536319:Negotiation failed:
EOFError
> +ERROR:qemu.aqmp.qmp_client.qemub-2536319:Failed to establish
session:
> EOFError
>
> Uh. I guess a temporary suppression in mirror-top-perms, then ...?

Sounds right to me, if that’s simple enough.

(By the way, I understand it right that you want to lower the
severity
of EOFErrors to INFO only on disconnect, right?  Which is why they’re
still logged as ERRORs here, because they aren’t occurring on
disconnects?)


More or less, yeah.

When an EOFError causes the reader coroutine to halt (because it can't 
read the next message), I decided (in v2) to drop that one particular 
logging message down to "INFO", because it might -- or might not be -- 
an expected occurrence from the point of view of whoever is managing 
the QMP connection. Maybe it was expected (The test used 
qemu-guest-agent or something else to make the guest shutdown, taking 
QEMU down with it without the knowledge of the QMP library layer) or 
maybe it was unexpected (the QMP remote really just disappeared from 
us on a whim). There's no way to know, so it probably isn't right to 
consider it an error.


In the connection case, I left it as an ERROR because the caller asked 
us to connect to an endpoint and we were unable to, which feels 
unambiguous. It will be ultimately reported via Exceptions as 
qemu.aqmp.ConnectError, with additional information available in 
fields of that exception object. Even though the exception is reported 
to the caller, I decided to log the occurrence anyway, because I felt 
like it should be the job of the library to make a good log and not 
the caller's responsibility to catch the exception and then log it 
themselves.


That does leave us with this atypical case though: the caller is 
intentionally

Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable

2021-10-05 Thread Stefan Hajnoczi

On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > > variable per virtio user.
> > 
> > virtio user == virtio device model?
> 
> Yes
> 
> > > Reasons:
> > > 
> > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > > 
> > > maximum queue size possible. Which is actually the maximum
> > > queue size allowed by the virtio protocol. The appropriate
> > > value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > 
> > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.h
> > > tml#x1-240006
> > > 
> > > Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > > more or less arbitrary value of 1024 in the past, which
> > > limits the maximum transfer size with virtio to 4M
> > > (more precise: 1024 * PAGE_SIZE, with the latter typically
> > > being 4k).
> > 
> > Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
> > that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> > etc).
> 
> Yes, that's use case dependent. Hence the solution to opt-in if it is desired 
> and feasible.
> 
> > > (2) Additionally the current value of 1024 poses a hidden limit,
> > > 
> > > invisible to guest, which causes a system hang with the
> > > following QEMU error if guest tries to exceed it:
> > > 
> > > virtio: too many write descriptors in indirect table
> > 
> > I don't understand this point. 2.6.5 The Virtqueue Descriptor Table says:
> > 
> >   The number of descriptors in the table is defined by the queue size for
> > this virtqueue: this is the maximum possible descriptor chain length.
> > 
> > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> > 
> >   A driver MUST NOT create a descriptor chain longer than the Queue Size of
> > the device.
> > 
> > Do you mean a broken/malicious guest driver that is violating the spec?
> > That's not a hidden limit, it's defined by the spec.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
> https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html
> 
> You can already go beyond that queue size at runtime with the indirection 
> table. The only actual limit is the currently hard coded value of 1k pages. 
> Hence the suggestion to turn that into a variable.

Exceeding Queue Size is a VIRTIO spec violation. Drivers that operate
outsided the spec do so at their own risk. They may not be compatible
with all device implementations.

The limit is not hidden, it's Queue Size as defined by the spec :).

If you have a driver that is exceeding the limit, then please fix the
driver.

> > > (3) Unfortunately not all virtio users in QEMU would currently
> > > 
> > > work correctly with the new value of 32768.
> > > 
> > > So let's turn this hard coded global value into a runtime
> > > variable as a first step in this commit, configurable for each
> > > virtio user by passing a corresponding value with virtio_init()
> > > call.
> > 
> > virtio_add_queue() already has an int queue_size argument, why isn't
> > that enough to deal with the maximum queue size? There's probably a good
> > reason for it, but please include it in the commit description.
> [...]
> > Can you make this value per-vq instead of per-vdev since virtqueues can
> > have different queue sizes?
> > 
> > The same applies to the rest of this patch. Anything using
> > vdev->queue_max_size should probably use vq->vring.num instead.
> 
> I would like to avoid that and keep it per device. The maximum size stored 
> there is the maximum size supported by virtio user (or vortio device model, 
> however you want to call it). So that's really a limit per device, not per 
> queue, as no queue of the device would ever exceed that limit.
>
> Plus a lot more code would need to be refactored, which I think is 
> unnecessary.

I'm against a per-device limit because it's a concept that cannot
accurately describe reality. Some devices have multiple classes of
virtqueues and they are sized differently, so a per-device limit is
insufficient. virtio-net has separate rx_queue_size and tx_queue_size
parameters (plus a control vq hardcoded to 64 descriptors).

The specification already gives us Queue Size (vring.num in QEMU). The
variable exists in QEMU and just needs to be used.

If per-vq limits require a lot of work, please describe why. I think
replacing the variable from this patch with virtio_queue_get_num()
should be fairly straightforward, but maybe I'm missing something? (If
you prefer VirtQueue *vq instead of the index-based
virtio_queue_get_num() API, you can introduce a virtqueue_get_num()
API.)

Stefan


signature.asc
Description: PGP signature

Re: [PULL 00/12] jobs: mirror: Handle errors after READY cancel

2021-10-05 Thread Hanna Reitz


On 04.10.21 19:59, Vladimir Sementsov-Ogievskiy wrote:

10/4/21 19:47, Hanna Reitz wrote:

On 24.09.21 00:01, Vladimir Sementsov-Ogievskiy wrote:

22.09.2021 22:19, Vladimir Sementsov-Ogievskiy wrote:

22.09.2021 19:05, Richard Henderson wrote:

On 9/21/21 3:20 AM, Vladimir Sementsov-Ogievskiy wrote:
The following changes since commit 
326ff8dd09556fc2e257196c49f35009700794ac:


   Merge remote-tracking branch 
'remotes/jasowang/tags/net-pull-request' into staging (2021-09-20 
16:17:05 +0100)


are available in the Git repository at:

   https://src.openvz.org/scm/~vsementsov/qemu.git 
tags/pull-jobs-2021-09-21


for you to fetch changes up to 
c9489c04319cac75c76af8fc27c254f46e10214c:


   iotests: Add mirror-ready-cancel-error test (2021-09-21 
11:56:11 +0300)



mirror: Handle errors after READY cancel


Hanna Reitz (12):
   job: Context changes in job_completed_txn_abort()
   mirror: Keep s->synced on error
   mirror: Drop s->synced
   job: Force-cancel jobs in a failed transaction
   job: @force parameter for job_cancel_sync()
   jobs: Give Job.force_cancel more meaning
   job: Add job_cancel_requested()
   mirror: Use job_is_cancelled()
   mirror: Check job_is_cancelled() earlier
   mirror: Stop active mirroring after force-cancel
   mirror: Do not clear .cancelled
   iotests: Add mirror-ready-cancel-error test


This fails testing with errors like so:

Running test test-replication
test-replication: ../job.c:186: job_state_transition: Assertion 
`JobSTT[s0][s1]' failed.

ERROR test-replication - too few tests run (expected 13, got 8)
make: *** [Makefile.mtest:816: run-test-100] Error 1
Cleaning up project directory and file based variables
ERROR: Job failed: exit code 1

https://gitlab.com/qemu-project/qemu/-/pipelines/375324015/failures




Interesting :(

I've reproduced, starting test-replication in several parallel 
loops. (it doesn't reproduce for me if just start in one loop). So, 
that's some racy bug..


Hmm, and seems it doesn't reproduce so simple on master. I'll try 
to bisect the series tomorrow.




(gdb) bt
#0  0x7f034a3d09d5 in raise () from /lib64/libc.so.6
#1  0x7f034a3b9954 in abort () from /lib64/libc.so.6
#2  0x7f034a3b9789 in __assert_fail_base.cold () from 
/lib64/libc.so.6

#3  0x7f034a3c9026 in __assert_fail () from /lib64/libc.so.6
#4  0x55d3b503d670 in job_state_transition (job=0x55d3b5e67020, 
s1=JOB_STATUS_CONCLUDED) at ../job.c:186
#5  0x55d3b503e7c2 in job_conclude (job=0x55d3b5e67020) at 
../job.c:652
#6  0x55d3b503eaa1 in job_finalize_single (job=0x55d3b5e67020) 
at ../job.c:722
#7  0x55d3b503ecd1 in job_completed_txn_abort 
(job=0x55d3b5e67020) at ../job.c:801
#8  0x55d3b503f2ea in job_cancel (job=0x55d3b5e67020, 
force=false) at ../job.c:973
#9  0x55d3b503f360 in job_cancel_err (job=0x55d3b5e67020, 
errp=0x7fffcc997a80) at ../job.c:992
#10 0x55d3b503f576 in job_finish_sync (job=0x55d3b5e67020, 
finish=0x55d3b503f33f , errp=0x0) at ../job.c:1054
#11 0x55d3b503f3d0 in job_cancel_sync (job=0x55d3b5e67020, 
force=false) at ../job.c:1008
#12 0x55d3b4ff14a3 in replication_close (bs=0x55d3b5e6ef80) at 
../block/replication.c:152
#13 0x55d3b50277fc in bdrv_close (bs=0x55d3b5e6ef80) at 
../block.c:4677
#14 0x55d3b50286cf in bdrv_delete (bs=0x55d3b5e6ef80) at 
../block.c:5100
#15 0x55d3b502ae3a in bdrv_unref (bs=0x55d3b5e6ef80) at 
../block.c:6495
#16 0x55d3b5023a38 in bdrv_root_unref_child 
(child=0x55d3b5e4c690) at ../block.c:3010
#17 0x55d3b5047998 in blk_remove_bs (blk=0x55d3b5e73b40) at 
../block/block-backend.c:845
#18 0x55d3b5046e38 in blk_delete (blk=0x55d3b5e73b40) at 
../block/block-backend.c:461
#19 0x55d3b50470dc in blk_unref (blk=0x55d3b5e73b40) at 
../block/block-backend.c:516
#20 0x55d3b4fdb20a in teardown_secondary () at 
../tests/unit/test-replication.c:367
#21 0x55d3b4fdb632 in test_secondary_continuous_replication () 
at ../tests/unit/test-replication.c:504
#22 0x7f034b26979e in g_test_run_suite_internal () from 
/lib64/libglib-2.0.so.0
#23 0x7f034b26959b in g_test_run_suite_internal () from 
/lib64/libglib-2.0.so.0
#24 0x7f034b26959b in g_test_run_suite_internal () from 
/lib64/libglib-2.0.so.0
#25 0x7f034b269c8a in g_test_run_suite () from 
/lib64/libglib-2.0.so.0

#26 0x7f034b269ca5 in g_test_run () from /lib64/libglib-2.0.so.0
#27 0x55d3b4fdb9c0 in main (argc=1, argv=0x7fffcc998138) at 
../tests/unit/test-replication.c:613

(gdb) fr 4
#4  0x55d3b503d670 in job_state_transition (job=0x55d3b5e67020, 
s1=JOB_STATUS_CONCLUDED) at ../job.c:186

186 assert(JobSTT[s0][s1]);
(gdb) list
181 JobStatus s0 = job->status;
182 assert(s1 >= 0 && s1 < JOB_STATUS__MAX);
183 trace_job_state_transition(job, job->ret,
184

[RFC PATCH v2 18/25] block/coroutines: I/O API

2021-10-05 Thread Emanuele Giuseppe Esposito

block coroutines functions run in different aiocontext, and are
not protected by the BQL. Therefore are I/O.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/coroutines.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/block/coroutines.h b/block/coroutines.h
index 514d169d23..105e0ce2a9 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -27,6 +27,12 @@
 
 #include "block/block_int.h"
 
+/*
+ * I/O API functions. These functions are thread-safe, and therefore
+ * can run in any thread as long as they have called
+ * aio_context_acquire/release().
+ */
+
 int coroutine_fn bdrv_co_check(BlockDriverState *bs,
BdrvCheckResult *res, BdrvCheckMode fix);
 int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp);
-- 
2.27.0

Re: [PATCH 09/11] qdev: Avoid QemuOpts in QMP device_add

2021-10-05 Thread Kevin Wolf

Am 27.09.2021 um 13:39 hat Kevin Wolf geschrieben:
> Am 27.09.2021 um 13:06 hat Damien Hedde geschrieben:
> > On 9/24/21 11:04, Kevin Wolf wrote:
> > > Directly call qdev_device_add_from_qdict() for QMP device_add instead of
> > > first going through QemuOpts and converting back to QDict.
> > > 
> > > Note that this changes the behaviour of device_add, though in ways that
> > > should be considered bug fixes:
> > > 
> > > QemuOpts ignores differences between data types, so you could
> > > successfully pass a string "123" for an integer property, or a string
> > > "on" for a boolean property (and vice versa).  After this change, the
> > > correct data type for the property must be used in the JSON input.
> > > 
> > > qemu_opts_from_qdict() also silently ignores any options whose value is
> > > a QDict, QList or QNull.
> > > 
> > > To illustrate, the following QMP command was accepted before and is now
> > > rejected for both reasons:
> > > 
> > > { "execute": "device_add",
> > >"arguments": { "driver": "scsi-cd",
> > >   "drive": { "completely": "invalid" },
> > >   "physical_block_size": "4096" } }
> > > 
> > > Signed-off-by: Kevin Wolf 
> > > ---
> > >   softmmu/qdev-monitor.c | 18 +++---
> > >   1 file changed, 11 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
> > > index c09b7430eb..8622ccade6 100644
> > > --- a/softmmu/qdev-monitor.c
> > > +++ b/softmmu/qdev-monitor.c
> > > @@ -812,7 +812,8 @@ void hmp_info_qdm(Monitor *mon, const QDict *qdict)
> > >   qdev_print_devinfos(true);
> > >   }
> > > -void qmp_device_add(QDict *qdict, QObject **ret_data, Error **errp)
> > > +static void monitor_device_add(QDict *qdict, QObject **ret_data,
> > > +   bool from_json, Error **errp)
> > >   {
> > >   QemuOpts *opts;
> > >   DeviceState *dev;
> > > @@ -825,7 +826,9 @@ void qmp_device_add(QDict *qdict, QObject **ret_data, 
> > > Error **errp)
> > >   qemu_opts_del(opts);
> > >   return;
> > >   }
> > > -dev = qdev_device_add(opts, errp);
> > > +qemu_opts_del(opts);
> > > +
> > > +dev = qdev_device_add_from_qdict(qdict, from_json, errp);
> > 
> > Hi Kevin,
> > 
> > I'm wandering if deleting the opts (which remove it from the "device" opts
> > list) is really a no-op ?
> 
> It's not exactly a no-op. Previously, the QemuOpts would only be freed
> when the device is destroying, now we delete it immediately after
> creating the device. This could matter in some cases.
> 
> The one case I was aware of is that QemuOpts used to be responsible for
> checking for duplicate IDs. Obviously, it can't do this job any more
> when we call qemu_opts_del() right after creating the device. This is
> the reason for patch 6.
> 
> > The opts list is, eg, traversed in hw/net/virtio-net.c in the function
> > failover_find_primary_device_id() which may be called during the
> > virtio_net_set_features() (a TYPE_VIRTIO_NET method).
> > I do not have the knowledge to tell when this method is called. But If this
> > is after we create the devices. Then the list will be empty at this point
> > now.
> > 
> > It seems, there are 2 other calling sites of
> > "qemu_opts_foreach(qemu_find_opts("device"), [...]" in net/vhost-user.c and
> > net/vhost-vdpa.c
> 
> Yes, you are right. These callers probably need to be changed. Going
> through the command line options rather than looking at the actual
> device objects that exist doesn't feel entirely clean anyway.

So I tried to have a look at the virtio-net case, and ended up very
confused.

Obviously looking at command line options (even of a differrent device)
from within a device is very unclean. With a non-broken, i.e. type safe,
device-add (as well as with the JSON CLI option introduced by this
series), we can't have a QemuOpts any more that is by definition unsafe.
So this code needs a replacement.

My naive idea was that we just need to look at runtime state instead.
Don't search the options for a device with a matching 'failover_pair_id'
(which, by the way, would fail as soon as any other device introduces a
property with the same name), but search for actual PCIDevices in qdev
that have pci_dev->failover_pair_id set accordingly.

However, the logic in failover_add_primary() suggests that we can have a
state where QemuOpts for a device exist, but the device doesn't, and
then it hotplugs the device from the command line options. How would we
ever get into such an inconsistent state where QemuOpts contains a
device that doesn't exist? Normally devices get their QemuOpts when they
are created and device_finalize() deletes the QemuOpts again.

Any suggestions how to get rid of the QemuOpts abuse in the failover
code?

If this is a device that we previously managed to rip out without
deleting its QemuOpts, can we store its dev->opts (which is a type safe
QDict after this series) somewhere locally instead of looking at global

[RFC PATCH v2 23/25] block-backend-common.h: split function pointers in BlockDevOps

2021-10-05 Thread Emanuele Giuseppe Esposito

Assertions in the callers of the funciton pointrs are already
added by previous patches.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/sysemu/block-backend-common.h | 42 +++
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/include/sysemu/block-backend-common.h 
b/include/sysemu/block-backend-common.h
index 52ff6a4d26..25f34917b6 100644
--- a/include/sysemu/block-backend-common.h
+++ b/include/sysemu/block-backend-common.h
@@ -17,6 +17,29 @@
 
 /* Callbacks for block device models */
 typedef struct BlockDevOps {
+
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * If a function modifies the graph, it also uses drain and/or
+ * aio_context_acquire/release to be sure it has unique access.
+ * aio_context locking is needed together with BQL because of
+ * the thread-safe I/O API that concurrently runs and accesses
+ * the graph without the BQL.
+ *
+ * It is important to note that not all of these functions are
+ * necessarily limited to running under the BQL, but they would
+ * require additional auditing and may small thread-safety changes
+ * to move them into the I/O API. Often it's not worth doing that
+ * work since the APIs are only used with the BQL held at the
+ * moment, so they have been placed in the GS API (for now).
+ *
+ * All bdrv_* callers that use these function pointers must
+ * use this assertion:
+ * g_assert(qemu_in_main_thread());
+ * to catch when they are accidentally called without the BQL.
+ */
+
 /*
  * Runs when virtual media changed (monitor commands eject, change)
  * Argument load is true on load and false on eject.
@@ -34,16 +57,25 @@ typedef struct BlockDevOps {
  * true, even if they do not support eject requests.
  */
 void (*eject_request_cb)(void *opaque, bool force);
-/*
- * Is the virtual tray open?
- * Device models implement this only when the device has a tray.
- */
-bool (*is_tray_open)(void *opaque);
+
 /*
  * Is the virtual medium locked into the device?
  * Device models implement this only when device has such a lock.
  */
 bool (*is_medium_locked)(void *opaque);
+
+/*
+ * I/O API functions. These functions are thread-safe, and therefore
+ * can run in any thread as long as they have called
+ * aio_context_acquire/release().
+ */
+
+/*
+ * Is the virtual tray open?
+ * Device models implement this only when the device has a tray.
+ */
+bool (*is_tray_open)(void *opaque);
+
 /*
  * Runs when the size changed (e.g. monitor command block_resize)
  */
-- 
2.27.0

[RFC PATCH v2 24/25] job.h: split function pointers in JobDriver

2021-10-05 Thread Emanuele Giuseppe Esposito

The job API will be handled separately in another serie.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/qemu/job.h | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 41162ed494..c236c43026 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -169,12 +169,21 @@ typedef struct Job {
  * Callbacks and other information about a Job driver.
  */
 struct JobDriver {
+
+/* Fields initialized in struct definition and never changed. */
+
 /** Derived Job struct size */
 size_t instance_size;
 
 /** Enum describing the operation */
 JobType job_type;
 
+/*
+ * I/O API functions. These functions are thread-safe, and therefore
+ * can run in any thread as long as they have called
+ * aio_context_acquire/release().
+ */
+
 /**
  * Mandatory: Entrypoint for the Coroutine.
  *
@@ -201,6 +210,28 @@ struct JobDriver {
  */
 void coroutine_fn (*resume)(Job *job);
 
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * If a function modifies the graph, it also uses drain and/or
+ * aio_context_acquire/release to be sure it has unique access.
+ * aio_context locking is needed together with BQL because of
+ * the thread-safe I/O API that concurrently runs and accesses
+ * the graph without the BQL.
+ *
+ * It is important to note that not all of these functions are
+ * necessarily limited to running under the BQL, but they would
+ * require additional auditing and may small thread-safety changes
+ * to move them into the I/O API. Often it's not worth doing that
+ * work since the APIs are only used with the BQL held at the
+ * moment, so they have been placed in the GS API (for now).
+ *
+ * All callers that use these function pointers must
+ * use this assertion:
+ * g_assert(qemu_in_main_thread());
+ * to catch when they are accidentally called without the BQL.
+ */
+
 /**
  * Called when the job is resumed by the user (i.e. user_paused becomes
  * false). .user_resume is called before .resume.
-- 
2.27.0

[RFC PATCH v2 21/25] block_int-common.h: split function pointers in BdrvChildClass

2021-10-05 Thread Emanuele Giuseppe Esposito

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/block/block_int-common.h | 65 ++--
 1 file changed, 46 insertions(+), 19 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 184cfab2d6..a6ea824b64 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -798,12 +798,31 @@ struct BdrvChildClass {
  */
 bool parent_is_bds;
 
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * If a function modifies the graph, it also uses drain and/or
+ * aio_context_acquire/release to be sure it has unique access.
+ * aio_context locking is needed together with BQL because of
+ * the thread-safe I/O API that concurrently runs and accesses
+ * the graph without the BQL.
+ *
+ * It is important to note that not all of these functions are
+ * necessarily limited to running under the BQL, but they would
+ * require additional auditing and may small thread-safety changes
+ * to move them into the I/O API. Often it's not worth doing that
+ * work since the APIs are only used with the BQL held at the
+ * moment, so they have been placed in the GS API (for now).
+ *
+ * All callers that use these function pointers must
+ * use this assertion:
+ * g_assert(qemu_in_main_thread());
+ * to catch when they are accidentally called without the BQL.
+ */
 void (*inherit_options)(BdrvChildRole role, bool parent_is_format,
 int *child_flags, QDict *child_options,
 int parent_flags, QDict *parent_options);
-
 void (*change_media)(BdrvChild *child, bool load);
-void (*resize)(BdrvChild *child);
 
 /*
  * Returns a name that is supposedly more useful for human users than the
@@ -820,6 +839,31 @@ struct BdrvChildClass {
  */
 char *(*get_parent_desc)(BdrvChild *child);
 
+void (*attach)(BdrvChild *child);
+void (*detach)(BdrvChild *child);
+
+/*
+ * Notifies the parent that the filename of its child has changed (e.g.
+ * because the direct child was removed from the backing chain), so that it
+ * can update its reference.
+ */
+int (*update_filename)(BdrvChild *child, BlockDriverState *new_base,
+   const char *filename, Error **errp);
+
+bool (*can_set_aio_ctx)(BdrvChild *child, AioContext *ctx,
+GSList **ignore, Error **errp);
+void (*set_aio_ctx)(BdrvChild *child, AioContext *ctx, GSList **ignore);
+
+AioContext *(*get_parent_aio_context)(BdrvChild *child);
+
+/*
+ * I/O API functions. These functions are thread-safe, and therefore
+ * can run in any thread as long as they have called
+ * aio_context_acquire/release().
+ */
+
+void (*resize)(BdrvChild *child);
+
 /*
  * If this pair of functions is implemented, the parent doesn't issue new
  * requests after returning from .drained_begin() until .drained_end() is
@@ -852,23 +896,6 @@ struct BdrvChildClass {
  */
 void (*activate)(BdrvChild *child, Error **errp);
 int (*inactivate)(BdrvChild *child);
-
-void (*attach)(BdrvChild *child);
-void (*detach)(BdrvChild *child);
-
-/*
- * Notifies the parent that the filename of its child has changed (e.g.
- * because the direct child was removed from the backing chain), so that it
- * can update its reference.
- */
-int (*update_filename)(BdrvChild *child, BlockDriverState *new_base,
-   const char *filename, Error **errp);
-
-bool (*can_set_aio_ctx)(BdrvChild *child, AioContext *ctx,
-GSList **ignore, Error **errp);
-void (*set_aio_ctx)(BdrvChild *child, AioContext *ctx, GSList **ignore);
-
-AioContext *(*get_parent_aio_context)(BdrvChild *child);
 };
 
 extern const BdrvChildClass child_of_bds;
-- 
2.27.0

[RFC PATCH v2 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers

2021-10-05 Thread Emanuele Giuseppe Esposito

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/block.c b/block.c
index 7d1eb847a4..a921066d4d 100644
--- a/block.c
+++ b/block.c
@@ -1069,6 +1069,7 @@ int refresh_total_sectors(BlockDriverState *bs, int64_t 
hint)
 static void bdrv_join_options(BlockDriverState *bs, QDict *options,
   QDict *old_options)
 {
+g_assert(qemu_in_main_thread());
 if (bs->drv && bs->drv->bdrv_join_options) {
 bs->drv->bdrv_join_options(options, old_options);
 } else {
@@ -1561,6 +1562,7 @@ static int bdrv_open_driver(BlockDriverState *bs, 
BlockDriver *drv,
 {
 Error *local_err = NULL;
 int i, ret;
+g_assert(qemu_in_main_thread());
 
 bdrv_assign_node_name(bs, node_name, _err);
 if (local_err) {
@@ -1932,6 +1934,8 @@ static int bdrv_fill_options(QDict **options, const char 
*filename,
 BlockDriver *drv = NULL;
 Error *local_err = NULL;
 
+g_assert(qemu_in_main_thread());
+
 /*
  * Caution: while qdict_get_try_str() is fine, getting non-string
  * types would require more care.  When @options come from
@@ -2125,6 +2129,7 @@ static void bdrv_child_perm(BlockDriverState *bs, 
BlockDriverState *child_bs,
 uint64_t *nperm, uint64_t *nshared)
 {
 assert(bs->drv && bs->drv->bdrv_child_perm);
+g_assert(qemu_in_main_thread());
 bs->drv->bdrv_child_perm(bs, c, role, reopen_queue,
  parent_perm, parent_shared,
  nperm, nshared);
@@ -2208,6 +2213,7 @@ static void bdrv_drv_set_perm_commit(void *opaque)
 {
 BlockDriverState *bs = opaque;
 uint64_t cumulative_perms, cumulative_shared_perms;
+g_assert(qemu_in_main_thread());
 
 if (bs->drv->bdrv_set_perm) {
 bdrv_get_cumulative_perm(bs, _perms,
@@ -2219,6 +2225,7 @@ static void bdrv_drv_set_perm_commit(void *opaque)
 static void bdrv_drv_set_perm_abort(void *opaque)
 {
 BlockDriverState *bs = opaque;
+g_assert(qemu_in_main_thread());
 
 if (bs->drv->bdrv_abort_perm_update) {
 bs->drv->bdrv_abort_perm_update(bs);
@@ -2234,6 +2241,7 @@ static int bdrv_drv_set_perm(BlockDriverState *bs, 
uint64_t perm,
  uint64_t shared_perm, Transaction *tran,
  Error **errp)
 {
+g_assert(qemu_in_main_thread());
 if (!bs->drv) {
 return 0;
 }
@@ -4198,6 +4206,7 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, 
Error **errp)
 
 assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 assert(bs_queue != NULL);
+g_assert(qemu_in_main_thread());
 
 QTAILQ_FOREACH(bs_entry, bs_queue, entry) {
 ctx = bdrv_get_aio_context(bs_entry->state.bs);
@@ -4461,6 +4470,7 @@ static int bdrv_reopen_prepare(BDRVReopenState 
*reopen_state,
 
 assert(reopen_state != NULL);
 assert(reopen_state->bs->drv != NULL);
+g_assert(qemu_in_main_thread());
 drv = reopen_state->bs->drv;
 
 /* This function and each driver's bdrv_reopen_prepare() remove
@@ -4671,6 +4681,7 @@ static void bdrv_reopen_commit(BDRVReopenState 
*reopen_state)
 bs = reopen_state->bs;
 drv = bs->drv;
 assert(drv != NULL);
+g_assert(qemu_in_main_thread());
 
 /* If there are any driver level actions to take */
 if (drv->bdrv_reopen_commit) {
@@ -4712,6 +4723,7 @@ static void bdrv_reopen_abort(BDRVReopenState 
*reopen_state)
 assert(reopen_state != NULL);
 drv = reopen_state->bs->drv;
 assert(drv != NULL);
+g_assert(qemu_in_main_thread());
 
 if (drv->bdrv_reopen_abort) {
 drv->bdrv_reopen_abort(reopen_state);
@@ -4725,6 +4737,7 @@ static void bdrv_close(BlockDriverState *bs)
 BdrvChild *child, *next;
 
 assert(!bs->refcnt);
+g_assert(qemu_in_main_thread());
 
 bdrv_drained_begin(bs); /* complete I/O */
 bdrv_flush(bs);
@@ -6409,6 +6422,7 @@ static int bdrv_inactivate_recurse(BlockDriverState *bs)
 {
 BdrvChild *child, *parent;
 int ret;
+g_assert(qemu_in_main_thread());
 
 if (!bs->drv) {
 return -ENOMEDIUM;
@@ -6911,6 +6925,7 @@ static void bdrv_detach_aio_context(BlockDriverState *bs)
 BdrvAioNotifier *baf, *baf_tmp;
 
 assert(!bs->walking_aio_notifiers);
+g_assert(qemu_in_main_thread());
 bs->walking_aio_notifiers = true;
 QLIST_FOREACH_SAFE(baf, >aio_notifiers, list, baf_tmp) {
 if (baf->deleted) {
@@ -6938,6 +6953,7 @@ static void bdrv_attach_aio_context(BlockDriverState *bs,
 AioContext *new_context)
 {
 BdrvAioNotifier *ban, *ban_tmp;
+g_assert(qemu_in_main_thread());
 
 if (bs->quiesce_counter) {
 aio_disable_external(new_context);
-- 
2.27.0

[RFC PATCH v2 14/25] assertions for blockdev.h global state API

2021-10-05 Thread Emanuele Giuseppe Esposito

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c |  2 ++
 blockdev.c| 12 
 2 files changed, 14 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index 9f09245069..18791c4fdc 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -805,6 +805,7 @@ DriveInfo *blk_legacy_dinfo(BlockBackend *blk)
 DriveInfo *blk_set_legacy_dinfo(BlockBackend *blk, DriveInfo *dinfo)
 {
 assert(!blk->legacy_dinfo);
+g_assert(qemu_in_main_thread());
 return blk->legacy_dinfo = dinfo;
 }
 
@@ -815,6 +816,7 @@ DriveInfo *blk_set_legacy_dinfo(BlockBackend *blk, 
DriveInfo *dinfo)
 BlockBackend *blk_by_legacy_dinfo(DriveInfo *dinfo)
 {
 BlockBackend *blk = NULL;
+g_assert(qemu_in_main_thread());
 
 while ((blk = blk_next(blk)) != NULL) {
 if (blk->legacy_dinfo == dinfo) {
diff --git a/blockdev.c b/blockdev.c
index 5608b78f8f..917bcf8cbc 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -114,6 +114,8 @@ void override_max_devs(BlockInterfaceType type, int 
max_devs)
 BlockBackend *blk;
 DriveInfo *dinfo;
 
+g_assert(qemu_in_main_thread());
+
 if (max_devs <= 0) {
 return;
 }
@@ -230,6 +232,8 @@ DriveInfo *drive_get(BlockInterfaceType type, int bus, int 
unit)
 BlockBackend *blk;
 DriveInfo *dinfo;
 
+g_assert(qemu_in_main_thread());
+
 for (blk = blk_next(NULL); blk; blk = blk_next(blk)) {
 dinfo = blk_legacy_dinfo(blk);
 if (dinfo && dinfo->type == type
@@ -252,6 +256,8 @@ void drive_check_orphaned(void)
 Location loc;
 bool orphans = false;
 
+g_assert(qemu_in_main_thread());
+
 for (blk = blk_next(NULL); blk; blk = blk_next(blk)) {
 dinfo = blk_legacy_dinfo(blk);
 /*
@@ -285,6 +291,7 @@ void drive_check_orphaned(void)
 
 DriveInfo *drive_get_by_index(BlockInterfaceType type, int index)
 {
+g_assert(qemu_in_main_thread());
 return drive_get(type,
  drive_index_to_bus_id(type, index),
  drive_index_to_unit_id(type, index));
@@ -296,6 +303,8 @@ int drive_get_max_bus(BlockInterfaceType type)
 BlockBackend *blk;
 DriveInfo *dinfo;
 
+g_assert(qemu_in_main_thread());
+
 max_bus = -1;
 for (blk = blk_next(NULL); blk; blk = blk_next(blk)) {
 dinfo = blk_legacy_dinfo(blk);
@@ -312,6 +321,7 @@ int drive_get_max_bus(BlockInterfaceType type)
 DriveInfo *drive_get_next(BlockInterfaceType type)
 {
 static int next_block_unit[IF_COUNT];
+g_assert(qemu_in_main_thread());
 
 return drive_get(type, 0, next_block_unit[type]++);
 }
@@ -792,6 +802,8 @@ DriveInfo *drive_new(QemuOpts *all_opts, BlockInterfaceType 
block_default_type,
 const char *filename;
 int i;
 
+g_assert(qemu_in_main_thread());
+
 /* Change legacy command line options into QMP ones */
 static const struct {
 const char *from;
-- 
2.27.0

[RFC PATCH v2 22/25] block_int-common.h: assertions in the callers of BdrvChildClass function pointers

2021-10-05 Thread Emanuele Giuseppe Esposito

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/block.c b/block.c
index a921066d4d..e4b3d40094 100644
--- a/block.c
+++ b/block.c
@@ -1457,6 +1457,7 @@ const BdrvChildClass child_of_bds = {
 
 AioContext *bdrv_child_get_parent_aio_context(BdrvChild *c)
 {
+g_assert(qemu_in_main_thread());
 return c->klass->get_parent_aio_context(c);
 }
 
@@ -2062,6 +2063,7 @@ bool bdrv_is_writable(BlockDriverState *bs)
 
 static char *bdrv_child_user_desc(BdrvChild *c)
 {
+g_assert(qemu_in_main_thread());
 return c->klass->get_parent_desc(c);
 }
 
@@ -2695,6 +2697,7 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 int drain_saldo;
 
 assert(!child->frozen);
+g_assert(qemu_in_main_thread());
 
 if (old_bs && new_bs) {
 assert(bdrv_get_aio_context(old_bs) == bdrv_get_aio_context(new_bs));
@@ -2783,6 +2786,8 @@ static void bdrv_attach_child_common_abort(void *opaque)
 BdrvChild *child = *s->child;
 BlockDriverState *bs = child->bs;
 
+g_assert(qemu_in_main_thread());
+
 bdrv_replace_child_noperm(child, NULL);
 
 if (bdrv_get_aio_context(bs) != s->old_child_ctx) {
@@ -3141,6 +3146,7 @@ void bdrv_unref_child(BlockDriverState *parent, BdrvChild 
*child)
 static void bdrv_parent_cb_change_media(BlockDriverState *bs, bool load)
 {
 BdrvChild *c;
+g_assert(qemu_in_main_thread());
 QLIST_FOREACH(c, >parents, next_parent) {
 if (c->klass->change_media) {
 c->klass->change_media(c, load);
@@ -3632,6 +3638,7 @@ static BlockDriverState *bdrv_open_inherit(const char 
*filename,
 
 assert(!child_class || !flags);
 assert(!child_class == !parent);
+g_assert(qemu_in_main_thread());
 
 if (reference) {
 bool options_non_empty = options ? qdict_size(options) : false;
@@ -4018,6 +4025,7 @@ static BlockReopenQueue 
*bdrv_reopen_queue_child(BlockReopenQueue *bs_queue,
  * important to avoid graph changes between the recursive queuing here and
  * bdrv_reopen_multiple(). */
 assert(bs->quiesce_counter > 0);
+g_assert(qemu_in_main_thread());
 
 if (bs_queue == NULL) {
 bs_queue = g_new0(BlockReopenQueue, 1);
@@ -7000,6 +7008,7 @@ void bdrv_set_aio_context_ignore(BlockDriverState *bs,
 BdrvChild *child, *parent;
 
 g_assert(qemu_get_current_aio_context() == qemu_get_aio_context());
+g_assert(qemu_in_main_thread());
 
 if (old_context == new_context) {
 return;
@@ -7076,6 +7085,7 @@ static bool bdrv_parent_can_set_aio_context(BdrvChild *c, 
AioContext *ctx,
 return true;
 }
 *ignore = g_slist_prepend(*ignore, c);
+g_assert(qemu_in_main_thread());
 
 /*
  * A BdrvChildClass that doesn't handle AioContext changes cannot
-- 
2.27.0

[RFC PATCH v2 19/25] block_int-common.h: split function pointers in BlockDriver

2021-10-05 Thread Emanuele Giuseppe Esposito

Similar to the header split, also the function pointers in BlockDriver
can be split in I/O and global state.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/block/block_int-common.h | 472 ---
 1 file changed, 251 insertions(+), 221 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 23f0d9c090..184cfab2d6 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -95,6 +95,7 @@ typedef struct BdrvTrackedRequest {
 
 
 struct BlockDriver {
+/* Fields initialized in struct definition and never changed. */
 const char *format_name;
 int instance_size;
 
@@ -120,23 +121,7 @@ struct BlockDriver {
  * on those children.
  */
 bool is_format;
-/*
- * Return true if @to_replace can be replaced by a BDS with the
- * same data as @bs without it affecting @bs's behavior (that is,
- * without it being visible to @bs's parents).
- */
-bool (*bdrv_recurse_can_replace)(BlockDriverState *bs,
- BlockDriverState *to_replace);
 
-int (*bdrv_probe)(const uint8_t *buf, int buf_size, const char *filename);
-int (*bdrv_probe_device)(const char *filename);
-
-/*
- * Any driver implementing this callback is expected to be able to handle
- * NULL file names in its .bdrv_open() implementation.
- */
-void (*bdrv_parse_filename)(const char *filename, QDict *options,
-Error **errp);
 /*
  * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
  * this field set to true, except ones that are defined only by their
@@ -158,7 +143,81 @@ struct BlockDriver {
  */
 bool supports_backing;
 
-/* For handling image reopen for split or non-split files */
+/*
+ * Drivers setting this field must be able to work with just a plain
+ * filename with ':' as a prefix, and no other options.
+ * Options may be extracted from the filename by implementing
+ * bdrv_parse_filename.
+ */
+const char *protocol_name;
+
+/* List of options for creating images, terminated by name == NULL */
+QemuOptsList *create_opts;
+
+/* List of options for image amend */
+QemuOptsList *amend_opts;
+
+/*
+ * If this driver supports reopening images this contains a
+ * NULL-terminated list of the runtime options that can be
+ * modified. If an option in this list is unspecified during
+ * reopen then it _must_ be reset to its default value or return
+ * an error.
+ */
+const char *const *mutable_opts;
+
+/*
+ * Pointer to a NULL-terminated array of names of strong options
+ * that can be specified for bdrv_open(). A strong option is one
+ * that changes the data of a BDS.
+ * If this pointer is NULL, the array is considered empty.
+ * "filename" and "driver" are always considered strong.
+ */
+const char *const *strong_runtime_opts;
+
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * If a function modifies the graph, it also uses drain and/or
+ * aio_context_acquire/release to be sure it has unique access.
+ * aio_context locking is needed together with BQL because of
+ * the thread-safe I/O API that concurrently runs and accesses
+ * the graph without the BQL.
+ *
+ * It is important to note that not all of these functions are
+ * necessarily limited to running under the BQL, but they would
+ * require additional auditing and may small thread-safety changes
+ * to move them into the I/O API. Often it's not worth doing that
+ * work since the APIs are only used with the BQL held at the
+ * moment, so they have been placed in the GS API (for now).
+ *
+ * All bdrv_* callers that use these function pointers must
+ * use this assertion:
+ * g_assert(qemu_in_main_thread());
+ * to catch when they are accidentally called without the BQL.
+ */
+
+/*
+ * Return true if @to_replace can be replaced by a BDS with the
+ * same data as @bs without it affecting @bs's behavior (that is,
+ * without it being visible to @bs's parents).
+ */
+bool (*bdrv_recurse_can_replace)(BlockDriverState *bs,
+ BlockDriverState *to_replace);
+
+int (*bdrv_probe)(const uint8_t *buf, int buf_size, const char *filename);
+int (*bdrv_probe_device)(const char *filename);
+
+/*
+ * Any driver implementing this callback is expected to be able to handle
+ * NULL file names in its .bdrv_open() implementation.
+ */
+void (*bdrv_parse_filename)(const char *filename, QDict *options,
+Error **errp);
+
+/*
+ * For handling image reopen for split or non-split files.
+ */
 int (*bdrv_reopen_prepare)(BDRVReopenState *reopen_state,

[RFC PATCH v2 17/25] include/block/transactions: global state API + assertions

2021-10-05 Thread Emanuele Giuseppe Esposito

transactions run always under the BQL lock, so they are all
in the global state API.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/qemu/transactions.h | 24 
 util/transactions.c |  4 
 2 files changed, 28 insertions(+)

diff --git a/include/qemu/transactions.h b/include/qemu/transactions.h
index 92c5965235..f4a7c473fa 100644
--- a/include/qemu/transactions.h
+++ b/include/qemu/transactions.h
@@ -37,6 +37,29 @@
 #define QEMU_TRANSACTIONS_H
 
 #include 
+#include "qemu/main-loop.h"
+
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * If a function modifies the graph, it also uses drain and/or
+ * aio_context_acquire/release to be sure it has unique access.
+ * aio_context locking is needed together with BQL because of
+ * the thread-safe I/O API that concurrently runs and accesses
+ * the graph without the BQL.
+ *
+ * It is important to note that not all of these functions are
+ * necessarily limited to running under the BQL, but they would
+ * require additional auditing and may small thread-safety changes
+ * to move them into the I/O API. Often it's not worth doing that
+ * work since the APIs are only used with the BQL held at the
+ * moment, so they have been placed in the GS API (for now).
+ *
+ * All functions and function pointers in this header must use
+ * this assertion:
+ * g_assert(qemu_in_main_thread());
+ * to catch when they are accidentally called without the BQL.
+ */
 
 typedef struct TransactionActionDrv {
 void (*abort)(void *opaque);
@@ -53,6 +76,7 @@ void tran_commit(Transaction *tran);
 
 static inline void tran_finalize(Transaction *tran, int ret)
 {
+g_assert(qemu_in_main_thread());
 if (ret < 0) {
 tran_abort(tran);
 } else {
diff --git a/util/transactions.c b/util/transactions.c
index d0bc9a3e73..20c3dafdb8 100644
--- a/util/transactions.c
+++ b/util/transactions.c
@@ -23,6 +23,7 @@
 #include "qemu/osdep.h"
 
 #include "qemu/transactions.h"
+#include "qemu/main-loop.h"
 #include "qemu/queue.h"
 
 typedef struct TransactionAction {
@@ -47,6 +48,7 @@ Transaction *tran_new(void)
 void tran_add(Transaction *tran, TransactionActionDrv *drv, void *opaque)
 {
 TransactionAction *act;
+g_assert(qemu_in_main_thread());
 
 act = g_new(TransactionAction, 1);
 *act = (TransactionAction) {
@@ -60,6 +62,7 @@ void tran_add(Transaction *tran, TransactionActionDrv *drv, 
void *opaque)
 void tran_abort(Transaction *tran)
 {
 TransactionAction *act, *next;
+g_assert(qemu_in_main_thread());
 
 QSLIST_FOREACH_SAFE(act, >actions, entry, next) {
 if (act->drv->abort) {
@@ -79,6 +82,7 @@ void tran_abort(Transaction *tran)
 void tran_commit(Transaction *tran)
 {
 TransactionAction *act, *next;
+g_assert(qemu_in_main_thread());
 
 QSLIST_FOREACH_SAFE(act, >actions, entry, next) {
 if (act->drv->commit) {
-- 
2.27.0

[RFC PATCH v2 25/25] job.h: assertions in the callers of JobDriver funcion pointers

2021-10-05 Thread Emanuele Giuseppe Esposito

Signed-off-by: Emanuele Giuseppe Esposito 
---
 job.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/job.c b/job.c
index e7a5d28854..62a13b6982 100644
--- a/job.c
+++ b/job.c
@@ -373,6 +373,8 @@ void job_ref(Job *job)
 
 void job_unref(Job *job)
 {
+g_assert(qemu_in_main_thread());
+
 if (--job->refcnt == 0) {
 assert(job->status == JOB_STATUS_NULL);
 assert(!timer_pending(>sleep_timer));
@@ -594,6 +596,7 @@ bool job_user_paused(Job *job)
 void job_user_resume(Job *job, Error **errp)
 {
 assert(job);
+g_assert(qemu_in_main_thread());
 if (!job->user_paused || job->pause_count <= 0) {
 error_setg(errp, "Can't resume a job that was not paused");
 return;
@@ -664,6 +667,7 @@ static void job_update_rc(Job *job)
 static void job_commit(Job *job)
 {
 assert(!job->ret);
+g_assert(qemu_in_main_thread());
 if (job->driver->commit) {
 job->driver->commit(job);
 }
@@ -672,6 +676,7 @@ static void job_commit(Job *job)
 static void job_abort(Job *job)
 {
 assert(job->ret);
+g_assert(qemu_in_main_thread());
 if (job->driver->abort) {
 job->driver->abort(job);
 }
@@ -679,6 +684,7 @@ static void job_abort(Job *job)
 
 static void job_clean(Job *job)
 {
+g_assert(qemu_in_main_thread());
 if (job->driver->clean) {
 job->driver->clean(job);
 }
@@ -718,6 +724,7 @@ static int job_finalize_single(Job *job)
 
 static void job_cancel_async(Job *job, bool force)
 {
+g_assert(qemu_in_main_thread());
 if (job->driver->cancel) {
 job->driver->cancel(job, force);
 }
@@ -786,6 +793,7 @@ static void job_completed_txn_abort(Job *job)
 
 static int job_prepare(Job *job)
 {
+g_assert(qemu_in_main_thread());
 if (job->ret == 0 && job->driver->prepare) {
 job->ret = job->driver->prepare(job);
 job_update_rc(job);
@@ -991,6 +999,7 @@ void job_complete(Job *job, Error **errp)
 {
 /* Should not be reachable via external interface for internal jobs */
 assert(job->id);
+g_assert(qemu_in_main_thread());
 if (job_apply_verb(job, JOB_VERB_COMPLETE, errp)) {
 return;
 }
-- 
2.27.0

[RFC PATCH v2 16/25] block/backup-top.h: global state API + assertions

2021-10-05 Thread Emanuele Giuseppe Esposito

backup-top functions always run under BQL lock.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/backup-top.c |  2 ++
 block/backup-top.h | 11 +++
 2 files changed, 13 insertions(+)

diff --git a/block/backup-top.c b/block/backup-top.c
index 425e3778be..8b58a909f7 100644
--- a/block/backup-top.c
+++ b/block/backup-top.c
@@ -182,6 +182,7 @@ BlockDriverState *bdrv_backup_top_append(BlockDriverState 
*source,
 bool appended = false;
 
 assert(source->total_sectors == target->total_sectors);
+g_assert(qemu_in_main_thread());
 
 top = bdrv_new_open_driver(_backup_top_filter, filter_node_name,
BDRV_O_RDWR, errp);
@@ -244,6 +245,7 @@ fail:
 void bdrv_backup_top_drop(BlockDriverState *bs)
 {
 BDRVBackupTopState *s = bs->opaque;
+g_assert(qemu_in_main_thread());
 
 bdrv_drop_filter(bs, _abort);
 
diff --git a/block/backup-top.h b/block/backup-top.h
index b28b0031c4..8cb6f62869 100644
--- a/block/backup-top.h
+++ b/block/backup-top.h
@@ -29,6 +29,17 @@
 #include "block/block_int.h"
 #include "block/block-copy.h"
 
+/*
+ * Graph API. These functions run under the BQL lock.
+ *
+ * If a function modifies the graph, it uses drain and/or
+ * aio_context_acquire/release to be sure it has unique access.
+ *
+ * All functions in this header must use this assertion:
+ * g_assert(qemu_in_main_thread());
+ * to be sure they belong here.
+ */
+
 BlockDriverState *bdrv_backup_top_append(BlockDriverState *source,
  BlockDriverState *target,
  const char *filter_node_name,
-- 
2.27.0

[RFC PATCH v2 13/25] include/systemu/blockdev.h: global state API

2021-10-05 Thread Emanuele Giuseppe Esposito

blockdev functions run always under the BQL lock.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/sysemu/blockdev.h | 35 ++-
 1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/include/sysemu/blockdev.h b/include/sysemu/blockdev.h
index 32c2d6023c..28233f6b63 100644
--- a/include/sysemu/blockdev.h
+++ b/include/sysemu/blockdev.h
@@ -38,24 +38,49 @@ struct DriveInfo {
 QTAILQ_ENTRY(DriveInfo) next;
 };
 
-DriveInfo *blk_legacy_dinfo(BlockBackend *blk);
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * If a function modifies the graph, it also uses drain and/or
+ * aio_context_acquire/release to be sure it has unique access.
+ * aio_context locking is needed together with BQL because of
+ * the thread-safe I/O API that concurrently runs and accesses
+ * the graph without the BQL.
+ *
+ * It is important to note that not all of these functions are
+ * necessarily limited to running under the BQL, but they would
+ * require additional auditing and may small thread-safety changes
+ * to move them into the I/O API. Often it's not worth doing that
+ * work since the APIs are only used with the BQL held at the
+ * moment, so they have been placed in the GS API (for now).
+ *
+ * All functions in this header must use this assertion:
+ * g_assert(qemu_in_main_thread());
+ * to catch when they are accidentally called without the BQL.
+ */
+
 DriveInfo *blk_set_legacy_dinfo(BlockBackend *blk, DriveInfo *dinfo);
 BlockBackend *blk_by_legacy_dinfo(DriveInfo *dinfo);
 
 void override_max_devs(BlockInterfaceType type, int max_devs);
 
 DriveInfo *drive_get(BlockInterfaceType type, int bus, int unit);
-void drive_mark_claimed_by_board(void);
 void drive_check_orphaned(void);
 DriveInfo *drive_get_by_index(BlockInterfaceType type, int index);
 int drive_get_max_bus(BlockInterfaceType type);
-int drive_get_max_devs(BlockInterfaceType type);
 DriveInfo *drive_get_next(BlockInterfaceType type);
 
+DriveInfo *drive_new(QemuOpts *arg, BlockInterfaceType block_default_type,
+ Error **errp);
+
+/* Common functions that are neither I/O nor Global State */
+
+DriveInfo *blk_legacy_dinfo(BlockBackend *blk);
+int drive_get_max_devs(BlockInterfaceType type);
+
 QemuOpts *drive_def(const char *optstr);
+
 QemuOpts *drive_add(BlockInterfaceType type, int index, const char *file,
 const char *optstr);
-DriveInfo *drive_new(QemuOpts *arg, BlockInterfaceType block_default_type,
- Error **errp);
 
 #endif
-- 
2.27.0

[RFC PATCH v2 12/25] assertions for blockob.h global state API

2021-10-05 Thread Emanuele Giuseppe Esposito

Signed-off-by: Emanuele Giuseppe Esposito 
---
 blockjob.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/blockjob.c b/blockjob.c
index 9878e255c6..3b224136c0 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -61,6 +61,7 @@ static bool is_block_job(Job *job)
 
 BlockJob *block_job_next(BlockJob *bjob)
 {
+g_assert(qemu_in_main_thread());
 Job *job = bjob ? >job : NULL;
 
 do {
@@ -72,6 +73,7 @@ BlockJob *block_job_next(BlockJob *bjob)
 
 BlockJob *block_job_get(const char *id)
 {
+g_assert(qemu_in_main_thread());
 Job *job = job_get(id);
 
 if (job && is_block_job(job)) {
@@ -185,6 +187,7 @@ static const BdrvChildClass child_job = {
 
 void block_job_remove_all_bdrv(BlockJob *job)
 {
+g_assert(qemu_in_main_thread());
 /*
  * bdrv_root_unref_child() may reach child_job_[can_]set_aio_ctx(),
  * which will also traverse job->nodes, so consume the list one by
@@ -207,6 +210,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
 bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs)
 {
 GSList *el;
+g_assert(qemu_in_main_thread());
 
 for (el = job->nodes; el; el = el->next) {
 BdrvChild *c = el->data;
@@ -223,6 +227,7 @@ int block_job_add_bdrv(BlockJob *job, const char *name, 
BlockDriverState *bs,
 {
 BdrvChild *c;
 bool need_context_ops;
+g_assert(qemu_in_main_thread());
 
 bdrv_ref(bs);
 
@@ -272,6 +277,8 @@ bool block_job_set_speed(BlockJob *job, int64_t speed, 
Error **errp)
 const BlockJobDriver *drv = block_job_driver(job);
 int64_t old_speed = job->speed;
 
+g_assert(qemu_in_main_thread());
+
 if (job_apply_verb(>job, JOB_VERB_SET_SPEED, errp) < 0) {
 return false;
 }
@@ -309,6 +316,8 @@ BlockJobInfo *block_job_query(BlockJob *job, Error **errp)
 BlockJobInfo *info;
 uint64_t progress_current, progress_total;
 
+g_assert(qemu_in_main_thread());
+
 if (block_job_is_internal(job)) {
 error_setg(errp, "Cannot query QEMU internal jobs");
 return NULL;
@@ -498,6 +507,7 @@ void *block_job_create(const char *job_id, const 
BlockJobDriver *driver,
 
 void block_job_iostatus_reset(BlockJob *job)
 {
+g_assert(qemu_in_main_thread());
 if (job->iostatus == BLOCK_DEVICE_IO_STATUS_OK) {
 return;
 }
-- 
2.27.0

[RFC PATCH v2 15/25] include/block/snapshot: global state API + assertions

2021-10-05 Thread Emanuele Giuseppe Esposito

Snapshots run also under the BQL lock, so they all are
in the global state API. The aiocontext lock that they hold
is currently an overkill and in future could be removed.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/snapshot.c | 28 
 include/block/snapshot.h | 21 +
 migration/savevm.c   |  2 ++
 3 files changed, 51 insertions(+)

diff --git a/block/snapshot.c b/block/snapshot.c
index ccacda8bd5..e8756f7f90 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -57,6 +57,8 @@ int bdrv_snapshot_find(BlockDriverState *bs, QEMUSnapshotInfo 
*sn_info,
 QEMUSnapshotInfo *sn_tab, *sn;
 int nb_sns, i, ret;
 
+g_assert(qemu_in_main_thread());
+
 ret = -ENOENT;
 nb_sns = bdrv_snapshot_list(bs, _tab);
 if (nb_sns < 0) {
@@ -105,6 +107,7 @@ bool bdrv_snapshot_find_by_id_and_name(BlockDriverState *bs,
 bool ret = false;
 
 assert(id || name);
+g_assert(qemu_in_main_thread());
 
 nb_sns = bdrv_snapshot_list(bs, _tab);
 if (nb_sns < 0) {
@@ -200,6 +203,7 @@ static BlockDriverState 
*bdrv_snapshot_fallback(BlockDriverState *bs)
 int bdrv_can_snapshot(BlockDriverState *bs)
 {
 BlockDriver *drv = bs->drv;
+g_assert(qemu_in_main_thread());
 if (!drv || !bdrv_is_inserted(bs) || bdrv_is_read_only(bs)) {
 return 0;
 }
@@ -220,6 +224,9 @@ int bdrv_snapshot_create(BlockDriverState *bs,
 {
 BlockDriver *drv = bs->drv;
 BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
+
+g_assert(qemu_in_main_thread());
+
 if (!drv) {
 return -ENOMEDIUM;
 }
@@ -240,6 +247,8 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
 BdrvChild **fallback_ptr;
 int ret, open_ret;
 
+g_assert(qemu_in_main_thread());
+
 if (!drv) {
 error_setg(errp, "Block driver is closed");
 return -ENOMEDIUM;
@@ -348,6 +357,8 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
 BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
 int ret;
 
+g_assert(qemu_in_main_thread());
+
 if (!drv) {
 error_setg(errp, QERR_DEVICE_HAS_NO_MEDIUM, bdrv_get_device_name(bs));
 return -ENOMEDIUM;
@@ -380,6 +391,8 @@ int bdrv_snapshot_list(BlockDriverState *bs,
 {
 BlockDriver *drv = bs->drv;
 BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
+
+g_assert(qemu_in_main_thread());
 if (!drv) {
 return -ENOMEDIUM;
 }
@@ -419,6 +432,8 @@ int bdrv_snapshot_load_tmp(BlockDriverState *bs,
 {
 BlockDriver *drv = bs->drv;
 
+g_assert(qemu_in_main_thread());
+
 if (!drv) {
 error_setg(errp, QERR_DEVICE_HAS_NO_MEDIUM, bdrv_get_device_name(bs));
 return -ENOMEDIUM;
@@ -447,6 +462,8 @@ int bdrv_snapshot_load_tmp_by_id_or_name(BlockDriverState 
*bs,
 int ret;
 Error *local_err = NULL;
 
+g_assert(qemu_in_main_thread());
+
 ret = bdrv_snapshot_load_tmp(bs, id_or_name, NULL, _err);
 if (ret == -ENOENT || ret == -EINVAL) {
 error_free(local_err);
@@ -515,6 +532,8 @@ bool bdrv_all_can_snapshot(bool has_devices, strList 
*devices,
 g_autoptr(GList) bdrvs = NULL;
 GList *iterbdrvs;
 
+g_assert(qemu_in_main_thread());
+
 if (bdrv_all_get_snapshot_devices(has_devices, devices, , errp) < 0) 
{
 return false;
 }
@@ -549,6 +568,8 @@ int bdrv_all_delete_snapshot(const char *name,
 g_autoptr(GList) bdrvs = NULL;
 GList *iterbdrvs;
 
+g_assert(qemu_in_main_thread());
+
 if (bdrv_all_get_snapshot_devices(has_devices, devices, , errp) < 0) 
{
 return -1;
 }
@@ -588,6 +609,8 @@ int bdrv_all_goto_snapshot(const char *name,
 g_autoptr(GList) bdrvs = NULL;
 GList *iterbdrvs;
 
+g_assert(qemu_in_main_thread());
+
 if (bdrv_all_get_snapshot_devices(has_devices, devices, , errp) < 0) 
{
 return -1;
 }
@@ -622,6 +645,8 @@ int bdrv_all_has_snapshot(const char *name,
 g_autoptr(GList) bdrvs = NULL;
 GList *iterbdrvs;
 
+g_assert(qemu_in_main_thread());
+
 if (bdrv_all_get_snapshot_devices(has_devices, devices, , errp) < 0) 
{
 return -1;
 }
@@ -663,6 +688,7 @@ int bdrv_all_create_snapshot(QEMUSnapshotInfo *sn,
 {
 g_autoptr(GList) bdrvs = NULL;
 GList *iterbdrvs;
+g_assert(qemu_in_main_thread());
 
 if (bdrv_all_get_snapshot_devices(has_devices, devices, , errp) < 0) 
{
 return -1;
@@ -703,6 +729,8 @@ BlockDriverState *bdrv_all_find_vmstate_bs(const char 
*vmstate_bs,
 g_autoptr(GList) bdrvs = NULL;
 GList *iterbdrvs;
 
+g_assert(qemu_in_main_thread());
+
 if (bdrv_all_get_snapshot_devices(has_devices, devices, , errp) < 0) 
{
 return NULL;
 }
diff --git a/include/block/snapshot.h b/include/block/snapshot.h
index 940345692f..3a84849388 100644
--- a/include/block/snapshot.h
+++ b/include/block/snapshot.h
@@ -45,6 +45,27 @@ typedef struct QEMUSnapshotInfo {
 uint64_t icount; /* record/replay step */
 } QEMUSnapshotInfo;
 
+/*
+ *

[RFC PATCH v2 11/25] include/block/blockjob.h: global state API

2021-10-05 Thread Emanuele Giuseppe Esposito

blockjob functions run always under the BQL lock.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/block/blockjob.h | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index d200f33c10..3bf384f8bf 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -77,6 +77,27 @@ typedef struct BlockJob {
 GSList *nodes;
 } BlockJob;
 
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * If a function modifies the graph, it also uses drain and/or
+ * aio_context_acquire/release to be sure it has unique access.
+ * aio_context locking is needed together with BQL because of
+ * the thread-safe I/O API that concurrently runs and accesses
+ * the graph without the BQL.
+ *
+ * It is important to note that not all of these functions are
+ * necessarily limited to running under the BQL, but they would
+ * require additional auditing and may small thread-safety changes
+ * to move them into the I/O API. Often it's not worth doing that
+ * work since the APIs are only used with the BQL held at the
+ * moment, so they have been placed in the GS API (for now).
+ *
+ * All functions below must use this assertion:
+ * g_assert(qemu_in_main_thread());
+ * to catch when they are accidentally called without the BQL.
+ */
+
 /**
  * block_job_next:
  * @job: A block job, or %NULL.
@@ -158,6 +179,8 @@ BlockJobInfo *block_job_query(BlockJob *job, Error **errp);
  */
 void block_job_iostatus_reset(BlockJob *job);
 
+/* Common functions that are neither I/O nor Global State */
+
 /**
  * block_job_is_internal:
  * @job: The job to determine if it is user-visible or not.
-- 
2.27.0

[RFC PATCH v2 07/25] assertions for block_int global state API

2021-10-05 Thread Emanuele Giuseppe Esposito

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c | 17 +
 block/backup.c  |  1 +
 block/block-backend.c   |  3 +++
 block/commit.c  |  2 ++
 block/dirty-bitmap.c|  2 ++
 block/io.c  |  6 ++
 block/mirror.c  |  4 
 block/monitor/bitmap-qmp-cmds.c |  6 ++
 block/stream.c  |  2 ++
 blockdev.c  |  7 +++
 10 files changed, 50 insertions(+)

diff --git a/block.c b/block.c
index 6121af7040..b912f517a4 100644
--- a/block.c
+++ b/block.c
@@ -648,6 +648,8 @@ int coroutine_fn bdrv_co_create_opts_simple(BlockDriver 
*drv,
 Error *local_err = NULL;
 int ret;
 
+g_assert(qemu_in_main_thread());
+
 size = qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0);
 buf = qemu_opt_get_del(opts, BLOCK_OPT_PREALLOC);
 prealloc = qapi_enum_parse(_lookup, buf,
@@ -2405,6 +2407,8 @@ void bdrv_get_cumulative_perm(BlockDriverState *bs, 
uint64_t *perm,
 uint64_t cumulative_perms = 0;
 uint64_t cumulative_shared_perms = BLK_PERM_ALL;
 
+g_assert(qemu_in_main_thread());
+
 QLIST_FOREACH(c, >parents, next_parent) {
 cumulative_perms |= c->perm;
 cumulative_shared_perms &= c->shared_perm;
@@ -2463,6 +2467,8 @@ int bdrv_child_try_set_perm(BdrvChild *c, uint64_t perm, 
uint64_t shared,
 Transaction *tran = tran_new();
 int ret;
 
+g_assert(qemu_in_main_thread());
+
 bdrv_child_set_perm(c, perm, shared, tran);
 
 ret = bdrv_refresh_perms(c->bs, _err);
@@ -2493,6 +2499,8 @@ int bdrv_child_refresh_perms(BlockDriverState *bs, 
BdrvChild *c, Error **errp)
 uint64_t parent_perms, parent_shared;
 uint64_t perms, shared;
 
+g_assert(qemu_in_main_thread());
+
 bdrv_get_cumulative_perm(bs, _perms, _shared);
 bdrv_child_perm(bs, c->bs, c, c->role, NULL,
 parent_perms, parent_shared, , );
@@ -2635,6 +2643,7 @@ void bdrv_default_perms(BlockDriverState *bs, BdrvChild 
*c,
 uint64_t perm, uint64_t shared,
 uint64_t *nperm, uint64_t *nshared)
 {
+g_assert(qemu_in_main_thread());
 if (role & BDRV_CHILD_FILTERED) {
 assert(!(role & (BDRV_CHILD_DATA | BDRV_CHILD_METADATA |
  BDRV_CHILD_COW)));
@@ -2961,6 +2970,8 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState 
*child_bs,
 BdrvChild *child = NULL;
 Transaction *tran = tran_new();
 
+g_assert(qemu_in_main_thread());
+
 ret = bdrv_attach_child_common(child_bs, child_name, child_class,
child_role, perm, shared_perm, opaque,
, tran, errp);
@@ -5939,6 +5950,8 @@ const char *bdrv_get_parent_name(const BlockDriverState 
*bs)
 BdrvChild *c;
 const char *name;
 
+g_assert(qemu_in_main_thread());
+
 /* If multiple parents have a name, just pick the first one. */
 QLIST_FOREACH(c, >parents, next_parent) {
 if (c->klass->get_name) {
@@ -7206,6 +7219,8 @@ bool bdrv_recurse_can_replace(BlockDriverState *bs,
 {
 BlockDriverState *filtered;
 
+g_assert(qemu_in_main_thread());
+
 if (!bs || !bs->drv) {
 return false;
 }
@@ -7377,6 +7392,7 @@ static bool append_strong_runtime_options(QDict *d, 
BlockDriverState *bs)
  * would result in exactly bs->backing. */
 bool bdrv_backing_overridden(BlockDriverState *bs)
 {
+g_assert(qemu_in_main_thread());
 if (bs->backing) {
 return strcmp(bs->auto_backing_file,
   bs->backing->bs->filename);
@@ -7765,6 +7781,7 @@ static BlockDriverState 
*bdrv_do_skip_filters(BlockDriverState *bs,
  */
 BlockDriverState *bdrv_skip_implicit_filters(BlockDriverState *bs)
 {
+g_assert(qemu_in_main_thread());
 return bdrv_do_skip_filters(bs, true);
 }
 
diff --git a/block/backup.c b/block/backup.c
index bd3614ce70..8677dd44dc 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -413,6 +413,7 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 
 assert(bs);
 assert(target);
+g_assert(qemu_in_main_thread());
 
 /* QMP interface protects us from these cases */
 assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL);
diff --git a/block/block-backend.c b/block/block-backend.c
index 9cd3b27b53..9f09245069 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1078,6 +1078,7 @@ static void blk_root_change_media(BdrvChild *child, bool 
load)
  */
 bool blk_dev_has_removable_media(BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
 return !blk->dev || (blk->dev_ops && blk->dev_ops->change_media_cb);
 }
 
@@ -1095,6 +1096,7 @@ bool blk_dev_has_tray(BlockBackend *blk)
  */
 void blk_dev_eject_request(BlockBackend *blk, bool force)
 {
+g_assert(qemu_in_main_thread());
 if (blk->dev_ops && blk->dev_ops->eject_request_cb) {
 blk->dev_ops->eject_request_cb(blk->dev_opaque, force);

[RFC PATCH v2 04/25] include/block/block: split header into I/O and global state API

2021-10-05 Thread Emanuele Giuseppe Esposito

Similarly to the previous patch, split block.h
in block-io.h and block-global-state.h

block-common.h contains the structures shared between
the two headers, and the functions that can't be categorized as
I/O or global state.

Assertions are added in the next patch.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c|   3 +
 block/meson.build  |   7 +-
 include/block/block-common.h   | 389 +
 include/block/block-global-state.h | 263 +
 include/block/block-io.h   | 278 ++
 include/block/block.h  | 859 +
 6 files changed, 942 insertions(+), 857 deletions(-)
 create mode 100644 include/block/block-common.h
 create mode 100644 include/block/block-global-state.h
 create mode 100644 include/block/block-io.h

diff --git a/block.c b/block.c
index e97ce0b1c8..d31543ac37 100644
--- a/block.c
+++ b/block.c
@@ -65,12 +65,15 @@
 
 #define NOT_DONE 0x7fff /* used while emulated sync operation in progress 
*/
 
+/* Protected by BQL */
 static QTAILQ_HEAD(, BlockDriverState) graph_bdrv_states =
 QTAILQ_HEAD_INITIALIZER(graph_bdrv_states);
 
+/* Protected by BQL */
 static QTAILQ_HEAD(, BlockDriverState) all_bdrv_states =
 QTAILQ_HEAD_INITIALIZER(all_bdrv_states);
 
+/* Protected by BQL */
 static QLIST_HEAD(, BlockDriver) bdrv_drivers =
 QLIST_HEAD_INITIALIZER(bdrv_drivers);
 
diff --git a/block/meson.build b/block/meson.build
index 0450914c7a..f5c942c697 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -113,8 +113,11 @@ block_ss.add(module_block_h)
 wrapper_py = find_program('../scripts/block-coroutine-wrapper.py')
 block_gen_c = custom_target('block-gen.c',
 output: 'block-gen.c',
-input: files('../include/block/block.h',
- 'coroutines.h'),
+input: files(
+  '../include/block/block-io.h',
+  '../include/block/block-global-state.h',
+  'coroutines.h'
+  ),
 command: [wrapper_py, '@OUTPUT@', '@INPUT@'])
 block_ss.add(block_gen_c)
 
diff --git a/include/block/block-common.h b/include/block/block-common.h
new file mode 100644
index 00..4f1fd8de21
--- /dev/null
+++ b/include/block/block-common.h
@@ -0,0 +1,389 @@
+#ifndef BLOCK_COMMON_H
+#define BLOCK_COMMON_H
+
+#include "block/aio.h"
+#include "block/aio-wait.h"
+#include "qemu/iov.h"
+#include "qemu/coroutine.h"
+#include "block/accounting.h"
+#include "block/dirty-bitmap.h"
+#include "block/blockjob.h"
+#include "qemu/hbitmap.h"
+#include "qemu/transactions.h"
+
+/*
+ * generated_co_wrapper
+ *
+ * Function specifier, which does nothing but mark functions to be
+ * generated by scripts/block-coroutine-wrapper.py
+ *
+ * Read more in docs/devel/block-coroutine-wrapper.rst
+ */
+#define generated_co_wrapper
+
+#define BLKDBG_EVENT(child, evt) \
+do { \
+if (child) { \
+bdrv_debug_event(child->bs, evt); \
+} \
+} while (0)
+
+/* block.c */
+typedef struct BlockDriver BlockDriver;
+typedef struct BdrvChild BdrvChild;
+typedef struct BdrvChildClass BdrvChildClass;
+
+typedef struct BlockDriverInfo {
+/* in bytes, 0 if irrelevant */
+int cluster_size;
+/* offset at which the VM state can be saved (0 if not possible) */
+int64_t vm_state_offset;
+bool is_dirty;
+/*
+ * True if this block driver only supports compressed writes
+ */
+bool needs_compressed_writes;
+} BlockDriverInfo;
+
+typedef struct BlockFragInfo {
+uint64_t allocated_clusters;
+uint64_t total_clusters;
+uint64_t fragmented_clusters;
+uint64_t compressed_clusters;
+} BlockFragInfo;
+
+typedef enum {
+BDRV_REQ_COPY_ON_READ   = 0x1,
+BDRV_REQ_ZERO_WRITE = 0x2,
+
+/*
+ * The BDRV_REQ_MAY_UNMAP flag is used in write_zeroes requests to indicate
+ * that the block driver should unmap (discard) blocks if it is guaranteed
+ * that the result will read back as zeroes. The flag is only passed to the
+ * driver if the block device is opened with BDRV_O_UNMAP.
+ */
+BDRV_REQ_MAY_UNMAP  = 0x4,
+
+BDRV_REQ_FUA= 0x10,
+BDRV_REQ_WRITE_COMPRESSED   = 0x20,
+
+/*
+ * Signifies that this write request will not change the visible disk
+ * content.
+ */
+BDRV_REQ_WRITE_UNCHANGED= 0x40,
+
+/*
+ * Forces request serialisation. Use only with write requests.
+ */
+BDRV_REQ_SERIALISING= 0x80,
+
+/*
+ * Execute the request only if the operation can be offloaded or otherwise
+ * be executed efficiently, but return an error instead of using a slow
+ * fallback.
+ */
+BDRV_REQ_NO_FALLBACK= 0x100,
+
+/*
+ * BDRV_REQ_PREFETCH makes sense only

[RFC PATCH v2 10/25] assertions for blockjob_int.h

2021-10-05 Thread Emanuele Giuseppe Esposito

Signed-off-by: Emanuele Giuseppe Esposito 
---
 blockjob.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/blockjob.c b/blockjob.c
index 4bad1408cb..9878e255c6 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -83,6 +83,7 @@ BlockJob *block_job_get(const char *id)
 
 void block_job_free(Job *job)
 {
+g_assert(qemu_in_main_thread());
 BlockJob *bjob = container_of(job, BlockJob, job);
 
 block_job_remove_all_bdrv(bjob);
@@ -436,6 +437,8 @@ void *block_job_create(const char *job_id, const 
BlockJobDriver *driver,
 BlockBackend *blk;
 BlockJob *job;
 
+g_assert(qemu_in_main_thread());
+
 if (job_id == NULL && !(flags & JOB_INTERNAL)) {
 job_id = bdrv_get_device_name(bs);
 }
@@ -504,6 +507,7 @@ void block_job_iostatus_reset(BlockJob *job)
 
 void block_job_user_resume(Job *job)
 {
+g_assert(qemu_in_main_thread());
 BlockJob *bjob = container_of(job, BlockJob, job);
 block_job_iostatus_reset(bjob);
 }
-- 
2.27.0

[RFC PATCH v2 08/25] block: introduce assert_bdrv_graph_writable

2021-10-05 Thread Emanuele Giuseppe Esposito

We want to be sure that the functions that write the child and
parent list of a bs are either under BQL or drain.
If this guarantee holds, then we can read the list also in the I/O APIs.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c| 5 +
 block/io.c | 5 +
 include/block/block_int-global-state.h | 8 +++-
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index b912f517a4..7d1eb847a4 100644
--- a/block.c
+++ b/block.c
@@ -2711,12 +2711,14 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 if (child->klass->detach) {
 child->klass->detach(child);
 }
+assert_bdrv_graph_writable(old_bs);
 QLIST_REMOVE(child, next_parent);
 }
 
 child->bs = new_bs;
 
 if (new_bs) {
+assert_bdrv_graph_writable(new_bs);
 QLIST_INSERT_HEAD(_bs->parents, child, next_parent);
 
 /*
@@ -2917,6 +2919,7 @@ static int bdrv_attach_child_noperm(BlockDriverState 
*parent_bs,
 return ret;
 }
 
+assert_bdrv_graph_writable(parent_bs);
 QLIST_INSERT_HEAD(_bs->children, *child, next);
 /*
  * child is removed in bdrv_attach_child_common_abort(), so don't care to
@@ -3117,6 +3120,7 @@ static void bdrv_unset_inherits_from(BlockDriverState 
*root, BdrvChild *child,
 void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child)
 {
 g_assert(qemu_in_main_thread());
+assert_bdrv_graph_writable(parent);
 if (child == NULL) {
 return;
 }
@@ -4878,6 +4882,7 @@ static void bdrv_remove_filter_or_cow_child_abort(void 
*opaque)
 BdrvRemoveFilterOrCowChild *s = opaque;
 BlockDriverState *parent_bs = s->child->opaque;
 
+assert_bdrv_graph_writable(parent_bs);
 QLIST_INSERT_HEAD(_bs->children, s->child, next);
 if (s->is_backing) {
 parent_bs->backing = s->child;
diff --git a/block/io.c b/block/io.c
index 21dcc5d962..d184183b07 100644
--- a/block/io.c
+++ b/block/io.c
@@ -739,6 +739,11 @@ void bdrv_drain_all(void)
 bdrv_drain_all_end();
 }
 
+void assert_bdrv_graph_writable(BlockDriverState *bs)
+{
+g_assert(qatomic_read(>quiesce_counter) > 0 || qemu_in_main_thread());
+}
+
 /**
  * Remove an active request from the tracked requests list
  *
diff --git a/include/block/block_int-global-state.h 
b/include/block/block_int-global-state.h
index aad549cb85..a53ab146fc 100644
--- a/include/block/block_int-global-state.h
+++ b/include/block/block_int-global-state.h
@@ -343,4 +343,10 @@ void bdrv_remove_aio_context_notifier(BlockDriverState *bs,
  */
 void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
 
-#endif /* BLOCK_INT_GLOBAL_STATE*/
+/**
+ * Make sure that the function is either running under
+ * drain, or under BQL.
+ */
+void assert_bdrv_graph_writable(BlockDriverState *bs);
+
+#endif /* BLOCK_INT_GLOBAL_STATE */
-- 
2.27.0

[RFC PATCH v2 06/25] include/block/block_int: split header into I/O and global state API

2021-10-05 Thread Emanuele Giuseppe Esposito

Similarly to the previous patch, split block_int.h
in block_int-io.h and block_int-global-state.h

block_int-common.h contains the structures shared between
the two headers, and the functions that can't be categorized as
I/O or global state.

Assertions are added in the next patch.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 blockdev.c |5 +
 include/block/block_int-common.h   | 1122 +++
 include/block/block_int-global-state.h |  346 ++
 include/block/block_int-io.h   |  124 +++
 include/block/block_int.h  | 1412 +---
 5 files changed, 1600 insertions(+), 1409 deletions(-)
 create mode 100644 include/block/block_int-common.h
 create mode 100644 include/block/block_int-global-state.h
 create mode 100644 include/block/block_int-io.h

diff --git a/blockdev.c b/blockdev.c
index 44c419545c..75407cbf67 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -64,6 +64,7 @@
 #include "qemu/main-loop.h"
 #include "qemu/throttle-options.h"
 
+/* Protected by BQL lock */
 QTAILQ_HEAD(, BlockDriverState) monitor_bdrv_states =
 QTAILQ_HEAD_INITIALIZER(monitor_bdrv_states);
 
@@ -1208,6 +1209,8 @@ typedef struct BlkActionState BlkActionState;
  *
  * Only prepare() may fail. In a single transaction, only one of commit() or
  * abort() will be called. clean() will always be called if it is present.
+ *
+ * Always run under BQL.
  */
 typedef struct BlkActionOps {
 size_t instance_size;
@@ -2317,6 +2320,8 @@ static TransactionProperties *get_transaction_properties(
 /*
  * 'Atomic' group operations.  The operations are performed as a set, and if
  * any fail then we roll back all operations in the group.
+ *
+ * Always run under BQL.
  */
 void qmp_transaction(TransactionActionList *dev_list,
  bool has_props,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
new file mode 100644
index 00..23f0d9c090
--- /dev/null
+++ b/include/block/block_int-common.h
@@ -0,0 +1,1122 @@
+/*
+ * QEMU System Emulator block driver
+ *
+ * Copyright (c) 2003 Fabrice Bellard
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+#ifndef BLOCK_INT_COMMON_H
+#define BLOCK_INT_COMMON_H
+
+#include "block/accounting.h"
+#include "block/block.h"
+#include "block/aio-wait.h"
+#include "qemu/queue.h"
+#include "qemu/coroutine.h"
+#include "qemu/stats64.h"
+#include "qemu/timer.h"
+#include "qemu/hbitmap.h"
+#include "block/snapshot.h"
+#include "qemu/throttle.h"
+
+#define BLOCK_FLAG_LAZY_REFCOUNTS   8
+
+#define BLOCK_OPT_SIZE  "size"
+#define BLOCK_OPT_ENCRYPT   "encryption"
+#define BLOCK_OPT_ENCRYPT_FORMAT"encrypt.format"
+#define BLOCK_OPT_COMPAT6   "compat6"
+#define BLOCK_OPT_HWVERSION "hwversion"
+#define BLOCK_OPT_BACKING_FILE  "backing_file"
+#define BLOCK_OPT_BACKING_FMT   "backing_fmt"
+#define BLOCK_OPT_CLUSTER_SIZE  "cluster_size"
+#define BLOCK_OPT_TABLE_SIZE"table_size"
+#define BLOCK_OPT_PREALLOC  "preallocation"
+#define BLOCK_OPT_SUBFMT"subformat"
+#define BLOCK_OPT_COMPAT_LEVEL  "compat"
+#define BLOCK_OPT_LAZY_REFCOUNTS"lazy_refcounts"
+#define BLOCK_OPT_ADAPTER_TYPE  "adapter_type"
+#define BLOCK_OPT_REDUNDANCY"redundancy"
+#define BLOCK_OPT_NOCOW "nocow"
+#define BLOCK_OPT_EXTENT_SIZE_HINT  "extent_size_hint"
+#define BLOCK_OPT_OBJECT_SIZE   "object_size"
+#define BLOCK_OPT_REFCOUNT_BITS "refcount_bits"
+#define BLOCK_OPT_DATA_FILE "data_file"
+#define BLOCK_OPT_DATA_FILE_RAW "data_file_raw"
+#define BLOCK_OPT_COMPRESSION_TYPE  "compression_type"
+#define BLOCK_OPT_EXTL2 "extended_l2"
+
+#define BLOCK_PROBE_BUF_SIZE512
+
+enum BdrvTrackedRequestType {
+BDRV_TRACKED_READ,
+BDRV_TRACKED_WRITE,
+BDRV_TRACKED_DISCARD,
+BDRV_TRACKED_TRUNCATE,
+};
+
+/*
+ * That is not quite good that

[RFC PATCH v2 09/25] include/block/blockjob_int.h: split header into I/O and GS API

2021-10-05 Thread Emanuele Giuseppe Esposito

Since the I/O functions are not many, keep a single file.
Also split the function pointers in BlockJobDriver.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/block/blockjob_int.h | 55 
 1 file changed, 55 insertions(+)

diff --git a/include/block/blockjob_int.h b/include/block/blockjob_int.h
index 6633d83da2..bac4e8f46d 100644
--- a/include/block/blockjob_int.h
+++ b/include/block/blockjob_int.h
@@ -38,6 +38,12 @@ struct BlockJobDriver {
 /** Generic JobDriver callbacks and settings */
 JobDriver job_driver;
 
+/*
+ * I/O API functions. These functions are thread-safe, and therefore
+ * can run in any thread as long as they have called
+ * aio_context_acquire/release().
+ */
+
 /*
  * Returns whether the job has pending requests for the child or will
  * submit new requests before the next pause point. This callback is polled
@@ -46,6 +52,28 @@ struct BlockJobDriver {
  */
 bool (*drained_poll)(BlockJob *job);
 
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * If a function modifies the graph, it also uses drain and/or
+ * aio_context_acquire/release to be sure it has unique access.
+ * aio_context locking is needed together with BQL because of
+ * the thread-safe I/O API that concurrently runs and accesses
+ * the graph without the BQL.
+ *
+ * It is important to note that not all of these functions are
+ * necessarily limited to running under the BQL, but they would
+ * require additional auditing and may small thread-safety changes
+ * to move them into the I/O API. Often it's not worth doing that
+ * work since the APIs are only used with the BQL held at the
+ * moment, so they have been placed in the GS API (for now).
+ *
+ * All callers that use these function pointers must
+ * use this assertion:
+ * g_assert(qemu_in_main_thread());
+ * to catch when they are accidentally called without the BQL.
+ */
+
 /*
  * If the callback is not NULL, it will be invoked before the job is
  * resumed in a new AioContext.  This is the place to move any resources
@@ -56,6 +84,27 @@ struct BlockJobDriver {
 void (*set_speed)(BlockJob *job, int64_t speed);
 };
 
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * If a function modifies the graph, it also uses drain and/or
+ * aio_context_acquire/release to be sure it has unique access.
+ * aio_context locking is needed together with BQL because of
+ * the thread-safe I/O API that concurrently runs and accesses
+ * the graph without the BQL.
+ *
+ * It is important to note that not all of these functions are
+ * necessarily limited to running under the BQL, but they would
+ * require additional auditing and may small thread-safety changes
+ * to move them into the I/O API. Often it's not worth doing that
+ * work since the APIs are only used with the BQL held at the
+ * moment, so they have been placed in the GS API (for now).
+ *
+ * All functions below must use this assertion:
+ * g_assert(qemu_in_main_thread());
+ * to catch when they are accidentally called without the BQL.
+ */
+
 /**
  * block_job_create:
  * @job_id: The id of the newly-created job, or %NULL to have one
@@ -98,6 +147,12 @@ void block_job_free(Job *job);
  */
 void block_job_user_resume(Job *job);
 
+/*
+ * I/O API functions. These functions are thread-safe, and therefore
+ * can run in any thread as long as they have called
+ * aio_context_acquire/release().
+ */
+
 /**
  * block_job_ratelimit_get_delay:
  *
-- 
2.27.0

[RFC PATCH v2 05/25] assertions for block global state API

2021-10-05 Thread Emanuele Giuseppe Esposito

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c| 135 +++--
 block/commit.c |   2 +
 block/io.c |  20 
 blockdev.c |   1 +
 4 files changed, 155 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index d31543ac37..6121af7040 100644
--- a/block.c
+++ b/block.c
@@ -384,6 +384,7 @@ char *bdrv_get_full_backing_filename(BlockDriverState *bs, 
Error **errp)
 void bdrv_register(BlockDriver *bdrv)
 {
 assert(bdrv->format_name);
+g_assert(qemu_in_main_thread());
 QLIST_INSERT_HEAD(_drivers, bdrv, list);
 }
 
@@ -392,6 +393,8 @@ BlockDriverState *bdrv_new(void)
 BlockDriverState *bs;
 int i;
 
+g_assert(qemu_in_main_thread());
+
 bs = g_new0(BlockDriverState, 1);
 QLIST_INIT(>dirty_bitmaps);
 for (i = 0; i < BLOCK_OP_TYPE_MAX; i++) {
@@ -416,6 +419,7 @@ BlockDriverState *bdrv_new(void)
 static BlockDriver *bdrv_do_find_format(const char *format_name)
 {
 BlockDriver *drv1;
+g_assert(qemu_in_main_thread());
 
 QLIST_FOREACH(drv1, _drivers, list) {
 if (!strcmp(drv1->format_name, format_name)) {
@@ -431,6 +435,8 @@ BlockDriver *bdrv_find_format(const char *format_name)
 BlockDriver *drv1;
 int i;
 
+g_assert(qemu_in_main_thread());
+
 drv1 = bdrv_do_find_format(format_name);
 if (drv1) {
 return drv1;
@@ -480,11 +486,13 @@ static int bdrv_format_is_whitelisted(const char 
*format_name, bool read_only)
 
 int bdrv_is_whitelisted(BlockDriver *drv, bool read_only)
 {
+g_assert(qemu_in_main_thread());
 return bdrv_format_is_whitelisted(drv->format_name, read_only);
 }
 
 bool bdrv_uses_whitelist(void)
 {
+g_assert(qemu_in_main_thread());
 return use_bdrv_whitelist;
 }
 
@@ -515,6 +523,8 @@ int bdrv_create(BlockDriver *drv, const char* filename,
 {
 int ret;
 
+g_assert(qemu_in_main_thread());
+
 Coroutine *co;
 CreateCo cco = {
 .drv = drv,
@@ -690,6 +700,8 @@ int bdrv_create_file(const char *filename, QemuOpts *opts, 
Error **errp)
 QDict *qdict;
 int ret;
 
+g_assert(qemu_in_main_thread());
+
 drv = bdrv_find_protocol(filename, true, errp);
 if (drv == NULL) {
 return -ENOENT;
@@ -787,6 +799,7 @@ int bdrv_probe_blocksizes(BlockDriverState *bs, BlockSizes 
*bsz)
 {
 BlockDriver *drv = bs->drv;
 BlockDriverState *filtered = bdrv_filter_bs(bs);
+g_assert(qemu_in_main_thread());
 
 if (drv && drv->bdrv_probe_blocksizes) {
 return drv->bdrv_probe_blocksizes(bs, bsz);
@@ -807,6 +820,7 @@ int bdrv_probe_geometry(BlockDriverState *bs, HDGeometry 
*geo)
 {
 BlockDriver *drv = bs->drv;
 BlockDriverState *filtered = bdrv_filter_bs(bs);
+g_assert(qemu_in_main_thread());
 
 if (drv && drv->bdrv_probe_geometry) {
 return drv->bdrv_probe_geometry(bs, geo);
@@ -861,6 +875,7 @@ static BlockDriver *find_hdev_driver(const char *filename)
 {
 int score_max = 0, score;
 BlockDriver *drv = NULL, *d;
+g_assert(qemu_in_main_thread());
 
 QLIST_FOREACH(d, _drivers, list) {
 if (d->bdrv_probe_device) {
@@ -878,6 +893,7 @@ static BlockDriver *find_hdev_driver(const char *filename)
 static BlockDriver *bdrv_do_find_protocol(const char *protocol)
 {
 BlockDriver *drv1;
+g_assert(qemu_in_main_thread());
 
 QLIST_FOREACH(drv1, _drivers, list) {
 if (drv1->protocol_name && !strcmp(drv1->protocol_name, protocol)) {
@@ -898,6 +914,7 @@ BlockDriver *bdrv_find_protocol(const char *filename,
 const char *p;
 int i;
 
+g_assert(qemu_in_main_thread());
 /* TODO Drivers without bdrv_file_open must be specified explicitly */
 
 /*
@@ -963,6 +980,7 @@ BlockDriver *bdrv_probe_all(const uint8_t *buf, int 
buf_size,
 {
 int score_max = 0, score;
 BlockDriver *drv = NULL, *d;
+g_assert(qemu_in_main_thread());
 
 QLIST_FOREACH(d, _drivers, list) {
 if (d->bdrv_probe) {
@@ -1110,6 +1128,7 @@ int bdrv_parse_aio(const char *mode, int *flags)
  */
 int bdrv_parse_discard_flags(const char *mode, int *flags)
 {
+g_assert(qemu_in_main_thread());
 *flags &= ~BDRV_O_UNMAP;
 
 if (!strcmp(mode, "off") || !strcmp(mode, "ignore")) {
@@ -1130,6 +1149,7 @@ int bdrv_parse_discard_flags(const char *mode, int *flags)
  */
 int bdrv_parse_cache_mode(const char *mode, int *flags, bool *writethrough)
 {
+g_assert(qemu_in_main_thread());
 *flags &= ~BDRV_O_CACHE_MASK;
 
 if (!strcmp(mode, "off") || !strcmp(mode, "none")) {
@@ -1494,6 +1514,7 @@ static void bdrv_assign_node_name(BlockDriverState *bs,
   Error **errp)
 {
 char *gen_node_name = NULL;
+g_assert(qemu_in_main_thread());
 
 if (!node_name) {
 node_name = gen_node_name = id_generate(ID_BLOCK);
@@ -1608,6 +1629,8 @@ BlockDriverState *bdrv_new_open_driver(BlockDriver *drv, 
const char *node_name,
 BlockDriverState *bs;
 int ret;
 
+g_assert(qemu_in_main_thread());
+
 bs =

[RFC PATCH v2 01/25] main-loop.h: introduce qemu_in_main_thread()

2021-10-05 Thread Emanuele Giuseppe Esposito

When invoked from the main loop, this function is the same
as qemu_mutex_iothread_locked, and returns true if the BQL is held.
When invoked from iothreads or tests, it returns true only
if the current AioContext is the Main Loop.

This essentially just extends qemu_mutex_iothread_locked to work
also in unit tests or other users like storage-daemon, that run
in the Main Loop but end up using the implementation in
stubs/iothread-lock.c.

Using qemu_mutex_iothread_locked in unit tests defaults to false
because they use the implementation in stubs/iothread-lock,
making all assertions added in next patches fail despite the
AioContext is still the main loop.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/qemu/main-loop.h | 13 +
 softmmu/cpus.c   |  5 +
 stubs/iothread-lock.c|  5 +
 3 files changed, 23 insertions(+)

diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h
index 8dbc6fcb89..6b8fa57c5d 100644
--- a/include/qemu/main-loop.h
+++ b/include/qemu/main-loop.h
@@ -245,6 +245,19 @@ AioContext *iohandler_get_aio_context(void);
  */
 bool qemu_mutex_iothread_locked(void);
 
+/**
+ * qemu_in_main_thread: same as qemu_mutex_iothread_locked when
+ * softmmu/cpus.c implementation is linked. Otherwise this function
+ * checks that the current AioContext is the global AioContext
+ * (main loop).
+ *
+ * This is useful when checking that the BQL is held, to avoid
+ * returning false when invoked by unit tests or other users like
+ * storage-daemon that end up using stubs/iothread-lock.c
+ * implementation.
+ */
+bool qemu_in_main_thread(void);
+
 /**
  * qemu_mutex_lock_iothread: Lock the main loop mutex.
  *
diff --git a/softmmu/cpus.c b/softmmu/cpus.c
index 071085f840..3f61a3c31d 100644
--- a/softmmu/cpus.c
+++ b/softmmu/cpus.c
@@ -481,6 +481,11 @@ bool qemu_mutex_iothread_locked(void)
 return iothread_locked;
 }
 
+bool qemu_in_main_thread(void)
+{
+return qemu_mutex_iothread_locked();
+}
+
 /*
  * The BQL is taken from so many places that it is worth profiling the
  * callers directly, instead of funneling them all through a single function.
diff --git a/stubs/iothread-lock.c b/stubs/iothread-lock.c
index 5b45b7fc8b..ff7386e42c 100644
--- a/stubs/iothread-lock.c
+++ b/stubs/iothread-lock.c
@@ -6,6 +6,11 @@ bool qemu_mutex_iothread_locked(void)
 return false;
 }
 
+bool qemu_in_main_thread(void)
+{
+return qemu_get_current_aio_context() == qemu_get_aio_context();
+}
+
 void qemu_mutex_lock_iothread_impl(const char *file, int line)
 {
 }
-- 
2.27.0

[RFC PATCH v2 02/25] include/sysemu/block-backend: split header into I/O and global state (GS) API

2021-10-05 Thread Emanuele Giuseppe Esposito

block-backend.h currently contains a mix of functions:
some of them run under the BQL and modify the block layer graph,
others are instead thread-safe and perform I/O in iothreads.
It is not easy to understand which function is part of which
group (I/O vs GS), and this patch aims to clarify it.

The "GS" functions need the BQL, and often use
aio_context_acquire/release and/or drain to be sure they
can modify the graph safely.
The I/O function are instead thread safe, and can run in
any AioContext.

By splitting the header in two files, block-backend-io.h
and block-backend-global-state.h we have a clearer view on what
needs what kind of protection. block-backend-common.h
instead contains common structures shared by both headers.

In addition, remove "block/block.h" include as it seems
it is not necessary anymore, together with "qemu/iov.h"

block-backend.h is left there for legacy and to avoid changing
all includes in all c files that use the block-backend APIs.

Assertions are added in the next patch.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c   |  10 +-
 include/sysemu/block-backend-common.h   |  74 ++
 include/sysemu/block-backend-global-state.h | 136 ++
 include/sysemu/block-backend-io.h   | 130 ++
 include/sysemu/block-backend.h  | 262 +---
 5 files changed, 350 insertions(+), 262 deletions(-)
 create mode 100644 include/sysemu/block-backend-common.h
 create mode 100644 include/sysemu/block-backend-global-state.h
 create mode 100644 include/sysemu/block-backend-io.h

diff --git a/block/block-backend.c b/block/block-backend.c
index deb55c272e..d31ae16b99 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -78,6 +78,7 @@ struct BlockBackend {
 bool allow_aio_context_change;
 bool allow_write_beyond_eof;
 
+/* Protected by BQL lock */
 NotifierList remove_bs_notifiers, insert_bs_notifiers;
 QLIST_HEAD(, BlockBackendAioNotifier) aio_notifiers;
 
@@ -110,12 +111,14 @@ static const AIOCBInfo block_backend_aiocb_info = {
 static void drive_info_del(DriveInfo *dinfo);
 static BlockBackend *bdrv_first_blk(BlockDriverState *bs);
 
-/* All BlockBackends */
+/* All BlockBackends. Protected by BQL lock. */
 static QTAILQ_HEAD(, BlockBackend) block_backends =
 QTAILQ_HEAD_INITIALIZER(block_backends);
 
-/* All BlockBackends referenced by the monitor and which are iterated through 
by
- * blk_next() */
+/*
+ * All BlockBackends referenced by the monitor and which are iterated through 
by
+ * blk_next(). Protected by BQL lock.
+ */
 static QTAILQ_HEAD(, BlockBackend) monitor_block_backends =
 QTAILQ_HEAD_INITIALIZER(monitor_block_backends);
 
@@ -985,6 +988,7 @@ BlockBackend *blk_by_dev(void *dev)
 void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops,
  void *opaque)
 {
+g_assert(qemu_in_main_thread());
 blk->dev_ops = ops;
 blk->dev_opaque = opaque;
 
diff --git a/include/sysemu/block-backend-common.h 
b/include/sysemu/block-backend-common.h
new file mode 100644
index 00..52ff6a4d26
--- /dev/null
+++ b/include/sysemu/block-backend-common.h
@@ -0,0 +1,74 @@
+/*
+ * QEMU Block backends
+ *
+ * Copyright (C) 2014-2016 Red Hat, Inc.
+ *
+ * Authors:
+ *  Markus Armbruster ,
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1
+ * or later.  See the COPYING.LIB file in the top-level directory.
+ */
+
+#ifndef BLOCK_BACKEND_COMMON_H
+#define BLOCK_BACKEND_COMMON_H
+
+#include "block/throttle-groups.h"
+
+/* Callbacks for block device models */
+typedef struct BlockDevOps {
+/*
+ * Runs when virtual media changed (monitor commands eject, change)
+ * Argument load is true on load and false on eject.
+ * Beware: doesn't run when a host device's physical media
+ * changes.  Sure would be useful if it did.
+ * Device models with removable media must implement this callback.
+ */
+void (*change_media_cb)(void *opaque, bool load, Error **errp);
+/*
+ * Runs when an eject request is issued from the monitor, the tray
+ * is closed, and the medium is locked.
+ * Device models that do not implement is_medium_locked will not need
+ * this callback.  Device models that can lock the medium or tray might
+ * want to implement the callback and unlock the tray when "force" is
+ * true, even if they do not support eject requests.
+ */
+void (*eject_request_cb)(void *opaque, bool force);
+/*
+ * Is the virtual tray open?
+ * Device models implement this only when the device has a tray.
+ */
+bool (*is_tray_open)(void *opaque);
+/*
+ * Is the virtual medium locked into the device?
+ * Device models implement this only when device has such a lock.
+ */
+bool (*is_medium_locked)(void *opaque);
+/*
+ * Runs when the size changed (e.g. monitor command block_resize)
+ */
+void (*resize_cb)(void *opaque);
+

[RFC PATCH v2 03/25] block/block-backend.c: assertions for block-backend

2021-10-05 Thread Emanuele Giuseppe Esposito

All the global state (GS) API functions will check that
qemu_in_main_thread() returns true. If not, it means
that the safety of BQL cannot be guaranteed, and
they need to be moved to I/O.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c  | 89 +-
 softmmu/qdev-monitor.c |  2 +
 2 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index d31ae16b99..9cd3b27b53 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -227,6 +227,7 @@ static void blk_root_activate(BdrvChild *child, Error 
**errp)
 
 void blk_set_force_allow_inactivate(BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
 blk->force_allow_inactivate = true;
 }
 
@@ -345,6 +346,8 @@ BlockBackend *blk_new(AioContext *ctx, uint64_t perm, 
uint64_t shared_perm)
 {
 BlockBackend *blk;
 
+g_assert(qemu_in_main_thread());
+
 blk = g_new0(BlockBackend, 1);
 blk->refcnt = 1;
 blk->ctx = ctx;
@@ -382,6 +385,8 @@ BlockBackend *blk_new_with_bs(BlockDriverState *bs, 
uint64_t perm,
 {
 BlockBackend *blk = blk_new(bdrv_get_aio_context(bs), perm, shared_perm);
 
+g_assert(qemu_in_main_thread());
+
 if (blk_insert_bs(blk, bs, errp) < 0) {
 blk_unref(blk);
 return NULL;
@@ -410,6 +415,8 @@ BlockBackend *blk_new_open(const char *filename, const char 
*reference,
 uint64_t perm = 0;
 uint64_t shared = BLK_PERM_ALL;
 
+g_assert(qemu_in_main_thread());
+
 /*
  * blk_new_open() is mainly used in .bdrv_create implementations and the
  * tools where sharing isn't a major concern because the BDS stays private
@@ -487,6 +494,7 @@ static void drive_info_del(DriveInfo *dinfo)
 
 int blk_get_refcnt(BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
 return blk ? blk->refcnt : 0;
 }
 
@@ -497,6 +505,7 @@ int blk_get_refcnt(BlockBackend *blk)
 void blk_ref(BlockBackend *blk)
 {
 assert(blk->refcnt > 0);
+g_assert(qemu_in_main_thread());
 blk->refcnt++;
 }
 
@@ -507,6 +516,7 @@ void blk_ref(BlockBackend *blk)
  */
 void blk_unref(BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
 if (blk) {
 assert(blk->refcnt > 0);
 if (blk->refcnt > 1) {
@@ -527,6 +537,7 @@ void blk_unref(BlockBackend *blk)
  */
 BlockBackend *blk_all_next(BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
 return blk ? QTAILQ_NEXT(blk, link)
: QTAILQ_FIRST(_backends);
 }
@@ -535,6 +546,8 @@ void blk_remove_all_bs(void)
 {
 BlockBackend *blk = NULL;
 
+g_assert(qemu_in_main_thread());
+
 while ((blk = blk_all_next(blk)) != NULL) {
 AioContext *ctx = blk_get_aio_context(blk);
 
@@ -558,6 +571,7 @@ void blk_remove_all_bs(void)
  */
 BlockBackend *blk_next(BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
 return blk ? QTAILQ_NEXT(blk, monitor_link)
: QTAILQ_FIRST(_block_backends);
 }
@@ -624,6 +638,7 @@ static void bdrv_next_reset(BdrvNextIterator *it)
 
 BlockDriverState *bdrv_first(BdrvNextIterator *it)
 {
+g_assert(qemu_in_main_thread());
 bdrv_next_reset(it);
 return bdrv_next(it);
 }
@@ -661,6 +676,7 @@ bool monitor_add_blk(BlockBackend *blk, const char *name, 
Error **errp)
 {
 assert(!blk->name);
 assert(name && name[0]);
+g_assert(qemu_in_main_thread());
 
 if (!id_wellformed(name)) {
 error_setg(errp, "Invalid device name");
@@ -688,6 +704,8 @@ bool monitor_add_blk(BlockBackend *blk, const char *name, 
Error **errp)
  */
 void monitor_remove_blk(BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
+
 if (!blk->name) {
 return;
 }
@@ -703,6 +721,7 @@ void monitor_remove_blk(BlockBackend *blk)
  */
 const char *blk_name(const BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
 return blk->name ?: "";
 }
 
@@ -714,6 +733,7 @@ BlockBackend *blk_by_name(const char *name)
 {
 BlockBackend *blk = NULL;
 
+g_assert(qemu_in_main_thread());
 assert(name);
 while ((blk = blk_next(blk)) != NULL) {
 if (!strcmp(name, blk->name)) {
@@ -748,6 +768,7 @@ static BlockBackend *bdrv_first_blk(BlockDriverState *bs)
  */
 bool bdrv_has_blk(BlockDriverState *bs)
 {
+g_assert(qemu_in_main_thread());
 return bdrv_first_blk(bs) != NULL;
 }
 
@@ -758,6 +779,7 @@ bool bdrv_is_root_node(BlockDriverState *bs)
 {
 BdrvChild *c;
 
+g_assert(qemu_in_main_thread());
 QLIST_FOREACH(c, >parents, next_parent) {
 if (c->klass != _root) {
 return false;
@@ -807,6 +829,7 @@ BlockBackend *blk_by_legacy_dinfo(DriveInfo *dinfo)
  */
 BlockBackendPublic *blk_get_public(BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
 return >public;
 }
 
@@ -815,6 +838,7 @@ BlockBackendPublic *blk_get_public(BlockBackend *blk)
  */
 BlockBackend *blk_by_public(BlockBackendPublic *public)
 {
+g_assert(qemu_in_main_thread());
 return container_of(public, BlockBackend,

[RFC PATCH v2 00/25] block layer: split block APIs in global state and I/O

2021-10-05 Thread Emanuele Giuseppe Esposito

Currently, block layer APIs like block-backend.h contain a mix of
functions that are either running in the main loop and under the
BQL, or are thread-safe functions and run in iothreads performing I/O.
The functions running under BQL also take care of modifying the
block graph, by using drain and/or aio_context_acquire/release.
This makes it very confusing to understand where each function
runs, and what assumptions it provided with regards to thread
safety.

We call the functions running under BQL "global state (GS) API", and
distinguish them from the thread-safe "I/O API".

The aim of this series is to split the relevant block headers in
global state and I/O sub-headers. The division will be done in
this way:
header.h will be split in header-global-state.h, header-io.h and
header-common.h. The latter will just contain the data structures
needed by header-global-state and header-io, and common helpers
that are neither in GS nor in I/O. header.h will stay for
legacy and to avoid changing all includes in all QEMU c files,
but will only include the two new headers. No function shall be
added in header.c .
Once we split all relevant headers, it will be much easier to see what
uses the AioContext lock and remove it, which is the overall main
goal of this and other series that I posted/will post.

In addition to splitting the relevant headers shown in this series,
it is also very helpful splitting the function pointers in some
block structures, to understand what runs under AioContext lock and
what doesn't. This is what patches 19-25 do.

Each function in the GS API will have an assertion, checking
that it is always running under BQL.
I/O functions are instead thread safe (or so should be), meaning
that they *can* run under BQL, but also in an iothread in another
AioContext. Therefore they do not provide any assertion, and
need to be audited manually to verify the correctness.

Adding assetions has helped finding 2 bugs already, as shown in
my series "Migration: fix missing iothread locking". This series
depends on those two fixes, as some assertions will fail because
some iothread locks are missing.

Tested this series by running unit tests, qemu-iotests and qtests
(x86_64)
Some functions in the GS API are used everywhere but not
properly tested. Therefore their assertion is never actually run in
the tests, so despite my very careful auditing, it is not impossible
to exclude that some will trigger while actually using QEMU.

Patch 1 introduces qemu_in_main_thread(), the function used in
all assertions. This had to be introduced otherwise all unit tests
would fail, since they run in the main loop but use the code in
stubs/iothread.c
Patches 2-14 and 19-25 (with the exception of patch 9, that is an additional
assert) are all structured in the same way: first we split the header
and in the next (even) patch we add assertions.
The rest of the patches ontain either both assertions and split,
or have no assertions.

Next steps once this get reviewed:
1) audit the GS API and replace the AioContext lock with drains,
or remove them when not necessary (requires further discussion).
2) [optional as it should be already the case] audit the I/O API
and check that thread safety is guaranteed

Based-on: <20211005080751.3797161-1-eespo...@redhat.com>

Signed-off-by: Emanuele Giuseppe Esposito 
---
v1 -> v2:
* remove the iothread locking bug fix, and send it as separate patch
* rename graph API -> global state API
* better documented patch 1 (qemu_in_main_thread)
* add and split all other block layer headers
* fix warnings given by checkpatch on multiline comments

Emanuele Giuseppe Esposito (25):
  main-loop.h: introduce qemu_in_main_thread()
  include/sysemu/block-backend: split header into I/O and global state
(GS) API
  block/block-backend.c: assertions for block-backend
  include/block/block: split header into I/O and global state API
  assertions for block global state API
  include/block/block_int: split header into I/O and global state API
  assertions for block_int global state API
  block: introduce assert_bdrv_graph_writable
  include/block/blockjob_int.h: split header into I/O and GS API
  assertions for blockjob_int.h
  include/block/blockjob.h: global state API
  assertions for blockob.h global state API
  include/systemu/blockdev.h: global state API
  assertions for blockdev.h global state API
  include/block/snapshot: global state API + assertions
  block/backup-top.h: global state API + assertions
  include/block/transactions: global state API + assertions
  block/coroutines: I/O API
  block_int-common.h: split function pointers in BlockDriver
  block_int-common.h: assertion in the callers of BlockDriver function
pointers
  block_int-common.h: split function pointers in BdrvChildClass
  block_int-common.h: assertions in the callers of BdrvChildClass
function pointers
  block-backend-common.h: split function pointers in BlockDevOps
  job.h: split function pointers in JobDriver
  job.h: assertions

Re: [PATCH v0 0/2] virtio-blk and vhost-user-blk cross-device migration

2021-10-05 Thread Dr. David Alan Gilbert

* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Tue, Oct 05, 2021 at 02:18:40AM +0300, Roman Kagan wrote:
> > On Mon, Oct 04, 2021 at 11:11:00AM -0400, Michael S. Tsirkin wrote:
> > > On Mon, Oct 04, 2021 at 06:07:29PM +0300, Denis Plotnikov wrote:
> > > > It might be useful for the cases when a slow block layer should be 
> > > > replaced
> > > > with a more performant one on running VM without stopping, i.e. with 
> > > > very low
> > > > downtime comparable with the one on migration.
> > > > 
> > > > It's possible to achive that for two reasons:
> > > > 
> > > > 1.The VMStates of "virtio-blk" and "vhost-user-blk" are almost the same.
> > > >   They consist of the identical VMSTATE_VIRTIO_DEVICE and differs from
> > > >   each other in the values of migration service fields only.
> > > > 2.The device driver used in the guest is the same: virtio-blk
> > > > 
> > > > In the series cross-migration is achieved by adding a new type.
> > > > The new type uses virtio-blk VMState instead of vhost-user-blk specific
> > > > VMstate, also it implements migration save/load callbacks to be 
> > > > compatible
> > > > with migration stream produced by "virtio-blk" device.
> > > > 
> > > > Adding the new type instead of modifying the existing one is convenent.
> > > > It ease to differ the new virtio-blk-compatible vhost-user-blk
> > > > device from the existing non-compatible one using qemu machinery 
> > > > without any
> > > > other modifiactions. That gives all the variety of qemu device related
> > > > constraints out of box.
> > > 
> > > Hmm I'm not sure I understand. What is the advantage for the user?
> > > What if vhost-user-blk became an alias for vhost-user-virtio-blk?
> > > We could add some hacks to make it compatible for old machine types.
> > 
> > The point is that virtio-blk and vhost-user-blk are not
> > migration-compatible ATM.  OTOH they are the same device from the guest
> > POV so there's nothing fundamentally preventing the migration between
> > the two.  In particular, we see it as a means to switch between the
> > storage backend transports via live migration without disrupting the
> > guest.
> > 
> > Migration-wise virtio-blk and vhost-user-blk have in common
> > 
> > - the content of the VMState -- VMSTATE_VIRTIO_DEVICE
> > 
> > The two differ in
> > 
> > - the name and the version of the VMStateDescription
> > 
> > - virtio-blk has an extra migration section (via .save/.load callbacks
> >   on VirtioDeviceClass) containing requests in flight
> > 
> > It looks like to become migration-compatible with virtio-blk,
> > vhost-user-blk has to start using VMStateDescription of virtio-blk and
> > provide compatible .save/.load callbacks.  It isn't entirely obvious how
> > to make this machine-type-dependent, so we came up with a simpler idea
> > of defining a new device that shares most of the implementation with the
> > original vhost-user-blk except for the migration stuff.  We're certainly
> > open to suggestions on how to reconcile this under a single
> > vhost-user-blk device, as this would be more user-friendly indeed.
> > 
> > We considered using a class property for this and defining the
> > respective compat clause, but IIUC the class constructors (where .vmsd
> > and .save/.load are defined) are not supposed to depend on class
> > properties.
> > 
> > Thanks,
> > Roman.
> 
> So the question is how to make vmsd depend on machine type.
> CC Eduardo who poked at this kind of compat stuff recently,
> paolo who looked at qom things most recently and dgilbert
> for advice on migration.

I don't think I've seen anyone change vmsd name dependent on machine
type; making fields appear/disappear is easy - that just ends up as a
property on the device that's checked;  I guess if that property is
global (rather than per instance) then you can check it in
vhost_user_blk_class_init and swing the dc->vmsd pointer?

Dave


> -- 
> MST
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable

2021-10-05 Thread Christian Schoenebeck

On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > variable per virtio user.
> 
> virtio user == virtio device model?

Yes

> > Reasons:
> > 
> > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > 
> > maximum queue size possible. Which is actually the maximum
> > queue size allowed by the virtio protocol. The appropriate
> > value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > 
> > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.h
> > tml#x1-240006
> > 
> > Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > more or less arbitrary value of 1024 in the past, which
> > limits the maximum transfer size with virtio to 4M
> > (more precise: 1024 * PAGE_SIZE, with the latter typically
> > being 4k).
> 
> Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
> that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> etc).

Yes, that's use case dependent. Hence the solution to opt-in if it is desired 
and feasible.

> > (2) Additionally the current value of 1024 poses a hidden limit,
> > 
> > invisible to guest, which causes a system hang with the
> > following QEMU error if guest tries to exceed it:
> > 
> > virtio: too many write descriptors in indirect table
> 
> I don't understand this point. 2.6.5 The Virtqueue Descriptor Table says:
> 
>   The number of descriptors in the table is defined by the queue size for
> this virtqueue: this is the maximum possible descriptor chain length.
> 
> and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> 
>   A driver MUST NOT create a descriptor chain longer than the Queue Size of
> the device.
> 
> Do you mean a broken/malicious guest driver that is violating the spec?
> That's not a hidden limit, it's defined by the spec.

https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html

You can already go beyond that queue size at runtime with the indirection 
table. The only actual limit is the currently hard coded value of 1k pages. 
Hence the suggestion to turn that into a variable.

> > (3) Unfortunately not all virtio users in QEMU would currently
> > 
> > work correctly with the new value of 32768.
> > 
> > So let's turn this hard coded global value into a runtime
> > variable as a first step in this commit, configurable for each
> > virtio user by passing a corresponding value with virtio_init()
> > call.
> 
> virtio_add_queue() already has an int queue_size argument, why isn't
> that enough to deal with the maximum queue size? There's probably a good
> reason for it, but please include it in the commit description.
[...]
> Can you make this value per-vq instead of per-vdev since virtqueues can
> have different queue sizes?
> 
> The same applies to the rest of this patch. Anything using
> vdev->queue_max_size should probably use vq->vring.num instead.

I would like to avoid that and keep it per device. The maximum size stored 
there is the maximum size supported by virtio user (or vortio device model, 
however you want to call it). So that's really a limit per device, not per 
queue, as no queue of the device would ever exceed that limit.

Plus a lot more code would need to be refactored, which I think is 
unnecessary.

Best regards,
Christian Schoenebeck

Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable

2021-10-05 Thread Stefan Hajnoczi

On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> variable per virtio user.

virtio user == virtio device model?

> 
> Reasons:
> 
> (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> maximum queue size possible. Which is actually the maximum
> queue size allowed by the virtio protocol. The appropriate
> value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> 
> 
> https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006
> 
> Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> more or less arbitrary value of 1024 in the past, which
> limits the maximum transfer size with virtio to 4M
> (more precise: 1024 * PAGE_SIZE, with the latter typically
> being 4k).

Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
that cannot be passed to host system calls (sendmsg(2), pwritev(2),
etc).

> (2) Additionally the current value of 1024 poses a hidden limit,
> invisible to guest, which causes a system hang with the
> following QEMU error if guest tries to exceed it:
> 
> virtio: too many write descriptors in indirect table

I don't understand this point. 2.6.5 The Virtqueue Descriptor Table says:

  The number of descriptors in the table is defined by the queue size for this 
virtqueue: this is the maximum possible descriptor chain length.

and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:

  A driver MUST NOT create a descriptor chain longer than the Queue Size of the 
device.

Do you mean a broken/malicious guest driver that is violating the spec?
That's not a hidden limit, it's defined by the spec.

> (3) Unfortunately not all virtio users in QEMU would currently
> work correctly with the new value of 32768.
> 
> So let's turn this hard coded global value into a runtime
> variable as a first step in this commit, configurable for each
> virtio user by passing a corresponding value with virtio_init()
> call.

virtio_add_queue() already has an int queue_size argument, why isn't
that enough to deal with the maximum queue size? There's probably a good
reason for it, but please include it in the commit description.

> 
> Signed-off-by: Christian Schoenebeck 
> ---
>  hw/9pfs/virtio-9p-device.c |  3 ++-
>  hw/block/vhost-user-blk.c  |  2 +-
>  hw/block/virtio-blk.c  |  3 ++-
>  hw/char/virtio-serial-bus.c|  2 +-
>  hw/display/virtio-gpu-base.c   |  2 +-
>  hw/input/virtio-input.c|  2 +-
>  hw/net/virtio-net.c| 15 ---
>  hw/scsi/virtio-scsi.c  |  2 +-
>  hw/virtio/vhost-user-fs.c  |  2 +-
>  hw/virtio/vhost-user-i2c.c |  3 ++-
>  hw/virtio/vhost-vsock-common.c |  2 +-
>  hw/virtio/virtio-balloon.c |  4 ++--
>  hw/virtio/virtio-crypto.c  |  3 ++-
>  hw/virtio/virtio-iommu.c   |  2 +-
>  hw/virtio/virtio-mem.c |  2 +-
>  hw/virtio/virtio-pmem.c|  2 +-
>  hw/virtio/virtio-rng.c |  2 +-
>  hw/virtio/virtio.c | 35 +++---
>  include/hw/virtio/virtio.h |  5 -
>  19 files changed, 57 insertions(+), 36 deletions(-)
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index 54ee93b71f..cd5d95dd51 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -216,7 +216,8 @@ static void virtio_9p_device_realize(DeviceState *dev, 
> Error **errp)
>  }
>  
>  v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
> -virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size);
> +virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> +VIRTQUEUE_MAX_SIZE);
>  v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
>  }
>  
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index ba13cb87e5..336f56705c 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState 
> *dev, Error **errp)
>  }
>  
>  virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> -sizeof(struct virtio_blk_config));
> +sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
>  
>  s->virtqs = g_new(VirtQueue *, s->num_queues);
>  for (i = 0; i < s->num_queues; i++) {
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index f139cd7cc9..9c0f46815c 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -1213,7 +1213,8 @@ static void virtio_blk_device_realize(DeviceState *dev, 
> Error **errp)
>  
>  virtio_blk_set_config_size(s, s->host_features);
>  
> -virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size);
> +virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> +VIRTQUEUE_MAX_SIZE);
>  
>  s->blk = conf->conf.blk;
>  s->rq = NULL;
> diff --git

Re: [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Christian Schoenebeck

On Dienstag, 5. Oktober 2021 13:24:36 CEST Michael S. Tsirkin wrote:
> On Tue, Oct 05, 2021 at 01:17:59PM +0200, Christian Schoenebeck wrote:
> > On Dienstag, 5. Oktober 2021 09:16:07 CEST Michael S. Tsirkin wrote:
> > > On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > > > Raise the maximum possible virtio transfer size to 128M
> > > > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > > > more detailed explanation for the reasons of this change.
> > > > 
> > > > For not breaking any virtio user, all virtio users transition
> > > > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > > > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > > > of 1k with this commit.
> > > > 
> > > > On the long-term, each virtio user should subsequently either
> > > > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > > > after checking that they support the new value of 32k, or
> > > > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > > > macro by an appropriate value supported by them.
> > > > 
> > > > Signed-off-by: Christian Schoenebeck 
> > > 
> > > I don't think we need this. Legacy isn't descriptive either.  Just leave
> > > VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.
> > 
> > Does this mean you disagree that on the long-term all virtio users should
> > transition either to the new upper limit of 32k max queue size or
> > introduce
> > their own limit at their end?
> 
> depends. if 9pfs is the only one unhappy, we can keep 4k as
> the default. it's sure a safe one.
> 
> > Independent of the name, and I would appreciate for suggestions for an
> > adequate macro name here, I still think this new limit should be placed in
> > the shared virtio.h file. Because this value is not something invented on
> > virtio user side. It rather reflects the theoretical upper limited
> > possible with the virtio protocol, which is and will be common for all
> > virtio users.
> We can add this to the linux uapi headers, sure.

Well, then I wait for few days, and if nobody else cares about this issue, 
then I just hard code 32k on 9pfs side exclusively in v3 for now and that's 
it.

Best regards,
Christian Schoenebeck

Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Christian Schoenebeck

On Dienstag, 5. Oktober 2021 13:19:43 CEST Michael S. Tsirkin wrote:
> On Tue, Oct 05, 2021 at 01:10:56PM +0200, Christian Schoenebeck wrote:
> > On Dienstag, 5. Oktober 2021 09:38:53 CEST David Hildenbrand wrote:
> > > On 04.10.21 21:38, Christian Schoenebeck wrote:
> > > > At the moment the maximum transfer size with virtio is limited to 4M
> > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > theoretical possible transfer size of 128M (32k pages) according to
> > > > the
> > > > virtio specs:
> > > > 
> > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.h
> > > > tml#
> > > > x1-240006
> > > 
> > > I'm missing the "why do we care". Can you comment on that?
> > 
> > Primary motivation is the possibility of improved performance, e.g. in
> > case of 9pfs, people can raise the maximum transfer size with the Linux
> > 9p client's 'msize' option on guest side (and only on guest side
> > actually). If guest performs large chunk I/O, e.g. consider something
> > "useful" like this one on> 
> > guest side:
> >   time cat large_file_on_9pfs.dat > /dev/null
> > 
> > Then there is a noticable performance increase with higher transfer size
> > values. That performance gain is continuous with rising transfer size
> > values, but the performance increase obviously shrinks with rising
> > transfer sizes as well, as with similar concepts in general like cache
> > sizes, etc.
> > 
> > Then a secondary motivation is described in reason (2) of patch 2: if the
> > transfer size is configurable on guest side (like it is the case with the
> > 9pfs 'msize' option), then there is the unpleasant side effect that the
> > current virtio limit of 4M is invisible to guest; as this value of 4M is
> > simply an arbitrarily limit set on QEMU side in the past (probably just
> > implementation motivated on QEMU side at that point), i.e. it is not a
> > limit specified by the virtio protocol,
> 
> According to the spec it's specified, sure enough: vq size limits the
> size of indirect descriptors too.

In the virtio specs the only hard limit that I see is the aforementioned 32k:

"Queue Size corresponds to the maximum number of buffers in the virtqueue. 
Queue Size value is always a power of 2. The maximum Queue Size value is 
32768. This value is specified in a bus-specific way."

> However, ever since commit 44ed8089e991a60d614abe0ee4b9057a28b364e4 we
> do not enforce it in the driver ...

Then there is the current queue size (that you probably mean) which is 
transmitted to guest with whatever virtio was initialized with.

In case of 9p client however the virtio queue size is first initialized with 
some initial hard coded value when the 9p driver is loaded on Linux kernel 
guest side, then when some 9pfs is mounted later on by guest, it may include 
the 'msize' mount option to raise the transfer size, and that's the problem. I 
don't see any way for guest to see that it cannot go above that 4M transfer 
size now.

> > nor is this limit be made aware to guest via virtio protocol
> > at all. The consequence with 9pfs would be if user tries to go higher than
> > 4M,> 
> > then the system would simply hang with this QEMU error:
> >   virtio: too many write descriptors in indirect table
> > 
> > Now whether this is an issue or not for individual virtio users, depends
> > on
> > whether the individual virtio user already had its own limitation <= 4M
> > enforced on its side.
> > 
> > Best regards,
> > Christian Schoenebeck

Re: [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Michael S. Tsirkin

On Tue, Oct 05, 2021 at 01:17:59PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 5. Oktober 2021 09:16:07 CEST Michael S. Tsirkin wrote:
> > On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > > Raise the maximum possible virtio transfer size to 128M
> > > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > > more detailed explanation for the reasons of this change.
> > > 
> > > For not breaking any virtio user, all virtio users transition
> > > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > > of 1k with this commit.
> > > 
> > > On the long-term, each virtio user should subsequently either
> > > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > > after checking that they support the new value of 32k, or
> > > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > > macro by an appropriate value supported by them.
> > > 
> > > Signed-off-by: Christian Schoenebeck 
> > 
> > I don't think we need this. Legacy isn't descriptive either.  Just leave
> > VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.
> 
> Does this mean you disagree that on the long-term all virtio users should 
> transition either to the new upper limit of 32k max queue size or introduce 
> their own limit at their end?


depends. if 9pfs is the only one unhappy, we can keep 4k as
the default. it's sure a safe one.

> Independent of the name, and I would appreciate for suggestions for an 
> adequate macro name here, I still think this new limit should be placed in 
> the 
> shared virtio.h file. Because this value is not something invented on virtio 
> user side. It rather reflects the theoretical upper limited possible with the 
> virtio protocol, which is and will be common for all virtio users.


We can add this to the linux uapi headers, sure.

> > > ---
> > > 
> > >  hw/9pfs/virtio-9p-device.c |  2 +-
> > >  hw/block/vhost-user-blk.c  |  6 +++---
> > >  hw/block/virtio-blk.c  |  6 +++---
> > >  hw/char/virtio-serial-bus.c|  2 +-
> > >  hw/input/virtio-input.c|  2 +-
> > >  hw/net/virtio-net.c| 12 ++--
> > >  hw/scsi/virtio-scsi.c  |  2 +-
> > >  hw/virtio/vhost-user-fs.c  |  6 +++---
> > >  hw/virtio/vhost-user-i2c.c |  2 +-
> > >  hw/virtio/vhost-vsock-common.c |  2 +-
> > >  hw/virtio/virtio-balloon.c |  2 +-
> > >  hw/virtio/virtio-crypto.c  |  2 +-
> > >  hw/virtio/virtio-iommu.c   |  2 +-
> > >  hw/virtio/virtio-mem.c |  2 +-
> > >  hw/virtio/virtio-mmio.c|  4 ++--
> > >  hw/virtio/virtio-pmem.c|  2 +-
> > >  hw/virtio/virtio-rng.c |  3 ++-
> > >  include/hw/virtio/virtio.h | 20 +++-
> > >  18 files changed, 49 insertions(+), 30 deletions(-)
> > > 
> > > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > > index cd5d95dd51..9013e7df6e 100644
> > > --- a/hw/9pfs/virtio-9p-device.c
> > > +++ b/hw/9pfs/virtio-9p-device.c
> > > @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev,
> > > Error **errp)> 
> > >  v->config_size = sizeof(struct virtio_9p_config) +
> > >  strlen(s->fsconf.tag);
> > >  virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > 
> > > -VIRTQUEUE_MAX_SIZE);
> > > +VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >  v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > >  
> > >  }
> > > 
> > > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> > > index 336f56705c..e5e45262ab 100644
> > > --- a/hw/block/vhost-user-blk.c
> > > +++ b/hw/block/vhost-user-blk.c
> > > @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >  error_setg(errp, "queue size must be non-zero");
> > >  return;
> > >  
> > >  }
> > > 
> > > -if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> > > +if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > > 
> > >  error_setg(errp, "queue size must not exceed %d",
> > > 
> > > -   VIRTQUEUE_MAX_SIZE);
> > > +   VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >  return;
> > >  
> > >  }
> > > 
> > > @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >  }
> > >  
> > >  virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> > > 
> > > -sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> > > +sizeof(struct virtio_blk_config),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> > >  s->virtqs = g_new(VirtQueue *, s->num_queues);
> > >  for (i = 0; i < s->num_queues; i++) {
> > > 
> > > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> > > index 9c0f46815c..5883e3e7db 100644
> > > --- a/hw/block/virtio-blk.c
> > > +++ b/hw/block/virtio-blk.c
> > > @@ -1171,10 +1171,10 @@ static

Re: [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Christian Schoenebeck

On Dienstag, 5. Oktober 2021 09:16:07 CEST Michael S. Tsirkin wrote:
> On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > Raise the maximum possible virtio transfer size to 128M
> > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > more detailed explanation for the reasons of this change.
> > 
> > For not breaking any virtio user, all virtio users transition
> > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > of 1k with this commit.
> > 
> > On the long-term, each virtio user should subsequently either
> > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > after checking that they support the new value of 32k, or
> > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > macro by an appropriate value supported by them.
> > 
> > Signed-off-by: Christian Schoenebeck 
> 
> I don't think we need this. Legacy isn't descriptive either.  Just leave
> VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.

Does this mean you disagree that on the long-term all virtio users should 
transition either to the new upper limit of 32k max queue size or introduce 
their own limit at their end?

Independent of the name, and I would appreciate for suggestions for an 
adequate macro name here, I still think this new limit should be placed in the 
shared virtio.h file. Because this value is not something invented on virtio 
user side. It rather reflects the theoretical upper limited possible with the 
virtio protocol, which is and will be common for all virtio users.

> > ---
> > 
> >  hw/9pfs/virtio-9p-device.c |  2 +-
> >  hw/block/vhost-user-blk.c  |  6 +++---
> >  hw/block/virtio-blk.c  |  6 +++---
> >  hw/char/virtio-serial-bus.c|  2 +-
> >  hw/input/virtio-input.c|  2 +-
> >  hw/net/virtio-net.c| 12 ++--
> >  hw/scsi/virtio-scsi.c  |  2 +-
> >  hw/virtio/vhost-user-fs.c  |  6 +++---
> >  hw/virtio/vhost-user-i2c.c |  2 +-
> >  hw/virtio/vhost-vsock-common.c |  2 +-
> >  hw/virtio/virtio-balloon.c |  2 +-
> >  hw/virtio/virtio-crypto.c  |  2 +-
> >  hw/virtio/virtio-iommu.c   |  2 +-
> >  hw/virtio/virtio-mem.c |  2 +-
> >  hw/virtio/virtio-mmio.c|  4 ++--
> >  hw/virtio/virtio-pmem.c|  2 +-
> >  hw/virtio/virtio-rng.c |  3 ++-
> >  include/hw/virtio/virtio.h | 20 +++-
> >  18 files changed, 49 insertions(+), 30 deletions(-)
> > 
> > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > index cd5d95dd51..9013e7df6e 100644
> > --- a/hw/9pfs/virtio-9p-device.c
> > +++ b/hw/9pfs/virtio-9p-device.c
> > @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev,
> > Error **errp)> 
> >  v->config_size = sizeof(struct virtio_9p_config) +
> >  strlen(s->fsconf.tag);
> >  virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > 
> > -VIRTQUEUE_MAX_SIZE);
> > +VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >  v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> >  
> >  }
> > 
> > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> > index 336f56705c..e5e45262ab 100644
> > --- a/hw/block/vhost-user-blk.c
> > +++ b/hw/block/vhost-user-blk.c
> > @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState
> > *dev, Error **errp)> 
> >  error_setg(errp, "queue size must be non-zero");
> >  return;
> >  
> >  }
> > 
> > -if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > 
> >  error_setg(errp, "queue size must not exceed %d",
> > 
> > -   VIRTQUEUE_MAX_SIZE);
> > +   VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >  return;
> >  
> >  }
> > 
> > @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState
> > *dev, Error **errp)> 
> >  }
> >  
> >  virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> > 
> > -sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> > +sizeof(struct virtio_blk_config),
> > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> >  s->virtqs = g_new(VirtQueue *, s->num_queues);
> >  for (i = 0; i < s->num_queues; i++) {
> > 
> > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> > index 9c0f46815c..5883e3e7db 100644
> > --- a/hw/block/virtio-blk.c
> > +++ b/hw/block/virtio-blk.c
> > @@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState
> > *dev, Error **errp)> 
> >  return;
> >  
> >  }
> >  if (!is_power_of_2(conf->queue_size) ||
> > 
> > -conf->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > 
> >  error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
> >  
> > "must be a power of 2 (max %d)",
> > 
> >

Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Michael S. Tsirkin

On Tue, Oct 05, 2021 at 01:10:56PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 5. Oktober 2021 09:38:53 CEST David Hildenbrand wrote:
> > On 04.10.21 21:38, Christian Schoenebeck wrote:
> > > At the moment the maximum transfer size with virtio is limited to 4M
> > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > theoretical possible transfer size of 128M (32k pages) according to the
> > > virtio specs:
> > > 
> > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > > x1-240006
> > I'm missing the "why do we care". Can you comment on that?
> 
> Primary motivation is the possibility of improved performance, e.g. in case 
> of 
> 9pfs, people can raise the maximum transfer size with the Linux 9p client's 
> 'msize' option on guest side (and only on guest side actually). If guest 
> performs large chunk I/O, e.g. consider something "useful" like this one on 
> guest side:
> 
>   time cat large_file_on_9pfs.dat > /dev/null
> 
> Then there is a noticable performance increase with higher transfer size 
> values. That performance gain is continuous with rising transfer size values, 
> but the performance increase obviously shrinks with rising transfer sizes as 
> well, as with similar concepts in general like cache sizes, etc.
> 
> Then a secondary motivation is described in reason (2) of patch 2: if the 
> transfer size is configurable on guest side (like it is the case with the 
> 9pfs 
> 'msize' option), then there is the unpleasant side effect that the current 
> virtio limit of 4M is invisible to guest; as this value of 4M is simply an 
> arbitrarily limit set on QEMU side in the past (probably just implementation 
> motivated on QEMU side at that point), i.e. it is not a limit specified by 
> the 
> virtio protocol,

According to the spec it's specified, sure enough: vq size limits the
size of indirect descriptors too.
However, ever since commit 44ed8089e991a60d614abe0ee4b9057a28b364e4 we
do not enforce it in the driver ...

> nor is this limit be made aware to guest via virtio protocol 
> at all. The consequence with 9pfs would be if user tries to go higher than 
> 4M, 
> then the system would simply hang with this QEMU error:
> 
>   virtio: too many write descriptors in indirect table
> 
> Now whether this is an issue or not for individual virtio users, depends on 
> whether the individual virtio user already had its own limitation <= 4M 
> enforced on its side.
> 
> Best regards,
> Christian Schoenebeck
>

Re: [PATCH 06/11] qdev: Add Error parameter to qdev_set_id()

2021-10-05 Thread Kevin Wolf

Am 27.09.2021 um 12:33 hat Damien Hedde geschrieben:
> Hi Kevin,
> 
> I proposed a very similar patch in our rfc series because we needed some of
> the cleaning you do here.
> https://lists.gnu.org/archive/html/qemu-devel/2021-09/msg05679.html
> I've added a bit of doc for the function, feel free to take it if you want.

Thanks, I'm replacing my patch with yours for v2.

Kevin

Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Christian Schoenebeck

On Dienstag, 5. Oktober 2021 09:38:53 CEST David Hildenbrand wrote:
> On 04.10.21 21:38, Christian Schoenebeck wrote:
> > At the moment the maximum transfer size with virtio is limited to 4M
> > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > theoretical possible transfer size of 128M (32k pages) according to the
> > virtio specs:
> > 
> > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > x1-240006
> I'm missing the "why do we care". Can you comment on that?

Primary motivation is the possibility of improved performance, e.g. in case of 
9pfs, people can raise the maximum transfer size with the Linux 9p client's 
'msize' option on guest side (and only on guest side actually). If guest 
performs large chunk I/O, e.g. consider something "useful" like this one on 
guest side:

  time cat large_file_on_9pfs.dat > /dev/null

Then there is a noticable performance increase with higher transfer size 
values. That performance gain is continuous with rising transfer size values, 
but the performance increase obviously shrinks with rising transfer sizes as 
well, as with similar concepts in general like cache sizes, etc.

Then a secondary motivation is described in reason (2) of patch 2: if the 
transfer size is configurable on guest side (like it is the case with the 9pfs 
'msize' option), then there is the unpleasant side effect that the current 
virtio limit of 4M is invisible to guest; as this value of 4M is simply an 
arbitrarily limit set on QEMU side in the past (probably just implementation 
motivated on QEMU side at that point), i.e. it is not a limit specified by the 
virtio protocol, nor is this limit be made aware to guest via virtio protocol 
at all. The consequence with 9pfs would be if user tries to go higher than 4M, 
then the system would simply hang with this QEMU error:

  virtio: too many write descriptors in indirect table

Now whether this is an issue or not for individual virtio users, depends on 
whether the individual virtio user already had its own limitation <= 4M 
enforced on its side.

Best regards,
Christian Schoenebeck

Re: [PATCH v1 2/2] migration: add missing qemu_mutex_lock_iothread in migration_completion

2021-10-05 Thread Dr. David Alan Gilbert

* Emanuele Giuseppe Esposito (eespo...@redhat.com) wrote:
> qemu_savevm_state_complete_postcopy assumes the iothread lock (BQL)
> to be held, but instead it isn't.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 

Interesting, I think you're right - and I think it's been missing it
from the start.

Reviewed-by: Dr. David Alan Gilbert 

> ---
>  migration/migration.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 041b8451a6..215d5281f2 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3182,7 +3182,10 @@ static void migration_completion(MigrationState *s)
>  } else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
>  trace_migration_completion_postcopy_end();
>  
> +qemu_mutex_lock_iothread();
>  qemu_savevm_state_complete_postcopy(s->to_dst_file);
> +qemu_mutex_unlock_iothread();
> +
>  trace_migration_completion_postcopy_end_after_complete();
>  } else if (s->state == MIGRATION_STATUS_CANCELLING) {
>  goto fail;
> -- 
> 2.27.0
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2021-10-05 Thread Peter Lieven


Am 05.10.21 um 10:36 schrieb Ilya Dryomov:

On Tue, Oct 5, 2021 at 10:19 AM Peter Lieven  wrote:

Am 05.10.21 um 09:54 schrieb Ilya Dryomov:

On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:

the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
V2->V3:
- check rbd_flags every time (they can change during runtime) [Ilya]
- also check for fast-diff invalid flag [Ilya]
- *map and *file cant be NULL [Ilya]
- set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
unallocated area [Ilya]
- typo: catched -> caught [Ilya]
- changed wording about fast-diff, object-map and exclusive lock in
commit msg [Ilya]

V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

   block/rbd.c | 126 
   1 file changed, 126 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..3cb24f9981 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
   return spec_info;
   }

+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;

Hi Peter,

Nit: make exists a bool.  The one in the callback has to be an int
because of the callback signature but let's not spread that.


+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;

Do you have a test case for when this branch is taken?


That would happen if you diff from a snapshot, the question is if it can also 
happen if the image is a clone from a snapshot?



+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+/*
+ * assert that we caught all cases above and allocation state has not
+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+BDRVRBDState *s = bs->opaque;
+int ret, r;

Nit: I would rename ret to status or something like that to make
it clear(er) that it is an actual value and never an error.  Or,
even better,

Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2021-10-05 Thread Ilya Dryomov

On Tue, Oct 5, 2021 at 10:19 AM Peter Lieven  wrote:
>
> Am 05.10.21 um 09:54 schrieb Ilya Dryomov:
> > On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:
> >> the qemu rbd driver currently lacks support for bdrv_co_block_status.
> >> This results mainly in incorrect progress during block operations (e.g.
> >> qemu-img convert with an rbd image as source).
> >>
> >> This patch utilizes the rbd_diff_iterate2 call from librbd to detect
> >> allocated and unallocated (all zero areas).
> >>
> >> To avoid querying the ceph OSDs for the answer this is only done if
> >> the image has the fast-diff feature which depends on the object-map and
> >> exclusive-lock features. In this case it is guaranteed that the information
> >> is present in memory in the librbd client and thus very fast.
> >>
> >> If fast-diff is not available all areas are reported to be allocated
> >> which is the current behaviour if bdrv_co_block_status is not implemented.
> >>
> >> Signed-off-by: Peter Lieven 
> >> ---
> >> V2->V3:
> >> - check rbd_flags every time (they can change during runtime) [Ilya]
> >> - also check for fast-diff invalid flag [Ilya]
> >> - *map and *file cant be NULL [Ilya]
> >> - set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
> >>unallocated area [Ilya]
> >> - typo: catched -> caught [Ilya]
> >> - changed wording about fast-diff, object-map and exclusive lock in
> >>commit msg [Ilya]
> >>
> >> V1->V2:
> >> - add commit comment [Stefano]
> >> - use failed_post_open [Stefano]
> >> - remove redundant assert [Stefano]
> >> - add macro+comment for the magic -9000 value [Stefano]
> >> - always set *file if its non NULL [Stefano]
> >>
> >>   block/rbd.c | 126 
> >>   1 file changed, 126 insertions(+)
> >>
> >> diff --git a/block/rbd.c b/block/rbd.c
> >> index dcf82b15b8..3cb24f9981 100644
> >> --- a/block/rbd.c
> >> +++ b/block/rbd.c
> >> @@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
> >> *qemu_rbd_get_specific_info(BlockDriverState *bs,
> >>   return spec_info;
> >>   }
> >>
> >> +typedef struct rbd_diff_req {
> >> +uint64_t offs;
> >> +uint64_t bytes;
> >> +int exists;
> > Hi Peter,
> >
> > Nit: make exists a bool.  The one in the callback has to be an int
> > because of the callback signature but let's not spread that.
> >
> >> +} rbd_diff_req;
> >> +
> >> +/*
> >> + * rbd_diff_iterate2 allows to interrupt the exection by returning a 
> >> negative
> >> + * value in the callback routine. Choose a value that does not conflict 
> >> with
> >> + * an existing exitcode and return it if we want to prematurely stop the
> >> + * execution because we detected a change in the allocation status.
> >> + */
> >> +#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
> >> +
> >> +static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
> >> +   int exists, void *opaque)
> >> +{
> >> +struct rbd_diff_req *req = opaque;
> >> +
> >> +assert(req->offs + req->bytes <= offs);
> >> +
> >> +if (req->exists && offs > req->offs + req->bytes) {
> >> +/*
> >> + * we started in an allocated area and jumped over an unallocated 
> >> area,
> >> + * req->bytes contains the length of the allocated area before the
> >> + * unallocated area. stop further processing.
> >> + */
> >> +return QEMU_RBD_EXIT_DIFF_ITERATE2;
> >> +}
> >> +if (req->exists && !exists) {
> >> +/*
> >> + * we started in an allocated area and reached a hole. req->bytes
> >> + * contains the length of the allocated area before the hole.
> >> + * stop further processing.
> >> + */
> >> +return QEMU_RBD_EXIT_DIFF_ITERATE2;
> > Do you have a test case for when this branch is taken?
>
>
> That would happen if you diff from a snapshot, the question is if it can also 
> happen if the image is a clone from a snapshot?
>
>
> >
> >> +}
> >> +if (!req->exists && exists && offs > req->offs) {
> >> +/*
> >> + * we started in an unallocated area and hit the first allocated
> >> + * block. req->bytes must be set to the length of the unallocated 
> >> area
> >> + * before the allocated area. stop further processing.
> >> + */
> >> +req->bytes = offs - req->offs;
> >> +return QEMU_RBD_EXIT_DIFF_ITERATE2;
> >> +}
> >> +
> >> +/*
> >> + * assert that we caught all cases above and allocation state has not
> >> + * changed during callbacks.
> >> + */
> >> +assert(exists == req->exists || !req->bytes);
> >> +req->exists = exists;
> >> +
> >> +/*
> >> + * assert that we either return an unallocated block or have got 
> >> callbacks
> >> + * for all allocated blocks present.
> >> + */
> >> +assert(!req->exists || offs == req->offs + req->bytes);
> >> +req->bytes = offs + len - req->offs;
> >> +
> >> +return 0;
> >> +}
> >> +
> >> +static int

Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2021-10-05 Thread Peter Lieven


Am 05.10.21 um 09:54 schrieb Ilya Dryomov:

On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:

the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
V2->V3:
- check rbd_flags every time (they can change during runtime) [Ilya]
- also check for fast-diff invalid flag [Ilya]
- *map and *file cant be NULL [Ilya]
- set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
   unallocated area [Ilya]
- typo: catched -> caught [Ilya]
- changed wording about fast-diff, object-map and exclusive lock in
   commit msg [Ilya]

V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

  block/rbd.c | 126 
  1 file changed, 126 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..3cb24f9981 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
  return spec_info;
  }

+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;

Hi Peter,

Nit: make exists a bool.  The one in the callback has to be an int
because of the callback signature but let's not spread that.


+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;

Do you have a test case for when this branch is taken?



That would happen if you diff from a snapshot, the question is if it can also 
happen if the image is a clone from a snapshot?





+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+/*
+ * assert that we caught all cases above and allocation state has not
+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+BDRVRBDState *s = bs->opaque;
+int ret, r;

Nit: I would rename ret to status or something like that to make
it clear(er) that it is an actual value and never an error.  Or,
even better, drop it entirely and return one of the two bitmasks
directly.


+struct rbd_diff_req req = { .offs

[PATCH v1 1/2] migration: block-dirty-bitmap: add missing qemu_mutex_lock_iothread

2021-10-05 Thread Emanuele Giuseppe Esposito

init_dirty_bitmap_migration assumes the iothread lock (BQL)
to be held, but instead it isn't.

Instead of adding the lock to qemu_savevm_state_setup(),
follow the same pattern as the other ->save_setup callbacks
and lock+unlock inside dirty_bitmap_save_setup().

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
---
 migration/block-dirty-bitmap.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index 35f5ef688d..9aba7d9c22 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -1215,7 +1215,10 @@ static int dirty_bitmap_save_setup(QEMUFile *f, void 
*opaque)
 {
 DBMSaveState *s = &((DBMState *)opaque)->save;
 SaveBitmapState *dbms = NULL;
+
+qemu_mutex_lock_iothread();
 if (init_dirty_bitmap_migration(s) < 0) {
+qemu_mutex_unlock_iothread();
 return -1;
 }
 
@@ -1223,7 +1226,7 @@ static int dirty_bitmap_save_setup(QEMUFile *f, void 
*opaque)
 send_bitmap_start(f, s, dbms);
 }
 qemu_put_bitmap_flags(f, DIRTY_BITMAP_MIG_FLAG_EOS);
-
+qemu_mutex_unlock_iothread();
 return 0;
 }
 
-- 
2.27.0

[PATCH v1 0/2] Migration: fix missing iothread locking

2021-10-05 Thread Emanuele Giuseppe Esposito

Some functions (in this case qemu_savevm_state_complete_postcopy() and
init_dirty_bitmap_migration()) assume and document that
qemu_mutex_lock_iothread() is hold.

This seems to have been forgotten in some places, and this series
aims to fix that.

Patch 1 was part of my RFC block layer series "block layer: split
block APIs in graph and I/O" but I decided to do a separate series
for these two bugs, as they are independent from the API split.

Signed-off-by: Emanuele Giuseppe Esposito 

Emanuele Giuseppe Esposito (2):
  migration: block-dirty-bitmap: add missing qemu_mutex_lock_iothread
  migration: add missing qemu_mutex_lock_iothread in
migration_completion

 migration/block-dirty-bitmap.c | 5 -
 migration/migration.c  | 3 +++
 2 files changed, 7 insertions(+), 1 deletion(-)

-- 
2.27.0

[PATCH v1 2/2] migration: add missing qemu_mutex_lock_iothread in migration_completion

2021-10-05 Thread Emanuele Giuseppe Esposito

qemu_savevm_state_complete_postcopy assumes the iothread lock (BQL)
to be held, but instead it isn't.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 migration/migration.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 041b8451a6..215d5281f2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3182,7 +3182,10 @@ static void migration_completion(MigrationState *s)
 } else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
 trace_migration_completion_postcopy_end();
 
+qemu_mutex_lock_iothread();
 qemu_savevm_state_complete_postcopy(s->to_dst_file);
+qemu_mutex_unlock_iothread();
+
 trace_migration_completion_postcopy_end_after_complete();
 } else if (s->state == MIGRATION_STATUS_CANCELLING) {
 goto fail;
-- 
2.27.0

Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2021-10-05 Thread Ilya Dryomov

On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:
>
> the qemu rbd driver currently lacks support for bdrv_co_block_status.
> This results mainly in incorrect progress during block operations (e.g.
> qemu-img convert with an rbd image as source).
>
> This patch utilizes the rbd_diff_iterate2 call from librbd to detect
> allocated and unallocated (all zero areas).
>
> To avoid querying the ceph OSDs for the answer this is only done if
> the image has the fast-diff feature which depends on the object-map and
> exclusive-lock features. In this case it is guaranteed that the information
> is present in memory in the librbd client and thus very fast.
>
> If fast-diff is not available all areas are reported to be allocated
> which is the current behaviour if bdrv_co_block_status is not implemented.
>
> Signed-off-by: Peter Lieven 
> ---
> V2->V3:
> - check rbd_flags every time (they can change during runtime) [Ilya]
> - also check for fast-diff invalid flag [Ilya]
> - *map and *file cant be NULL [Ilya]
> - set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
>   unallocated area [Ilya]
> - typo: catched -> caught [Ilya]
> - changed wording about fast-diff, object-map and exclusive lock in
>   commit msg [Ilya]
>
> V1->V2:
> - add commit comment [Stefano]
> - use failed_post_open [Stefano]
> - remove redundant assert [Stefano]
> - add macro+comment for the magic -9000 value [Stefano]
> - always set *file if its non NULL [Stefano]
>
>  block/rbd.c | 126 
>  1 file changed, 126 insertions(+)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index dcf82b15b8..3cb24f9981 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
> *qemu_rbd_get_specific_info(BlockDriverState *bs,
>  return spec_info;
>  }
>
> +typedef struct rbd_diff_req {
> +uint64_t offs;
> +uint64_t bytes;
> +int exists;

Hi Peter,

Nit: make exists a bool.  The one in the callback has to be an int
because of the callback signature but let's not spread that.

> +} rbd_diff_req;
> +
> +/*
> + * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
> + * value in the callback routine. Choose a value that does not conflict with
> + * an existing exitcode and return it if we want to prematurely stop the
> + * execution because we detected a change in the allocation status.
> + */
> +#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
> +
> +static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
> +   int exists, void *opaque)
> +{
> +struct rbd_diff_req *req = opaque;
> +
> +assert(req->offs + req->bytes <= offs);
> +
> +if (req->exists && offs > req->offs + req->bytes) {
> +/*
> + * we started in an allocated area and jumped over an unallocated 
> area,
> + * req->bytes contains the length of the allocated area before the
> + * unallocated area. stop further processing.
> + */
> +return QEMU_RBD_EXIT_DIFF_ITERATE2;
> +}
> +if (req->exists && !exists) {
> +/*
> + * we started in an allocated area and reached a hole. req->bytes
> + * contains the length of the allocated area before the hole.
> + * stop further processing.
> + */
> +return QEMU_RBD_EXIT_DIFF_ITERATE2;

Do you have a test case for when this branch is taken?

> +}
> +if (!req->exists && exists && offs > req->offs) {
> +/*
> + * we started in an unallocated area and hit the first allocated
> + * block. req->bytes must be set to the length of the unallocated 
> area
> + * before the allocated area. stop further processing.
> + */
> +req->bytes = offs - req->offs;
> +return QEMU_RBD_EXIT_DIFF_ITERATE2;
> +}
> +
> +/*
> + * assert that we caught all cases above and allocation state has not
> + * changed during callbacks.
> + */
> +assert(exists == req->exists || !req->bytes);
> +req->exists = exists;
> +
> +/*
> + * assert that we either return an unallocated block or have got 
> callbacks
> + * for all allocated blocks present.
> + */
> +assert(!req->exists || offs == req->offs + req->bytes);
> +req->bytes = offs + len - req->offs;
> +
> +return 0;
> +}
> +
> +static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
> + bool want_zero, int64_t 
> offset,
> + int64_t bytes, int64_t 
> *pnum,
> + int64_t *map,
> + BlockDriverState **file)
> +{
> +BDRVRBDState *s = bs->opaque;
> +int ret, r;

Nit: I would rename ret to status or something like that to make
it clear(er) that it is an actual value and never an error.  Or,
even better, drop it entirely and return one of the two

Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread David Hildenbrand


On 04.10.21 21:38, Christian Schoenebeck wrote:

At the moment the maximum transfer size with virtio is limited to 4M
(1024 * PAGE_SIZE). This series raises this limit to its maximum
theoretical possible transfer size of 128M (32k pages) according to the
virtio specs:

https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006



I'm missing the "why do we care". Can you comment on that?


--
Thanks,

David / dhildenb

Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable

2021-10-05 Thread Greg Kurz

On Mon, 4 Oct 2021 21:38:04 +0200
Christian Schoenebeck  wrote:

> Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> variable per virtio user.
> 
> Reasons:
> 
> (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> maximum queue size possible. Which is actually the maximum
> queue size allowed by the virtio protocol. The appropriate
> value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> 
> 
> https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006
> 
> Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> more or less arbitrary value of 1024 in the past, which
> limits the maximum transfer size with virtio to 4M
> (more precise: 1024 * PAGE_SIZE, with the latter typically
> being 4k).
> 
> (2) Additionally the current value of 1024 poses a hidden limit,
> invisible to guest, which causes a system hang with the
> following QEMU error if guest tries to exceed it:
> 
> virtio: too many write descriptors in indirect table
> 
> (3) Unfortunately not all virtio users in QEMU would currently
> work correctly with the new value of 32768.
> 
> So let's turn this hard coded global value into a runtime
> variable as a first step in this commit, configurable for each
> virtio user by passing a corresponding value with virtio_init()
> call.
> 
> Signed-off-by: Christian Schoenebeck 
> ---

Reviewed-by: Greg Kurz 

>  hw/9pfs/virtio-9p-device.c |  3 ++-
>  hw/block/vhost-user-blk.c  |  2 +-
>  hw/block/virtio-blk.c  |  3 ++-
>  hw/char/virtio-serial-bus.c|  2 +-
>  hw/display/virtio-gpu-base.c   |  2 +-
>  hw/input/virtio-input.c|  2 +-
>  hw/net/virtio-net.c| 15 ---
>  hw/scsi/virtio-scsi.c  |  2 +-
>  hw/virtio/vhost-user-fs.c  |  2 +-
>  hw/virtio/vhost-user-i2c.c |  3 ++-
>  hw/virtio/vhost-vsock-common.c |  2 +-
>  hw/virtio/virtio-balloon.c |  4 ++--
>  hw/virtio/virtio-crypto.c  |  3 ++-
>  hw/virtio/virtio-iommu.c   |  2 +-
>  hw/virtio/virtio-mem.c |  2 +-
>  hw/virtio/virtio-pmem.c|  2 +-
>  hw/virtio/virtio-rng.c |  2 +-
>  hw/virtio/virtio.c | 35 +++---
>  include/hw/virtio/virtio.h |  5 -
>  19 files changed, 57 insertions(+), 36 deletions(-)
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index 54ee93b71f..cd5d95dd51 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -216,7 +216,8 @@ static void virtio_9p_device_realize(DeviceState *dev, 
> Error **errp)
>  }
>  
>  v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
> -virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size);
> +virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> +VIRTQUEUE_MAX_SIZE);
>  v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
>  }
>  
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index ba13cb87e5..336f56705c 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState 
> *dev, Error **errp)
>  }
>  
>  virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> -sizeof(struct virtio_blk_config));
> +sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
>  
>  s->virtqs = g_new(VirtQueue *, s->num_queues);
>  for (i = 0; i < s->num_queues; i++) {
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index f139cd7cc9..9c0f46815c 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -1213,7 +1213,8 @@ static void virtio_blk_device_realize(DeviceState *dev, 
> Error **errp)
>  
>  virtio_blk_set_config_size(s, s->host_features);
>  
> -virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size);
> +virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> +VIRTQUEUE_MAX_SIZE);
>  
>  s->blk = conf->conf.blk;
>  s->rq = NULL;
> diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> index f01ec2137c..9ad915 100644
> --- a/hw/char/virtio-serial-bus.c
> +++ b/hw/char/virtio-serial-bus.c
> @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState 
> *dev, Error **errp)
>  config_size = offsetof(struct virtio_console_config, emerg_wr);
>  }
>  virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
> -config_size);
> +config_size, VIRTQUEUE_MAX_SIZE);
>  
>  /* Spawn a new virtio-serial bus on which the ports will ride as devices 
> */
>  qbus_init(>bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
> diff --git a/hw/display/virtio-gpu-base.c b/hw/display/virtio-gpu-base.c
> index c8da4806e0..20b06a7adf 100644
> --- a/hw/display/virtio-gpu-base.c
> +++ b/hw/display/virtio-gpu-base.c
> @@ -171,7 +171,7 @@

Re: [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Greg Kurz

On Tue, 5 Oct 2021 03:16:07 -0400
"Michael S. Tsirkin"  wrote:

> On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > Raise the maximum possible virtio transfer size to 128M
> > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > more detailed explanation for the reasons of this change.
> > 
> > For not breaking any virtio user, all virtio users transition
> > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > of 1k with this commit.
> > 
> > On the long-term, each virtio user should subsequently either
> > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > after checking that they support the new value of 32k, or
> > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > macro by an appropriate value supported by them.
> > 
> > Signed-off-by: Christian Schoenebeck 
> 
> 
> I don't think we need this. Legacy isn't descriptive either.  Just leave
> VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.
> 

Yes I agree. Only virtio-9p is going to benefit from the new
size in the short/medium term, so it looks a bit excessive to
patch all devices. Also in the end, you end up reverting the name
change in the last patch for virtio-9p... which is a indication
that this patch does too much.

Introduce the new macro in virtio-9p and use it only there.

> > ---
> >  hw/9pfs/virtio-9p-device.c |  2 +-
> >  hw/block/vhost-user-blk.c  |  6 +++---
> >  hw/block/virtio-blk.c  |  6 +++---
> >  hw/char/virtio-serial-bus.c|  2 +-
> >  hw/input/virtio-input.c|  2 +-
> >  hw/net/virtio-net.c| 12 ++--
> >  hw/scsi/virtio-scsi.c  |  2 +-
> >  hw/virtio/vhost-user-fs.c  |  6 +++---
> >  hw/virtio/vhost-user-i2c.c |  2 +-
> >  hw/virtio/vhost-vsock-common.c |  2 +-
> >  hw/virtio/virtio-balloon.c |  2 +-
> >  hw/virtio/virtio-crypto.c  |  2 +-
> >  hw/virtio/virtio-iommu.c   |  2 +-
> >  hw/virtio/virtio-mem.c |  2 +-
> >  hw/virtio/virtio-mmio.c|  4 ++--
> >  hw/virtio/virtio-pmem.c|  2 +-
> >  hw/virtio/virtio-rng.c |  3 ++-
> >  include/hw/virtio/virtio.h | 20 +++-
> >  18 files changed, 49 insertions(+), 30 deletions(-)
> > 
> > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > index cd5d95dd51..9013e7df6e 100644
> > --- a/hw/9pfs/virtio-9p-device.c
> > +++ b/hw/9pfs/virtio-9p-device.c
> > @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev, 
> > Error **errp)
> >  
> >  v->config_size = sizeof(struct virtio_9p_config) + 
> > strlen(s->fsconf.tag);
> >  virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > -VIRTQUEUE_MAX_SIZE);
> > +VIRTQUEUE_LEGACY_MAX_SIZE);
> >  v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> >  }
> >  
> > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> > index 336f56705c..e5e45262ab 100644
> > --- a/hw/block/vhost-user-blk.c
> > +++ b/hw/block/vhost-user-blk.c
> > @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState 
> > *dev, Error **errp)
> >  error_setg(errp, "queue size must be non-zero");
> >  return;
> >  }
> > -if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> >  error_setg(errp, "queue size must not exceed %d",
> > -   VIRTQUEUE_MAX_SIZE);
> > +   VIRTQUEUE_LEGACY_MAX_SIZE);
> >  return;
> >  }
> >  
> > @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState 
> > *dev, Error **errp)
> >  }
> >  
> >  virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> > -sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> > +sizeof(struct virtio_blk_config), 
> > VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >  s->virtqs = g_new(VirtQueue *, s->num_queues);
> >  for (i = 0; i < s->num_queues; i++) {
> > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> > index 9c0f46815c..5883e3e7db 100644
> > --- a/hw/block/virtio-blk.c
> > +++ b/hw/block/virtio-blk.c
> > @@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState 
> > *dev, Error **errp)
> >  return;
> >  }
> >  if (!is_power_of_2(conf->queue_size) ||
> > -conf->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> >  error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
> > "must be a power of 2 (max %d)",
> > -   conf->queue_size, VIRTQUEUE_MAX_SIZE);
> > +   conf->queue_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> >  return;
> >  }
> >  
> > @@ -1214,7 +1214,7 @@ static void virtio_blk_device_realize(DeviceState 
> > *dev, Error **errp)
> >  virtio_blk_set_config_size(s,

Re: [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Michael S. Tsirkin

On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> Raise the maximum possible virtio transfer size to 128M
> (more precisely: 32k * PAGE_SIZE). See previous commit for a
> more detailed explanation for the reasons of this change.
> 
> For not breaking any virtio user, all virtio users transition
> to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> of 1k with this commit.
> 
> On the long-term, each virtio user should subsequently either
> switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> after checking that they support the new value of 32k, or
> otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> macro by an appropriate value supported by them.
> 
> Signed-off-by: Christian Schoenebeck 


I don't think we need this. Legacy isn't descriptive either.  Just leave
VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.

> ---
>  hw/9pfs/virtio-9p-device.c |  2 +-
>  hw/block/vhost-user-blk.c  |  6 +++---
>  hw/block/virtio-blk.c  |  6 +++---
>  hw/char/virtio-serial-bus.c|  2 +-
>  hw/input/virtio-input.c|  2 +-
>  hw/net/virtio-net.c| 12 ++--
>  hw/scsi/virtio-scsi.c  |  2 +-
>  hw/virtio/vhost-user-fs.c  |  6 +++---
>  hw/virtio/vhost-user-i2c.c |  2 +-
>  hw/virtio/vhost-vsock-common.c |  2 +-
>  hw/virtio/virtio-balloon.c |  2 +-
>  hw/virtio/virtio-crypto.c  |  2 +-
>  hw/virtio/virtio-iommu.c   |  2 +-
>  hw/virtio/virtio-mem.c |  2 +-
>  hw/virtio/virtio-mmio.c|  4 ++--
>  hw/virtio/virtio-pmem.c|  2 +-
>  hw/virtio/virtio-rng.c |  3 ++-
>  include/hw/virtio/virtio.h | 20 +++-
>  18 files changed, 49 insertions(+), 30 deletions(-)
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index cd5d95dd51..9013e7df6e 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev, 
> Error **errp)
>  
>  v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
>  virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> -VIRTQUEUE_MAX_SIZE);
> +VIRTQUEUE_LEGACY_MAX_SIZE);
>  v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
>  }
>  
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index 336f56705c..e5e45262ab 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState 
> *dev, Error **errp)
>  error_setg(errp, "queue size must be non-zero");
>  return;
>  }
> -if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> +if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
>  error_setg(errp, "queue size must not exceed %d",
> -   VIRTQUEUE_MAX_SIZE);
> +   VIRTQUEUE_LEGACY_MAX_SIZE);
>  return;
>  }
>  
> @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState 
> *dev, Error **errp)
>  }
>  
>  virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> -sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> +sizeof(struct virtio_blk_config), VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>  s->virtqs = g_new(VirtQueue *, s->num_queues);
>  for (i = 0; i < s->num_queues; i++) {
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index 9c0f46815c..5883e3e7db 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState 
> *dev, Error **errp)
>  return;
>  }
>  if (!is_power_of_2(conf->queue_size) ||
> -conf->queue_size > VIRTQUEUE_MAX_SIZE) {
> +conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
>  error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
> "must be a power of 2 (max %d)",
> -   conf->queue_size, VIRTQUEUE_MAX_SIZE);
> +   conf->queue_size, VIRTQUEUE_LEGACY_MAX_SIZE);
>  return;
>  }
>  
> @@ -1214,7 +1214,7 @@ static void virtio_blk_device_realize(DeviceState *dev, 
> Error **errp)
>  virtio_blk_set_config_size(s, s->host_features);
>  
>  virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> -VIRTQUEUE_MAX_SIZE);
> +VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>  s->blk = conf->conf.blk;
>  s->rq = NULL;
> diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> index 9ad915..2d4285ab53 100644
> --- a/hw/char/virtio-serial-bus.c
> +++ b/hw/char/virtio-serial-bus.c
> @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState 
> *dev, Error **errp)
>  config_size = offsetof(struct virtio_console_config, emerg_wr);
>  }
>  virtio_init(vdev,

Re: [PATCH v0 0/2] virtio-blk and vhost-user-blk cross-device migration

2021-10-05 Thread Michael S. Tsirkin

On Tue, Oct 05, 2021 at 02:18:40AM +0300, Roman Kagan wrote:
> On Mon, Oct 04, 2021 at 11:11:00AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Oct 04, 2021 at 06:07:29PM +0300, Denis Plotnikov wrote:
> > > It might be useful for the cases when a slow block layer should be 
> > > replaced
> > > with a more performant one on running VM without stopping, i.e. with very 
> > > low
> > > downtime comparable with the one on migration.
> > > 
> > > It's possible to achive that for two reasons:
> > > 
> > > 1.The VMStates of "virtio-blk" and "vhost-user-blk" are almost the same.
> > >   They consist of the identical VMSTATE_VIRTIO_DEVICE and differs from
> > >   each other in the values of migration service fields only.
> > > 2.The device driver used in the guest is the same: virtio-blk
> > > 
> > > In the series cross-migration is achieved by adding a new type.
> > > The new type uses virtio-blk VMState instead of vhost-user-blk specific
> > > VMstate, also it implements migration save/load callbacks to be compatible
> > > with migration stream produced by "virtio-blk" device.
> > > 
> > > Adding the new type instead of modifying the existing one is convenent.
> > > It ease to differ the new virtio-blk-compatible vhost-user-blk
> > > device from the existing non-compatible one using qemu machinery without 
> > > any
> > > other modifiactions. That gives all the variety of qemu device related
> > > constraints out of box.
> > 
> > Hmm I'm not sure I understand. What is the advantage for the user?
> > What if vhost-user-blk became an alias for vhost-user-virtio-blk?
> > We could add some hacks to make it compatible for old machine types.
> 
> The point is that virtio-blk and vhost-user-blk are not
> migration-compatible ATM.  OTOH they are the same device from the guest
> POV so there's nothing fundamentally preventing the migration between
> the two.  In particular, we see it as a means to switch between the
> storage backend transports via live migration without disrupting the
> guest.
> 
> Migration-wise virtio-blk and vhost-user-blk have in common
> 
> - the content of the VMState -- VMSTATE_VIRTIO_DEVICE
> 
> The two differ in
> 
> - the name and the version of the VMStateDescription
> 
> - virtio-blk has an extra migration section (via .save/.load callbacks
>   on VirtioDeviceClass) containing requests in flight
> 
> It looks like to become migration-compatible with virtio-blk,
> vhost-user-blk has to start using VMStateDescription of virtio-blk and
> provide compatible .save/.load callbacks.  It isn't entirely obvious how
> to make this machine-type-dependent, so we came up with a simpler idea
> of defining a new device that shares most of the implementation with the
> original vhost-user-blk except for the migration stuff.  We're certainly
> open to suggestions on how to reconcile this under a single
> vhost-user-blk device, as this would be more user-friendly indeed.
> 
> We considered using a class property for this and defining the
> respective compat clause, but IIUC the class constructors (where .vmsd
> and .save/.load are defined) are not supposed to depend on class
> properties.
> 
> Thanks,
> Roman.

So the question is how to make vmsd depend on machine type.
CC Eduardo who poked at this kind of compat stuff recently,
paolo who looked at qom things most recently and dgilbert
for advice on migration.

-- 
MST

82 matches

Mail list logo