Re: [PATCH RFC 2/2] hw: Replace drive_get_next() by drive_get()
Philippe Mathieu-Daudé writes: > On 11/15/21 16:57, Markus Armbruster wrote: >> Philippe Mathieu-Daudé writes: >>> On 11/15/21 13:55, Markus Armbruster wrote: drive_get_next() is basically a bad idea. It returns the "next" block backend of a certain interface type. "Next" means bus=0,unit=N, where subsequent calls count N up from zero, per interface type. This lets you define unit numbers implicitly by execution order. If the order changes, or new calls appear "in the middle", unit numbers change. ABI break. Hard to spot in review. Explicit is better than implicit: use drive_get() directly. Signed-off-by: Markus Armbruster --- > @@ -435,11 +438,13 @@ static void aspeed_machine_init(MachineState *machine) } for (i = 0; i < bmc->soc.sdhci.num_slots; i++) { -sdhci_attach_drive(>soc.sdhci.slots[i], drive_get_next(IF_SD)); +sdhci_attach_drive(>soc.sdhci.slots[i], + drive_get(IF_SD, 0, i)); >>> >>> If we put SD on bus #0, ... >>> } if (bmc->soc.emmc.num_slots) { -sdhci_attach_drive(>soc.emmc.slots[0], drive_get_next(IF_SD)); +sdhci_attach_drive(>soc.emmc.slots[0], + drive_get(IF_SD, 0, bmc->soc.sdhci.num_slots)); >>> >>> ... we'd want to put eMMC on bus #1 >> >> Using separate buses for different kinds of devices would be neater, but >> it also would be an incompatible change. This patch keeps existing >> bus/unit numbers working. drive_get_next() can only use bus 0. >> >>> but I see having eMMC cards on a >>> IF_SD bus as a bug, since these cards are soldered on the board. >> >> IF_SD is not a bus, it's an "block interface type", which is really just >> a user interface thing. > > Why are we discriminating by "block interface type" then? > > What is the difference between "block interfaces"? I see a block drive > as a generic unit, usable on multiple hardware devices. > > I never really understood how this "block interface type" helps > developers and users. I thought BlockInterfaceType and DriveInfo > were legacy / deprecated APIs we want to get rid of; and we would > come up with a replacement API using BlockDeviceInfo or providing > a BlockFrontend state of the art object. > Anyway, I suppose the explanation is buried in the git history > before the last 8 years. I need to keep reading. In the beginning (v0.4.2), there was -hda and -hdb, and life was simple. Then there was -hdc, -hdd, -cdrom (v0.5.1), -fda, -fdb (v0.6.0), -mtdblock, -sd, -pflash (v0.9.1). All these options do two things: they create a block backend, and they request the board to create a certain block frontend for it, similar to other options of this vintage, like -serial, -parallel, and -net. Boards generally ignore requests they don't understand, but that's just sloppiness. For each set of related options, there was a global variable holding the requests: bs_table[] for -hda, -hdb, -hdc, -hdd, -cdrom; fd_table[] -fda, -fdb; mtd_bdrv for -mtd; sd_drv for -ds; pflash_table[] for -pflash. The options replaced prior ones, except for -pflash, which appended to its table. bs_table[]'s index had a peculiar meaning: it's bus * MAX_IDE_DEVS + unit. This ensures that -hda (index 0) goes on IDE bus 0 as unit 0; -hdb on bus 0, unit 1; -hdc on 1, 0; -hdc on 1, 1. Life was now complicated enough for a generalization (v0.9.1), so there was -drive (v0.9.1). All the variables holding requests were fused into drives_table[]. Table elements are identified by (type, bus, unit), where type is an enum whose members correspond to the old global variables: IF_IDE for bs_table[], IF_FLOPPY for fd_table[], and so forth. So: -hda becomes type = IF_IDE, bus = 0, unit = 0 -hdb becomes type = IF_IDE, bus = 0, unit = 1 ... -sd becomes type = IF_SD, bus = 0, unit = 0 1st -pflash becomes type = IF_PFLASH, bus = 0, unit = 0 2nd -pflash becomes type = IF_PFLASH, bus = 0, unit = 1 ... Other mappings from old to new global variables would have been possible. I figure this one was chosen because it comes with a reasonable user interface. Identifying block devices by (interface type, bus, unit) is certainly nicer than by index in bs_table[]. Since bus and/or unit make sense only with some interface types, they are optional. Things calmed down for a couple of years, until -device appeared (v0.12). Now we needed a way to define just a backend, without requesting a frontend from the board. Instead of inventing a new option, this became IF_NONE, with meaningless bus and unit. Over the next years, the block layer outgrew -drive's limited capabilities to define frontends. -blockdev appeard (v1.7.0) and matured over several releases. I don't remember exactly when it became stable, relegating -drive if=none to legacy status. What's *not*
Re: [PATCH for 6.2 v4] nbd/server: Add --selinux-label option
On 15/11/2021 21.29, Eric Blake wrote: From: "Richard W.M. Jones" Under SELinux, Unix domain sockets have two labels. One is on the disk and can be set with commands such as chcon(1). There is a different label stored in memory (called the process label). This can only be set by the process creating the socket. When using SELinux + SVirt and wanting qemu to be able to connect to a qemu-nbd instance, you must set both labels correctly first. For qemu-nbd the options to set the second label are awkward. You can create the socket in a wrapper program and then exec into qemu-nbd. Or you could try something with LD_PRELOAD. This commit adds the ability to set the label straightforwardly on the command line, via the new --selinux-label flag. (The name of the flag is the same as the equivalent nbdkit option.) ... @@ -3430,6 +3437,7 @@ summary_info += {'libdaxctl support': libdaxctl} summary_info += {'libudev': libudev} # Dummy dependency, keep .found() summary_info += {'FUSE lseek':fuse_lseek.found()} +summary_info += {'selinux': selinux.found()} It's nicer if you do it like this (i.e. without the .found()): summary_info += {'selinux': selinux} ... then meson prints out the version of the library, too. Apart from that, patch looks fine to me: Reviewed-by: Thomas Huth
RE: [PATCH 2/2] virtio-blk: support BLKSECDISCARD
> The Linux block layer shares the DISCARD queue limits with SECDISCARD. > That's different from BLKZEROOUT (QEMU's WRITE_ZEROES). This is a Linux > implementation detail but I guess virtio-blk can share the DISCARD limits with > SECDISCARD too... > > > @@ -622,6 +628,7 @@ static int virtio_blk_handle_request(VirtIOBlockReq > *req, MultiReqBuffer *mrb) > > unsigned out_num = req->elem.out_num; > > VirtIOBlock *s = req->dev; > > VirtIODevice *vdev = VIRTIO_DEVICE(s); > > +bool is_secdiscard = false; > > > > if (req->elem.out_num < 1 || req->elem.in_num < 1) { > > virtio_error(vdev, "virtio-blk missing headers"); @@ -722,6 > > +729,9 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, > MultiReqBuffer *mrb) > > * VIRTIO_BLK_T_OUT flag set. We masked this flag in the switch > > statement, > > * so we must mask it for these requests, then we will check if it is > > set. > > */ > > +case VIRTIO_BLK_T_SECDISCARD & ~VIRTIO_BLK_T_OUT: > > +is_secdiscard = true; > > +__attribute__((fallthrough)); > > The DISCARD case doesn't use __attribute__((fallthrough)) so this is > inconsistent. > QEMU doesn't use __attribute__((fallthrough)) so I suggest dropping this. Sure, will try to drop the fallthrough case. > > > diff --git a/include/standard-headers/linux/virtio_blk.h > > b/include/standard-headers/linux/virtio_blk.h > > index 2dcc90826a..c55a07840c 100644 > > --- a/include/standard-headers/linux/virtio_blk.h > > +++ b/include/standard-headers/linux/virtio_blk.h > > @@ -40,6 +40,7 @@ > > #define VIRTIO_BLK_F_MQ12 /* support more than one vq > */ > > #define VIRTIO_BLK_F_DISCARD 13 /* DISCARD is supported */ > > #define VIRTIO_BLK_F_WRITE_ZEROES 14 /* WRITE ZEROES is > supported */ > > +#define VIRTIO_BLK_F_SECDISCARD15 /* WRITE ZEROES is supported > */ > > The comment is copy-pasted from WRITE_ZEROES. Will fix it.
RE: [PATCH 1/2] block:hdev: support BLKSECDISCARD
> > Add a new option "secdiscard" for block drive. Parse it and record it > > in bs->open_flags as bit(BDRV_O_SECDISCARD). > > > > Add a new BDRV_REQ_SECDISCARD bit for secure discard request from > > virtual device. > > > > For host_device backend: implement by ioctl(BLKSECDISCARD) on real > > host block device. > > For other backend, no implementation. > > > > E.g.: > > qemu-system-x86_64 \ > > ... \ > > -drive > file=/dev/mmcblk0p2,if=none,format=raw,discard=on,secdiscard=on,id=sd0 \ > > -device virtio-blk-pci,drive=sd0,id=sd0_vblk \ > > ... > > I'm curious why there is explicit control over this feature (-drive > secdiscard=on|off). For example, is there a reason why users would want to > disable secdiscard on a drive that supports it? I guess there is no way to > emulate > it correctly so secdiscard=on on a drive that doesn't support it produces an > error? Yes. AFAIK, there is no way to emulate a "secure" discard. But it seems also hard to detect whether a host drive support secdiscard unless it really perform a real secdiscard. So I choose to add an option for user to enable it for virtual block. There is an assumption that user knows whether host support secdiscard. Best Regard Yadong
RE: [PATCH 1/2] block:hdev: support BLKSECDISCARD
> > Notably absent: qapi/block-core.json. Without changing this, the option can't > be > available in -blockdev, which is the primary option to configure block device > backends. > > This patch seems to contain multiple logical changes that should be split into > separate patches: > > * Adding a flags parameter to .bdrv_co_pdiscard > > * Support for the new feature in the core block layer (should be done > with -blockdev) > > * Convenience magic for -drive (BDRV_O_SECDISCARD). It's not clear that > this should be done at all because the option is really specific to > one single block driver (file-posix). I think in your patch, all > other block drivers silently ignore the option, which is not what we > want. Sorry for the absent for -blockdev. Will try add that. Also I will try to split the patches. And for the BDRV_O_SECDISCARD, it is specific for file-posix.c(host_device). Maybe I need to add the option only for file-posix.c. > > > diff --git a/block.c b/block.c > > index 580cb77a70..4f05e96d12 100644 > > --- a/block.c > > +++ b/block.c > > @@ -1128,6 +1128,32 @@ int bdrv_parse_discard_flags(const char *mode, > int *flags) > > return 0; > > } > > > > +/** > > + * Set open flags for a given secdiscard mode > > + * > > + * Return 0 on success, -1 if the secdiscard mode was invalid. > > + */ > > +int bdrv_parse_secdiscard_flags(const char *mode, int *flags, Error > > +**errp) { > > +*flags &= ~BDRV_O_SECDISCARD; > > + > > +if (!strcmp(mode, "off")) { > > +/* do nothing */ > > +} else if (!strcmp(mode, "on")) { > > +if (!(*flags & BDRV_O_UNMAP)) { > > +error_setg(errp, "cannot enable secdiscard when discard is " > > + "disabled!"); > > +return -1; > > +} > > This check has nothing to do with parsing the option, it's validating its > value. > > You don't even need a new function to parse it, because there is already > qemu_opt_get_bool(). Duplicating it means only that you're inconsistent with > other boolean options, which alternatively accept "yes"/"no", "true"/"false", > "y/n". Sure. Will use qemu_opt_get_bool() instead. > > > + > > +*flags |= BDRV_O_SECDISCARD; > > +} else { > > +return -1; > > +} > > + > > +return 0; > > +} > > + > > /** > > * Set open flags for a given cache mode > > * > > @@ -1695,6 +1721,11 @@ QemuOptsList bdrv_runtime_opts = { > > .type = QEMU_OPT_STRING, > > .help = "discard operation (ignore/off, unmap/on)", > > }, > > +{ > > +.name = BDRV_OPT_SECDISCARD, > > +.type = QEMU_OPT_STRING, > > +.help = "secure discard operation (off, on)", > > +}, > > { > > .name = BDRV_OPT_FORCE_SHARE, > > .type = QEMU_OPT_BOOL, > > @@ -1735,6 +1766,7 @@ static int bdrv_open_common(BlockDriverState *bs, > BlockBackend *file, > > const char *driver_name = NULL; > > const char *node_name = NULL; > > const char *discard; > > +const char *secdiscard; > > QemuOpts *opts; > > BlockDriver *drv; > > Error *local_err = NULL; > > @@ -1829,6 +1861,16 @@ static int bdrv_open_common(BlockDriverState > *bs, BlockBackend *file, > > } > > } > > > > + > > +secdiscard = qemu_opt_get(opts, BDRV_OPT_SECDISCARD); > > +if (secdiscard != NULL) { > > +if (bdrv_parse_secdiscard_flags(secdiscard, >open_flags, > > +errp) != 0) { > > +ret = -EINVAL; > > +goto fail_opts; > > +} > > +} > > + > > bs->detect_zeroes = > > bdrv_parse_detect_zeroes(opts, bs->open_flags, _err); > > if (local_err) { > > @@ -3685,6 +3727,10 @@ static BlockDriverState *bdrv_open_inherit(const > char *filename, > > , options, flags, options); > > } > > > > +if (g_strcmp0(qdict_get_try_str(options, BDRV_OPT_SECDISCARD), "on")) { > > +flags |= BDRV_O_SECDISCARD; > > Indentation is off. Will fix it. > > > --- a/block/file-posix.c > > +++ b/block/file-posix.c > > @@ -160,6 +160,7 @@ typedef struct BDRVRawState { > > bool is_xfs:1; > > #endif > > bool has_discard:1; > > +bool has_secdiscard:1; > > bool has_write_zeroes:1; > > bool discard_zeroes:1; > > bool use_linux_aio:1; > > has_secdiscard is only set to false in raw_open_common() and never changed or > used. Will remove it. > > > @@ -727,6 +728,7 @@ static int raw_open_common(BlockDriverState *bs, > > QDict *options, #endif /* !defined(CONFIG_LINUX_IO_URING) */ > > > > s->has_discard = true; > > +s->has_secdiscard = false; > > s->has_write_zeroes = true; > > if ((bs->open_flags & BDRV_O_NOCACHE) != 0 && !dio_byte_aligned(s->fd)) > { > > s->needs_alignment = true; > > @@ -765,6 +767,7 @@ static int raw_open_common(BlockDriverState *bs, > QDict *options, > >
RE: [PATCH 2/2] virtio-blk: support BLKSECDISCARD
> On Mon, Nov 15, 2021 at 12:52:00PM +0800, yadong...@intel.com wrote: > > From: Yadong Qi > > > > Add new virtio feature: VIRTIO_BLK_F_SECDISCARD. > > Add new virtio command: VIRTIO_BLK_T_SECDISCARD. > > > > This feature is disabled by default, it will check the backend > > bs->open_flags & BDRV_O_SECDISCARD, enable it if BDRV_O_SECDISCARD > > is supported. > > > > Signed-off-by: Yadong Qi > > --- > > hw/block/virtio-blk.c | 26 + > > include/standard-headers/linux/virtio_blk.h | 4 > > > Any changes to standard headers need to go to linux and virtio TC. Sure. I will draft proposal to virtio-comment for review.
RE: [PATCH 2/2] virtio-blk: support BLKSECDISCARD
> >Add new virtio feature: VIRTIO_BLK_F_SECDISCARD. > >Add new virtio command: VIRTIO_BLK_T_SECDISCARD. > > Has a proposal been sent out yet to virtio-comment mailing list for discussing > these specification changes? > Not yet. I will draft a proposal to virtio-comment if no big concern of this patch >From maintainer. > > > >diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index > >dbc4c5a3cd..7bc3484521 100644 > >--- a/hw/block/virtio-blk.c > >+++ b/hw/block/virtio-blk.c > >@@ -536,7 +536,8 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock > >*dev, } > > > > static uint8_t virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req, > >-struct virtio_blk_discard_write_zeroes *dwz_hdr, bool is_write_zeroes) > >+struct virtio_blk_discard_write_zeroes *dwz_hdr, bool is_write_zeroes, > >+bool is_secdiscard) > > Since the function now handles 3 commands, I'm thinking if it's better to pass > the command directly and then make a switch instead of using 2 booleans. > Make sense. > > { > > VirtIOBlock *s = req->dev; > > VirtIODevice *vdev = VIRTIO_DEVICE(s); @@ -577,8 +578,8 @@ static > >uint8_t virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req, > > goto err; > > } > > > >+int blk_aio_flags = 0; > > Maybe better to move it to the beginning of the function. Sure. > > > > >-blk_aio_pdiscard(s->blk, sector << BDRV_SECTOR_BITS, bytes, 0, > >+if (is_secdiscard) { > >+blk_aio_flags |= BDRV_REQ_SECDISCARD; > >+} > >+ > >+blk_aio_pdiscard(s->blk, sector << BDRV_SECTOR_BITS, bytes, > >+ blk_aio_flags, > > virtio_blk_discard_write_zeroes_complete, req); > > } > > > >@@ -622,6 +628,7 @@ static int virtio_blk_handle_request(VirtIOBlockReq > *req, MultiReqBuffer *mrb) > > unsigned out_num = req->elem.out_num; > > VirtIOBlock *s = req->dev; > > VirtIODevice *vdev = VIRTIO_DEVICE(s); > >+bool is_secdiscard = false; > > > > if (req->elem.out_num < 1 || req->elem.in_num < 1) { > > virtio_error(vdev, "virtio-blk missing headers"); @@ -722,6 > >+729,9 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, > MultiReqBuffer *mrb) > > * VIRTIO_BLK_T_OUT flag set. We masked this flag in the switch > > statement, > > * so we must mask it for these requests, then we will check if it is > > set. > > */ > >+case VIRTIO_BLK_T_SECDISCARD & ~VIRTIO_BLK_T_OUT: > >+is_secdiscard = true; > >+__attribute__((fallthrough)); > > We can use QEMU_FALLTHROUGH here. Sure. > > > > > err_status = virtio_blk_handle_discard_write_zeroes(req, _hdr, > >- > >is_write_zeroes); > >+is_write_zeroes, > >+ > >+ is_secdiscard); > > > >+if (blk_get_flags(conf->conf.blk) & BDRV_O_SECDISCARD) > >+virtio_add_feature(>host_features, > >VIRTIO_BLK_F_SECDISCARD); > >+else > >+virtio_clear_feature(>host_features, > >+ VIRTIO_BLK_F_SECDISCARD); > >+ > > IIUC here we set or not the feature if BDRV_O_SECDISCARD is set. > > Should we keep it disabled if "secdiscard" is false? (e.g. to avoid migration > problems) Yes, BDRV_O_SECDISCARD=(secdiscard=="on") ? 1 : 0; > > Otherwise what is the purpose of the "secdiscard" property? I cannot find a good method to detect whether host device support BLKSECDISCARD. So I add this "secdiscard" property to explicitly enable this feature. Best Regard Yadong > > Thanks, > Stefano
Re: [PATCH for 6.2] nbd/server: Silence clang sanitizer warning
On 11/15/21 23:39, Eric Blake wrote: > clang's sanitizer is picky: memset(NULL, x, 0) is technically > undefined behavior, even though no sane implementation of memset() > deferences the NULL. Caught by the nbd-qemu-allocation iotest. > > The alternative to checking before each memset is to instead force an > allocation of 1 element instead of g_new0(type, 0)'s behavior of > returning NULL for a 0-length array. > > Reported-by: Peter Maydell > Fixes: 3b1f244c59 (nbd: Allow export of multiple bitmaps for one device) > Signed-off-by: Eric Blake > --- > nbd/server.c | 13 + > 1 file changed, 9 insertions(+), 4 deletions(-) Reviewed-by: Philippe Mathieu-Daudé
[PATCH for 6.2] nbd/server: Silence clang sanitizer warning
clang's sanitizer is picky: memset(NULL, x, 0) is technically undefined behavior, even though no sane implementation of memset() deferences the NULL. Caught by the nbd-qemu-allocation iotest. The alternative to checking before each memset is to instead force an allocation of 1 element instead of g_new0(type, 0)'s behavior of returning NULL for a 0-length array. Reported-by: Peter Maydell Fixes: 3b1f244c59 (nbd: Allow export of multiple bitmaps for one device) Signed-off-by: Eric Blake --- nbd/server.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/nbd/server.c b/nbd/server.c index 6d03e8a4b436..d9164ee6d0da 100644 --- a/nbd/server.c +++ b/nbd/server.c @@ -1,5 +1,5 @@ /* - * Copyright (C) 2016-2020 Red Hat, Inc. + * Copyright (C) 2016-2021 Red Hat, Inc. * Copyright (C) 2005 Anthony Liguori * * Network Block Device Server Side @@ -879,7 +879,9 @@ static bool nbd_meta_qemu_query(NBDClient *client, NBDExportMetaContexts *meta, if (!*query) { if (client->opt == NBD_OPT_LIST_META_CONTEXT) { meta->allocation_depth = meta->exp->allocation_depth; -memset(meta->bitmaps, 1, meta->exp->nr_export_bitmaps); +if (meta->exp->nr_export_bitmaps) { +memset(meta->bitmaps, 1, meta->exp->nr_export_bitmaps); +} } trace_nbd_negotiate_meta_query_parse("empty"); return true; @@ -894,7 +896,8 @@ static bool nbd_meta_qemu_query(NBDClient *client, NBDExportMetaContexts *meta, if (nbd_strshift(, "dirty-bitmap:")) { trace_nbd_negotiate_meta_query_parse("dirty-bitmap:"); if (!*query) { -if (client->opt == NBD_OPT_LIST_META_CONTEXT) { +if (client->opt == NBD_OPT_LIST_META_CONTEXT && +meta->exp->nr_export_bitmaps) { memset(meta->bitmaps, 1, meta->exp->nr_export_bitmaps); } trace_nbd_negotiate_meta_query_parse("empty"); @@ -1024,7 +1027,9 @@ static int nbd_negotiate_meta_queries(NBDClient *client, /* enable all known contexts */ meta->base_allocation = true; meta->allocation_depth = meta->exp->allocation_depth; -memset(meta->bitmaps, 1, meta->exp->nr_export_bitmaps); +if (meta->exp->nr_export_bitmaps) { +memset(meta->bitmaps, 1, meta->exp->nr_export_bitmaps); +} } else { for (i = 0; i < nb_queries; ++i) { ret = nbd_negotiate_meta_query(client, meta, errp); -- 2.33.1
Re: "make check" fails in a clang sanitizer build on "nbd-qemu-allocation" iotest
On Mon, Nov 15, 2021 at 05:11:54PM +, Peter Maydell wrote: > Hi; running a 'make check' on a clang sanitizer build one of > the iotests falls over due to a NULL pointer being passed to > memset(): > > > TEST iotest-qcow2: nbd-qemu-allocation [fail] > +../../nbd/server.c:1027:16: runtime error: null pointer passed as > argument 1, which is declared to never be null The code in question: if (client->opt == NBD_OPT_LIST_META_CONTEXT && !nb_queries) { /* enable all known contexts */ meta->base_allocation = true; meta->allocation_depth = meta->exp->allocation_depth; memset(meta->bitmaps, 1, meta->exp->nr_export_bitmaps); I suspect what is happening is that meta->bitmaps is NULL when meta->exp->nr_export_bitmaps is 0. It's annoying that clang's sanitizer whines even for a 0-length memset, but a strict reading of POSIX says that we really are in the technically undefined behavior when passing NULL (even with 0 length), so such whiny behavior is permitted. So I'll post a patch. > > Does this look familiar ? First I've heard of it; thanks for alerting me. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [PATCH RFC 2/2] hw: Replace drive_get_next() by drive_get()
On 11/15/21 16:57, Markus Armbruster wrote: > Philippe Mathieu-Daudé writes: >> On 11/15/21 13:55, Markus Armbruster wrote: >>> drive_get_next() is basically a bad idea. It returns the "next" block >>> backend of a certain interface type. "Next" means bus=0,unit=N, where >>> subsequent calls count N up from zero, per interface type. >>> >>> This lets you define unit numbers implicitly by execution order. If the >>> order changes, or new calls appear "in the middle", unit numbers change. >>> ABI break. Hard to spot in review. >>> >>> Explicit is better than implicit: use drive_get() directly. >>> >>> Signed-off-by: Markus Armbruster >>> --- >>> @@ -435,11 +438,13 @@ static void aspeed_machine_init(MachineState *machine) >>> } >>> >>> for (i = 0; i < bmc->soc.sdhci.num_slots; i++) { >>> -sdhci_attach_drive(>soc.sdhci.slots[i], >>> drive_get_next(IF_SD)); >>> +sdhci_attach_drive(>soc.sdhci.slots[i], >>> + drive_get(IF_SD, 0, i)); >> >> If we put SD on bus #0, ... >> >>> } >>> >>> if (bmc->soc.emmc.num_slots) { >>> -sdhci_attach_drive(>soc.emmc.slots[0], drive_get_next(IF_SD)); >>> +sdhci_attach_drive(>soc.emmc.slots[0], >>> + drive_get(IF_SD, 0, bmc->soc.sdhci.num_slots)); >> >> ... we'd want to put eMMC on bus #1 > > Using separate buses for different kinds of devices would be neater, but > it also would be an incompatible change. This patch keeps existing > bus/unit numbers working. drive_get_next() can only use bus 0. > >> but I see having eMMC cards on a >> IF_SD bus as a bug, since these cards are soldered on the board. > > IF_SD is not a bus, it's an "block interface type", which is really just > a user interface thing. Why are we discriminating by "block interface type" then? What is the difference between "block interfaces"? I see a block drive as a generic unit, usable on multiple hardware devices. I never really understood how this "block interface type" helps developers and users. I thought BlockInterfaceType and DriveInfo were legacy / deprecated APIs we want to get rid of; and we would come up with a replacement API using BlockDeviceInfo or providing a BlockFrontend state of the art object. Anyway, I suppose the explanation is buried in the git history before the last 8 years. I need to keep reading.
Re: [PULL 00/13] Block layer patches
Hi Kevin, On 11/15/21 15:53, Kevin Wolf wrote: > The following changes since commit 42f6c9179be4401974dd3a75ee72defd16b5092d: > > Merge tag 'pull-ppc-2022' of https://github.com/legoater/qemu into > staging (2021-11-12 12:28:25 +0100) > > are available in the Git repository at: > > git://repo.or.cz/qemu/kevin.git tags/for-upstream > > for you to fetch changes up to 7461272c5f6032436ef9032c091c0118539483e4: > > softmmu/qdev-monitor: fix use-after-free in qdev_set_id() (2021-11-15 > 15:49:46 +0100) > > > Block layer patches > > - Fixes to image streaming job and block layer reconfiguration to make > iotest 030 pass again > - docs: Deprecate incorrectly typed device_add arguments > - file-posix: Fix alignment after reopen changing O_DIRECT > > > Hanna Reitz (10): > stream: Traverse graph after modification > block: Manipulate children list in .attach/.detach > block: Unite remove_empty_child and child_free > block: Drop detached child from ignore list > block: Pass BdrvChild ** to replace_child_noperm > block: Restructure remove_file_or_backing_child() > transactions: Invoke clean() after everything else > block: Let replace_child_tran keep indirect pointer > block: Let replace_child_noperm free children > iotests/030: Unthrottle parallel jobs in reverse > > Kevin Wolf (2): > docs: Deprecate incorrectly typed device_add arguments > file-posix: Fix alignment after reopen changing O_DIRECT > > Stefan Hajnoczi (1): > softmmu/qdev-monitor: fix use-after-free in qdev_set_id() > > docs/about/deprecated.rst | 14 +++ > include/qemu/transactions.h | 3 + > block.c | 233 > +--- > block/file-posix.c | 20 +++- > block/stream.c | 7 +- > softmmu/qdev-monitor.c | 2 +- > util/transactions.c | 8 +- > tests/qemu-iotests/030 | 11 ++- > tests/qemu-iotests/142 | 22 + > tests/qemu-iotests/142.out | 15 +++ > 10 files changed, 269 insertions(+), 66 deletions(-) Looking at current /staging I noticed iotest#142 failed, build-tcg-disabled job: +++ 142.out.bad @@ -750,6 +750,7 @@ --- Alignment after changing O_DIRECT --- +qemu-io: Cannot get 'write' permission without 'resize': Image size is not a multiple of request alignment read 42/42 bytes at offset 42 42 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) read 42/42 bytes at offset 42 https://gitlab.com/qemu-project/qemu/-/jobs/1784955950#L2794
Re: [PULL 00/13] Block layer patches
On 11/15/21 3:53 PM, Kevin Wolf wrote: The following changes since commit 42f6c9179be4401974dd3a75ee72defd16b5092d: Merge tag 'pull-ppc-2022' of https://github.com/legoater/qemu into staging (2021-11-12 12:28:25 +0100) are available in the Git repository at: git://repo.or.cz/qemu/kevin.git tags/for-upstream for you to fetch changes up to 7461272c5f6032436ef9032c091c0118539483e4: softmmu/qdev-monitor: fix use-after-free in qdev_set_id() (2021-11-15 15:49:46 +0100) Block layer patches - Fixes to image streaming job and block layer reconfiguration to make iotest 030 pass again - docs: Deprecate incorrectly typed device_add arguments - file-posix: Fix alignment after reopen changing O_DIRECT Hanna Reitz (10): stream: Traverse graph after modification block: Manipulate children list in .attach/.detach block: Unite remove_empty_child and child_free block: Drop detached child from ignore list block: Pass BdrvChild ** to replace_child_noperm block: Restructure remove_file_or_backing_child() transactions: Invoke clean() after everything else block: Let replace_child_tran keep indirect pointer block: Let replace_child_noperm free children iotests/030: Unthrottle parallel jobs in reverse Kevin Wolf (2): docs: Deprecate incorrectly typed device_add arguments file-posix: Fix alignment after reopen changing O_DIRECT Stefan Hajnoczi (1): softmmu/qdev-monitor: fix use-after-free in qdev_set_id() docs/about/deprecated.rst | 14 +++ include/qemu/transactions.h | 3 + block.c | 233 +--- block/file-posix.c | 20 +++- block/stream.c | 7 +- softmmu/qdev-monitor.c | 2 +- util/transactions.c | 8 +- tests/qemu-iotests/030 | 11 ++- tests/qemu-iotests/142 | 22 + tests/qemu-iotests/142.out | 15 +++ 10 files changed, 269 insertions(+), 66 deletions(-) This is failing iotest 142 for build-tcg-disabled. I did retry, in case it was transitory. https://gitlab.com/qemu-project/qemu/-/jobs/1784955950 r~
[PATCH for 6.2 v4] nbd/server: Add --selinux-label option
From: "Richard W.M. Jones" Under SELinux, Unix domain sockets have two labels. One is on the disk and can be set with commands such as chcon(1). There is a different label stored in memory (called the process label). This can only be set by the process creating the socket. When using SELinux + SVirt and wanting qemu to be able to connect to a qemu-nbd instance, you must set both labels correctly first. For qemu-nbd the options to set the second label are awkward. You can create the socket in a wrapper program and then exec into qemu-nbd. Or you could try something with LD_PRELOAD. This commit adds the ability to set the label straightforwardly on the command line, via the new --selinux-label flag. (The name of the flag is the same as the equivalent nbdkit option.) A worked example showing how to use the new option can be found in this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1984938 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1984938 Signed-off-by: Richard W.M. Jones Reviewed-by: Daniel P. Berrangé Message-Id: <20210930084701.3899578-1-rjo...@redhat.com> [eblake: rebase to configure changes, reject --selinux-label if it is not compiled in or not used on a Unix socket] Signed-off-by: Eric Blake --- v3 was here: https://lists.gnu.org/archive/html/qemu-devel/2021-09/msg07677.html since then: another rework of the configure logic (now that meson-buildoptions.sh exists), and a rework of the error reporting (for now, loudly complain on all unsupported attempts at labeling. We may later allow things that currently fail) Candidate for 6.2 in spite of soft freeze because of earlier attempted pull request here: https://lists.gnu.org/archive/html/qemu-devel/2021-09/msg07081.html and my apologies for letting this slip a month without action. meson.build | 10 +++- qemu-nbd.c| 46 +++ meson_options.txt | 3 ++ scripts/meson-buildoptions.sh | 3 ++ tests/docker/dockerfiles/centos8.docker | 1 + .../dockerfiles/fedora-i386-cross.docker | 1 + tests/docker/dockerfiles/fedora.docker| 1 + tests/docker/dockerfiles/opensuse-leap.docker | 1 + tests/docker/dockerfiles/ubuntu1804.docker| 1 + tests/docker/dockerfiles/ubuntu2004.docker| 1 + 10 files changed, 67 insertions(+), 1 deletion(-) diff --git a/meson.build b/meson.build index 2ece4fe0889a..73b7deaa0ef3 100644 --- a/meson.build +++ b/meson.build @@ -1201,6 +1201,11 @@ keyutils = dependency('libkeyutils', required: false, has_gettid = cc.has_function('gettid') +# libselinux +selinux = dependency('libselinux', + required: get_option('selinux'), + method: 'pkg-config', kwargs: static_kwargs) + # Malloc tests malloc = [] @@ -1479,6 +1484,7 @@ config_host_data.set('CONFIG_SPICE_PROTOCOL', spice_protocol.found()) config_host_data.set('CONFIG_SPICE', spice.found()) config_host_data.set('CONFIG_X11', x11.found()) config_host_data.set('CONFIG_CFI', get_option('cfi')) +config_host_data.set('CONFIG_SELINUX', selinux.found()) config_host_data.set('QEMU_VERSION', '"@0@"'.format(meson.project_version())) config_host_data.set('QEMU_VERSION_MAJOR', meson.project_version().split('.')[0]) config_host_data.set('QEMU_VERSION_MINOR', meson.project_version().split('.')[1]) @@ -3054,7 +3060,8 @@ if have_tools qemu_io = executable('qemu-io', files('qemu-io.c'), dependencies: [block, qemuutil], install: true) qemu_nbd = executable('qemu-nbd', files('qemu-nbd.c'), - dependencies: [blockdev, qemuutil, gnutls], install: true) + dependencies: [blockdev, qemuutil, gnutls, selinux], + install: true) subdir('storage-daemon') subdir('contrib/rdmacm-mux') @@ -3430,6 +3437,7 @@ summary_info += {'libdaxctl support': libdaxctl} summary_info += {'libudev': libudev} # Dummy dependency, keep .found() summary_info += {'FUSE lseek':fuse_lseek.found()} +summary_info += {'selinux': selinux.found()} summary(summary_info, bool_yn: true, section: 'Dependencies') if not supported_cpus.contains(cpu) diff --git a/qemu-nbd.c b/qemu-nbd.c index 9d895ba24b1e..c6c20df68a4d 100644 --- a/qemu-nbd.c +++ b/qemu-nbd.c @@ -47,6 +47,10 @@ #include "trace/control.h" #include "qemu-version.h" +#ifdef CONFIG_SELINUX +#include +#endif + #ifdef __linux__ #define HAVE_NBD_DEVICE 1 #else @@ -64,6 +68,7 @@ #define QEMU_NBD_OPT_FORK 263 #define QEMU_NBD_OPT_TLSAUTHZ 264 #define QEMU_NBD_OPT_PID_FILE 265 +#define QEMU_NBD_OPT_SELINUX_LABEL 266 #define MBR_SIZE 512 @@ -116,6 +121,9 @@ static void usage(const char *name) " --forkfork off the server process and exit the parent\n" "once the server is running\n" " --pid-file=PATH store the server's process ID in the given file\n" +#ifdef
Re: [PATCH 0/2] hw/nvme: fix namespace attachment on controller hotplug
On Sep 24 09:27, Klaus Jensen wrote: > From: Klaus Jensen > > First patch from Hannes fixes the subsystem registration process such > that shared (but non-detached) namespaces are automatically attached to > hotplugged controllers. > > The second patch changes the default for 'shared' such that namespaces > are shared by default and will thus by default be attached to hotplugged > controllers. This adds a compat property for older machine versions and > updates the documentation to reflect this. > > Hannes Reinecke (1): > hw/nvme: reattach subsystem namespaces on hotplug > > Klaus Jensen (1): > hw/nvme: change nvme-ns 'shared' default > > docs/system/devices/nvme.rst | 24 ++-- > hw/core/machine.c| 4 +++- > hw/nvme/ns.c | 8 +--- > hw/nvme/subsys.c | 10 +- > 4 files changed, 27 insertions(+), 19 deletions(-) > Ping :) signature.asc Description: PGP signature
"make check" fails in a clang sanitizer build on "nbd-qemu-allocation" iotest
Hi; running a 'make check' on a clang sanitizer build one of the iotests falls over due to a NULL pointer being passed to memset(): TEST iotest-qcow2: nbd-qemu-allocation [fail] QEMU -- "/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/qemu-iotests/../../qemu-system-aarch64" -nodefaults -display none -accel qtest -machine virt QEMU_IMG -- "/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/qemu-iotests/../../qemu-img" QEMU_IO -- "/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/qemu-iotests/../../qemu-io" --cache writeback --aio threads -f qcow2 QEMU_NBD -- "/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/qemu-iotests/../../qemu-nbd" IMGFMT-- qcow2 IMGPROTO -- file PLATFORM -- Linux/x86_64 e104462 5.4.0-89-generic TEST_DIR -- /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/qemu-iotests/scratch SOCK_DIR -- /tmp/tmp13ihi_hj GDB_OPTIONS -- VALGRIND_QEMU -- PRINT_QEMU_OUTPUT -- --- /home/petmay01/linaro/qemu-from-laptop/qemu/tests/qemu-iotests/tests/nbd-qemu-allocation.out +++ nbd-qemu-allocation.out.bad @@ -14,6 +14,8 @@ [{ "start": 0, "length": 1048576, "depth": 1, "present": true, "zero": false, "data": true, "offset": 327680}, { "start": 1048576, "length": 2097152, "depth": 0, "present": true, "zero": false, "data": true, "offset": 327680}, { "start": 3145728, "length": 1048576, "depth": 1, "present": false, "zero": true, "data": false}] +../../nbd/server.c:1027:16: runtime error: null pointer passed as argument 1, which is declared to never be null +/usr/include/string.h:61:62: note: nonnull attribute specified here exports available: 1 export: '' size: 4194304 TEST iotest-qcow2: qsd-jobs Not run: 172 186 192 220 287 Failures: nbd-qemu-allocation Failed 1 of 118 iotests Does this look familiar ? -- PMM
Re: [PATCH v4 00/25] block layer: split block APIs in global state and I/O
On Mon, Nov 15, 2021 at 05:03:28PM +0100, Hanna Reitz wrote: > On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote: > > Currently, block layer APIs like block-backend.h contain a mix of > > functions that are either running in the main loop and under the > > BQL, or are thread-safe functions and run in iothreads performing I/O. > > The functions running under BQL also take care of modifying the > > block graph, by using drain and/or aio_context_acquire/release. > > This makes it very confusing to understand where each function > > runs, and what assumptions it provided with regards to thread > > safety. > > > > We call the functions running under BQL "global state (GS) API", and > > distinguish them from the thread-safe "I/O API". > > > > The aim of this series is to split the relevant block headers in > > global state and I/O sub-headers. > > Despite leaving quite some comments, the series and the split seem > reasonable to me overall. (This is a pretty big series, after all, so those > “some comments” stack up against a majority of changes that seem OK to me. > :)) > > One thing I noticed while reviewing is that it’s really hard to verify that > no I/O function calls a GS function. What would be wonderful is some > function marker like coroutine_fn that marks GS functions (or I/O functions) > and that we could then verify the call paths. But AFAIU we’ve always wanted > precisely that for coroutine_fn and still don’t have it, so this seems like > extremely wishful thinking... :( Even if we don't programmatically verify these rules, it would be a major step forward if we simply adopted a standard naming convention for the APIs that was essentally self-describing. eg block_io_XXX for all I/O stuff and block_state_ for all global state ,and block_common if we have stuff applicable to both use cases. I wouldn't suggest doing it as part of this series, but I think it'd be worthwhile in general for the medium-long term, despite the code churn in updating all usage in the short term. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: does drive_get_next(IF_NONE) make sense?
On 15/11/2021 08.12, Alistair Francis wrote: On Mon, Nov 15, 2021 at 3:32 PM Markus Armbruster wrote: Peter Maydell writes: On Fri, 12 Nov 2021 at 13:34, Markus Armbruster wrote: Thomas Huth writes: On 03/11/2021 09.41, Markus Armbruster wrote: Peter Maydell writes: Does it make sense for a device/board to do drive_get_next(IF_NONE) ? Short answer: hell, no! ;) Would it make sense to add an "assert(type != IF_NONE)" to drive_get() to avoid such mistakes in the future? Worth a try. You need to fix the sifive_u_otp device first :-) And for that, we may want Hao Wu's "[PATCH v4 5/7] blockdev: Add a new IF type IF_OTHER" first. I can fixup sifive_u_otp, just let me know what the prefered solution is What kind of device is that OTP exactly? If it is some kind of non-serial flash device, maybe you could simply use IF_PFLASH instead? Thomas
Re: [PATCH v4 00/25] block layer: split block APIs in global state and I/O
On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote: Currently, block layer APIs like block-backend.h contain a mix of functions that are either running in the main loop and under the BQL, or are thread-safe functions and run in iothreads performing I/O. The functions running under BQL also take care of modifying the block graph, by using drain and/or aio_context_acquire/release. This makes it very confusing to understand where each function runs, and what assumptions it provided with regards to thread safety. We call the functions running under BQL "global state (GS) API", and distinguish them from the thread-safe "I/O API". The aim of this series is to split the relevant block headers in global state and I/O sub-headers. Despite leaving quite some comments, the series and the split seem reasonable to me overall. (This is a pretty big series, after all, so those “some comments” stack up against a majority of changes that seem OK to me. :)) One thing I noticed while reviewing is that it’s really hard to verify that no I/O function calls a GS function. What would be wonderful is some function marker like coroutine_fn that marks GS functions (or I/O functions) and that we could then verify the call paths. But AFAIU we’ve always wanted precisely that for coroutine_fn and still don’t have it, so this seems like extremely wishful thinking... :( I think most of the issues I found can be fixed (or are even irrelevant), the only thing that really worries me are the two places that are clearly I/O paths that call permission functions: Namely first block_crypto_amend_options_generic_luks() (part of the luks block driver’s .bdrv_co_amend implementation), which calls bdrv_child_refresh_perms(); and second fuse_do_truncate(), which calls blk_set_perm(). In the first case, we need this call so that we don’t permanently hog the WRITE permission for the luks file, which used to be a problem, I believe. We want to unshare the WRITE permission (and apparently also CONSISTENT_READ) during the key update, so we need some way to temporarily update the permissions. I only really see four solutions for this: (1) We somehow make the amend job run in the main context under the BQL and have it prevent all concurrent I/O access (seems bad) (2) We can make the permission functions part of the I/O path (seems wrong and probably impossible?) (3) We can drop the permissions update and permanently require the permissions that we need when updating keys (I think this might break existing use cases) (4) We can acquire the BQL around the permission update call and perhaps that works? I don’t know how (4) would work but it’s basically the only reasonable solution I can come up with. Would this be a way to call a BQL function from an I/O function? As for the second case, the same applies as above, with the differences that we have no jobs, so this code must always run in the block device’s AioContext (I think), which rules out (1); but (3) would become easier (i.e. require the RESIZE permission all the time), although that too might have an impact on existing users (don’t think so, though). In any case, if we could do (4), that would solve the problem here, too. And finally, another notable thing I noticed is that the way how create-related functions are handled is inconsistent. I believe they should all be GS functions; qmp_blockdev_create() seems to agree with me on this, but we currently seem to have some bugs there. It’s possible to invoke blockdev-create on a block device that’s in an I/O thread, and then qemu crashes. Oops. (The comment in qmp_blockdev_create() says that the block drivers’ implementations should prevent this, but apparently they don’t...?) In any case, that’s a pre-existing bug, of course, that doesn’t concern this series (other than that it suggests that “create” functions should be classified as GS). Hanna
Re: [PATCH RFC 0/2] Eliminate drive_get_next()
Peter Maydell writes: > On Mon, 15 Nov 2021 at 12:55, Markus Armbruster wrote: >> >> This is RFC because I'm unsure about the removal of >> >> /* Reason: init() method uses drive_get_next() */ >> dc->user_creatable = false; >> >> in PATCH 1. Both users appear to wire up some GPIO. If that's >> necessary for the thing to work, we should just replace the comment. > > Looking at the code, it sort of is and sort of isn't. The GPIO line > is the chip-select line. If you don't connect it then (because the > ssi-sd device configures cs_polarity to SSI_CS_LOW, requesting an > active-low chip-select) the device will always be selected. If > the machine created an SSI bus with no SSI device attached to it > then in theory the user could create an ssi-sd device and connect > it there and have it work. But in practice it's really unlikely: > machines create SSI buses with specific wired-in devices on them, > and the guest OS knows what it has to do to enable the chip select > for the device it wants to talk to (often some known GPIO pin on > a GPIO controller). > > So I would keep the user_creatable = false, with a reason of > "user should wire up GPIO chip-select line". But see below for I'll do it this way. > a pile of contrary precedent. > > (The chip-select GPIO is created in the parent class, incidentally.) > >> Aside: there may be devices that need manual wiring to work, yet don't >> have user_creatable unset. Bugs if you ask me. I don't have smart >> ideas on how to track them down. > > Me neither. I notice that the TYPE_M25P80 is also an SSI peripheral > with an active-low chipselect but that one doesn't set user_creatable > to false. TYPE_SSD0323 also is user-creatable and that one has an > active-high chipselect, which means the user can create a device but > it will then never be usable because it will ignore all transactions. > (More generally, looks like most subclasses of TYPE_SSI_PERIPHERAL > don't set user_creatable = false.) For sysbus devices, we clear user_creatable by default, because sysbus devices pretty much always[*] need wiring. Is this the case for SSI bus devices, too? [*] The most prominent exception is "dynamic sysbus", which I believe was a mistake.
Re: [PATCH RFC 2/2] hw: Replace drive_get_next() by drive_get()
Philippe Mathieu-Daudé writes: > On 11/15/21 13:55, Markus Armbruster wrote: >> drive_get_next() is basically a bad idea. It returns the "next" block >> backend of a certain interface type. "Next" means bus=0,unit=N, where >> subsequent calls count N up from zero, per interface type. >> >> This lets you define unit numbers implicitly by execution order. If the >> order changes, or new calls appear "in the middle", unit numbers change. >> ABI break. Hard to spot in review. >> >> Explicit is better than implicit: use drive_get() directly. >> >> Signed-off-by: Markus Armbruster >> --- >> include/sysemu/blockdev.h | 1 - >> blockdev.c | 10 -- >> hw/arm/aspeed.c | 21 + >> hw/arm/cubieboard.c | 2 +- >> hw/arm/imx25_pdk.c | 2 +- >> hw/arm/integratorcp.c | 2 +- >> hw/arm/mcimx6ul-evk.c | 2 +- >> hw/arm/mcimx7d-sabre.c | 2 +- >> hw/arm/msf2-som.c | 2 +- >> hw/arm/npcm7xx_boards.c | 6 +++--- >> hw/arm/orangepi.c | 2 +- >> hw/arm/raspi.c | 2 +- >> hw/arm/realview.c | 2 +- >> hw/arm/sabrelite.c | 2 +- >> hw/arm/versatilepb.c| 4 ++-- >> hw/arm/vexpress.c | 6 +++--- >> hw/arm/xilinx_zynq.c| 16 +--- >> hw/arm/xlnx-versal-virt.c | 3 ++- >> hw/arm/xlnx-zcu102.c| 6 +++--- >> hw/microblaze/petalogix_ml605_mmu.c | 2 +- >> hw/misc/sifive_u_otp.c | 2 +- >> hw/riscv/microchip_pfsoc.c | 2 +- >> hw/sparc64/niagara.c| 2 +- >> 23 files changed, 49 insertions(+), 52 deletions(-) > >> @@ -435,11 +438,13 @@ static void aspeed_machine_init(MachineState *machine) >> } >> >> for (i = 0; i < bmc->soc.sdhci.num_slots; i++) { >> -sdhci_attach_drive(>soc.sdhci.slots[i], drive_get_next(IF_SD)); >> +sdhci_attach_drive(>soc.sdhci.slots[i], >> + drive_get(IF_SD, 0, i)); > > If we put SD on bus #0, ... > >> } >> >> if (bmc->soc.emmc.num_slots) { >> -sdhci_attach_drive(>soc.emmc.slots[0], drive_get_next(IF_SD)); >> +sdhci_attach_drive(>soc.emmc.slots[0], >> + drive_get(IF_SD, 0, bmc->soc.sdhci.num_slots)); > > ... we'd want to put eMMC on bus #1 Using separate buses for different kinds of devices would be neater, but it also would be an incompatible change. This patch keeps existing bus/unit numbers working. drive_get_next() can only use bus 0. > but I see having eMMC cards on a > IF_SD bus as a bug, since these cards are soldered on the board. IF_SD is not a bus, it's an "block interface type", which is really just a user interface thing. >> --- a/hw/arm/vexpress.c >> +++ b/hw/arm/vexpress.c >> @@ -625,7 +625,7 @@ static void vexpress_common_init(MachineState *machine) >>qdev_get_gpio_in(sysctl, >> ARM_SYSCTL_GPIO_MMC_WPROT)); >> qdev_connect_gpio_out_named(dev, "card-inserted", 0, >>qdev_get_gpio_in(sysctl, >> ARM_SYSCTL_GPIO_MMC_CARDIN)); >> -dinfo = drive_get_next(IF_SD); >> +dinfo = drive_get(IF_SD, 0, 0); > > Can we have one interface refactor per patch (IF_SD, IF_PFLASH, IF_MTD...)? Peter asked for one patch per "board/SoC model". I'll do whatever helps reviewers. >> @@ -657,7 +657,7 @@ static void vexpress_common_init(MachineState *machine) >> >> sysbus_create_simple("pl111", map[VE_CLCD], pic[14]); >> >> -dinfo = drive_get_next(IF_PFLASH); >> +dinfo = drive_get(IF_PFLASH, 0, 0); > >> -static inline void zynq_init_spi_flashes(uint32_t base_addr, qemu_irq irq, >> - bool is_qspi) >> +static inline int zynq_init_spi_flashes(uint32_t base_addr, qemu_irq irq, >> +bool is_qspi, int unit0) >> { >> +int unit = unit0; >> DeviceState *dev; >> SysBusDevice *busdev; >> SSIBus *spi; >> @@ -156,7 +157,7 @@ static inline void zynq_init_spi_flashes(uint32_t >> base_addr, qemu_irq irq, >> spi = (SSIBus *)qdev_get_child_bus(dev, bus_name); >> >> for (j = 0; j < num_ss; ++j) { >> -DriveInfo *dinfo = drive_get_next(IF_MTD); >> +DriveInfo *dinfo = drive_get(IF_MTD, 0, unit++); > >> diff --git a/hw/arm/xlnx-zcu102.c b/hw/arm/xlnx-zcu102.c >> index 3dc2b5e8ca..45eb19ab3b 100644 >> --- a/hw/arm/xlnx-zcu102.c >> +++ b/hw/arm/xlnx-zcu102.c >> @@ -190,7 +190,7 @@ static void xlnx_zcu102_init(MachineState *machine) >> BusState *spi_bus; >> DeviceState *flash_dev; >> qemu_irq cs_line; >> -DriveInfo *dinfo = drive_get_next(IF_MTD); >> +DriveInfo *dinfo = drive_get(IF_MTD, 0, i); > > If this is bus #0, ... > >>
On configuring onboard block devices with -blockdev (was: [PATCH v4 6/7] hw/nvram: Update at24c EEPROM init function in NPCM7xx boards)
Kevin Wolf writes: > Am 03.11.2021 um 23:01 hat Hao Wu geschrieben: >> We made 3 changes to the at24c_eeprom_init function in >> npcm7xx_boards.c: >> >> 1. We allow the function to take a I2CBus* as parameter. This allows >>us to attach an EEPROM device behind an I2C mux which is not >>possible with the old method. >> >> 2. We make at24c EEPROMs are backed by drives so that we can >>specify the content of the EEPROMs. >> >> 3. Instead of using i2c address as unit number, This patch assigns >>unique unit numbers for each eeproms in each board. This avoids >>conflict in providing multiple eeprom contents with the same address. >>In the old method if we specify two drives with the same unit number, >>the following error will occur: `Device with id 'none85' exists`. >> >> Signed-off-by: Hao Wu >> --- >> hw/arm/npcm7xx_boards.c | 14 +- >> 1 file changed, 9 insertions(+), 5 deletions(-) >> >> diff --git a/hw/arm/npcm7xx_boards.c b/hw/arm/npcm7xx_boards.c >> index dec7d16ae5..9121e081fa 100644 >> --- a/hw/arm/npcm7xx_boards.c >> +++ b/hw/arm/npcm7xx_boards.c >> @@ -126,13 +126,17 @@ static I2CBus *npcm7xx_i2c_get_bus(NPCM7xxState *soc, >> uint32_t num) >> return I2C_BUS(qdev_get_child_bus(DEVICE(>smbus[num]), "i2c-bus")); >> } >> >> -static void at24c_eeprom_init(NPCM7xxState *soc, int bus, uint8_t addr, >> - uint32_t rsize) >> +static void at24c_eeprom_init(I2CBus *i2c_bus, int bus, uint8_t addr, >> + uint32_t rsize, int unit_number) >> { >> -I2CBus *i2c_bus = npcm7xx_i2c_get_bus(soc, bus); >> I2CSlave *i2c_dev = i2c_slave_new("at24c-eeprom", addr); >> DeviceState *dev = DEVICE(i2c_dev); >> +DriveInfo *dinfo; >> >> +dinfo = drive_get(IF_OTHER, bus, unit_number); >> +if (dinfo) { >> +qdev_prop_set_drive(dev, "drive", blk_by_legacy_dinfo(dinfo)); >> +} > > I may be missing the history of this series, but why do we have to use a > legacy DriveInfo here instead of adding a drive property to the board to > make this configurable with -blockdev? > > Adding even more devices that can only be configured with -drive feels > like a step backwards to me. More like sideways. The big unfinished part of -blockdev is configuring onboard devices with it. I took a stab at one instance in commit ebc29e1bea "pc: Support firmware configuration with -blockdev", 2019-03-11. Quoting from the commit message: Mapping -drive if=none,... to -blockdev is a solved problem. With if=T other than if=none, -drive additionally configures a block device frontend. For non-onboard devices, that part maps to -device. Also a solved problem. For onboard devices such as PC flash memory, we have an unsolved problem. This is actually an instance of a wider problem: our general device configuration interface doesn't cover onboard devices. Instead, we have a zoo of ad hoc interfaces that are much more limited. One of them is -drive, which we'd rather deprecate, but can't until we have suitable replacements for all its uses. Sadly, I can't attack the wider problem today. So back to the narrow problem. My first idea was to reduce it to its solved buddy by using pluggable instead of onboard devices for the flash memory. Workable, but it requires some extra smarts in firmware descriptors and libvirt. Paolo had an idea that is simpler for libvirt: keep the devices onboard, and add machine properties for their block backends. The implementation is less than straightforward, I'm afraid. First, block backend properties are *qdev* properties. Machines can't have those, as they're not devices. I could duplicate these qdev properties as QOM properties, but I hate that. More seriously, the properties do not belong to the machine, they belong to the onboard flash devices. Adding them to the machine would then require bad magic to somehow transfer them to the flash devices. Fortunately, QOM provides the means to handle exactly this case: add alias properties to the machine that forward to the onboard devices' properties. Properties need to be created in .instance_init() methods. For PC machines, that's pc_machine_initfn(). To make alias properties work, we need to create the onboard flash devices there, too. Requires several bug fixes, in the previous commits. We also have to realize the devices. More on that below. The need to create onboard devices differently results in a non-trivial refactoring. The need for keeping -drive working complicates things further. The "several bug fixes" took me 26 preparatory patches, plus at least three more to fix regressions caused by one of them. I then did the same for ARM virt in commit e0561e60f1 "hw/arm/virt: Support firmware configuration with -blockdev", 2019-05-07. Only a few preparatory patches, but the patch
Re: does drive_get_next(IF_NONE) make sense?
Peter Maydell writes: > On Mon, 15 Nov 2021 at 13:24, Kevin Wolf wrote: >> Same question as for Hao Wu's series: Wouldn't the proper solution be to >> add a drive property to the machine type? >> >> If you can't use -blockdev, it's not done right. > > Is there an example of "doing it right" for built-in-to-the-machine > devices? > > (My experience with the new-style options is that almost > always they're designed for x86 where the device they're attached > to is also created on the command line, and then handling of boards > where the device is builtin is either an afterthought or forgotten. > See also -netdev, where it took forever for built-in-ethernet to > be usable.) I'm afraid the situation for onboard block devices is far worse than it ever was for NICs. See my reply "On configuring onboard block devices with -blockdev" to Kevin's other message on the topic. To be fair, improving onboard device configuration is *hard*. Our general device configuration interface doesn't cover them, and we've so far failed finding a general solution. Without one, we're drowning in the sheer number of boards and onboard devices. Which is ever growing.
Re: [PATCH v4 03/25] assertions for block global state API
On 15.11.21 13:27, Emanuele Giuseppe Esposito wrote: @@ -586,12 +589,14 @@ void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent) void coroutine_fn bdrv_co_drain(BlockDriverState *bs) { assert(qemu_in_coroutine()); + assert(qemu_in_main_thread()); bdrv_drained_begin(bs); bdrv_drained_end(bs); } void bdrv_drain(BlockDriverState *bs) { + assert(qemu_in_main_thread()); bdrv_drained_begin(bs); bdrv_drained_end(bs); } Why are these GS functions when both bdrv_drained_begin() and bdrv_drained_end() are I/O functions? I can understand making the drain_all functions GS functions, but it seems weird to say it’s an I/O function when a single BDS is drained via bdrv_drained_begin() and bdrv_drained_end(), but not via bdrv_drain(), which just does both. (I can see that there are no I/O path callers, but I still find it strange.) The problem as you saw is on the path callers: bdrv_drain seems to be called only in main thread, while others or similar drains are also coming from I/O. I would say maybe it's a better idea to just put everything I/O, maybe (probably) there are cases not caught by the tests. The only ones in global-state will be: - bdrv_apply_subtree_drain and bdrv_unapply_subtree_drain (called only in .attach and .detach callbacks of BdrvChildClass, run under BQL). - bdrv_drain_all_end_quiesce (called only by bdrv_close thus under BQL). - bdrv_drain_all_begin, bdrv_drain_all_end and bdrv_drain_all (have already BQL assertions) In general, this is the underlying problem with these assertions: if a test doesn't test a specific code path, an unneeded assertion might pass undetected. I believe the I/O vs. GS classification should not only be done based on functional concerns, i.e. who calls this function (is it called by an I/O function?) and whether it can be thread-safe or not (does it call a BQL function?), but also needs to be done semantically. I believe there are some functions that we consider thread-safe but are also only called by BQL functions, for example apparently bdrv_drain(), bdrv_apply_subtree_drain(), and bdrv_unapply_subtree_drain(). Such functions can functionally be considered both GS and I/O functions, so the decision should be done semantically. I believe that it makes sense to say all drain-related functions for a single subtree are I/O functions. OTOH, functions operating on multiple trees in the block graph (i.e. all functions that have some “_all_” in their name) should naturally be considered GS functions, regardless of whether their implementation is thread-safe or not, because they are by nature global functions. That’s why I’m wondering why some drain functions are I/O and others are GS; or why this block-status function is I/O and this one is GS. It may make sense functionally, but semantically it’s strange. I understand it may be difficult for you to know which functions relate to each other and thus know these semantic relationships, but in most of the series the split seems very reasonable to me, semantically. If it didn’t, I said so. :) Hanna
Re: [PATCH v4 24/25] job.h: split function pointers in JobDriver
On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote: The job API will be handled separately in another serie. Signed-off-by: Emanuele Giuseppe Esposito Reviewed-by: Stefan Hajnoczi --- include/qemu/job.h | 16 1 file changed, 16 insertions(+) diff --git a/include/qemu/job.h b/include/qemu/job.h index 6e67b6977f..7e9e59f4b8 100644 --- a/include/qemu/job.h +++ b/include/qemu/job.h @@ -169,12 +169,21 @@ typedef struct Job { * Callbacks and other information about a Job driver. */ struct JobDriver { + +/* Fields initialized in struct definition and never changed. */ Like in patch 19, I’d prefer a slightly more verbose comment that I’d find more easily readable. + /** Derived Job struct size */ size_t instance_size; /** Enum describing the operation */ JobType job_type; +/* + * Functions run without regard to the BQL and may run in any s/and/that/? + * arbitrary thread. These functions do not need to be thread-safe + * because the caller ensures that are invoked from one thread at time. s/that/they/ (or “that they”) I believe .run() must be run in the job’s context, though. Not sure if that’s necessary to note, but it isn’t really an arbitrary thread, and block jobs certainly require this (because they run in the block device’s context). Or is that something that’s going to change with I/O threading? Hanna
Re: does drive_get_next(IF_NONE) make sense?
Am 15.11.2021 um 14:31 hat Peter Maydell geschrieben: > On Mon, 15 Nov 2021 at 13:24, Kevin Wolf wrote: > > Same question as for Hao Wu's series: Wouldn't the proper solution be to > > add a drive property to the machine type? > > > > If you can't use -blockdev, it's not done right. > > Is there an example of "doing it right" for built-in-to-the-machine > devices? > > (My experience with the new-style options is that almost > always they're designed for x86 where the device they're attached > to is also created on the command line, and then handling of boards > where the device is builtin is either an afterthought or forgotten. > See also -netdev, where it took forever for built-in-ethernet to > be usable.) As long as we don't have a separate way to configure built-in devices without creating them, for boards where the device is built in, the properties for the device have to become machine properties. I seem to remember that Markus implemented this for some boards. Just doing without properties for these devices, either by just hard coding things instead of providing options to the user, or by bypassing the property system and accessing global state instead feels wrong. Maybe once we have .instance_config (see my QOM/QAPI integration RFC), forwarding these properties from -machine could actually become trivial, just one call to set the properties for each built-in device. For now, it's probably not as nice because each property needs to be forwarded individually instead of just providing access to all properties of the device. Kevin
[PULL 09/13] block: Let replace_child_noperm free children
From: Hanna Reitz In most of the block layer, especially when traversing down from other BlockDriverStates, we assume that BdrvChild.bs can never be NULL. When it becomes NULL, it is expected that the corresponding BdrvChild pointer also becomes NULL and the BdrvChild object is freed. Therefore, once bdrv_replace_child_noperm() sets the BdrvChild.bs pointer to NULL, it should also immediately set the corresponding BdrvChild pointer (like bs->file or bs->backing) to NULL. In that context, it also makes sense for this function to free the child. Sometimes we cannot do so, though, because it is called in a transactional context where the caller might still want to reinstate the child in the abort branch (and free it only on commit), so this behavior has to remain optional. In bdrv_replace_child_tran()'s abort handler, we now rely on the fact that the BdrvChild passed to bdrv_replace_child_tran() must have had a non-NULL .bs pointer initially. Make a note of that and assert it. Signed-off-by: Hanna Reitz Message-Id: <2021120829.81329-10-hre...@redhat.com> Signed-off-by: Kevin Wolf --- block.c | 102 +++- 1 file changed, 79 insertions(+), 23 deletions(-) diff --git a/block.c b/block.c index a40027161c..0ac5b163d2 100644 --- a/block.c +++ b/block.c @@ -87,8 +87,10 @@ static BlockDriverState *bdrv_open_inherit(const char *filename, static bool bdrv_recurse_has_child(BlockDriverState *bs, BlockDriverState *child); +static void bdrv_child_free(BdrvChild *child); static void bdrv_replace_child_noperm(BdrvChild **child, - BlockDriverState *new_bs); + BlockDriverState *new_bs, + bool free_empty_child); static void bdrv_remove_file_or_backing_child(BlockDriverState *bs, BdrvChild *child, Transaction *tran); @@ -2256,12 +2258,16 @@ typedef struct BdrvReplaceChildState { BdrvChild *child; BdrvChild **childp; BlockDriverState *old_bs; +bool free_empty_child; } BdrvReplaceChildState; static void bdrv_replace_child_commit(void *opaque) { BdrvReplaceChildState *s = opaque; +if (s->free_empty_child && !s->child->bs) { +bdrv_child_free(s->child); +} bdrv_unref(s->old_bs); } @@ -2278,22 +2284,26 @@ static void bdrv_replace_child_abort(void *opaque) * modify the BdrvChild * pointer we indirectly pass to it, i.e. it * will not modify s->child. From that perspective, it does not matter * whether we pass s->childp or >child. - * (TODO: Right now, bdrv_replace_child_noperm() never modifies that - * pointer anyway (though it will in the future), so at this point it - * absolutely does not matter whether we pass s->childp or >child.) * (2) If new_bs is not NULL, s->childp will be NULL. We then cannot use * it here. * (3) If new_bs is NULL, *s->childp will have been NULLed by * bdrv_replace_child_tran()'s bdrv_replace_child_noperm() call, and we * must not pass a NULL *s->childp here. - * (TODO: In its current state, bdrv_replace_child_noperm() will not - * have NULLed *s->childp, so this does not apply yet. It will in the - * future.) * * So whether new_bs was NULL or not, we cannot pass s->childp here; and in * any case, there is no reason to pass it anyway. */ -bdrv_replace_child_noperm(>child, s->old_bs); +bdrv_replace_child_noperm(>child, s->old_bs, true); +/* + * The child was pre-existing, so s->old_bs must be non-NULL, and + * s->child thus must not have been freed + */ +assert(s->child != NULL); +if (!new_bs) { +/* As described above, *s->childp was cleared, so restore it */ +assert(s->childp != NULL); +*s->childp = s->child; +} bdrv_unref(new_bs); } @@ -2310,30 +2320,44 @@ static TransactionActionDrv bdrv_replace_child_drv = { * * The function doesn't update permissions, caller is responsible for this. * + * (*childp)->bs must not be NULL. + * * Note that if new_bs == NULL, @childp is stored in a state object attached * to @tran, so that the old child can be reinstated in the abort handler. * Therefore, if @new_bs can be NULL, @childp must stay valid until the * transaction is committed or aborted. * - * (TODO: The reinstating does not happen yet, but it will once - * bdrv_replace_child_noperm() NULLs *childp when new_bs is NULL.) + * If @free_empty_child is true and @new_bs is NULL, the BdrvChild is + * freed (on commit). @free_empty_child should only be false if the + * caller will free the BDrvChild themselves (which may be important + * if this is in turn called in another transactional context). */ static void
Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
On Montag, 15. November 2021 12:54:32 CET Stefan Hajnoczi wrote: > On Thu, Nov 11, 2021 at 06:54:03PM +0100, Christian Schoenebeck wrote: > > On Donnerstag, 11. November 2021 17:31:52 CET Stefan Hajnoczi wrote: > > > On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote: > > > > On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote: > > > > > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote: > > > > > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote: > > > > > > As you are apparently reluctant for changing the virtio specs, > > > > > > what > > > > > > about > > > > > > introducing those discussed virtio capabalities either as > > > > > > experimental > > > > > > ones > > > > > > without specs changes, or even just as 9p specific device > > > > > > capabilities > > > > > > for > > > > > > now. I mean those could be revoked on both sides at any time > > > > > > anyway. > > > > > > > > > > I would like to understand the root cause before making changes. > > > > > > > > > > "It's faster when I do X" is useful information but it doesn't > > > > > necessarily mean doing X is the solution. The "it's faster when I do > > > > > X > > > > > because Y" part is missing in my mind. Once there is evidence that > > > > > shows > > > > > Y then it will be clearer if X is a good solution, if there's a more > > > > > general solution, or if it was just a side-effect. > > > > > > > > I think I made it clear that the root cause of the observed > > > > performance > > > > gain with rising transmission size is latency (and also that > > > > performance > > > > is not the only reason for addressing this queue size issue). > > > > > > > > Each request roundtrip has a certain minimum latency, the virtio ring > > > > alone > > > > has its latency, plus latency of the controller portion of the file > > > > server > > > > (e.g. permissions, sandbox checks, file IDs) that is executed with > > > > *every* > > > > request, plus latency of dispatching the request handling between > > > > threads > > > > several times back and forth (also for each request). > > > > > > > > Therefore when you split large payloads (e.g. reading a large file) > > > > into > > > > smaller n amount of chunks, then that individual latency per request > > > > accumulates to n times the individual latency, eventually leading to > > > > degraded transmission speed as those requests are serialized. > > > > > > It's easy to increase the blocksize in benchmarks, but real applications > > > offer less control over the I/O pattern. If latency in the device > > > implementation (QEMU) is the root cause then reduce the latency to speed > > > up all applications, even those that cannot send huge requests. > > > > Which I did, still do, and also mentioned before, e.g.: > > > > 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 9pfs: reduce latency of Twalk > > 0c4356ba7dafc8ecb5877a42fc0d68d45ccf5951 9pfs: T_readdir latency > > optimization > > > > Reducing overall latency is a process that is ongoing and will still take > > a > > very long development time. Not because of me, but because of lack of > > reviewers. And even then, it does not make the effort to support higher > > transmission sizes obsolete. > > > > > One idea is request merging on the QEMU side. If the application sends > > > 10 sequential read or write requests, coalesce them together before the > > > main part of request processing begins in the device. Process a single > > > large request to spread the cost of the file server over the 10 > > > requests. (virtio-blk has request merging to help with the cost of lots > > > of small qcow2 I/O requests.) The cool thing about this is that the > > > guest does not need to change its I/O pattern to benefit from the > > > optimization. > > > > > > Stefan > > > > Ok, don't get me wrong: I appreciate that you are suggesting approaches > > that could improve things. But I could already hand you over a huge list > > of mine. The limiting factor here is not the lack of ideas of what could > > be improved, but rather the lack of people helping out actively on 9p > > side: > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06452.html > > > > The situation on kernel side is the same. I already have a huge list of > > what could & should be improved. But there is basically no reviewer for > > 9p patches on Linux kernel side either. > > > > The much I appreciate suggestions of what could be improved, I would > > appreciate much more if there was *anybody* actively assisting as well. In > > the time being I have to work the list down in small patch chunks, > > priority based. > I see request merging as an alternative to this patch series, not as an > additional idea. It is not an alternative. Like I said before, even if it would solve the sequential I/O performance issue (by not simply moving the problem somewhere else), which I doubt, your suggestion would still not solve the semantical
[PULL 07/13] transactions: Invoke clean() after everything else
From: Hanna Reitz Invoke the transaction drivers' .clean() methods only after all .commit() or .abort() handlers are done. This makes it easier to have nested transactions where the top-level transactions pass objects to lower transactions that the latter can still use throughout their commit/abort phases, while the top-level transaction keeps a reference that is released in its .clean() method. (Before this commit, that is also possible, but the top-level transaction would need to take care to invoke tran_add() before the lower-level transaction does. This commit makes the ordering irrelevant, which is just a bit nicer.) Signed-off-by: Hanna Reitz Message-Id: <2021120829.81329-8-hre...@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Kevin Wolf --- include/qemu/transactions.h | 3 +++ util/transactions.c | 8 ++-- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/include/qemu/transactions.h b/include/qemu/transactions.h index 92c5965235..2f2060acd9 100644 --- a/include/qemu/transactions.h +++ b/include/qemu/transactions.h @@ -31,6 +31,9 @@ * tran_create(), call your "prepare" functions on it, and finally call * tran_abort() or tran_commit() to finalize the transaction by corresponding * finalization actions in reverse order. + * + * The clean() functions registered by the drivers in a transaction are called + * last, after all abort() or commit() functions have been called. */ #ifndef QEMU_TRANSACTIONS_H diff --git a/util/transactions.c b/util/transactions.c index d0bc9a3e73..2dbdedce95 100644 --- a/util/transactions.c +++ b/util/transactions.c @@ -61,11 +61,13 @@ void tran_abort(Transaction *tran) { TransactionAction *act, *next; -QSLIST_FOREACH_SAFE(act, >actions, entry, next) { +QSLIST_FOREACH(act, >actions, entry) { if (act->drv->abort) { act->drv->abort(act->opaque); } +} +QSLIST_FOREACH_SAFE(act, >actions, entry, next) { if (act->drv->clean) { act->drv->clean(act->opaque); } @@ -80,11 +82,13 @@ void tran_commit(Transaction *tran) { TransactionAction *act, *next; -QSLIST_FOREACH_SAFE(act, >actions, entry, next) { +QSLIST_FOREACH(act, >actions, entry) { if (act->drv->commit) { act->drv->commit(act->opaque); } +} +QSLIST_FOREACH_SAFE(act, >actions, entry, next) { if (act->drv->clean) { act->drv->clean(act->opaque); } -- 2.31.1
[PULL 12/13] file-posix: Fix alignment after reopen changing O_DIRECT
At the end of a reopen, we already call bdrv_refresh_limits(), which should update bs->request_alignment according to the new file descriptor. However, raw_probe_alignment() relies on s->needs_alignment and just uses 1 if it isn't set. We neglected to update this field, so starting with cache=writeback and then reopening with cache=none means that we get an incorrect bs->request_alignment == 1 and unaligned requests fail instead of being automatically aligned. Fix this by recalculating s->needs_alignment in raw_refresh_limits() before calling raw_probe_alignment(). Signed-off-by: Kevin Wolf Message-Id: <20211104113109.56336-1-kw...@redhat.com> Reviewed-by: Hanna Reitz Reviewed-by: Stefano Garzarella Signed-off-by: Kevin Wolf --- block/file-posix.c | 20 tests/qemu-iotests/142 | 22 ++ tests/qemu-iotests/142.out | 15 +++ 3 files changed, 53 insertions(+), 4 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index 7a27c83060..b283093e5b 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -167,6 +167,7 @@ typedef struct BDRVRawState { int page_cache_inconsistent; /* errno from fdatasync failure */ bool has_fallocate; bool needs_alignment; +bool force_alignment; bool drop_cache; bool check_cache_dropped; struct { @@ -351,6 +352,17 @@ static bool dio_byte_aligned(int fd) return false; } +static bool raw_needs_alignment(BlockDriverState *bs) +{ +BDRVRawState *s = bs->opaque; + +if ((bs->open_flags & BDRV_O_NOCACHE) != 0 && !dio_byte_aligned(s->fd)) { +return true; +} + +return s->force_alignment; +} + /* Check if read is allowed with given memory buffer and length. * * This function is used to check O_DIRECT memory buffer and request alignment. @@ -728,9 +740,6 @@ static int raw_open_common(BlockDriverState *bs, QDict *options, s->has_discard = true; s->has_write_zeroes = true; -if ((bs->open_flags & BDRV_O_NOCACHE) != 0 && !dio_byte_aligned(s->fd)) { -s->needs_alignment = true; -} if (fstat(s->fd, ) < 0) { ret = -errno; @@ -784,9 +793,10 @@ static int raw_open_common(BlockDriverState *bs, QDict *options, * so QEMU makes sure all IO operations on the device are aligned * to sector size, or else FreeBSD will reject them with EINVAL. */ -s->needs_alignment = true; +s->force_alignment = true; } #endif +s->needs_alignment = raw_needs_alignment(bs); #ifdef CONFIG_XFS if (platform_test_xfs_fd(s->fd)) { @@ -1251,7 +1261,9 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) BDRVRawState *s = bs->opaque; struct stat st; +s->needs_alignment = raw_needs_alignment(bs); raw_probe_alignment(bs, s->fd, errp); + bs->bl.min_mem_alignment = s->buf_align; bs->bl.opt_mem_alignment = MAX(s->buf_align, qemu_real_host_page_size); diff --git a/tests/qemu-iotests/142 b/tests/qemu-iotests/142 index 69fd10ef51..07003c597a 100755 --- a/tests/qemu-iotests/142 +++ b/tests/qemu-iotests/142 @@ -350,6 +350,28 @@ info block backing-file" echo "$hmp_cmds" | run_qemu -drive "$files","$ids" | grep "Cache" +echo +echo "--- Alignment after changing O_DIRECT ---" +echo + +# Directly test the protocol level: Can unaligned requests succeed even if +# O_DIRECT was only enabled through a reopen and vice versa? + +$QEMU_IO --cache=writeback -f file $TEST_IMG <
[PULL 03/13] block: Unite remove_empty_child and child_free
From: Hanna Reitz Now that bdrv_remove_empty_child() no longer removes the child from the parent's children list but only checks that it is not in such a list, it is only a wrapper around bdrv_child_free() that checks that the child is empty and unused. That should apply to all children that we free, so put those checks into bdrv_child_free() and drop bdrv_remove_empty_child(). Signed-off-by: Hanna Reitz Reviewed-by: Kevin Wolf Reviewed-by: Vladimir Sementsov-Ogievskiy Message-Id: <2021120829.81329-4-hre...@redhat.com> Signed-off-by: Kevin Wolf --- block.c | 26 +- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/block.c b/block.c index ca024ffced..19bff4f95c 100644 --- a/block.c +++ b/block.c @@ -2740,19 +2740,19 @@ static void bdrv_replace_child_noperm(BdrvChild *child, } } -static void bdrv_child_free(void *opaque) -{ -BdrvChild *c = opaque; - -g_free(c->name); -g_free(c); -} - -static void bdrv_remove_empty_child(BdrvChild *child) +/** + * Free the given @child. + * + * The child must be empty (i.e. `child->bs == NULL`) and it must be + * unused (i.e. not in a children list). + */ +static void bdrv_child_free(BdrvChild *child) { assert(!child->bs); assert(!child->next.le_prev); /* not in children list */ -bdrv_child_free(child); + +g_free(child->name); +g_free(child); } typedef struct BdrvAttachChildCommonState { @@ -2786,7 +2786,7 @@ static void bdrv_attach_child_common_abort(void *opaque) } bdrv_unref(bs); -bdrv_remove_empty_child(child); +bdrv_child_free(child); *s->child = NULL; } @@ -2859,7 +2859,7 @@ static int bdrv_attach_child_common(BlockDriverState *child_bs, if (ret < 0) { error_propagate(errp, local_err); -bdrv_remove_empty_child(new_child); +bdrv_child_free(new_child); return ret; } } @@ -2925,7 +2925,7 @@ static void bdrv_detach_child(BdrvChild *child) BlockDriverState *old_bs = child->bs; bdrv_replace_child_noperm(child, NULL); -bdrv_remove_empty_child(child); +bdrv_child_free(child); if (old_bs) { /* -- 2.31.1
[PULL 11/13] docs: Deprecate incorrectly typed device_add arguments
While introducing a non-QemuOpts code path for device creation for JSON -device, we noticed that QMP device_add doesn't check its input correctly (accepting arguments that should have been rejected), and that users may be relying on this behaviour (libvirt did until it was fixed recently). Let's use a deprecation period before we fix this bug in QEMU to avoid nasty surprises for users. Signed-off-by: Kevin Wolf Message-Id: <2021143530.18985-1-kw...@redhat.com> Reviewed-by: Markus Armbruster Reviewed-by: Daniel P. Berrangé Signed-off-by: Kevin Wolf --- docs/about/deprecated.rst | 14 ++ 1 file changed, 14 insertions(+) diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst index 600031210d..c03fcf951f 100644 --- a/docs/about/deprecated.rst +++ b/docs/about/deprecated.rst @@ -250,6 +250,20 @@ options are removed in favor of using explicit ``blockdev-create`` and ``blockdev-add`` calls. See :doc:`/interop/live-block-operations` for details. +Incorrectly typed ``device_add`` arguments (since 6.2) +'' + +Due to shortcomings in the internal implementation of ``device_add``, QEMU +incorrectly accepts certain invalid arguments: Any object or list arguments are +silently ignored. Other argument types are not checked, but an implicit +conversion happens, so that e.g. string values can be assigned to integer +device properties or vice versa. + +This is a bug in QEMU that will be fixed in the future so that previously +accepted incorrect commands will return an error. Users should make sure that +all arguments passed to ``device_add`` are consistent with the documented +property types. + System accelerators --- -- 2.31.1
[PULL 08/13] block: Let replace_child_tran keep indirect pointer
From: Hanna Reitz As of a future commit, bdrv_replace_child_noperm() will clear the indirect BdrvChild pointer passed to it if the new child BDS is NULL. bdrv_replace_child_tran() will want to let it do that, but revert this change in its abort handler. For that, we need to have it receive a BdrvChild ** pointer, too, and keep it stored in the BdrvReplaceChildState object that we attach to the transaction. Note that we do not need to store it in the BdrvReplaceChildState when new_bs is not NULL, because then there is nothing to revert. This is important so that bdrv_replace_node_noperm() can pass a pointer to a loop-local variable to bdrv_replace_child_tran() without worrying that this pointer will outlive one loop iteration. (Of course, for that to work, bdrv_replace_node_noperm() and in turn bdrv_replace_node() and its relatives may not be called with a NULL @to node. Luckily, they already are not, but now we should assert this.) bdrv_remove_file_or_backing_child() on the other hand needs to ensure that the indirect pointer it passes will stay valid for the duration of the transaction. Ensure this by keeping a strong reference to the BDS whose >backing or >file it passes to bdrv_replace_child_tran(), and giving up that reference only in the transaction .clean() handler. Signed-off-by: Hanna Reitz Message-Id: <2021120829.81329-9-hre...@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Kevin Wolf --- block.c | 83 ++--- 1 file changed, 73 insertions(+), 10 deletions(-) diff --git a/block.c b/block.c index 8da057f800..a40027161c 100644 --- a/block.c +++ b/block.c @@ -2254,6 +2254,7 @@ static int bdrv_drv_set_perm(BlockDriverState *bs, uint64_t perm, typedef struct BdrvReplaceChildState { BdrvChild *child; +BdrvChild **childp; BlockDriverState *old_bs; } BdrvReplaceChildState; @@ -2269,7 +2270,29 @@ static void bdrv_replace_child_abort(void *opaque) BdrvReplaceChildState *s = opaque; BlockDriverState *new_bs = s->child->bs; -/* old_bs reference is transparently moved from @s to @s->child */ +/* + * old_bs reference is transparently moved from @s to s->child. + * + * Pass >child here instead of s->childp, because: + * (1) s->old_bs must be non-NULL, so bdrv_replace_child_noperm() will not + * modify the BdrvChild * pointer we indirectly pass to it, i.e. it + * will not modify s->child. From that perspective, it does not matter + * whether we pass s->childp or >child. + * (TODO: Right now, bdrv_replace_child_noperm() never modifies that + * pointer anyway (though it will in the future), so at this point it + * absolutely does not matter whether we pass s->childp or >child.) + * (2) If new_bs is not NULL, s->childp will be NULL. We then cannot use + * it here. + * (3) If new_bs is NULL, *s->childp will have been NULLed by + * bdrv_replace_child_tran()'s bdrv_replace_child_noperm() call, and we + * must not pass a NULL *s->childp here. + * (TODO: In its current state, bdrv_replace_child_noperm() will not + * have NULLed *s->childp, so this does not apply yet. It will in the + * future.) + * + * So whether new_bs was NULL or not, we cannot pass s->childp here; and in + * any case, there is no reason to pass it anyway. + */ bdrv_replace_child_noperm(>child, s->old_bs); bdrv_unref(new_bs); } @@ -2286,22 +2309,32 @@ static TransactionActionDrv bdrv_replace_child_drv = { * Note: real unref of old_bs is done only on commit. * * The function doesn't update permissions, caller is responsible for this. + * + * Note that if new_bs == NULL, @childp is stored in a state object attached + * to @tran, so that the old child can be reinstated in the abort handler. + * Therefore, if @new_bs can be NULL, @childp must stay valid until the + * transaction is committed or aborted. + * + * (TODO: The reinstating does not happen yet, but it will once + * bdrv_replace_child_noperm() NULLs *childp when new_bs is NULL.) */ -static void bdrv_replace_child_tran(BdrvChild *child, BlockDriverState *new_bs, +static void bdrv_replace_child_tran(BdrvChild **childp, +BlockDriverState *new_bs, Transaction *tran) { BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1); *s = (BdrvReplaceChildState) { -.child = child, -.old_bs = child->bs, +.child = *childp, +.childp = new_bs == NULL ? childp : NULL, +.old_bs = (*childp)->bs, }; tran_add(tran, _replace_child_drv, s); if (new_bs) { bdrv_ref(new_bs); } -bdrv_replace_child_noperm(, new_bs); -/* old_bs reference is transparently moved from @child to @s */ +bdrv_replace_child_noperm(childp, new_bs); +/* old_bs reference is transparently moved from
[PULL 05/13] block: Pass BdrvChild ** to replace_child_noperm
From: Hanna Reitz bdrv_replace_child_noperm() modifies BdrvChild.bs, and can potentially set it to NULL. That is dangerous, because BDS parents generally assume that their children's .bs pointer is never NULL. We therefore want to let bdrv_replace_child_noperm() set the corresponding BdrvChild pointer to NULL, too. This patch lays the foundation for it by passing a BdrvChild ** pointer to bdrv_replace_child_noperm() so that it can later use it to NULL the BdrvChild pointer immediately after setting BdrvChild.bs to NULL. (We will still need to undertake some intermediate steps, though.) Signed-off-by: Hanna Reitz Message-Id: <2021120829.81329-6-hre...@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Kevin Wolf --- block.c | 23 --- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/block.c b/block.c index c7d5aa5254..d668156eca 100644 --- a/block.c +++ b/block.c @@ -87,7 +87,7 @@ static BlockDriverState *bdrv_open_inherit(const char *filename, static bool bdrv_recurse_has_child(BlockDriverState *bs, BlockDriverState *child); -static void bdrv_replace_child_noperm(BdrvChild *child, +static void bdrv_replace_child_noperm(BdrvChild **child, BlockDriverState *new_bs); static void bdrv_remove_file_or_backing_child(BlockDriverState *bs, BdrvChild *child, @@ -2270,7 +2270,7 @@ static void bdrv_replace_child_abort(void *opaque) BlockDriverState *new_bs = s->child->bs; /* old_bs reference is transparently moved from @s to @s->child */ -bdrv_replace_child_noperm(s->child, s->old_bs); +bdrv_replace_child_noperm(>child, s->old_bs); bdrv_unref(new_bs); } @@ -2300,7 +2300,7 @@ static void bdrv_replace_child_tran(BdrvChild *child, BlockDriverState *new_bs, if (new_bs) { bdrv_ref(new_bs); } -bdrv_replace_child_noperm(child, new_bs); +bdrv_replace_child_noperm(, new_bs); /* old_bs reference is transparently moved from @child to @s */ } @@ -2672,9 +2672,10 @@ uint64_t bdrv_qapi_perm_to_blk_perm(BlockPermission qapi_perm) return permissions[qapi_perm]; } -static void bdrv_replace_child_noperm(BdrvChild *child, +static void bdrv_replace_child_noperm(BdrvChild **childp, BlockDriverState *new_bs) { +BdrvChild *child = *childp; BlockDriverState *old_bs = child->bs; int new_bs_quiesce_counter; int drain_saldo; @@ -2767,7 +2768,7 @@ static void bdrv_attach_child_common_abort(void *opaque) BdrvChild *child = *s->child; BlockDriverState *bs = child->bs; -bdrv_replace_child_noperm(child, NULL); +bdrv_replace_child_noperm(s->child, NULL); if (bdrv_get_aio_context(bs) != s->old_child_ctx) { bdrv_try_set_aio_context(bs, s->old_child_ctx, _abort); @@ -2867,7 +2868,7 @@ static int bdrv_attach_child_common(BlockDriverState *child_bs, } bdrv_ref(child_bs); -bdrv_replace_child_noperm(new_child, child_bs); +bdrv_replace_child_noperm(_child, child_bs); *child = new_child; @@ -2922,12 +2923,12 @@ static int bdrv_attach_child_noperm(BlockDriverState *parent_bs, return 0; } -static void bdrv_detach_child(BdrvChild *child) +static void bdrv_detach_child(BdrvChild **childp) { -BlockDriverState *old_bs = child->bs; +BlockDriverState *old_bs = (*childp)->bs; -bdrv_replace_child_noperm(child, NULL); -bdrv_child_free(child); +bdrv_replace_child_noperm(childp, NULL); +bdrv_child_free(*childp); if (old_bs) { /* @@ -3033,7 +3034,7 @@ void bdrv_root_unref_child(BdrvChild *child) BlockDriverState *child_bs; child_bs = child->bs; -bdrv_detach_child(child); +bdrv_detach_child(); bdrv_unref(child_bs); } -- 2.31.1
[PULL 13/13] softmmu/qdev-monitor: fix use-after-free in qdev_set_id()
From: Stefan Hajnoczi Reported by Coverity (CID 1465222). Fixes: 4a1d937796de0fecd8b22d7dbebf87f38e8282fd ("softmmu/qdev-monitor: add error handling in qdev_set_id") Cc: Damien Hedde Cc: Kevin Wolf Cc: Michael S. Tsirkin Signed-off-by: Stefan Hajnoczi Message-Id: <20211102163342.31162-1-stefa...@redhat.com> Reviewed-by: Kevin Wolf Reviewed-by: Michael S. Tsirkin Reviewed-by: Philippe Mathieu-Daudé Reviewed-by: Marc-André Lureau Reviewed-by: Damien Hedde Reviewed-by: Markus Armbruster Signed-off-by: Kevin Wolf --- softmmu/qdev-monitor.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c index b5aaae4b8c..01ec420e61 100644 --- a/softmmu/qdev-monitor.c +++ b/softmmu/qdev-monitor.c @@ -593,8 +593,8 @@ const char *qdev_set_id(DeviceState *dev, char *id, Error **errp) if (prop) { dev->id = id; } else { -g_free(id); error_setg(errp, "Duplicate device ID '%s'", id); +g_free(id); return NULL; } } else { -- 2.31.1
[PULL 04/13] block: Drop detached child from ignore list
From: Hanna Reitz bdrv_attach_child_common_abort() restores the parent's AioContext. To do so, the child (which was supposed to be attached, but is now detached again by this abort handler) is added to the ignore list for the AioContext changing functions. However, since we modify a BDS's children list in the BdrvChildClass's .attach and .detach handlers, the child is already effectively detached from the parent by this point. We do not need to put it into the ignore list. Use this opportunity to clean up the empty line structure: Keep setting the ignore list, invoking the AioContext function, and freeing the ignore list in blocks separated by empty lines. Signed-off-by: Hanna Reitz Reviewed-by: Kevin Wolf Reviewed-by: Vladimir Sementsov-Ogievskiy Message-Id: <2021120829.81329-5-hre...@redhat.com> Signed-off-by: Kevin Wolf --- block.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/block.c b/block.c index 19bff4f95c..c7d5aa5254 100644 --- a/block.c +++ b/block.c @@ -2774,14 +2774,16 @@ static void bdrv_attach_child_common_abort(void *opaque) } if (bdrv_child_get_parent_aio_context(child) != s->old_parent_ctx) { -GSList *ignore = g_slist_prepend(NULL, child); +GSList *ignore; +/* No need to ignore `child`, because it has been detached already */ +ignore = NULL; child->klass->can_set_aio_ctx(child, s->old_parent_ctx, , _abort); g_slist_free(ignore); -ignore = g_slist_prepend(NULL, child); -child->klass->set_aio_ctx(child, s->old_parent_ctx, ); +ignore = NULL; +child->klass->set_aio_ctx(child, s->old_parent_ctx, ); g_slist_free(ignore); } -- 2.31.1
[PULL 01/13] stream: Traverse graph after modification
From: Hanna Reitz bdrv_cor_filter_drop() modifies the block graph. That means that other parties can also modify the block graph before it returns. Therefore, we cannot assume that the result of a graph traversal we did before remains valid afterwards. We should thus fetch `base` and `unfiltered_base` afterwards instead of before. Signed-off-by: Hanna Reitz Reviewed-by: Kevin Wolf Reviewed-by: Vladimir Sementsov-Ogievskiy Message-Id: <2021120829.81329-2-hre...@redhat.com> Signed-off-by: Kevin Wolf --- block/stream.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/block/stream.c b/block/stream.c index 97bee482dc..e45113aed6 100644 --- a/block/stream.c +++ b/block/stream.c @@ -54,8 +54,8 @@ static int stream_prepare(Job *job) { StreamBlockJob *s = container_of(job, StreamBlockJob, common.job); BlockDriverState *unfiltered_bs = bdrv_skip_filters(s->target_bs); -BlockDriverState *base = bdrv_filter_or_cow_bs(s->above_base); -BlockDriverState *unfiltered_base = bdrv_skip_filters(base); +BlockDriverState *base; +BlockDriverState *unfiltered_base; Error *local_err = NULL; int ret = 0; @@ -63,6 +63,9 @@ static int stream_prepare(Job *job) bdrv_cor_filter_drop(s->cor_filter_bs); s->cor_filter_bs = NULL; +base = bdrv_filter_or_cow_bs(s->above_base); +unfiltered_base = bdrv_skip_filters(base); + if (bdrv_cow_child(unfiltered_bs)) { const char *base_id = NULL, *base_fmt = NULL; if (unfiltered_base) { -- 2.31.1
[PULL 06/13] block: Restructure remove_file_or_backing_child()
From: Hanna Reitz As of a future patch, bdrv_replace_child_tran() will take a BdrvChild ** pointer. Prepare for that by getting such a pointer and using it where applicable, and (dereferenced) as a parameter for bdrv_replace_child_tran(). Signed-off-by: Hanna Reitz Message-Id: <2021120829.81329-7-hre...@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Kevin Wolf --- block.c | 21 - 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/block.c b/block.c index d668156eca..8da057f800 100644 --- a/block.c +++ b/block.c @@ -4887,30 +4887,33 @@ static void bdrv_remove_file_or_backing_child(BlockDriverState *bs, BdrvChild *child, Transaction *tran) { +BdrvChild **childp; BdrvRemoveFilterOrCowChild *s; -assert(child == bs->backing || child == bs->file); - if (!child) { return; } +if (child == bs->backing) { +childp = >backing; +} else if (child == bs->file) { +childp = >file; +} else { +g_assert_not_reached(); +} + if (child->bs) { -bdrv_replace_child_tran(child, NULL, tran); +bdrv_replace_child_tran(*childp, NULL, tran); } s = g_new(BdrvRemoveFilterOrCowChild, 1); *s = (BdrvRemoveFilterOrCowChild) { .child = child, -.is_backing = (child == bs->backing), +.is_backing = (childp == >backing), }; tran_add(tran, _remove_filter_or_cow_child_drv, s); -if (s->is_backing) { -bs->backing = NULL; -} else { -bs->file = NULL; -} +*childp = NULL; } /* -- 2.31.1
[PULL 10/13] iotests/030: Unthrottle parallel jobs in reverse
From: Hanna Reitz See the comment for why this is necessary. Signed-off-by: Hanna Reitz Message-Id: <2021120829.81329-11-hre...@redhat.com> Signed-off-by: Kevin Wolf --- tests/qemu-iotests/030 | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/tests/qemu-iotests/030 b/tests/qemu-iotests/030 index 5fb65b4bef..567bf1da67 100755 --- a/tests/qemu-iotests/030 +++ b/tests/qemu-iotests/030 @@ -251,7 +251,16 @@ class TestParallelOps(iotests.QMPTestCase): speed=1024) self.assert_qmp(result, 'return', {}) -for job in pending_jobs: +# Do this in reverse: After unthrottling them, some jobs may finish +# before we have unthrottled all of them. This will drain their +# subgraph, and this will make jobs above them advance (despite those +# jobs on top being throttled). In the worst case, all jobs below the +# top one are finished before we can unthrottle it, and this makes it +# advance so far that it completes before we can unthrottle it - which +# results in an error. +# Starting from the top (i.e. in reverse) does not have this problem: +# When a job finishes, the ones below it are not advanced. +for job in reversed(pending_jobs): result = self.vm.qmp('block-job-set-speed', device=job, speed=0) self.assert_qmp(result, 'return', {}) -- 2.31.1
[PULL 02/13] block: Manipulate children list in .attach/.detach
From: Hanna Reitz The children list is specific to BDS parents. We should not modify it in the general children modification code, but let BDS parents deal with it in their .attach() and .detach() methods. This also has the advantage that a BdrvChild is removed from the children list before its .bs pointer can become NULL. BDS parents generally assume that their children's .bs pointer is never NULL, so this is actually a bug fix. Signed-off-by: Hanna Reitz Reviewed-by: Vladimir Sementsov-Ogievskiy Message-Id: <2021120829.81329-3-hre...@redhat.com> Signed-off-by: Kevin Wolf --- block.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/block.c b/block.c index 580cb77a70..ca024ffced 100644 --- a/block.c +++ b/block.c @@ -1387,6 +1387,8 @@ static void bdrv_child_cb_attach(BdrvChild *child) { BlockDriverState *bs = child->opaque; +QLIST_INSERT_HEAD(>children, child, next); + if (child->role & BDRV_CHILD_COW) { bdrv_backing_attach(child); } @@ -1403,6 +1405,8 @@ static void bdrv_child_cb_detach(BdrvChild *child) } bdrv_unapply_subtree_drain(child, bs); + +QLIST_REMOVE(child, next); } static int bdrv_child_cb_update_filename(BdrvChild *c, BlockDriverState *base, @@ -2747,7 +2751,7 @@ static void bdrv_child_free(void *opaque) static void bdrv_remove_empty_child(BdrvChild *child) { assert(!child->bs); -QLIST_SAFE_REMOVE(child, next); +assert(!child->next.le_prev); /* not in children list */ bdrv_child_free(child); } @@ -2913,12 +2917,6 @@ static int bdrv_attach_child_noperm(BlockDriverState *parent_bs, return ret; } -QLIST_INSERT_HEAD(_bs->children, *child, next); -/* - * child is removed in bdrv_attach_child_common_abort(), so don't care to - * abort this change separately. - */ - return 0; } @@ -4851,7 +4849,6 @@ static void bdrv_remove_filter_or_cow_child_abort(void *opaque) BdrvRemoveFilterOrCowChild *s = opaque; BlockDriverState *parent_bs = s->child->opaque; -QLIST_INSERT_HEAD(_bs->children, s->child, next); if (s->is_backing) { parent_bs->backing = s->child; } else { @@ -4906,7 +4903,6 @@ static void bdrv_remove_file_or_backing_child(BlockDriverState *bs, }; tran_add(tran, _remove_filter_or_cow_child_drv, s); -QLIST_SAFE_REMOVE(child, next); if (s->is_backing) { bs->backing = NULL; } else { -- 2.31.1
[PULL 00/13] Block layer patches
The following changes since commit 42f6c9179be4401974dd3a75ee72defd16b5092d: Merge tag 'pull-ppc-2022' of https://github.com/legoater/qemu into staging (2021-11-12 12:28:25 +0100) are available in the Git repository at: git://repo.or.cz/qemu/kevin.git tags/for-upstream for you to fetch changes up to 7461272c5f6032436ef9032c091c0118539483e4: softmmu/qdev-monitor: fix use-after-free in qdev_set_id() (2021-11-15 15:49:46 +0100) Block layer patches - Fixes to image streaming job and block layer reconfiguration to make iotest 030 pass again - docs: Deprecate incorrectly typed device_add arguments - file-posix: Fix alignment after reopen changing O_DIRECT Hanna Reitz (10): stream: Traverse graph after modification block: Manipulate children list in .attach/.detach block: Unite remove_empty_child and child_free block: Drop detached child from ignore list block: Pass BdrvChild ** to replace_child_noperm block: Restructure remove_file_or_backing_child() transactions: Invoke clean() after everything else block: Let replace_child_tran keep indirect pointer block: Let replace_child_noperm free children iotests/030: Unthrottle parallel jobs in reverse Kevin Wolf (2): docs: Deprecate incorrectly typed device_add arguments file-posix: Fix alignment after reopen changing O_DIRECT Stefan Hajnoczi (1): softmmu/qdev-monitor: fix use-after-free in qdev_set_id() docs/about/deprecated.rst | 14 +++ include/qemu/transactions.h | 3 + block.c | 233 +--- block/file-posix.c | 20 +++- block/stream.c | 7 +- softmmu/qdev-monitor.c | 2 +- util/transactions.c | 8 +- tests/qemu-iotests/030 | 11 ++- tests/qemu-iotests/142 | 22 + tests/qemu-iotests/142.out | 15 +++ 10 files changed, 269 insertions(+), 66 deletions(-)
Re: [PATCH v4 22/25] block_int-common.h: assertions in the callers of BdrvChildClass function pointers
On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote: Signed-off-by: Emanuele Giuseppe Esposito Reviewed-by: Stefan Hajnoczi --- block.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/block.c b/block.c index 40c4729b8d..da80e89ad4 100644 --- a/block.c +++ b/block.c [...] @@ -7173,6 +7182,7 @@ static bool bdrv_parent_can_set_aio_context(BdrvChild *c, AioContext *ctx, return true; } *ignore = g_slist_prepend(*ignore, c); +assert(qemu_in_main_thread()); It definitely isn’t wrong to place the assert() here, of course, but it’s an interesting place nonetheless. In other places it seemed like you’d prefer to place it above the first non-declaration statement. Absolutely no need to change it if you don’t want to, just something that I noticed. Hanna
Re: [PATCH v4 21/25] block_int-common.h: split function pointers in BdrvChildClass
On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote: Signed-off-by: Emanuele Giuseppe Esposito Reviewed-by: Stefan Hajnoczi --- include/block/block_int-common.h | 51 1 file changed, 32 insertions(+), 19 deletions(-) diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h index 9857e775fe..ea16099c53 100644 --- a/include/block/block_int-common.h +++ b/include/block/block_int-common.h [...] +/* + * I/O API functions. These functions are thread-safe. + * + * See include/block/block-io.h for more information about + * the I/O API. + */ + +void (*resize)(BdrvChild *child); + /* * If this pair of functions is implemented, the parent doesn't issue new * requests after returning from .drained_begin() until .drained_end() is @@ -859,23 +889,6 @@ struct BdrvChildClass { */ void (*activate)(BdrvChild *child, Error **errp); int (*inactivate)(BdrvChild *child); These are now I/O functions, but I believe they should be GS functions: BlockBackend’s implementation for .activate (blk_root_activate()) invokes blk_set_perm(), which is a GS function. Its .inactivate implementation (blk_root_inactivate()) invokes bdrv_child_try_set_perm(). (I also believe this makes sense in context: .activate is called by bdrv_co_invalidate_cache(), which should be a GS function as I’ve suggested in my reply to patch 20. .inactivate is called by bdrv_inactivate_recurse(), which is a GS function, too.) Hanna
Re: [PATCH 2/2] virtio-blk: support BLKSECDISCARD
On Mon, Nov 15, 2021 at 12:52:00PM +0800, yadong...@intel.com wrote: The Linux block layer shares the DISCARD queue limits with SECDISCARD. That's different from BLKZEROOUT (QEMU's WRITE_ZEROES). This is a Linux implementation detail but I guess virtio-blk can share the DISCARD limits with SECDISCARD too... > @@ -622,6 +628,7 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, > MultiReqBuffer *mrb) > unsigned out_num = req->elem.out_num; > VirtIOBlock *s = req->dev; > VirtIODevice *vdev = VIRTIO_DEVICE(s); > +bool is_secdiscard = false; > > if (req->elem.out_num < 1 || req->elem.in_num < 1) { > virtio_error(vdev, "virtio-blk missing headers"); > @@ -722,6 +729,9 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, > MultiReqBuffer *mrb) > * VIRTIO_BLK_T_OUT flag set. We masked this flag in the switch > statement, > * so we must mask it for these requests, then we will check if it is > set. > */ > +case VIRTIO_BLK_T_SECDISCARD & ~VIRTIO_BLK_T_OUT: > +is_secdiscard = true; > +__attribute__((fallthrough)); The DISCARD case doesn't use __attribute__((fallthrough)) so this is inconsistent. QEMU doesn't use __attribute__((fallthrough)) so I suggest dropping this. > diff --git a/include/standard-headers/linux/virtio_blk.h > b/include/standard-headers/linux/virtio_blk.h > index 2dcc90826a..c55a07840c 100644 > --- a/include/standard-headers/linux/virtio_blk.h > +++ b/include/standard-headers/linux/virtio_blk.h > @@ -40,6 +40,7 @@ > #define VIRTIO_BLK_F_MQ 12 /* support more than one vq */ > #define VIRTIO_BLK_F_DISCARD 13 /* DISCARD is supported */ > #define VIRTIO_BLK_F_WRITE_ZEROES14 /* WRITE ZEROES is supported */ > +#define VIRTIO_BLK_F_SECDISCARD 15 /* WRITE ZEROES is supported */ The comment is copy-pasted from WRITE_ZEROES. signature.asc Description: PGP signature
Re: [PATCH v4 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers
On 15.11.21 13:48, Hanna Reitz wrote: On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote: Signed-off-by: Emanuele Giuseppe Esposito Reviewed-by: Stefan Hajnoczi --- block.c | 17 + 1 file changed, 17 insertions(+) diff --git a/block.c b/block.c index 94bff5c757..40c4729b8d 100644 --- a/block.c +++ b/block.c [...] @@ -2148,6 +2152,7 @@ static void bdrv_child_perm(BlockDriverState *bs, BlockDriverState *child_bs, uint64_t *nperm, uint64_t *nshared) { assert(bs->drv && bs->drv->bdrv_child_perm); + assert(qemu_in_main_thread()); bs->drv->bdrv_child_perm(bs, c, role, reopen_queue, parent_perm, parent_shared, nperm, nshared); (Should’ve noticed earlier, but only did now...) First, this function is indirectly called by bdrv_refresh_perms(). I understand that all perm-related functions are classified as GS. However, bdrv_co_invalidate_cache() invokes bdrv_refresh_perms. Being declared in block/coroutine.h, it’s an I/O function, so it mustn’t call such a GS function. BlockDriver.bdrv_co_invalidate_cache(), bdrv_invalidate_cache(), and blk_invalidate_cache() are also classified as I/O functions. Perhaps all of these functions should be classified as GS functions? I believe their callers and their purpose would allow for this. Second, it’s called by bdrv_child_refresh_perms(), which is called by block_crypto_amend_options_generic_luks(). This function is called by block_crypto_co_amend_luks(), which is a BlockDriver.bdrv_co_amend implementation, which is classified as an I/O function. Honestly, I don’t know how to fix that mess. The best would be if we could make the perm functions thread-safe and classify them as I/O, but it seems to me like that’s impossible (I sure hope I’m wrong). On the other hand, .bdrv_co_amend very much strikes me like a GS function, but it isn’t. I’m afraid it must work on nodes that are not in the main context, and it launches a job, so AFAIU we absolutely cannot run it under the BQL. It almost seems to me like we’d need a thread-safe variant of the perm functions that’s allowed to fail when it cannot guarantee thread safety or something. Or perhaps I’m wrong and the perm functions can actually be classified as thread-safe and I/O, that’d be great… Hm. Can we perhaps let block_crypto_amend_options_generic_luks() take the BQL just for the permission adjustment (i.e. the bdrv_child_refresh_perms() call)? Would that work? :/ Hanna
Re: [PATCH 1/2] block:hdev: support BLKSECDISCARD
On Mon, Nov 15, 2021 at 12:51:59PM +0800, yadong...@intel.com wrote: > From: Yadong Qi > > Add a new option "secdiscard" for block drive. Parse it and > record it in bs->open_flags as bit(BDRV_O_SECDISCARD). > > Add a new BDRV_REQ_SECDISCARD bit for secure discard request > from virtual device. > > For host_device backend: implement by ioctl(BLKSECDISCARD) on > real host block device. > For other backend, no implementation. > > E.g.: > qemu-system-x86_64 \ > ... \ > -drive > file=/dev/mmcblk0p2,if=none,format=raw,discard=on,secdiscard=on,id=sd0 \ > -device virtio-blk-pci,drive=sd0,id=sd0_vblk \ > ... I'm curious why there is explicit control over this feature (-drive secdiscard=on|off). For example, is there a reason why users would want to disable secdiscard on a drive that supports it? I guess there is no way to emulate it correctly so secdiscard=on on a drive that doesn't support it produces an error? signature.asc Description: PGP signature
Re: [PATCH RFC 0/2] Eliminate drive_get_next()
On Mon, 15 Nov 2021 at 12:55, Markus Armbruster wrote: > > This is RFC because I'm unsure about the removal of > > /* Reason: init() method uses drive_get_next() */ > dc->user_creatable = false; > > in PATCH 1. Both users appear to wire up some GPIO. If that's > necessary for the thing to work, we should just replace the comment. Looking at the code, it sort of is and sort of isn't. The GPIO line is the chip-select line. If you don't connect it then (because the ssi-sd device configures cs_polarity to SSI_CS_LOW, requesting an active-low chip-select) the device will always be selected. If the machine created an SSI bus with no SSI device attached to it then in theory the user could create an ssi-sd device and connect it there and have it work. But in practice it's really unlikely: machines create SSI buses with specific wired-in devices on them, and the guest OS knows what it has to do to enable the chip select for the device it wants to talk to (often some known GPIO pin on a GPIO controller). So I would keep the user_creatable = false, with a reason of "user should wire up GPIO chip-select line". But see below for a pile of contrary precedent. (The chip-select GPIO is created in the parent class, incidentally.) > Aside: there may be devices that need manual wiring to work, yet don't > have user_creatable unset. Bugs if you ask me. I don't have smart > ideas on how to track them down. Me neither. I notice that the TYPE_M25P80 is also an SSI peripheral with an active-low chipselect but that one doesn't set user_creatable to false. TYPE_SSD0323 also is user-creatable and that one has an active-high chipselect, which means the user can create a device but it will then never be usable because it will ignore all transactions. (More generally, looks like most subclasses of TYPE_SSI_PERIPHERAL don't set user_creatable = false.) -- PMM
Re: [PATCH RFC 2/2] hw: Replace drive_get_next() by drive_get()
On 11/15/21 13:55, Markus Armbruster wrote: > drive_get_next() is basically a bad idea. It returns the "next" block > backend of a certain interface type. "Next" means bus=0,unit=N, where > subsequent calls count N up from zero, per interface type. > > This lets you define unit numbers implicitly by execution order. If the > order changes, or new calls appear "in the middle", unit numbers change. > ABI break. Hard to spot in review. > > Explicit is better than implicit: use drive_get() directly. > > Signed-off-by: Markus Armbruster > --- > include/sysemu/blockdev.h | 1 - > blockdev.c | 10 -- > hw/arm/aspeed.c | 21 + > hw/arm/cubieboard.c | 2 +- > hw/arm/imx25_pdk.c | 2 +- > hw/arm/integratorcp.c | 2 +- > hw/arm/mcimx6ul-evk.c | 2 +- > hw/arm/mcimx7d-sabre.c | 2 +- > hw/arm/msf2-som.c | 2 +- > hw/arm/npcm7xx_boards.c | 6 +++--- > hw/arm/orangepi.c | 2 +- > hw/arm/raspi.c | 2 +- > hw/arm/realview.c | 2 +- > hw/arm/sabrelite.c | 2 +- > hw/arm/versatilepb.c| 4 ++-- > hw/arm/vexpress.c | 6 +++--- > hw/arm/xilinx_zynq.c| 16 +--- > hw/arm/xlnx-versal-virt.c | 3 ++- > hw/arm/xlnx-zcu102.c| 6 +++--- > hw/microblaze/petalogix_ml605_mmu.c | 2 +- > hw/misc/sifive_u_otp.c | 2 +- > hw/riscv/microchip_pfsoc.c | 2 +- > hw/sparc64/niagara.c| 2 +- > 23 files changed, 49 insertions(+), 52 deletions(-) > @@ -435,11 +438,13 @@ static void aspeed_machine_init(MachineState *machine) > } > > for (i = 0; i < bmc->soc.sdhci.num_slots; i++) { > -sdhci_attach_drive(>soc.sdhci.slots[i], drive_get_next(IF_SD)); > +sdhci_attach_drive(>soc.sdhci.slots[i], > + drive_get(IF_SD, 0, i)); If we put SD on bus #0, ... > } > > if (bmc->soc.emmc.num_slots) { > -sdhci_attach_drive(>soc.emmc.slots[0], drive_get_next(IF_SD)); > +sdhci_attach_drive(>soc.emmc.slots[0], > + drive_get(IF_SD, 0, bmc->soc.sdhci.num_slots)); ... we'd want to put eMMC on bus #1, but I see having eMMC cards on a IF_SD bus as a bug, since these cards are soldered on the board. > --- a/hw/arm/vexpress.c > +++ b/hw/arm/vexpress.c > @@ -625,7 +625,7 @@ static void vexpress_common_init(MachineState *machine) >qdev_get_gpio_in(sysctl, > ARM_SYSCTL_GPIO_MMC_WPROT)); > qdev_connect_gpio_out_named(dev, "card-inserted", 0, >qdev_get_gpio_in(sysctl, > ARM_SYSCTL_GPIO_MMC_CARDIN)); > -dinfo = drive_get_next(IF_SD); > +dinfo = drive_get(IF_SD, 0, 0); Can we have one interface refactor per patch (IF_SD, IF_PFLASH, IF_MTD...)? > @@ -657,7 +657,7 @@ static void vexpress_common_init(MachineState *machine) > > sysbus_create_simple("pl111", map[VE_CLCD], pic[14]); > > -dinfo = drive_get_next(IF_PFLASH); > +dinfo = drive_get(IF_PFLASH, 0, 0); > -static inline void zynq_init_spi_flashes(uint32_t base_addr, qemu_irq irq, > - bool is_qspi) > +static inline int zynq_init_spi_flashes(uint32_t base_addr, qemu_irq irq, > +bool is_qspi, int unit0) > { > +int unit = unit0; > DeviceState *dev; > SysBusDevice *busdev; > SSIBus *spi; > @@ -156,7 +157,7 @@ static inline void zynq_init_spi_flashes(uint32_t > base_addr, qemu_irq irq, > spi = (SSIBus *)qdev_get_child_bus(dev, bus_name); > > for (j = 0; j < num_ss; ++j) { > -DriveInfo *dinfo = drive_get_next(IF_MTD); > +DriveInfo *dinfo = drive_get(IF_MTD, 0, unit++); > diff --git a/hw/arm/xlnx-zcu102.c b/hw/arm/xlnx-zcu102.c > index 3dc2b5e8ca..45eb19ab3b 100644 > --- a/hw/arm/xlnx-zcu102.c > +++ b/hw/arm/xlnx-zcu102.c > @@ -190,7 +190,7 @@ static void xlnx_zcu102_init(MachineState *machine) > BusState *spi_bus; > DeviceState *flash_dev; > qemu_irq cs_line; > -DriveInfo *dinfo = drive_get_next(IF_MTD); > +DriveInfo *dinfo = drive_get(IF_MTD, 0, i); If this is bus #0, ... > gchar *bus_name = g_strdup_printf("spi%d", i); > > spi_bus = qdev_get_child_bus(DEVICE(>soc), bus_name); > @@ -212,7 +212,7 @@ static void xlnx_zcu102_init(MachineState *machine) > BusState *spi_bus; > DeviceState *flash_dev; > qemu_irq cs_line; > -DriveInfo *dinfo = drive_get_next(IF_MTD); > +DriveInfo *dinfo = drive_get(IF_MTD, 0, XLNX_ZYNQMP_NUM_SPIS + i); ... I'd expect we use bus #1 here (different connector on the board). > int bus = i / XLNX_ZYNQMP_NUM_QSPI_BUS_CS; > gchar *bus_name =
Re: [PATCH v2 10/10] iotests/030: Unthrottle parallel jobs in reverse
On 12.11.21 17:25, Vladimir Sementsov-Ogievskiy wrote: 11.11.2021 15:08, Hanna Reitz wrote: See the comment for why this is necessary. Signed-off-by: Hanna Reitz --- tests/qemu-iotests/030 | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/tests/qemu-iotests/030 b/tests/qemu-iotests/030 index 5fb65b4bef..567bf1da67 100755 --- a/tests/qemu-iotests/030 +++ b/tests/qemu-iotests/030 @@ -251,7 +251,16 @@ class TestParallelOps(iotests.QMPTestCase): speed=1024) self.assert_qmp(result, 'return', {}) - for job in pending_jobs: + # Do this in reverse: After unthrottling them, some jobs may finish + # before we have unthrottled all of them. This will drain their + # subgraph, and this will make jobs above them advance (despite those + # jobs on top being throttled). In the worst case, all jobs below the + # top one are finished before we can unthrottle it, and this makes it + # advance so far that it completes before we can unthrottle it - which + # results in an error. + # Starting from the top (i.e. in reverse) does not have this problem: + # When a job finishes, the ones below it are not advanced. Hmm, interesting why only jobs above the finished job may advance in the situation.. Looks like something may change and this workaround will stop working. Isn't it better just handle the error, and don't care if job was just finished? Something like if result['return'] != {}: # Job was finished during drain caused by finish of already unthrottled job self.assert_qmp(result, 'error/class', 'DeviceNotActive') Well. My explanation (excuse) is that I felt like this was the hack-ish solution that I could have gone for from the start without understanding what the issue is (and in fact it was the solution I used while debugging the other problems). I went with `reversed()`, because that really addresses the problem. You’re right in that it only addresses the problem for now and there’s a chance it might reappear. If we want to go with ignoring DeviceNotActive errors, then I think we should at least query all block jobs before the unthrottle loop and see that at least at one point they were all running simultaneously. I don’t really have a strong opinion. We can exchange this patch now (though I’d rather not hold up the rest of the series for it), or have a patch on top later, or, well, just keep it for now. I think the least stressful option would be to just fix it up later. Hanna
Re: [PATCH RFC 1/2] hw/sd/ssi-sd: Do not create SD card within controller's realize
Peter Maydell writes: > On Mon, 15 Nov 2021 at 12:56, Markus Armbruster wrote: >> >> ssi_sd_realize() creates an "sd-card" device. This is inappropriate, >> and marked FIXME. >> >> Move it to the boards that create these devices. Prior art: commit >> eb4f566bbb for device "generic-sdhci", and commit 26c607b86b for >> device "pl181". >> >> Signed-off-by: Markus Armbruster >> --- > >> @@ -670,7 +672,7 @@ static void sifive_u_machine_init(MachineState *machine) >> >> /* Connect an SPI flash to SPI0 */ >> flash_dev = qdev_new("is25wp256"); >> -dinfo = drive_get_next(IF_MTD); >> +dinfo = drive_get(IF_MTD, 0, 0); >> if (dinfo) { >> qdev_prop_set_drive_err(flash_dev, "drive", >> blk_by_legacy_dinfo(dinfo), > > > This part looks like it should have been in the other patch. You're right. Thanks!
Re: [PATCH RFC 2/2] hw: Replace drive_get_next() by drive_get()
Peter Maydell writes: > On Mon, 15 Nov 2021 at 12:55, Markus Armbruster wrote: >> >> drive_get_next() is basically a bad idea. It returns the "next" block >> backend of a certain interface type. "Next" means bus=0,unit=N, where >> subsequent calls count N up from zero, per interface type. >> >> This lets you define unit numbers implicitly by execution order. If the >> order changes, or new calls appear "in the middle", unit numbers change. >> ABI break. Hard to spot in review. >> >> Explicit is better than implicit: use drive_get() directly. >> >> Signed-off-by: Markus Armbruster >> --- >> include/sysemu/blockdev.h | 1 - >> blockdev.c | 10 -- >> hw/arm/aspeed.c | 21 + >> hw/arm/cubieboard.c | 2 +- >> hw/arm/imx25_pdk.c | 2 +- >> hw/arm/integratorcp.c | 2 +- >> hw/arm/mcimx6ul-evk.c | 2 +- >> hw/arm/mcimx7d-sabre.c | 2 +- >> hw/arm/msf2-som.c | 2 +- >> hw/arm/npcm7xx_boards.c | 6 +++--- >> hw/arm/orangepi.c | 2 +- >> hw/arm/raspi.c | 2 +- >> hw/arm/realview.c | 2 +- >> hw/arm/sabrelite.c | 2 +- >> hw/arm/versatilepb.c| 4 ++-- >> hw/arm/vexpress.c | 6 +++--- >> hw/arm/xilinx_zynq.c| 16 +--- >> hw/arm/xlnx-versal-virt.c | 3 ++- >> hw/arm/xlnx-zcu102.c| 6 +++--- >> hw/microblaze/petalogix_ml605_mmu.c | 2 +- >> hw/misc/sifive_u_otp.c | 2 +- >> hw/riscv/microchip_pfsoc.c | 2 +- >> hw/sparc64/niagara.c| 2 +- >> 23 files changed, 49 insertions(+), 52 deletions(-) > > This would be easier to review if it didn't try to change all of > these board/SoC models at once. Each one of them is entirely > separate review work. Happy to split, but first I'd like to get some advice on the part I'm unsure about; see my cover letter.
Re: [PATCH RFC 1/2] hw/sd/ssi-sd: Do not create SD card within controller's realize
On Mon, 15 Nov 2021 at 12:56, Markus Armbruster wrote: > > ssi_sd_realize() creates an "sd-card" device. This is inappropriate, > and marked FIXME. > > Move it to the boards that create these devices. Prior art: commit > eb4f566bbb for device "generic-sdhci", and commit 26c607b86b for > device "pl181". > > Signed-off-by: Markus Armbruster > --- > @@ -670,7 +672,7 @@ static void sifive_u_machine_init(MachineState *machine) > > /* Connect an SPI flash to SPI0 */ > flash_dev = qdev_new("is25wp256"); > -dinfo = drive_get_next(IF_MTD); > +dinfo = drive_get(IF_MTD, 0, 0); > if (dinfo) { > qdev_prop_set_drive_err(flash_dev, "drive", > blk_by_legacy_dinfo(dinfo), This part looks like it should have been in the other patch. -- PMM
Re: [PATCH RFC 2/2] hw: Replace drive_get_next() by drive_get()
On Mon, 15 Nov 2021 at 12:55, Markus Armbruster wrote: > > drive_get_next() is basically a bad idea. It returns the "next" block > backend of a certain interface type. "Next" means bus=0,unit=N, where > subsequent calls count N up from zero, per interface type. > > This lets you define unit numbers implicitly by execution order. If the > order changes, or new calls appear "in the middle", unit numbers change. > ABI break. Hard to spot in review. > > Explicit is better than implicit: use drive_get() directly. > > Signed-off-by: Markus Armbruster > --- > include/sysemu/blockdev.h | 1 - > blockdev.c | 10 -- > hw/arm/aspeed.c | 21 + > hw/arm/cubieboard.c | 2 +- > hw/arm/imx25_pdk.c | 2 +- > hw/arm/integratorcp.c | 2 +- > hw/arm/mcimx6ul-evk.c | 2 +- > hw/arm/mcimx7d-sabre.c | 2 +- > hw/arm/msf2-som.c | 2 +- > hw/arm/npcm7xx_boards.c | 6 +++--- > hw/arm/orangepi.c | 2 +- > hw/arm/raspi.c | 2 +- > hw/arm/realview.c | 2 +- > hw/arm/sabrelite.c | 2 +- > hw/arm/versatilepb.c| 4 ++-- > hw/arm/vexpress.c | 6 +++--- > hw/arm/xilinx_zynq.c| 16 +--- > hw/arm/xlnx-versal-virt.c | 3 ++- > hw/arm/xlnx-zcu102.c| 6 +++--- > hw/microblaze/petalogix_ml605_mmu.c | 2 +- > hw/misc/sifive_u_otp.c | 2 +- > hw/riscv/microchip_pfsoc.c | 2 +- > hw/sparc64/niagara.c| 2 +- > 23 files changed, 49 insertions(+), 52 deletions(-) This would be easier to review if it didn't try to change all of these board/SoC models at once. Each one of them is entirely separate review work. -- PMM
Re: does drive_get_next(IF_NONE) make sense?
On Mon, 15 Nov 2021 at 13:24, Kevin Wolf wrote: > Same question as for Hao Wu's series: Wouldn't the proper solution be to > add a drive property to the machine type? > > If you can't use -blockdev, it's not done right. Is there an example of "doing it right" for built-in-to-the-machine devices? (My experience with the new-style options is that almost always they're designed for x86 where the device they're attached to is also created on the command line, and then handling of boards where the device is builtin is either an afterthought or forgotten. See also -netdev, where it took forever for built-in-ethernet to be usable.) -- PMM
Re: does drive_get_next(IF_NONE) make sense?
Am 15.11.2021 um 06:31 hat Markus Armbruster geschrieben: > Peter Maydell writes: > > > On Fri, 12 Nov 2021 at 13:34, Markus Armbruster wrote: > >> > >> Thomas Huth writes: > >> > >> > On 03/11/2021 09.41, Markus Armbruster wrote: > >> >> Peter Maydell writes: > >> >> > >> >>> Does it make sense for a device/board to do drive_get_next(IF_NONE) ? > >> >> Short answer: hell, no! ;) > >> > > >> > Would it make sense to add an "assert(type != IF_NONE)" to drive_get() > >> > to avoid such mistakes in the future? > >> > >> Worth a try. > > > > You need to fix the sifive_u_otp device first :-) > > And for that, we may want Hao Wu's "[PATCH v4 5/7] blockdev: Add a new > IF type IF_OTHER" first. Same question as for Hao Wu's series: Wouldn't the proper solution be to add a drive property to the machine type? If you can't use -blockdev, it's not done right. Kevin
Re: [PATCH v4 6/7] hw/nvram: Update at24c EEPROM init function in NPCM7xx boards
Am 03.11.2021 um 23:01 hat Hao Wu geschrieben: > We made 3 changes to the at24c_eeprom_init function in > npcm7xx_boards.c: > > 1. We allow the function to take a I2CBus* as parameter. This allows >us to attach an EEPROM device behind an I2C mux which is not >possible with the old method. > > 2. We make at24c EEPROMs are backed by drives so that we can >specify the content of the EEPROMs. > > 3. Instead of using i2c address as unit number, This patch assigns >unique unit numbers for each eeproms in each board. This avoids >conflict in providing multiple eeprom contents with the same address. >In the old method if we specify two drives with the same unit number, >the following error will occur: `Device with id 'none85' exists`. > > Signed-off-by: Hao Wu > --- > hw/arm/npcm7xx_boards.c | 14 +- > 1 file changed, 9 insertions(+), 5 deletions(-) > > diff --git a/hw/arm/npcm7xx_boards.c b/hw/arm/npcm7xx_boards.c > index dec7d16ae5..9121e081fa 100644 > --- a/hw/arm/npcm7xx_boards.c > +++ b/hw/arm/npcm7xx_boards.c > @@ -126,13 +126,17 @@ static I2CBus *npcm7xx_i2c_get_bus(NPCM7xxState *soc, > uint32_t num) > return I2C_BUS(qdev_get_child_bus(DEVICE(>smbus[num]), "i2c-bus")); > } > > -static void at24c_eeprom_init(NPCM7xxState *soc, int bus, uint8_t addr, > - uint32_t rsize) > +static void at24c_eeprom_init(I2CBus *i2c_bus, int bus, uint8_t addr, > + uint32_t rsize, int unit_number) > { > -I2CBus *i2c_bus = npcm7xx_i2c_get_bus(soc, bus); > I2CSlave *i2c_dev = i2c_slave_new("at24c-eeprom", addr); > DeviceState *dev = DEVICE(i2c_dev); > +DriveInfo *dinfo; > > +dinfo = drive_get(IF_OTHER, bus, unit_number); > +if (dinfo) { > +qdev_prop_set_drive(dev, "drive", blk_by_legacy_dinfo(dinfo)); > +} I may be missing the history of this series, but why do we have to use a legacy DriveInfo here instead of adding a drive property to the board to make this configurable with -blockdev? Adding even more devices that can only be configured with -drive feels like a step backwards to me. Kevin
Re: [PATCH v2 09/10] block: Let replace_child_noperm free children
On 12.11.21 17:10, Vladimir Sementsov-Ogievskiy wrote: 11.11.2021 15:08, Hanna Reitz wrote: In most of the block layer, especially when traversing down from other BlockDriverStates, we assume that BdrvChild.bs can never be NULL. When it becomes NULL, it is expected that the corresponding BdrvChild pointer also becomes NULL and the BdrvChild object is freed. Therefore, once bdrv_replace_child_noperm() sets the BdrvChild.bs pointer to NULL, it should also immediately set the corresponding BdrvChild pointer (like bs->file or bs->backing) to NULL. In that context, it also makes sense for this function to free the child. Sometimes we cannot do so, though, because it is called in a transactional context where the caller might still want to reinstate the child in the abort branch (and free it only on commit), so this behavior has to remain optional. In bdrv_replace_child_tran()'s abort handler, we now rely on the fact that the BdrvChild passed to bdrv_replace_child_tran() must have had a non-NULL .bs pointer initially. Make a note of that and assert it. Signed-off-by: Hanna Reitz --- block.c | 102 +++- 1 file changed, 79 insertions(+), 23 deletions(-) diff --git a/block.c b/block.c index a40027161c..0ac5b163d2 100644 --- a/block.c +++ b/block.c @@ -87,8 +87,10 @@ static BlockDriverState *bdrv_open_inherit(const char *filename, static bool bdrv_recurse_has_child(BlockDriverState *bs, BlockDriverState *child); +static void bdrv_child_free(BdrvChild *child); static void bdrv_replace_child_noperm(BdrvChild **child, - BlockDriverState *new_bs); + BlockDriverState *new_bs, + bool free_empty_child); static void bdrv_remove_file_or_backing_child(BlockDriverState *bs, BdrvChild *child, Transaction *tran); @@ -2256,12 +2258,16 @@ typedef struct BdrvReplaceChildState { BdrvChild *child; BdrvChild **childp; BlockDriverState *old_bs; + bool free_empty_child; } BdrvReplaceChildState; static void bdrv_replace_child_commit(void *opaque) { BdrvReplaceChildState *s = opaque; + if (s->free_empty_child && !s->child->bs) { + bdrv_child_free(s->child); + } bdrv_unref(s->old_bs); } @@ -2278,22 +2284,26 @@ static void bdrv_replace_child_abort(void *opaque) * modify the BdrvChild * pointer we indirectly pass to it, i.e. it * will not modify s->child. From that perspective, it does not matter * whether we pass s->childp or >child. - * (TODO: Right now, bdrv_replace_child_noperm() never modifies that - * pointer anyway (though it will in the future), so at this point it - * absolutely does not matter whether we pass s->childp or >child.) * (2) If new_bs is not NULL, s->childp will be NULL. We then cannot use * it here. * (3) If new_bs is NULL, *s->childp will have been NULLed by * bdrv_replace_child_tran()'s bdrv_replace_child_noperm() call, and we * must not pass a NULL *s->childp here. - * (TODO: In its current state, bdrv_replace_child_noperm() will not - * have NULLed *s->childp, so this does not apply yet. It will in the - * future.) What I don't like about this patch is that it does two different things: zeroing the pointer and clearing the object. And if we look at the latter in separate, it seems that it's not needed: Look: bdrv_replace_child_tran(): new parameter is set to true in two places, in both of them we are sure (and do assertion and comment) that new bs is not NULL and nothing will be freed. Similarly, bdrv_replace_child_noperm() is called with true in two places where we sure that new bs is not NULL. and only one place where new parameter set to true really do something: @@ -2960,8 +3013,7 @@ static void bdrv_detach_child(BdrvChild **childp) { BlockDriverState *old_bs = (*childp)->bs; - bdrv_replace_child_noperm(childp, NULL); - bdrv_child_free(*childp); + bdrv_replace_child_noperm(childp, NULL, true); if (old_bs) { /* And it doesn't worth the whole complexity of new parameters for two functions. In this place we can simply do something like BdrvChild *child = *childp; bdrv_replace_child_noperm(childp, NULL); bdrv_child_free(child); I understand the idea: it seems good and intuitive to do zeroing the pointer and clearing the object in one shot. But this patch itself shows that we just can't do it in 90% of cases. So, I think better is not do it and live with only "zeroing the pointer" part of this patch. I see your point, but I don’t find it too complex. Passing `true` is the default and then calling the function is easy. Passing
[PATCH RFC 1/2] hw/sd/ssi-sd: Do not create SD card within controller's realize
ssi_sd_realize() creates an "sd-card" device. This is inappropriate, and marked FIXME. Move it to the boards that create these devices. Prior art: commit eb4f566bbb for device "generic-sdhci", and commit 26c607b86b for device "pl181". Signed-off-by: Markus Armbruster --- hw/arm/stellaris.c | 15 ++- hw/riscv/sifive_u.c | 15 +-- hw/sd/ssi-sd.c | 29 - 3 files changed, 27 insertions(+), 32 deletions(-) diff --git a/hw/arm/stellaris.c b/hw/arm/stellaris.c index 78827ace6b..b6c8a5d609 100644 --- a/hw/arm/stellaris.c +++ b/hw/arm/stellaris.c @@ -10,6 +10,7 @@ #include "qemu/osdep.h" #include "qapi/error.h" #include "hw/sysbus.h" +#include "hw/sd/sd.h" #include "hw/ssi/ssi.h" #include "hw/arm/boot.h" #include "qemu/timer.h" @@ -1157,6 +1158,9 @@ static void stellaris_init(MachineState *ms, stellaris_board_info *board) void *bus; DeviceState *sddev; DeviceState *ssddev; +DriveInfo *dinfo; +DeviceState *carddev; +BlockBackend *blk; /* * Some boards have both an OLED controller and SD card connected to @@ -1221,8 +1225,17 @@ static void stellaris_init(MachineState *ms, stellaris_board_info *board) * - Make the ssd0323 OLED controller chipselect active-low */ bus = qdev_get_child_bus(dev, "ssi"); - sddev = ssi_create_peripheral(bus, "ssi-sd"); + +dinfo = drive_get(IF_SD, 0, 0); +blk = dinfo ? blk_by_legacy_dinfo(dinfo) : NULL; +carddev = qdev_new(TYPE_SD_CARD); +qdev_prop_set_drive_err(carddev, "drive", blk, _fatal); +qdev_prop_set_bit(carddev, "spi", true); +qdev_realize_and_unref(carddev, + qdev_get_child_bus(sddev, "sd-bus"), + _fatal); + ssddev = ssi_create_peripheral(bus, "ssd0323"); gpio_out[GPIO_D][0] = qemu_irq_split( qdev_get_gpio_in_named(sddev, SSI_GPIO_CS, 0), diff --git a/hw/riscv/sifive_u.c b/hw/riscv/sifive_u.c index 589ae72a59..aa74e67889 100644 --- a/hw/riscv/sifive_u.c +++ b/hw/riscv/sifive_u.c @@ -46,6 +46,7 @@ #include "hw/char/serial.h" #include "hw/cpu/cluster.h" #include "hw/misc/unimp.h" +#include "hw/sd/sd.h" #include "hw/ssi/ssi.h" #include "target/riscv/cpu.h" #include "hw/riscv/riscv_hart.h" @@ -536,7 +537,8 @@ static void sifive_u_machine_init(MachineState *machine) uint32_t fdt_load_addr; uint64_t kernel_entry; DriveInfo *dinfo; -DeviceState *flash_dev, *sd_dev; +BlockBackend *blk; +DeviceState *flash_dev, *sd_dev, *card_dev; qemu_irq flash_cs, sd_cs; /* Initialize SoC */ @@ -670,7 +672,7 @@ static void sifive_u_machine_init(MachineState *machine) /* Connect an SPI flash to SPI0 */ flash_dev = qdev_new("is25wp256"); -dinfo = drive_get_next(IF_MTD); +dinfo = drive_get(IF_MTD, 0, 0); if (dinfo) { qdev_prop_set_drive_err(flash_dev, "drive", blk_by_legacy_dinfo(dinfo), @@ -686,6 +688,15 @@ static void sifive_u_machine_init(MachineState *machine) sd_cs = qdev_get_gpio_in_named(sd_dev, SSI_GPIO_CS, 0); sysbus_connect_irq(SYS_BUS_DEVICE(>soc.spi2), 1, sd_cs); + +dinfo = drive_get(IF_SD, 0, 0); +blk = dinfo ? blk_by_legacy_dinfo(dinfo) : NULL; +card_dev = qdev_new(TYPE_SD_CARD); +qdev_prop_set_drive_err(card_dev, "drive", blk, _fatal); +qdev_prop_set_bit(card_dev, "spi", true); +qdev_realize_and_unref(card_dev, + qdev_get_child_bus(sd_dev, "sd-bus"), + _fatal); } static bool sifive_u_machine_get_start_in_flash(Object *obj, Error **errp) diff --git a/hw/sd/ssi-sd.c b/hw/sd/ssi-sd.c index e60854eeef..558506f6a0 100644 --- a/hw/sd/ssi-sd.c +++ b/hw/sd/ssi-sd.c @@ -368,36 +368,9 @@ static const VMStateDescription vmstate_ssi_sd = { static void ssi_sd_realize(SSIPeripheral *d, Error **errp) { -ERRP_GUARD(); ssi_sd_state *s = SSI_SD(d); -DeviceState *carddev; -DriveInfo *dinfo; qbus_init(>sdbus, sizeof(s->sdbus), TYPE_SD_BUS, DEVICE(d), "sd-bus"); - -/* Create and plug in the sd card */ -/* FIXME use a qdev drive property instead of drive_get_next() */ -dinfo = drive_get_next(IF_SD); -carddev = qdev_new(TYPE_SD_CARD); -if (dinfo) { -if (!qdev_prop_set_drive_err(carddev, "drive", - blk_by_legacy_dinfo(dinfo), errp)) { -goto fail; -} -} - -if (!object_property_set_bool(OBJECT(carddev), "spi", true, errp)) { -goto fail; -} - -if (!qdev_realize_and_unref(carddev, BUS(>sdbus), errp)) { -goto fail; -} - -return; - -fail: -error_prepend(errp, "failed to init SD card: "); } static void ssi_sd_reset(DeviceState *dev) @@ -426,8 +399,6 @@
[PATCH RFC 2/2] hw: Replace drive_get_next() by drive_get()
drive_get_next() is basically a bad idea. It returns the "next" block backend of a certain interface type. "Next" means bus=0,unit=N, where subsequent calls count N up from zero, per interface type. This lets you define unit numbers implicitly by execution order. If the order changes, or new calls appear "in the middle", unit numbers change. ABI break. Hard to spot in review. Explicit is better than implicit: use drive_get() directly. Signed-off-by: Markus Armbruster --- include/sysemu/blockdev.h | 1 - blockdev.c | 10 -- hw/arm/aspeed.c | 21 + hw/arm/cubieboard.c | 2 +- hw/arm/imx25_pdk.c | 2 +- hw/arm/integratorcp.c | 2 +- hw/arm/mcimx6ul-evk.c | 2 +- hw/arm/mcimx7d-sabre.c | 2 +- hw/arm/msf2-som.c | 2 +- hw/arm/npcm7xx_boards.c | 6 +++--- hw/arm/orangepi.c | 2 +- hw/arm/raspi.c | 2 +- hw/arm/realview.c | 2 +- hw/arm/sabrelite.c | 2 +- hw/arm/versatilepb.c| 4 ++-- hw/arm/vexpress.c | 6 +++--- hw/arm/xilinx_zynq.c| 16 +--- hw/arm/xlnx-versal-virt.c | 3 ++- hw/arm/xlnx-zcu102.c| 6 +++--- hw/microblaze/petalogix_ml605_mmu.c | 2 +- hw/misc/sifive_u_otp.c | 2 +- hw/riscv/microchip_pfsoc.c | 2 +- hw/sparc64/niagara.c| 2 +- 23 files changed, 49 insertions(+), 52 deletions(-) diff --git a/include/sysemu/blockdev.h b/include/sysemu/blockdev.h index 32c2d6023c..a750f99b79 100644 --- a/include/sysemu/blockdev.h +++ b/include/sysemu/blockdev.h @@ -50,7 +50,6 @@ void drive_check_orphaned(void); DriveInfo *drive_get_by_index(BlockInterfaceType type, int index); int drive_get_max_bus(BlockInterfaceType type); int drive_get_max_devs(BlockInterfaceType type); -DriveInfo *drive_get_next(BlockInterfaceType type); QemuOpts *drive_def(const char *optstr); QemuOpts *drive_add(BlockInterfaceType type, int index, const char *file, diff --git a/blockdev.c b/blockdev.c index b35072644e..0eb2823b1b 100644 --- a/blockdev.c +++ b/blockdev.c @@ -303,16 +303,6 @@ int drive_get_max_bus(BlockInterfaceType type) return max_bus; } -/* Get a block device. This should only be used for single-drive devices - (e.g. SD/Floppy/MTD). Multi-disk devices (scsi/ide) should use the - appropriate bus. */ -DriveInfo *drive_get_next(BlockInterfaceType type) -{ -static int next_block_unit[IF_COUNT]; - -return drive_get(type, 0, next_block_unit[type]++); -} - static void bdrv_format_print(void *opaque, const char *name) { qemu_printf(" %s", name); diff --git a/hw/arm/aspeed.c b/hw/arm/aspeed.c index a77f46b3ad..cf20ae0db5 100644 --- a/hw/arm/aspeed.c +++ b/hw/arm/aspeed.c @@ -284,12 +284,13 @@ static void write_boot_rom(DriveInfo *dinfo, hwaddr addr, size_t rom_size, } static void aspeed_board_init_flashes(AspeedSMCState *s, - const char *flashtype) + const char *flashtype, + int unit0) { int i ; for (i = 0; i < s->num_cs; ++i) { -DriveInfo *dinfo = drive_get_next(IF_MTD); +DriveInfo *dinfo = drive_get(IF_MTD, 0, unit0 + i); qemu_irq cs_line; DeviceState *dev; @@ -382,10 +383,12 @@ static void aspeed_machine_init(MachineState *machine) "max_ram", max_ram_size - machine->ram_size); memory_region_add_subregion(>ram_container, machine->ram_size, >max_ram); -aspeed_board_init_flashes(>soc.fmc, bmc->fmc_model ? - bmc->fmc_model : amc->fmc_model); -aspeed_board_init_flashes(>soc.spi[0], bmc->spi_model ? - bmc->spi_model : amc->spi_model); +aspeed_board_init_flashes(>soc.fmc, + bmc->fmc_model ? bmc->fmc_model : amc->fmc_model, + 0); +aspeed_board_init_flashes(>soc.spi[0], + bmc->spi_model ? bmc->spi_model : amc->spi_model, + bmc->soc.fmc.num_cs); /* Install first FMC flash content as a boot rom. */ if (drive0) { @@ -435,11 +438,13 @@ static void aspeed_machine_init(MachineState *machine) } for (i = 0; i < bmc->soc.sdhci.num_slots; i++) { -sdhci_attach_drive(>soc.sdhci.slots[i], drive_get_next(IF_SD)); +sdhci_attach_drive(>soc.sdhci.slots[i], + drive_get(IF_SD, 0, i)); } if (bmc->soc.emmc.num_slots) { -sdhci_attach_drive(>soc.emmc.slots[0], drive_get_next(IF_SD)); +sdhci_attach_drive(>soc.emmc.slots[0], + drive_get(IF_SD, 0, bmc->soc.sdhci.num_slots)); }
[PATCH RFC 0/2] Eliminate drive_get_next()
This is RFC because I'm unsure about the removal of /* Reason: init() method uses drive_get_next() */ dc->user_creatable = false; in PATCH 1. Both users appear to wire up some GPIO. If that's necessary for the thing to work, we should just replace the comment. Aside: there may be devices that need manual wiring to work, yet don't have user_creatable unset. Bugs if you ask me. I don't have smart ideas on how to track them down. Markus Armbruster (2): hw/sd/ssi-sd: Do not create SD card within controller's realize hw: Replace drive_get_next() by drive_get() include/sysemu/blockdev.h | 1 - blockdev.c | 10 -- hw/arm/aspeed.c | 21 + hw/arm/cubieboard.c | 2 +- hw/arm/imx25_pdk.c | 2 +- hw/arm/integratorcp.c | 2 +- hw/arm/mcimx6ul-evk.c | 2 +- hw/arm/mcimx7d-sabre.c | 2 +- hw/arm/msf2-som.c | 2 +- hw/arm/npcm7xx_boards.c | 6 +++--- hw/arm/orangepi.c | 2 +- hw/arm/raspi.c | 2 +- hw/arm/realview.c | 2 +- hw/arm/sabrelite.c | 2 +- hw/arm/stellaris.c | 15 ++- hw/arm/versatilepb.c| 4 ++-- hw/arm/vexpress.c | 6 +++--- hw/arm/xilinx_zynq.c| 16 +--- hw/arm/xlnx-versal-virt.c | 3 ++- hw/arm/xlnx-zcu102.c| 6 +++--- hw/microblaze/petalogix_ml605_mmu.c | 2 +- hw/misc/sifive_u_otp.c | 2 +- hw/riscv/microchip_pfsoc.c | 2 +- hw/riscv/sifive_u.c | 15 +-- hw/sd/ssi-sd.c | 29 - hw/sparc64/niagara.c| 2 +- 26 files changed, 76 insertions(+), 84 deletions(-) -- 2.31.1
Re: [PATCH v4 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers
On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote: Signed-off-by: Emanuele Giuseppe Esposito Reviewed-by: Stefan Hajnoczi --- block.c | 17 + 1 file changed, 17 insertions(+) diff --git a/block.c b/block.c index 94bff5c757..40c4729b8d 100644 --- a/block.c +++ b/block.c [...] @@ -2148,6 +2152,7 @@ static void bdrv_child_perm(BlockDriverState *bs, BlockDriverState *child_bs, uint64_t *nperm, uint64_t *nshared) { assert(bs->drv && bs->drv->bdrv_child_perm); +assert(qemu_in_main_thread()); bs->drv->bdrv_child_perm(bs, c, role, reopen_queue, parent_perm, parent_shared, nperm, nshared); (Should’ve noticed earlier, but only did now...) First, this function is indirectly called by bdrv_refresh_perms(). I understand that all perm-related functions are classified as GS. However, bdrv_co_invalidate_cache() invokes bdrv_refresh_perms. Being declared in block/coroutine.h, it’s an I/O function, so it mustn’t call such a GS function. BlockDriver.bdrv_co_invalidate_cache(), bdrv_invalidate_cache(), and blk_invalidate_cache() are also classified as I/O functions. Perhaps all of these functions should be classified as GS functions? I believe their callers and their purpose would allow for this. Second, it’s called by bdrv_child_refresh_perms(), which is called by block_crypto_amend_options_generic_luks(). This function is called by block_crypto_co_amend_luks(), which is a BlockDriver.bdrv_co_amend implementation, which is classified as an I/O function. Honestly, I don’t know how to fix that mess. The best would be if we could make the perm functions thread-safe and classify them as I/O, but it seems to me like that’s impossible (I sure hope I’m wrong). On the other hand, .bdrv_co_amend very much strikes me like a GS function, but it isn’t. I’m afraid it must work on nodes that are not in the main context, and it launches a job, so AFAIU we absolutely cannot run it under the BQL. It almost seems to me like we’d need a thread-safe variant of the perm functions that’s allowed to fail when it cannot guarantee thread safety or something. Or perhaps I’m wrong and the perm functions can actually be classified as thread-safe and I/O, that’d be great… Hanna
Re: [PATCH 1/2] block:hdev: support BLKSECDISCARD
Am 15.11.2021 um 05:51 hat yadong...@intel.com geschrieben: > From: Yadong Qi > > Add a new option "secdiscard" for block drive. Parse it and > record it in bs->open_flags as bit(BDRV_O_SECDISCARD). > > Add a new BDRV_REQ_SECDISCARD bit for secure discard request > from virtual device. > > For host_device backend: implement by ioctl(BLKSECDISCARD) on > real host block device. > For other backend, no implementation. > > E.g.: > qemu-system-x86_64 \ > ... \ > -drive > file=/dev/mmcblk0p2,if=none,format=raw,discard=on,secdiscard=on,id=sd0 \ > -device virtio-blk-pci,drive=sd0,id=sd0_vblk \ > ... > > Signed-off-by: Yadong Qi > --- > block.c | 46 + > block/blkdebug.c | 5 ++-- > block/blklogwrites.c | 6 ++-- > block/blkreplay.c| 5 ++-- > block/block-backend.c| 15 ++ > block/copy-before-write.c| 5 ++-- > block/copy-on-read.c | 5 ++-- > block/coroutines.h | 6 ++-- > block/file-posix.c | 50 > block/filter-compress.c | 5 ++-- > block/io.c | 5 ++-- > block/mirror.c | 5 ++-- > block/nbd.c | 3 +- > block/nvme.c | 3 +- > block/preallocate.c | 5 ++-- > block/qcow2-refcount.c | 4 +-- > block/qcow2.c| 3 +- > block/raw-format.c | 5 ++-- > block/throttle.c | 5 ++-- > hw/block/virtio-blk.c| 2 +- > hw/ide/core.c| 1 + > hw/nvme/ctrl.c | 3 +- > hw/scsi/scsi-disk.c | 2 +- > include/block/block.h| 13 +++-- > include/block/block_int.h| 2 +- > include/block/raw-aio.h | 4 ++- > include/sysemu/block-backend.h | 1 + > tests/unit/test-block-iothread.c | 9 +++--- > 28 files changed, 171 insertions(+), 52 deletions(-) Notably absent: qapi/block-core.json. Without changing this, the option can't be available in -blockdev, which is the primary option to configure block device backends. This patch seems to contain multiple logical changes that should be split into separate patches: * Adding a flags parameter to .bdrv_co_pdiscard * Support for the new feature in the core block layer (should be done with -blockdev) * Convenience magic for -drive (BDRV_O_SECDISCARD). It's not clear that this should be done at all because the option is really specific to one single block driver (file-posix). I think in your patch, all other block drivers silently ignore the option, which is not what we want. > diff --git a/block.c b/block.c > index 580cb77a70..4f05e96d12 100644 > --- a/block.c > +++ b/block.c > @@ -1128,6 +1128,32 @@ int bdrv_parse_discard_flags(const char *mode, int > *flags) > return 0; > } > > +/** > + * Set open flags for a given secdiscard mode > + * > + * Return 0 on success, -1 if the secdiscard mode was invalid. > + */ > +int bdrv_parse_secdiscard_flags(const char *mode, int *flags, Error **errp) > +{ > +*flags &= ~BDRV_O_SECDISCARD; > + > +if (!strcmp(mode, "off")) { > +/* do nothing */ > +} else if (!strcmp(mode, "on")) { > +if (!(*flags & BDRV_O_UNMAP)) { > +error_setg(errp, "cannot enable secdiscard when discard is " > + "disabled!"); > +return -1; > +} This check has nothing to do with parsing the option, it's validating its value. You don't even need a new function to parse it, because there is already qemu_opt_get_bool(). Duplicating it means only that you're inconsistent with other boolean options, which alternatively accept "yes"/"no", "true"/"false", "y/n". > + > +*flags |= BDRV_O_SECDISCARD; > +} else { > +return -1; > +} > + > +return 0; > +} > + > /** > * Set open flags for a given cache mode > * > @@ -1695,6 +1721,11 @@ QemuOptsList bdrv_runtime_opts = { > .type = QEMU_OPT_STRING, > .help = "discard operation (ignore/off, unmap/on)", > }, > +{ > +.name = BDRV_OPT_SECDISCARD, > +.type = QEMU_OPT_STRING, > +.help = "secure discard operation (off, on)", > +}, > { > .name = BDRV_OPT_FORCE_SHARE, > .type = QEMU_OPT_BOOL, > @@ -1735,6 +1766,7 @@ static int bdrv_open_common(BlockDriverState *bs, > BlockBackend *file, > const char *driver_name = NULL; > const char *node_name = NULL; > const char *discard; > +const char *secdiscard; > QemuOpts *opts; > BlockDriver *drv; > Error *local_err = NULL; > @@ -1829,6 +1861,16 @@ static int bdrv_open_common(BlockDriverState *bs, > BlockBackend *file, > } > } > > + > +secdiscard = qemu_opt_get(opts, BDRV_OPT_SECDISCARD); > +if (secdiscard
Re: [PATCH v4 03/25] assertions for block global state API
@@ -5958,6 +6043,7 @@ const char *bdrv_get_parent_name(const BlockDriverState *bs) /* TODO check what callers really want: bs->node_name or blk_name() */ const char *bdrv_get_device_name(const BlockDriverState *bs) { + assert(qemu_in_main_thread()); return bdrv_get_parent_name(bs) ?: ""; } This function is invoked from qcow2_signal_corruption(), which comes generally from an I/O path. Is it safe to assert that we’re in the main thread here? Well, the question is probably rather whether this needs really be a considered a global-state function, or whether putting it in common or I/O is fine. I believe you’re right given that it invokes bdrv_get_parent_name(), it cannot be thread-safe, but then we’ll have to change qcow2_signal_corruption() so it doesn’t invoke this function. I think that the assertion is wrong here. As long as a single caller is not under BQL, we cannot keep the function in global-state headers. It should go into block-io.h, together with bdrv_get_parent_name and bdrv_get_device_or_node_name (caller of bdrv_get_parent_name). Since bdrv_get_parent_name only scans through the list of bs->parents, as long as we can assert that the parent list is modified only under BQL + drain, it is safe to be read and can be I/O. [...] diff --git a/block/io.c b/block/io.c index bb0a254def..c5d7f8495e 100644 --- a/block/io.c +++ b/block/io.c [...] @@ -544,6 +546,7 @@ void bdrv_drained_end(BlockDriverState *bs) void bdrv_drained_end_no_poll(BlockDriverState *bs, int *drained_end_counter) { + assert(qemu_in_main_thread()); bdrv_do_drained_end(bs, false, NULL, false, drained_end_counter); } Why is bdrv_drained_end an I/O function and this is a GS function, even though it does just a subset? Right this is clearly called in bdrv_drained_end. I don't know how is it possible that no test managed to trigger this assertion... This is actually invoked in both unit and qemu-iothread tests... @@ -586,12 +589,14 @@ void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent) void coroutine_fn bdrv_co_drain(BlockDriverState *bs) { assert(qemu_in_coroutine()); + assert(qemu_in_main_thread()); bdrv_drained_begin(bs); bdrv_drained_end(bs); } void bdrv_drain(BlockDriverState *bs) { + assert(qemu_in_main_thread()); bdrv_drained_begin(bs); bdrv_drained_end(bs); } Why are these GS functions when both bdrv_drained_begin() and bdrv_drained_end() are I/O functions? I can understand making the drain_all functions GS functions, but it seems weird to say it’s an I/O function when a single BDS is drained via bdrv_drained_begin() and bdrv_drained_end(), but not via bdrv_drain(), which just does both. (I can see that there are no I/O path callers, but I still find it strange.) The problem as you saw is on the path callers: bdrv_drain seems to be called only in main thread, while others or similar drains are also coming from I/O. I would say maybe it's a better idea to just put everything I/O, maybe (probably) there are cases not caught by the tests. The only ones in global-state will be: - bdrv_apply_subtree_drain and bdrv_unapply_subtree_drain (called only in .attach and .detach callbacks of BdrvChildClass, run under BQL). - bdrv_drain_all_end_quiesce (called only by bdrv_close thus under BQL). - bdrv_drain_all_begin, bdrv_drain_all_end and bdrv_drain_all (have already BQL assertions) In general, this is the underlying problem with these assertions: if a test doesn't test a specific code path, an unneeded assertion might pass undetected. [...] @@ -2731,6 +2742,7 @@ int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base, int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes, int64_t *pnum, int64_t *map, BlockDriverState **file) { + assert(qemu_in_main_thread()); return bdrv_block_status_above(bs, bdrv_filter_or_cow_bs(bs), offset, bytes, pnum, map, file); } Why is this a GS function as opposed to all other block-status functions? Because of the bdrv_filter_or_cow_bs() call? This function is in block.io but has the assertion. I think it's a proper I/O but I forgot to remove the assertion in my various trial Let me know if you disagree on anything of what I said. Thank you, Emanuele
Re: [PATCH v4 02/25] include/block/block: split header into I/O and global state API
+/* + * I/O API functions. These functions are thread-safe, and therefore + * can run in any thread as long as the thread has called + * aio_context_acquire/release(). + */ + +int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs, + Error **errp); Why is this function here? Naïvely, I would’ve assumed as a graph-modifying function it should be in block-global-state.h. I mean, perhaps it’s thread-safe and then it can fit here, too. Still, it surprises me a bit to find this here. Hanna Agree, I also tested this, it can go in global state. Will fix that. Thank you, Emanuele
Re: [PATCH v4 19/25] block_int-common.h: split function pointers in BlockDriver
On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote: Similar to the header split, also the function pointers in BlockDriver can be split in I/O and global state. Signed-off-by: Emanuele Giuseppe Esposito Reviewed-by: Stefan Hajnoczi --- include/block/block_int-common.h | 458 --- 1 file changed, 237 insertions(+), 221 deletions(-) diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h index 79a3d801d2..9857e775fe 100644 --- a/include/block/block_int-common.h +++ b/include/block/block_int-common.h @@ -96,6 +96,7 @@ typedef struct BdrvTrackedRequest { struct BlockDriver { +/* Fields initialized in struct definition and never changed. */ I find this a bit difficult to understand. Perhaps we could be more verbose to make it easier to read? Like “These fields are initialized when this object is created, and are never changed afterwards” And I’d also add an empty line below to make visually clear that this does not describe the single subsequent field (format_name) but the whole chunk of fields. I also wonder if we could substantiate the claim “never changed afterwards” by making these fields consts. Then again, I suppose technically none of the fields in BlockDriver is supposed to be mutable (except for .list), neither these fields nor the function pointers... Yeah, maybe scratch that idea. const char *format_name; int instance_size; [...] +int coroutine_fn (*bdrv_save_vmstate)(BlockDriverState *bs, + QEMUIOVector *qiov, + int64_t pos); +int coroutine_fn (*bdrv_load_vmstate)(BlockDriverState *bs, + QEMUIOVector *qiov, + int64_t pos); In patch 18, you classified bdrv_co_readv_vmstate() and bdrv_co_writev_vmstate() as I/O functions. They call these methods, though, so something doesn’t seem quite right. Now I also remember you classified the global bdrv_save_vmstate() and bdrv_load_vmstate() functions as GS. I don’t mind either way; AFAIU they are I/O functions that are generally called under the BQL. But the classification should be consistent, of course. + +int (*bdrv_change_backing_file)(BlockDriverState *bs, +const char *backing_file, const char *backing_fmt); + +void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag); +void (*bdrv_lock_medium)(BlockDriverState *bs, bool locked); Above bdrv_is_inserted(), there’s a comment (“removable device specific”) that I think should be duplicated here. + +/* TODO Better pass a option string/QDict/QemuOpts to add any rule? */ +int (*bdrv_debug_breakpoint)(BlockDriverState *bs, const char *event, +const char *tag); +int (*bdrv_debug_remove_breakpoint)(BlockDriverState *bs, +const char *tag); +int (*bdrv_debug_resume)(BlockDriverState *bs, const char *tag); +bool (*bdrv_debug_is_suspended)(BlockDriverState *bs, const char *tag); +void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp); Originally, there was an empty line before bdrv_refresh_limits(), and I’d keep it. [...] +/** + * Try to get @bs's logical and physical block size. + * On success, store them in @bsz and return zero. + * On failure, return negative errno. + */ +int (*bdrv_probe_blocksizes)(BlockDriverState *bs, BlockSizes *bsz); Originally, there was a comment above this function (“I/O API, even though if it's a filter jumps on parent”) that you added in patch 6 and that I didn’t understand. Given this, I’m not unhappy to see it go again, but now I wonder about its purpose even more... +/** + * Try to get @bs's geometry (cyls, heads, sectors) + * On success, store them in @geo and return 0. + * On failure return -errno. + * Only drivers that want to override guest geometry implement this + * callback; see hd_geometry_guess(). + */ +int (*bdrv_probe_geometry)(BlockDriverState *bs, HDGeometry *geo); (Here, too) [...] +/** + * Register/unregister a buffer for I/O. For example, when the driver is + * interested to know the memory areas that will later be used in iovs, so + * that it can do IOMMU mapping with VFIO etc., in order to get better + * performance. In the case of VFIO drivers, this callback is used to do + * DMA mapping for hot buffers. + */ +void (*bdrv_register_buf)(BlockDriverState *bs, void *host, size_t size); +void (*bdrv_unregister_buf)(BlockDriverState *bs, void *host); +QLIST_ENTRY(BlockDriver) list; This field is interesting because it’s the only mutable field in the whole struct. I think it should be separated from the function pointers above and given a comment like “This field is modified only under the BQL, and is part of the global state”. + +/* + * I/O API functions. These functions
Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k
On Thu, Nov 11, 2021 at 06:54:03PM +0100, Christian Schoenebeck wrote: > On Donnerstag, 11. November 2021 17:31:52 CET Stefan Hajnoczi wrote: > > On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote: > > > On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote: > > > > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote: > > > > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote: > > > > > As you are apparently reluctant for changing the virtio specs, what > > > > > about > > > > > introducing those discussed virtio capabalities either as experimental > > > > > ones > > > > > without specs changes, or even just as 9p specific device capabilities > > > > > for > > > > > now. I mean those could be revoked on both sides at any time anyway. > > > > > > > > I would like to understand the root cause before making changes. > > > > > > > > "It's faster when I do X" is useful information but it doesn't > > > > necessarily mean doing X is the solution. The "it's faster when I do X > > > > because Y" part is missing in my mind. Once there is evidence that shows > > > > Y then it will be clearer if X is a good solution, if there's a more > > > > general solution, or if it was just a side-effect. > > > > > > I think I made it clear that the root cause of the observed performance > > > gain with rising transmission size is latency (and also that performance > > > is not the only reason for addressing this queue size issue). > > > > > > Each request roundtrip has a certain minimum latency, the virtio ring > > > alone > > > has its latency, plus latency of the controller portion of the file server > > > (e.g. permissions, sandbox checks, file IDs) that is executed with *every* > > > request, plus latency of dispatching the request handling between threads > > > several times back and forth (also for each request). > > > > > > Therefore when you split large payloads (e.g. reading a large file) into > > > smaller n amount of chunks, then that individual latency per request > > > accumulates to n times the individual latency, eventually leading to > > > degraded transmission speed as those requests are serialized. > > > > It's easy to increase the blocksize in benchmarks, but real applications > > offer less control over the I/O pattern. If latency in the device > > implementation (QEMU) is the root cause then reduce the latency to speed > > up all applications, even those that cannot send huge requests. > > Which I did, still do, and also mentioned before, e.g.: > > 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 9pfs: reduce latency of Twalk > 0c4356ba7dafc8ecb5877a42fc0d68d45ccf5951 9pfs: T_readdir latency optimization > > Reducing overall latency is a process that is ongoing and will still take a > very long development time. Not because of me, but because of lack of > reviewers. And even then, it does not make the effort to support higher > transmission sizes obsolete. > > > One idea is request merging on the QEMU side. If the application sends > > 10 sequential read or write requests, coalesce them together before the > > main part of request processing begins in the device. Process a single > > large request to spread the cost of the file server over the 10 > > requests. (virtio-blk has request merging to help with the cost of lots > > of small qcow2 I/O requests.) The cool thing about this is that the > > guest does not need to change its I/O pattern to benefit from the > > optimization. > > > > Stefan > > Ok, don't get me wrong: I appreciate that you are suggesting approaches that > could improve things. But I could already hand you over a huge list of mine. > The limiting factor here is not the lack of ideas of what could be improved, > but rather the lack of people helping out actively on 9p side: > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06452.html > > The situation on kernel side is the same. I already have a huge list of what > could & should be improved. But there is basically no reviewer for 9p patches > on Linux kernel side either. > > The much I appreciate suggestions of what could be improved, I would > appreciate much more if there was *anybody* actively assisting as well. In > the > time being I have to work the list down in small patch chunks, priority based. I see request merging as an alternative to this patch series, not as an additional idea. My thoughts behind this is that request merging is less work than this patch series and more broadly applicable. It would be easy to merge (no idea how easy it is to implement though) in QEMU's virtio-9p device implementation, does not require changes across the stack, and benefits applications that can't change their I/O pattern to take advantage of huge requests. There is a risk that request merging won't pan out, it could have worse performance than submitting huge requests. Stefan signature.asc Description: PGP signature
Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support
Am 26.10.21 um 16:53 schrieb Peter Lieven: Am 25.10.21 um 14:58 schrieb Kevin Wolf: Am 25.10.2021 um 13:39 hat Peter Lieven geschrieben: Am 16.09.21 um 14:34 schrieb Peter Lieven: Am 09.07.21 um 12:21 schrieb Kevin Wolf: Am 08.07.2021 um 20:23 hat Peter Lieven geschrieben: Am 08.07.2021 um 14:18 schrieb Kevin Wolf : Am 07.07.2021 um 20:13 hat Peter Lieven geschrieben: Am 06.07.2021 um 17:25 schrieb Kevin Wolf : Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben: I will have a decent look after my vacation. Sounds good, thanks. Enjoy your vacation! As I had to fire up my laptop to look into another issue anyway, I have sent two patches for updating MAINTAINERS and to fix the int vs. bool mix for task->complete. I think you need to reevaluate your definition of vacation. ;-) Lets talk about this when the kids are grown up. Sometimes sending patches can be quite relaxing :-) Heh, fair enough. :-) But thanks anyway. As Paolos fix (5f50be9b5) is relatively new and there are maybe other non obvious problems when removing the BH indirection and we are close to soft freeze I would leave the BH removal change for 6.2. Sure, code cleanups aren't urgent. Isn’t the indirection also a slight performance drop? Yeah, I guess technically it is, though I doubt it's measurable. As promised I was trying to remove the indirection through the BH after Qemu 6.1 release. However, if I remove the BH I run into the following assertion while running some fio tests: qemu-system-x86_64: ../block/block-backend.c:1197: blk_wait_while_drained: Assertion `blk->in_flight > 0' failed. Any idea? This is what I changed: diff --git a/block/rbd.c b/block/rbd.c index 3cb24f9981..bc1dbc20f7 100644 --- a/block/rbd.c +++ b/block/rbd.c @@ -1063,13 +1063,6 @@ static int qemu_rbd_resize(BlockDriverState *bs, uint64_t size) return 0; } -static void qemu_rbd_finish_bh(void *opaque) -{ - RBDTask *task = opaque; - task->complete = true; - aio_co_wake(task->co); -} - /* * This is the completion callback function for all rbd aio calls * started from qemu_rbd_start_co(). @@ -1083,8 +1076,8 @@ static void qemu_rbd_completion_cb(rbd_completion_t c, RBDTask *task) { task->ret = rbd_aio_get_return_value(c); rbd_aio_release(c); - aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs), - qemu_rbd_finish_bh, task); + task->complete = true; + aio_co_wake(task->co); } Kevin, Paolo, any idea? Not really, I don't see the connection between both places. Do you have a stack trace for the crash? The crash seems not to be limited to that assertion. I have also seen: qemu-system-x86_64: ../block/block-backend.c:1497: blk_aio_write_entry: Assertion `!qiov || qiov->size == acb->bytes' failed. Altough harder to trigger I catch this backtrace in gdb: qemu-system-x86_64: ../block/block-backend.c:1497: blk_aio_write_entry: Assertion `!qiov || qiov->size == acb->bytes' failed. [Wechseln zu Thread 0x77fa8f40 (LWP 17852)] Thread 1 "qemu-system-x86" hit Breakpoint 1, __GI_abort () at abort.c:49 49 abort.c: Datei oder Verzeichnis nicht gefunden. (gdb) bt #0 0x742567e0 in __GI_abort () at abort.c:49 #1 0x7424648a in __assert_fail_base (fmt=0x743cd750 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x55e638e0 "!qiov || qiov->size == acb->bytes", file=file@entry=0x55e634b2 "../block/block-backend.c", line=line@entry=1497, function=function@entry=0x55e63b20 <__PRETTY_FUNCTION__.32450> "blk_aio_write_entry") at assert.c:92 #2 0x74246502 in __GI___assert_fail (assertion=assertion@entry=0x55e638e0 "!qiov || qiov->size == acb->bytes", file=file@entry=0x55e634b2 "../block/block-backend.c", line=line@entry=1497, function=function@entry=0x55e63b20 <__PRETTY_FUNCTION__.32450> "blk_aio_write_entry") at assert.c:101 #3 0x55becc78 in blk_aio_write_entry (opaque=0x56b534f0) at ../block/block-backend.c:1497 #4 0x55cf0e4c in coroutine_trampoline (i0=, i1=) at ../util/coroutine-ucontext.c:173 #5 0x7426e7b0 in __start_context () at /lib/x86_64-linux-gnu/libc.so.6 #6 0x7fffd5a0 in () #7 0x in () any ideas? Or should we just abandon the idea of removing the BH? Peter
Re: [PATCH 2/2] virtio-blk: support BLKSECDISCARD
On Mon, Nov 15, 2021 at 12:52:00PM +0800, yadong...@intel.com wrote: > From: Yadong Qi > > Add new virtio feature: VIRTIO_BLK_F_SECDISCARD. > Add new virtio command: VIRTIO_BLK_T_SECDISCARD. > > This feature is disabled by default, it will check the backend > bs->open_flags & BDRV_O_SECDISCARD, enable it if BDRV_O_SECDISCARD > is supported. > > Signed-off-by: Yadong Qi > --- > hw/block/virtio-blk.c | 26 + > include/standard-headers/linux/virtio_blk.h | 4 Any changes to standard headers need to go to linux and virtio TC. > 2 files changed, 26 insertions(+), 4 deletions(-) > > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c > index dbc4c5a3cd..7bc3484521 100644 > --- a/hw/block/virtio-blk.c > +++ b/hw/block/virtio-blk.c > @@ -536,7 +536,8 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev, > } > > static uint8_t virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req, > -struct virtio_blk_discard_write_zeroes *dwz_hdr, bool is_write_zeroes) > +struct virtio_blk_discard_write_zeroes *dwz_hdr, bool is_write_zeroes, > +bool is_secdiscard) > { > VirtIOBlock *s = req->dev; > VirtIODevice *vdev = VIRTIO_DEVICE(s); > @@ -577,8 +578,8 @@ static uint8_t > virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req, > goto err; > } > > +int blk_aio_flags = 0; > if (is_write_zeroes) { /* VIRTIO_BLK_T_WRITE_ZEROES */ > -int blk_aio_flags = 0; > > if (flags & VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP) { > blk_aio_flags |= BDRV_REQ_MAY_UNMAP; > @@ -600,7 +601,12 @@ static uint8_t > virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req, > goto err; > } > > -blk_aio_pdiscard(s->blk, sector << BDRV_SECTOR_BITS, bytes, 0, > +if (is_secdiscard) { > +blk_aio_flags |= BDRV_REQ_SECDISCARD; > +} > + > +blk_aio_pdiscard(s->blk, sector << BDRV_SECTOR_BITS, bytes, > + blk_aio_flags, > virtio_blk_discard_write_zeroes_complete, req); > } > > @@ -622,6 +628,7 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, > MultiReqBuffer *mrb) > unsigned out_num = req->elem.out_num; > VirtIOBlock *s = req->dev; > VirtIODevice *vdev = VIRTIO_DEVICE(s); > +bool is_secdiscard = false; > > if (req->elem.out_num < 1 || req->elem.in_num < 1) { > virtio_error(vdev, "virtio-blk missing headers"); > @@ -722,6 +729,9 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, > MultiReqBuffer *mrb) > * VIRTIO_BLK_T_OUT flag set. We masked this flag in the switch > statement, > * so we must mask it for these requests, then we will check if it is > set. > */ > +case VIRTIO_BLK_T_SECDISCARD & ~VIRTIO_BLK_T_OUT: > +is_secdiscard = true; > +__attribute__((fallthrough)); > case VIRTIO_BLK_T_DISCARD & ~VIRTIO_BLK_T_OUT: > case VIRTIO_BLK_T_WRITE_ZEROES & ~VIRTIO_BLK_T_OUT: > { > @@ -752,7 +762,8 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, > MultiReqBuffer *mrb) > } > > err_status = virtio_blk_handle_discard_write_zeroes(req, _hdr, > -is_write_zeroes); > +is_write_zeroes, > +is_secdiscard); > if (err_status != VIRTIO_BLK_S_OK) { > virtio_blk_req_complete(req, err_status); > virtio_blk_free_request(req); > @@ -1201,6 +1212,11 @@ static void virtio_blk_device_realize(DeviceState > *dev, Error **errp) > return; > } > > +if (blk_get_flags(conf->conf.blk) & BDRV_O_SECDISCARD) > +virtio_add_feature(>host_features, VIRTIO_BLK_F_SECDISCARD); > +else > +virtio_clear_feature(>host_features, VIRTIO_BLK_F_SECDISCARD); > + > if (virtio_has_feature(s->host_features, VIRTIO_BLK_F_WRITE_ZEROES) && > (!conf->max_write_zeroes_sectors || > conf->max_write_zeroes_sectors > BDRV_REQUEST_MAX_SECTORS)) { > @@ -1307,6 +1323,8 @@ static Property virtio_blk_properties[] = { > conf.report_discard_granularity, true), > DEFINE_PROP_BIT64("write-zeroes", VirtIOBlock, host_features, >VIRTIO_BLK_F_WRITE_ZEROES, true), > +DEFINE_PROP_BIT64("secdiscard", VirtIOBlock, host_features, > + VIRTIO_BLK_F_SECDISCARD, false), > DEFINE_PROP_UINT32("max-discard-sectors", VirtIOBlock, > conf.max_discard_sectors, BDRV_REQUEST_MAX_SECTORS), > DEFINE_PROP_UINT32("max-write-zeroes-sectors", VirtIOBlock, > diff --git a/include/standard-headers/linux/virtio_blk.h > b/include/standard-headers/linux/virtio_blk.h > index 2dcc90826a..c55a07840c 100644 > --- a/include/standard-headers/linux/virtio_blk.h > +++
Re: [PATCH 2/2] virtio-blk: support BLKSECDISCARD
On Mon, Nov 15, 2021 at 12:52:00PM +0800, yadong...@intel.com wrote: From: Yadong Qi Add new virtio feature: VIRTIO_BLK_F_SECDISCARD. Add new virtio command: VIRTIO_BLK_T_SECDISCARD. Has a proposal been sent out yet to virtio-comment mailing list for discussing these specification changes? This feature is disabled by default, it will check the backend bs->open_flags & BDRV_O_SECDISCARD, enable it if BDRV_O_SECDISCARD is supported. Signed-off-by: Yadong Qi --- hw/block/virtio-blk.c | 26 + include/standard-headers/linux/virtio_blk.h | 4 2 files changed, 26 insertions(+), 4 deletions(-) diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index dbc4c5a3cd..7bc3484521 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -536,7 +536,8 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev, } static uint8_t virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req, -struct virtio_blk_discard_write_zeroes *dwz_hdr, bool is_write_zeroes) +struct virtio_blk_discard_write_zeroes *dwz_hdr, bool is_write_zeroes, +bool is_secdiscard) Since the function now handles 3 commands, I'm thinking if it's better to pass the command directly and then make a switch instead of using 2 booleans. { VirtIOBlock *s = req->dev; VirtIODevice *vdev = VIRTIO_DEVICE(s); @@ -577,8 +578,8 @@ static uint8_t virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req, goto err; } +int blk_aio_flags = 0; Maybe better to move it to the beginning of the function. if (is_write_zeroes) { /* VIRTIO_BLK_T_WRITE_ZEROES */ -int blk_aio_flags = 0; if (flags & VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP) { blk_aio_flags |= BDRV_REQ_MAY_UNMAP; @@ -600,7 +601,12 @@ static uint8_t virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req, goto err; } -blk_aio_pdiscard(s->blk, sector << BDRV_SECTOR_BITS, bytes, 0, +if (is_secdiscard) { +blk_aio_flags |= BDRV_REQ_SECDISCARD; +} + +blk_aio_pdiscard(s->blk, sector << BDRV_SECTOR_BITS, bytes, + blk_aio_flags, virtio_blk_discard_write_zeroes_complete, req); } @@ -622,6 +628,7 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb) unsigned out_num = req->elem.out_num; VirtIOBlock *s = req->dev; VirtIODevice *vdev = VIRTIO_DEVICE(s); +bool is_secdiscard = false; if (req->elem.out_num < 1 || req->elem.in_num < 1) { virtio_error(vdev, "virtio-blk missing headers"); @@ -722,6 +729,9 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb) * VIRTIO_BLK_T_OUT flag set. We masked this flag in the switch statement, * so we must mask it for these requests, then we will check if it is set. */ +case VIRTIO_BLK_T_SECDISCARD & ~VIRTIO_BLK_T_OUT: +is_secdiscard = true; +__attribute__((fallthrough)); We can use QEMU_FALLTHROUGH here. case VIRTIO_BLK_T_DISCARD & ~VIRTIO_BLK_T_OUT: case VIRTIO_BLK_T_WRITE_ZEROES & ~VIRTIO_BLK_T_OUT: { @@ -752,7 +762,8 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb) } err_status = virtio_blk_handle_discard_write_zeroes(req, _hdr, -is_write_zeroes); +is_write_zeroes, +is_secdiscard); if (err_status != VIRTIO_BLK_S_OK) { virtio_blk_req_complete(req, err_status); virtio_blk_free_request(req); @@ -1201,6 +1212,11 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp) return; } +if (blk_get_flags(conf->conf.blk) & BDRV_O_SECDISCARD) +virtio_add_feature(>host_features, VIRTIO_BLK_F_SECDISCARD); +else +virtio_clear_feature(>host_features, VIRTIO_BLK_F_SECDISCARD); + IIUC here we set or not the feature if BDRV_O_SECDISCARD is set. Should we keep it disabled if "secdiscard" is false? (e.g. to avoid migration problems) Otherwise what is the purpose of the "secdiscard" property? Thanks, Stefano