Re: [PATCH RFC 2/2] hw: Replace drive_get_next() by drive_get()

2021-11-15 Thread Markus Armbruster
Philippe Mathieu-Daudé  writes:

> On 11/15/21 16:57, Markus Armbruster wrote:
>> Philippe Mathieu-Daudé  writes:
>>> On 11/15/21 13:55, Markus Armbruster wrote:
 drive_get_next() is basically a bad idea.  It returns the "next" block
 backend of a certain interface type.  "Next" means bus=0,unit=N, where
 subsequent calls count N up from zero, per interface type.

 This lets you define unit numbers implicitly by execution order.  If the
 order changes, or new calls appear "in the middle", unit numbers change.
 ABI break.  Hard to spot in review.

 Explicit is better than implicit: use drive_get() directly.

 Signed-off-by: Markus Armbruster 
 ---
>
 @@ -435,11 +438,13 @@ static void aspeed_machine_init(MachineState 
 *machine)
  }
  
  for (i = 0; i < bmc->soc.sdhci.num_slots; i++) {
 -sdhci_attach_drive(>soc.sdhci.slots[i], 
 drive_get_next(IF_SD));
 +sdhci_attach_drive(>soc.sdhci.slots[i],
 +   drive_get(IF_SD, 0, i));
>>>
>>> If we put SD on bus #0, ...
>>>
  }
  
  if (bmc->soc.emmc.num_slots) {
 -sdhci_attach_drive(>soc.emmc.slots[0], 
 drive_get_next(IF_SD));
 +sdhci_attach_drive(>soc.emmc.slots[0],
 +   drive_get(IF_SD, 0, bmc->soc.sdhci.num_slots));
>>>
>>> ... we'd want to put eMMC on bus #1
>> 
>> Using separate buses for different kinds of devices would be neater, but
>> it also would be an incompatible change.  This patch keeps existing
>> bus/unit numbers working.  drive_get_next() can only use bus 0.
>> 
>>>  but I see having eMMC cards on a
>>> IF_SD bus as a bug, since these cards are soldered on the board.
>> 
>> IF_SD is not a bus, it's an "block interface type", which is really just
>> a user interface thing.
>
> Why are we discriminating by "block interface type" then?
>
> What is the difference between "block interfaces"? I see a block drive
> as a generic unit, usable on multiple hardware devices.
>
> I never really understood how this "block interface type" helps
> developers and users. I thought BlockInterfaceType and DriveInfo
> were legacy / deprecated APIs we want to get rid of; and we would
> come up with a replacement API using BlockDeviceInfo or providing
> a BlockFrontend state of the art object.
> Anyway, I suppose the explanation is buried in the git history
> before the last 8 years. I need to keep reading.

In the beginning (v0.4.2), there was -hda and -hdb, and life was simple.

Then there was -hdc, -hdd, -cdrom (v0.5.1), -fda, -fdb (v0.6.0),
-mtdblock, -sd, -pflash (v0.9.1).

All these options do two things: they create a block backend, and they
request the board to create a certain block frontend for it, similar to
other options of this vintage, like -serial, -parallel, and -net.
Boards generally ignore requests they don't understand, but that's just
sloppiness.

For each set of related options, there was a global variable holding the
requests: bs_table[] for -hda, -hdb, -hdc, -hdd, -cdrom; fd_table[]
-fda, -fdb; mtd_bdrv for -mtd; sd_drv for -ds; pflash_table[] for
-pflash.

The options replaced prior ones, except for -pflash, which appended to
its table.

bs_table[]'s index had a peculiar meaning: it's bus * MAX_IDE_DEVS +
unit.  This ensures that -hda (index 0) goes on IDE bus 0 as unit 0;
-hdb on bus 0, unit 1; -hdc on 1, 0; -hdc on 1, 1.

Life was now complicated enough for a generalization (v0.9.1), so there
was -drive (v0.9.1).  All the variables holding requests were fused into
drives_table[].  Table elements are identified by (type, bus, unit),
where type is an enum whose members correspond to the old global
variables: IF_IDE for bs_table[], IF_FLOPPY for fd_table[], and so
forth.  So:

-hda becomes type = IF_IDE, bus = 0, unit = 0
-hdb becomes type = IF_IDE, bus = 0, unit = 1
...
-sd becomes type = IF_SD, bus = 0, unit = 0
1st -pflash becomes type = IF_PFLASH, bus = 0, unit = 0
2nd -pflash becomes type = IF_PFLASH, bus = 0, unit = 1
...

Other mappings from old to new global variables would have been
possible.  I figure this one was chosen because it comes with a
reasonable user interface.  Identifying block devices by (interface
type, bus, unit) is certainly nicer than by index in bs_table[].  Since
bus and/or unit make sense only with some interface types, they are
optional.

Things calmed down for a couple of years, until -device appeared
(v0.12).  Now we needed a way to define just a backend, without
requesting a frontend from the board.  Instead of inventing a new
option, this became IF_NONE, with meaningless bus and unit.

Over the next years, the block layer outgrew -drive's limited
capabilities to define frontends.  -blockdev appeard (v1.7.0) and
matured over several releases.  I don't remember exactly when it became
stable, relegating -drive if=none to legacy status.

What's *not* 

Re: [PATCH for 6.2 v4] nbd/server: Add --selinux-label option

2021-11-15 Thread Thomas Huth

On 15/11/2021 21.29, Eric Blake wrote:

From: "Richard W.M. Jones" 

Under SELinux, Unix domain sockets have two labels.  One is on the
disk and can be set with commands such as chcon(1).  There is a
different label stored in memory (called the process label).  This can
only be set by the process creating the socket.  When using SELinux +
SVirt and wanting qemu to be able to connect to a qemu-nbd instance,
you must set both labels correctly first.

For qemu-nbd the options to set the second label are awkward.  You can
create the socket in a wrapper program and then exec into qemu-nbd.
Or you could try something with LD_PRELOAD.

This commit adds the ability to set the label straightforwardly on the
command line, via the new --selinux-label flag.  (The name of the flag
is the same as the equivalent nbdkit option.)

...

@@ -3430,6 +3437,7 @@ summary_info += {'libdaxctl support': libdaxctl}
  summary_info += {'libudev':   libudev}
  # Dummy dependency, keep .found()
  summary_info += {'FUSE lseek':fuse_lseek.found()}
+summary_info += {'selinux':   selinux.found()}


It's nicer if you do it like this (i.e. without the .found()):

summary_info += {'selinux':   selinux}

... then meson prints out the version of the library, too.

Apart from that, patch looks fine to me:

Reviewed-by: Thomas Huth 




RE: [PATCH 2/2] virtio-blk: support BLKSECDISCARD

2021-11-15 Thread Qi, Yadong
> The Linux block layer shares the DISCARD queue limits with SECDISCARD.
> That's different from BLKZEROOUT (QEMU's WRITE_ZEROES). This is a Linux
> implementation detail but I guess virtio-blk can share the DISCARD limits with
> SECDISCARD too...
> 
> > @@ -622,6 +628,7 @@ static int virtio_blk_handle_request(VirtIOBlockReq
> *req, MultiReqBuffer *mrb)
> >  unsigned out_num = req->elem.out_num;
> >  VirtIOBlock *s = req->dev;
> >  VirtIODevice *vdev = VIRTIO_DEVICE(s);
> > +bool is_secdiscard = false;
> >
> >  if (req->elem.out_num < 1 || req->elem.in_num < 1) {
> >  virtio_error(vdev, "virtio-blk missing headers"); @@ -722,6
> > +729,9 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req,
> MultiReqBuffer *mrb)
> >   * VIRTIO_BLK_T_OUT flag set. We masked this flag in the switch 
> > statement,
> >   * so we must mask it for these requests, then we will check if it is 
> > set.
> >   */
> > +case VIRTIO_BLK_T_SECDISCARD & ~VIRTIO_BLK_T_OUT:
> > +is_secdiscard = true;
> > +__attribute__((fallthrough));
> 
> The DISCARD case doesn't use __attribute__((fallthrough)) so this is 
> inconsistent.
> QEMU doesn't use __attribute__((fallthrough)) so I suggest dropping this.
Sure, will try to drop the fallthrough case.

> 
> > diff --git a/include/standard-headers/linux/virtio_blk.h
> > b/include/standard-headers/linux/virtio_blk.h
> > index 2dcc90826a..c55a07840c 100644
> > --- a/include/standard-headers/linux/virtio_blk.h
> > +++ b/include/standard-headers/linux/virtio_blk.h
> > @@ -40,6 +40,7 @@
> >  #define VIRTIO_BLK_F_MQ12  /* support more than one vq
> */
> >  #define VIRTIO_BLK_F_DISCARD   13  /* DISCARD is supported */
> >  #define VIRTIO_BLK_F_WRITE_ZEROES  14  /* WRITE ZEROES is
> supported */
> > +#define VIRTIO_BLK_F_SECDISCARD15  /* WRITE ZEROES is supported
> */
> 
> The comment is copy-pasted from WRITE_ZEROES.
Will fix it.



RE: [PATCH 1/2] block:hdev: support BLKSECDISCARD

2021-11-15 Thread Qi, Yadong
> > Add a new option "secdiscard" for block drive. Parse it and record it
> > in bs->open_flags as bit(BDRV_O_SECDISCARD).
> >
> > Add a new BDRV_REQ_SECDISCARD bit for secure discard request from
> > virtual device.
> >
> > For host_device backend: implement by ioctl(BLKSECDISCARD) on real
> > host block device.
> > For other backend, no implementation.
> >
> > E.g.:
> > qemu-system-x86_64 \
> > ... \
> > -drive
> file=/dev/mmcblk0p2,if=none,format=raw,discard=on,secdiscard=on,id=sd0 \
> > -device virtio-blk-pci,drive=sd0,id=sd0_vblk \
> > ...
> 
> I'm curious why there is explicit control over this feature (-drive
> secdiscard=on|off). For example, is there a reason why users would want to
> disable secdiscard on a drive that supports it? I guess there is no way to 
> emulate
> it correctly so secdiscard=on on a drive that doesn't support it produces an 
> error?

Yes. AFAIK, there is no way to emulate a "secure" discard. But it seems also 
hard to
detect whether a host drive support secdiscard unless it really perform a real
secdiscard. So I choose to add an option for user to enable it for virtual 
block.
There is an assumption that user knows whether host support secdiscard.


Best Regard
Yadong



RE: [PATCH 1/2] block:hdev: support BLKSECDISCARD

2021-11-15 Thread Qi, Yadong
> 
> Notably absent: qapi/block-core.json. Without changing this, the option can't 
> be
> available in -blockdev, which is the primary option to configure block device
> backends.
> 
> This patch seems to contain multiple logical changes that should be split into
> separate patches:
> 
> * Adding a flags parameter to .bdrv_co_pdiscard
> 
> * Support for the new feature in the core block layer (should be done
>   with -blockdev)
> 
> * Convenience magic for -drive (BDRV_O_SECDISCARD). It's not clear that
>   this should be done at all because the option is really specific to
>   one single block driver (file-posix). I think in your patch, all
>   other block drivers silently ignore the option, which is not what we
>   want.
Sorry for the absent for -blockdev. Will try add that.
Also I will try to split the patches.
And for the BDRV_O_SECDISCARD, it is specific for file-posix.c(host_device). 
Maybe
I need to add the option only for file-posix.c.

> 
> > diff --git a/block.c b/block.c
> > index 580cb77a70..4f05e96d12 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -1128,6 +1128,32 @@ int bdrv_parse_discard_flags(const char *mode,
> int *flags)
> >  return 0;
> >  }
> >
> > +/**
> > + * Set open flags for a given secdiscard mode
> > + *
> > + * Return 0 on success, -1 if the secdiscard mode was invalid.
> > + */
> > +int bdrv_parse_secdiscard_flags(const char *mode, int *flags, Error
> > +**errp) {
> > +*flags &= ~BDRV_O_SECDISCARD;
> > +
> > +if (!strcmp(mode, "off")) {
> > +/* do nothing */
> > +} else if (!strcmp(mode, "on")) {
> > +if (!(*flags & BDRV_O_UNMAP)) {
> > +error_setg(errp, "cannot enable secdiscard when discard is "
> > + "disabled!");
> > +return -1;
> > +}
> 
> This check has nothing to do with parsing the option, it's validating its 
> value.
> 
> You don't even need a new function to parse it, because there is already
> qemu_opt_get_bool(). Duplicating it means only that you're inconsistent with
> other boolean options, which alternatively accept "yes"/"no", "true"/"false",
> "y/n".
Sure. Will use qemu_opt_get_bool() instead.

> 
> > +
> > +*flags |= BDRV_O_SECDISCARD;
> > +} else {
> > +return -1;
> > +}
> > +
> > +return 0;
> > +}
> > +
> >  /**
> >   * Set open flags for a given cache mode
> >   *
> > @@ -1695,6 +1721,11 @@ QemuOptsList bdrv_runtime_opts = {
> >  .type = QEMU_OPT_STRING,
> >  .help = "discard operation (ignore/off, unmap/on)",
> >  },
> > +{
> > +.name = BDRV_OPT_SECDISCARD,
> > +.type = QEMU_OPT_STRING,
> > +.help = "secure discard operation (off, on)",
> > +},
> >  {
> >  .name = BDRV_OPT_FORCE_SHARE,
> >  .type = QEMU_OPT_BOOL,
> > @@ -1735,6 +1766,7 @@ static int bdrv_open_common(BlockDriverState *bs,
> BlockBackend *file,
> >  const char *driver_name = NULL;
> >  const char *node_name = NULL;
> >  const char *discard;
> > +const char *secdiscard;
> >  QemuOpts *opts;
> >  BlockDriver *drv;
> >  Error *local_err = NULL;
> > @@ -1829,6 +1861,16 @@ static int bdrv_open_common(BlockDriverState
> *bs, BlockBackend *file,
> >  }
> >  }
> >
> > +
> > +secdiscard = qemu_opt_get(opts, BDRV_OPT_SECDISCARD);
> > +if (secdiscard != NULL) {
> > +if (bdrv_parse_secdiscard_flags(secdiscard, >open_flags,
> > +errp) != 0) {
> > +ret = -EINVAL;
> > +goto fail_opts;
> > +}
> > +}
> > +
> >  bs->detect_zeroes =
> >  bdrv_parse_detect_zeroes(opts, bs->open_flags, _err);
> >  if (local_err) {
> > @@ -3685,6 +3727,10 @@ static BlockDriverState *bdrv_open_inherit(const
> char *filename,
> > , options, flags, options);
> >  }
> >
> > +if (g_strcmp0(qdict_get_try_str(options, BDRV_OPT_SECDISCARD), "on")) {
> > +flags |= BDRV_O_SECDISCARD;
> 
> Indentation is off.
Will fix it.

> 
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -160,6 +160,7 @@ typedef struct BDRVRawState {
> >  bool is_xfs:1;
> >  #endif
> >  bool has_discard:1;
> > +bool has_secdiscard:1;
> >  bool has_write_zeroes:1;
> >  bool discard_zeroes:1;
> >  bool use_linux_aio:1;
> 
> has_secdiscard is only set to false in raw_open_common() and never changed or
> used.
Will remove it.

> 
> > @@ -727,6 +728,7 @@ static int raw_open_common(BlockDriverState *bs,
> > QDict *options,  #endif /* !defined(CONFIG_LINUX_IO_URING) */
> >
> >  s->has_discard = true;
> > +s->has_secdiscard = false;
> >  s->has_write_zeroes = true;
> >  if ((bs->open_flags & BDRV_O_NOCACHE) != 0 && !dio_byte_aligned(s->fd))
> {
> >  s->needs_alignment = true;
> > @@ -765,6 +767,7 @@ static int raw_open_common(BlockDriverState *bs,
> QDict *options,
> >  

RE: [PATCH 2/2] virtio-blk: support BLKSECDISCARD

2021-11-15 Thread Qi, Yadong
> On Mon, Nov 15, 2021 at 12:52:00PM +0800, yadong...@intel.com wrote:
> > From: Yadong Qi 
> >
> > Add new virtio feature: VIRTIO_BLK_F_SECDISCARD.
> > Add new virtio command: VIRTIO_BLK_T_SECDISCARD.
> >
> > This feature is disabled by default, it will check the backend
> > bs->open_flags & BDRV_O_SECDISCARD, enable it if BDRV_O_SECDISCARD
> > is supported.
> >
> > Signed-off-by: Yadong Qi 
> > ---
> >  hw/block/virtio-blk.c   | 26 +
> >  include/standard-headers/linux/virtio_blk.h |  4 
> 
> 
> Any changes to standard headers need to go to linux and virtio TC.
Sure. I will draft proposal to virtio-comment for review.





RE: [PATCH 2/2] virtio-blk: support BLKSECDISCARD

2021-11-15 Thread Qi, Yadong
> >Add new virtio feature: VIRTIO_BLK_F_SECDISCARD.
> >Add new virtio command: VIRTIO_BLK_T_SECDISCARD.
> 
> Has a proposal been sent out yet to virtio-comment mailing list for discussing
> these specification changes?
> 
Not yet. I will draft a proposal to virtio-comment if no big concern of this 
patch
>From maintainer.
> >
> >diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index
> >dbc4c5a3cd..7bc3484521 100644
> >--- a/hw/block/virtio-blk.c
> >+++ b/hw/block/virtio-blk.c
> >@@ -536,7 +536,8 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock
> >*dev,  }
> >
> > static uint8_t virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req,
> >-struct virtio_blk_discard_write_zeroes *dwz_hdr, bool is_write_zeroes)
> >+struct virtio_blk_discard_write_zeroes *dwz_hdr, bool is_write_zeroes,
> >+bool is_secdiscard)
> 
> Since the function now handles 3 commands, I'm thinking if it's better to pass
> the command directly and then make a switch instead of using 2 booleans.
> 
Make sense.

> > {
> > VirtIOBlock *s = req->dev;
> > VirtIODevice *vdev = VIRTIO_DEVICE(s); @@ -577,8 +578,8 @@ static
> >uint8_t virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req,
> > goto err;
> > }
> >
> >+int blk_aio_flags = 0;
> 
> Maybe better to move it to the beginning of the function.
Sure.

> 
> >
> >-blk_aio_pdiscard(s->blk, sector << BDRV_SECTOR_BITS, bytes, 0,
> >+if (is_secdiscard) {
> >+blk_aio_flags |= BDRV_REQ_SECDISCARD;
> >+}
> >+
> >+blk_aio_pdiscard(s->blk, sector << BDRV_SECTOR_BITS, bytes,
> >+ blk_aio_flags,
> >  virtio_blk_discard_write_zeroes_complete, req);
> > }
> >
> >@@ -622,6 +628,7 @@ static int virtio_blk_handle_request(VirtIOBlockReq
> *req, MultiReqBuffer *mrb)
> > unsigned out_num = req->elem.out_num;
> > VirtIOBlock *s = req->dev;
> > VirtIODevice *vdev = VIRTIO_DEVICE(s);
> >+bool is_secdiscard = false;
> >
> > if (req->elem.out_num < 1 || req->elem.in_num < 1) {
> > virtio_error(vdev, "virtio-blk missing headers"); @@ -722,6
> >+729,9 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req,
> MultiReqBuffer *mrb)
> >  * VIRTIO_BLK_T_OUT flag set. We masked this flag in the switch 
> > statement,
> >  * so we must mask it for these requests, then we will check if it is 
> > set.
> >  */
> >+case VIRTIO_BLK_T_SECDISCARD & ~VIRTIO_BLK_T_OUT:
> >+is_secdiscard = true;
> >+__attribute__((fallthrough));
> 
> We can use QEMU_FALLTHROUGH here.
Sure.

> 
> >
> > err_status = virtio_blk_handle_discard_write_zeroes(req, _hdr,
> >-
> >is_write_zeroes);
> >+is_write_zeroes,
> >+
> >+ is_secdiscard);


> >
> >+if (blk_get_flags(conf->conf.blk) & BDRV_O_SECDISCARD)
> >+virtio_add_feature(>host_features,
> >VIRTIO_BLK_F_SECDISCARD);
> >+else
> >+virtio_clear_feature(>host_features,
> >+ VIRTIO_BLK_F_SECDISCARD);
> >+
> 
> IIUC here we set or not the feature if BDRV_O_SECDISCARD is set.
> 
> Should we keep it disabled if "secdiscard" is false? (e.g. to avoid migration
> problems)
Yes, BDRV_O_SECDISCARD=(secdiscard=="on") ? 1 : 0;

> 
> Otherwise what is the purpose of the "secdiscard" property?
I cannot find a good method to detect whether host device support BLKSECDISCARD.
So I add this "secdiscard" property to explicitly enable this feature.

Best Regard
Yadong
> 
> Thanks,
> Stefano




Re: [PATCH for 6.2] nbd/server: Silence clang sanitizer warning

2021-11-15 Thread Philippe Mathieu-Daudé
On 11/15/21 23:39, Eric Blake wrote:
> clang's sanitizer is picky: memset(NULL, x, 0) is technically
> undefined behavior, even though no sane implementation of memset()
> deferences the NULL.  Caught by the nbd-qemu-allocation iotest.
> 
> The alternative to checking before each memset is to instead force an
> allocation of 1 element instead of g_new0(type, 0)'s behavior of
> returning NULL for a 0-length array.
> 
> Reported-by: Peter Maydell 
> Fixes: 3b1f244c59 (nbd: Allow export of multiple bitmaps for one device)
> Signed-off-by: Eric Blake 
> ---
>  nbd/server.c | 13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé 




[PATCH for 6.2] nbd/server: Silence clang sanitizer warning

2021-11-15 Thread Eric Blake
clang's sanitizer is picky: memset(NULL, x, 0) is technically
undefined behavior, even though no sane implementation of memset()
deferences the NULL.  Caught by the nbd-qemu-allocation iotest.

The alternative to checking before each memset is to instead force an
allocation of 1 element instead of g_new0(type, 0)'s behavior of
returning NULL for a 0-length array.

Reported-by: Peter Maydell 
Fixes: 3b1f244c59 (nbd: Allow export of multiple bitmaps for one device)
Signed-off-by: Eric Blake 
---
 nbd/server.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 6d03e8a4b436..d9164ee6d0da 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1,5 +1,5 @@
 /*
- *  Copyright (C) 2016-2020 Red Hat, Inc.
+ *  Copyright (C) 2016-2021 Red Hat, Inc.
  *  Copyright (C) 2005  Anthony Liguori 
  *
  *  Network Block Device Server Side
@@ -879,7 +879,9 @@ static bool nbd_meta_qemu_query(NBDClient *client, 
NBDExportMetaContexts *meta,
 if (!*query) {
 if (client->opt == NBD_OPT_LIST_META_CONTEXT) {
 meta->allocation_depth = meta->exp->allocation_depth;
-memset(meta->bitmaps, 1, meta->exp->nr_export_bitmaps);
+if (meta->exp->nr_export_bitmaps) {
+memset(meta->bitmaps, 1, meta->exp->nr_export_bitmaps);
+}
 }
 trace_nbd_negotiate_meta_query_parse("empty");
 return true;
@@ -894,7 +896,8 @@ static bool nbd_meta_qemu_query(NBDClient *client, 
NBDExportMetaContexts *meta,
 if (nbd_strshift(, "dirty-bitmap:")) {
 trace_nbd_negotiate_meta_query_parse("dirty-bitmap:");
 if (!*query) {
-if (client->opt == NBD_OPT_LIST_META_CONTEXT) {
+if (client->opt == NBD_OPT_LIST_META_CONTEXT &&
+meta->exp->nr_export_bitmaps) {
 memset(meta->bitmaps, 1, meta->exp->nr_export_bitmaps);
 }
 trace_nbd_negotiate_meta_query_parse("empty");
@@ -1024,7 +1027,9 @@ static int nbd_negotiate_meta_queries(NBDClient *client,
 /* enable all known contexts */
 meta->base_allocation = true;
 meta->allocation_depth = meta->exp->allocation_depth;
-memset(meta->bitmaps, 1, meta->exp->nr_export_bitmaps);
+if (meta->exp->nr_export_bitmaps) {
+memset(meta->bitmaps, 1, meta->exp->nr_export_bitmaps);
+}
 } else {
 for (i = 0; i < nb_queries; ++i) {
 ret = nbd_negotiate_meta_query(client, meta, errp);
-- 
2.33.1




Re: "make check" fails in a clang sanitizer build on "nbd-qemu-allocation" iotest

2021-11-15 Thread Eric Blake
On Mon, Nov 15, 2021 at 05:11:54PM +, Peter Maydell wrote:
> Hi; running a 'make check' on a clang sanitizer build one of
> the iotests falls over due to a NULL pointer being passed to
> memset():
> 
> 
>   TEST   iotest-qcow2: nbd-qemu-allocation [fail]

> +../../nbd/server.c:1027:16: runtime error: null pointer passed as
> argument 1, which is declared to never be null

The code in question:

if (client->opt == NBD_OPT_LIST_META_CONTEXT && !nb_queries) {
/* enable all known contexts */
meta->base_allocation = true;
meta->allocation_depth = meta->exp->allocation_depth;
memset(meta->bitmaps, 1, meta->exp->nr_export_bitmaps);

I suspect what is happening is that meta->bitmaps is NULL when
meta->exp->nr_export_bitmaps is 0. It's annoying that clang's
sanitizer whines even for a 0-length memset, but a strict reading of
POSIX says that we really are in the technically undefined behavior
when passing NULL (even with 0 length), so such whiny behavior is
permitted.  So I'll post a patch.

> 
> Does this look familiar ?

First I've heard of it; thanks for alerting me.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH RFC 2/2] hw: Replace drive_get_next() by drive_get()

2021-11-15 Thread Philippe Mathieu-Daudé
On 11/15/21 16:57, Markus Armbruster wrote:
> Philippe Mathieu-Daudé  writes:
>> On 11/15/21 13:55, Markus Armbruster wrote:
>>> drive_get_next() is basically a bad idea.  It returns the "next" block
>>> backend of a certain interface type.  "Next" means bus=0,unit=N, where
>>> subsequent calls count N up from zero, per interface type.
>>>
>>> This lets you define unit numbers implicitly by execution order.  If the
>>> order changes, or new calls appear "in the middle", unit numbers change.
>>> ABI break.  Hard to spot in review.
>>>
>>> Explicit is better than implicit: use drive_get() directly.
>>>
>>> Signed-off-by: Markus Armbruster 
>>> ---

>>> @@ -435,11 +438,13 @@ static void aspeed_machine_init(MachineState *machine)
>>>  }
>>>  
>>>  for (i = 0; i < bmc->soc.sdhci.num_slots; i++) {
>>> -sdhci_attach_drive(>soc.sdhci.slots[i], 
>>> drive_get_next(IF_SD));
>>> +sdhci_attach_drive(>soc.sdhci.slots[i],
>>> +   drive_get(IF_SD, 0, i));
>>
>> If we put SD on bus #0, ...
>>
>>>  }
>>>  
>>>  if (bmc->soc.emmc.num_slots) {
>>> -sdhci_attach_drive(>soc.emmc.slots[0], drive_get_next(IF_SD));
>>> +sdhci_attach_drive(>soc.emmc.slots[0],
>>> +   drive_get(IF_SD, 0, bmc->soc.sdhci.num_slots));
>>
>> ... we'd want to put eMMC on bus #1
> 
> Using separate buses for different kinds of devices would be neater, but
> it also would be an incompatible change.  This patch keeps existing
> bus/unit numbers working.  drive_get_next() can only use bus 0.
> 
>>  but I see having eMMC cards on a
>> IF_SD bus as a bug, since these cards are soldered on the board.
> 
> IF_SD is not a bus, it's an "block interface type", which is really just
> a user interface thing.

Why are we discriminating by "block interface type" then?

What is the difference between "block interfaces"? I see a block drive
as a generic unit, usable on multiple hardware devices.

I never really understood how this "block interface type" helps
developers and users. I thought BlockInterfaceType and DriveInfo
were legacy / deprecated APIs we want to get rid of; and we would
come up with a replacement API using BlockDeviceInfo or providing
a BlockFrontend state of the art object.
Anyway, I suppose the explanation is buried in the git history
before the last 8 years. I need to keep reading.



Re: [PULL 00/13] Block layer patches

2021-11-15 Thread Philippe Mathieu-Daudé
Hi Kevin,

On 11/15/21 15:53, Kevin Wolf wrote:
> The following changes since commit 42f6c9179be4401974dd3a75ee72defd16b5092d:
> 
>   Merge tag 'pull-ppc-2022' of https://github.com/legoater/qemu into 
> staging (2021-11-12 12:28:25 +0100)
> 
> are available in the Git repository at:
> 
>   git://repo.or.cz/qemu/kevin.git tags/for-upstream
> 
> for you to fetch changes up to 7461272c5f6032436ef9032c091c0118539483e4:
> 
>   softmmu/qdev-monitor: fix use-after-free in qdev_set_id() (2021-11-15 
> 15:49:46 +0100)
> 
> 
> Block layer patches
> 
> - Fixes to image streaming job and block layer reconfiguration to make
>   iotest 030 pass again
> - docs: Deprecate incorrectly typed device_add arguments
> - file-posix: Fix alignment after reopen changing O_DIRECT
> 
> 
> Hanna Reitz (10):
>   stream: Traverse graph after modification
>   block: Manipulate children list in .attach/.detach
>   block: Unite remove_empty_child and child_free
>   block: Drop detached child from ignore list
>   block: Pass BdrvChild ** to replace_child_noperm
>   block: Restructure remove_file_or_backing_child()
>   transactions: Invoke clean() after everything else
>   block: Let replace_child_tran keep indirect pointer
>   block: Let replace_child_noperm free children
>   iotests/030: Unthrottle parallel jobs in reverse
> 
> Kevin Wolf (2):
>   docs: Deprecate incorrectly typed device_add arguments
>   file-posix: Fix alignment after reopen changing O_DIRECT
> 
> Stefan Hajnoczi (1):
>   softmmu/qdev-monitor: fix use-after-free in qdev_set_id()
> 
>  docs/about/deprecated.rst   |  14 +++
>  include/qemu/transactions.h |   3 +
>  block.c | 233 
> +---
>  block/file-posix.c  |  20 +++-
>  block/stream.c  |   7 +-
>  softmmu/qdev-monitor.c  |   2 +-
>  util/transactions.c |   8 +-
>  tests/qemu-iotests/030  |  11 ++-
>  tests/qemu-iotests/142  |  22 +
>  tests/qemu-iotests/142.out  |  15 +++
>  10 files changed, 269 insertions(+), 66 deletions(-)

Looking at current /staging I noticed iotest#142 failed,
build-tcg-disabled job:

+++ 142.out.bad
@@ -750,6 +750,7 @@
 --- Alignment after changing O_DIRECT ---
+qemu-io: Cannot get 'write' permission without 'resize': Image size is
not a multiple of request alignment
 read 42/42 bytes at offset 42
 42 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 read 42/42 bytes at offset 42

https://gitlab.com/qemu-project/qemu/-/jobs/1784955950#L2794




Re: [PULL 00/13] Block layer patches

2021-11-15 Thread Richard Henderson

On 11/15/21 3:53 PM, Kevin Wolf wrote:

The following changes since commit 42f6c9179be4401974dd3a75ee72defd16b5092d:

   Merge tag 'pull-ppc-2022' of https://github.com/legoater/qemu into 
staging (2021-11-12 12:28:25 +0100)

are available in the Git repository at:

   git://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 7461272c5f6032436ef9032c091c0118539483e4:

   softmmu/qdev-monitor: fix use-after-free in qdev_set_id() (2021-11-15 
15:49:46 +0100)


Block layer patches

- Fixes to image streaming job and block layer reconfiguration to make
   iotest 030 pass again
- docs: Deprecate incorrectly typed device_add arguments
- file-posix: Fix alignment after reopen changing O_DIRECT


Hanna Reitz (10):
   stream: Traverse graph after modification
   block: Manipulate children list in .attach/.detach
   block: Unite remove_empty_child and child_free
   block: Drop detached child from ignore list
   block: Pass BdrvChild ** to replace_child_noperm
   block: Restructure remove_file_or_backing_child()
   transactions: Invoke clean() after everything else
   block: Let replace_child_tran keep indirect pointer
   block: Let replace_child_noperm free children
   iotests/030: Unthrottle parallel jobs in reverse

Kevin Wolf (2):
   docs: Deprecate incorrectly typed device_add arguments
   file-posix: Fix alignment after reopen changing O_DIRECT

Stefan Hajnoczi (1):
   softmmu/qdev-monitor: fix use-after-free in qdev_set_id()

  docs/about/deprecated.rst   |  14 +++
  include/qemu/transactions.h |   3 +
  block.c | 233 +---
  block/file-posix.c  |  20 +++-
  block/stream.c  |   7 +-
  softmmu/qdev-monitor.c  |   2 +-
  util/transactions.c |   8 +-
  tests/qemu-iotests/030  |  11 ++-
  tests/qemu-iotests/142  |  22 +
  tests/qemu-iotests/142.out  |  15 +++
  10 files changed, 269 insertions(+), 66 deletions(-)


This is failing iotest 142 for build-tcg-disabled.
I did retry, in case it was transitory.

https://gitlab.com/qemu-project/qemu/-/jobs/1784955950


r~




[PATCH for 6.2 v4] nbd/server: Add --selinux-label option

2021-11-15 Thread Eric Blake
From: "Richard W.M. Jones" 

Under SELinux, Unix domain sockets have two labels.  One is on the
disk and can be set with commands such as chcon(1).  There is a
different label stored in memory (called the process label).  This can
only be set by the process creating the socket.  When using SELinux +
SVirt and wanting qemu to be able to connect to a qemu-nbd instance,
you must set both labels correctly first.

For qemu-nbd the options to set the second label are awkward.  You can
create the socket in a wrapper program and then exec into qemu-nbd.
Or you could try something with LD_PRELOAD.

This commit adds the ability to set the label straightforwardly on the
command line, via the new --selinux-label flag.  (The name of the flag
is the same as the equivalent nbdkit option.)

A worked example showing how to use the new option can be found in
this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1984938

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1984938
Signed-off-by: Richard W.M. Jones 
Reviewed-by: Daniel P. Berrangé 
Message-Id: <20210930084701.3899578-1-rjo...@redhat.com>
[eblake: rebase to configure changes, reject --selinux-label if it is
not compiled in or not used on a Unix socket]
Signed-off-by: Eric Blake 
---

v3 was here:
https://lists.gnu.org/archive/html/qemu-devel/2021-09/msg07677.html

since then: another rework of the configure logic (now that
meson-buildoptions.sh exists), and a rework of the error reporting
(for now, loudly complain on all unsupported attempts at labeling. We
may later allow things that currently fail)

Candidate for 6.2 in spite of soft freeze because of earlier attempted
pull request here:
https://lists.gnu.org/archive/html/qemu-devel/2021-09/msg07081.html

and my apologies for letting this slip a month without action.

 meson.build   | 10 +++-
 qemu-nbd.c| 46 +++
 meson_options.txt |  3 ++
 scripts/meson-buildoptions.sh |  3 ++
 tests/docker/dockerfiles/centos8.docker   |  1 +
 .../dockerfiles/fedora-i386-cross.docker  |  1 +
 tests/docker/dockerfiles/fedora.docker|  1 +
 tests/docker/dockerfiles/opensuse-leap.docker |  1 +
 tests/docker/dockerfiles/ubuntu1804.docker|  1 +
 tests/docker/dockerfiles/ubuntu2004.docker|  1 +
 10 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/meson.build b/meson.build
index 2ece4fe0889a..73b7deaa0ef3 100644
--- a/meson.build
+++ b/meson.build
@@ -1201,6 +1201,11 @@ keyutils = dependency('libkeyutils', required: false,

 has_gettid = cc.has_function('gettid')

+# libselinux
+selinux = dependency('libselinux',
+ required: get_option('selinux'),
+ method: 'pkg-config', kwargs: static_kwargs)
+
 # Malloc tests

 malloc = []
@@ -1479,6 +1484,7 @@ config_host_data.set('CONFIG_SPICE_PROTOCOL', 
spice_protocol.found())
 config_host_data.set('CONFIG_SPICE', spice.found())
 config_host_data.set('CONFIG_X11', x11.found())
 config_host_data.set('CONFIG_CFI', get_option('cfi'))
+config_host_data.set('CONFIG_SELINUX', selinux.found())
 config_host_data.set('QEMU_VERSION', '"@0@"'.format(meson.project_version()))
 config_host_data.set('QEMU_VERSION_MAJOR', 
meson.project_version().split('.')[0])
 config_host_data.set('QEMU_VERSION_MINOR', 
meson.project_version().split('.')[1])
@@ -3054,7 +3060,8 @@ if have_tools
   qemu_io = executable('qemu-io', files('qemu-io.c'),
  dependencies: [block, qemuutil], install: true)
   qemu_nbd = executable('qemu-nbd', files('qemu-nbd.c'),
-   dependencies: [blockdev, qemuutil, gnutls], install: true)
+   dependencies: [blockdev, qemuutil, gnutls, selinux],
+   install: true)

   subdir('storage-daemon')
   subdir('contrib/rdmacm-mux')
@@ -3430,6 +3437,7 @@ summary_info += {'libdaxctl support': libdaxctl}
 summary_info += {'libudev':   libudev}
 # Dummy dependency, keep .found()
 summary_info += {'FUSE lseek':fuse_lseek.found()}
+summary_info += {'selinux':   selinux.found()}
 summary(summary_info, bool_yn: true, section: 'Dependencies')

 if not supported_cpus.contains(cpu)
diff --git a/qemu-nbd.c b/qemu-nbd.c
index 9d895ba24b1e..c6c20df68a4d 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -47,6 +47,10 @@
 #include "trace/control.h"
 #include "qemu-version.h"

+#ifdef CONFIG_SELINUX
+#include 
+#endif
+
 #ifdef __linux__
 #define HAVE_NBD_DEVICE 1
 #else
@@ -64,6 +68,7 @@
 #define QEMU_NBD_OPT_FORK  263
 #define QEMU_NBD_OPT_TLSAUTHZ  264
 #define QEMU_NBD_OPT_PID_FILE  265
+#define QEMU_NBD_OPT_SELINUX_LABEL 266

 #define MBR_SIZE 512

@@ -116,6 +121,9 @@ static void usage(const char *name)
 "  --forkfork off the server process and exit the parent\n"
 "once the server is running\n"
 "  --pid-file=PATH   store the server's process ID in the given file\n"
+#ifdef 

Re: [PATCH 0/2] hw/nvme: fix namespace attachment on controller hotplug

2021-11-15 Thread Klaus Jensen
On Sep 24 09:27, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> First patch from Hannes fixes the subsystem registration process such
> that shared (but non-detached) namespaces are automatically attached to
> hotplugged controllers.
> 
> The second patch changes the default for 'shared' such that namespaces
> are shared by default and will thus by default be attached to hotplugged
> controllers. This adds a compat property for older machine versions and
> updates the documentation to reflect this.
> 
> Hannes Reinecke (1):
>   hw/nvme: reattach subsystem namespaces on hotplug
> 
> Klaus Jensen (1):
>   hw/nvme: change nvme-ns 'shared' default
> 
>  docs/system/devices/nvme.rst | 24 ++--
>  hw/core/machine.c|  4 +++-
>  hw/nvme/ns.c |  8 +---
>  hw/nvme/subsys.c | 10 +-
>  4 files changed, 27 insertions(+), 19 deletions(-)
> 

Ping :)


signature.asc
Description: PGP signature


"make check" fails in a clang sanitizer build on "nbd-qemu-allocation" iotest

2021-11-15 Thread Peter Maydell
Hi; running a 'make check' on a clang sanitizer build one of
the iotests falls over due to a NULL pointer being passed to
memset():


  TEST   iotest-qcow2: nbd-qemu-allocation [fail]
QEMU  --
"/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/qemu-iotests/../../qemu-system-aarch64"
-nodefaults -display none -accel qtest -machine virt
QEMU_IMG  --
"/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/qemu-iotests/../../qemu-img"
QEMU_IO   --
"/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/qemu-iotests/../../qemu-io"
--cache writeback --aio threads -f qcow2
QEMU_NBD  --
"/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/qemu-iotests/../../qemu-nbd"
IMGFMT-- qcow2
IMGPROTO  -- file
PLATFORM  -- Linux/x86_64 e104462 5.4.0-89-generic
TEST_DIR  --
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/qemu-iotests/scratch
SOCK_DIR  -- /tmp/tmp13ihi_hj
GDB_OPTIONS   --
VALGRIND_QEMU --
PRINT_QEMU_OUTPUT --

--- 
/home/petmay01/linaro/qemu-from-laptop/qemu/tests/qemu-iotests/tests/nbd-qemu-allocation.out
+++ nbd-qemu-allocation.out.bad
@@ -14,6 +14,8 @@
 [{ "start": 0, "length": 1048576, "depth": 1, "present": true,
"zero": false, "data": true, "offset": 327680},
 { "start": 1048576, "length": 2097152, "depth": 0, "present": true,
"zero": false, "data": true, "offset": 327680},
 { "start": 3145728, "length": 1048576, "depth": 1, "present": false,
"zero": true, "data": false}]
+../../nbd/server.c:1027:16: runtime error: null pointer passed as
argument 1, which is declared to never be null
+/usr/include/string.h:61:62: note: nonnull attribute specified here
 exports available: 1
  export: ''
   size:  4194304
  TEST   iotest-qcow2: qsd-jobs
Not run: 172 186 192 220 287
Failures: nbd-qemu-allocation
Failed 1 of 118 iotests


Does this look familiar ?

-- PMM



Re: [PATCH v4 00/25] block layer: split block APIs in global state and I/O

2021-11-15 Thread Daniel P . Berrangé
On Mon, Nov 15, 2021 at 05:03:28PM +0100, Hanna Reitz wrote:
> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> > Currently, block layer APIs like block-backend.h contain a mix of
> > functions that are either running in the main loop and under the
> > BQL, or are thread-safe functions and run in iothreads performing I/O.
> > The functions running under BQL also take care of modifying the
> > block graph, by using drain and/or aio_context_acquire/release.
> > This makes it very confusing to understand where each function
> > runs, and what assumptions it provided with regards to thread
> > safety.
> > 
> > We call the functions running under BQL "global state (GS) API", and
> > distinguish them from the thread-safe "I/O API".
> > 
> > The aim of this series is to split the relevant block headers in
> > global state and I/O sub-headers.
> 
> Despite leaving quite some comments, the series and the split seem
> reasonable to me overall.  (This is a pretty big series, after all, so those
> “some comments” stack up against a majority of changes that seem OK to me.
> :))
> 
> One thing I noticed while reviewing is that it’s really hard to verify that
> no I/O function calls a GS function.  What would be wonderful is some
> function marker like coroutine_fn that marks GS functions (or I/O functions)
> and that we could then verify the call paths.  But AFAIU we’ve always wanted
> precisely that for coroutine_fn and still don’t have it, so this seems like
> extremely wishful thinking... :(

Even if we don't programmatically verify these rules, it would be a major
step forward if we simply adopted a standard naming convention for the
APIs that was essentally self-describing. eg block_io_XXX for all I/O
stuff and block_state_ for all global state ,and block_common if
we have stuff applicable to both use cases.

I wouldn't suggest doing it as part of this series, but I think it'd
be worthwhile in general for the medium-long term, despite the code
churn in updating all usage in the short term.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: does drive_get_next(IF_NONE) make sense?

2021-11-15 Thread Thomas Huth

On 15/11/2021 08.12, Alistair Francis wrote:

On Mon, Nov 15, 2021 at 3:32 PM Markus Armbruster  wrote:


Peter Maydell  writes:


On Fri, 12 Nov 2021 at 13:34, Markus Armbruster  wrote:


Thomas Huth  writes:


On 03/11/2021 09.41, Markus Armbruster wrote:

Peter Maydell  writes:


Does it make sense for a device/board to do drive_get_next(IF_NONE) ?

Short answer: hell, no!  ;)


Would it make sense to add an "assert(type != IF_NONE)" to drive_get()
to avoid such mistakes in the future?


Worth a try.


You need to fix the sifive_u_otp device first :-)


And for that, we may want Hao Wu's "[PATCH v4 5/7] blockdev: Add a new
IF type IF_OTHER" first.


I can fixup sifive_u_otp, just let me know what the prefered solution is


What kind of device is that OTP exactly? If it is some kind of non-serial 
flash device, maybe you could simply use IF_PFLASH instead?


 Thomas




Re: [PATCH v4 00/25] block layer: split block APIs in global state and I/O

2021-11-15 Thread Hanna Reitz

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:

Currently, block layer APIs like block-backend.h contain a mix of
functions that are either running in the main loop and under the
BQL, or are thread-safe functions and run in iothreads performing I/O.
The functions running under BQL also take care of modifying the
block graph, by using drain and/or aio_context_acquire/release.
This makes it very confusing to understand where each function
runs, and what assumptions it provided with regards to thread
safety.

We call the functions running under BQL "global state (GS) API", and
distinguish them from the thread-safe "I/O API".

The aim of this series is to split the relevant block headers in
global state and I/O sub-headers.


Despite leaving quite some comments, the series and the split seem 
reasonable to me overall.  (This is a pretty big series, after all, so 
those “some comments” stack up against a majority of changes that seem 
OK to me. :))


One thing I noticed while reviewing is that it’s really hard to verify 
that no I/O function calls a GS function.  What would be wonderful is 
some function marker like coroutine_fn that marks GS functions (or I/O 
functions) and that we could then verify the call paths.  But AFAIU 
we’ve always wanted precisely that for coroutine_fn and still don’t have 
it, so this seems like extremely wishful thinking... :(


I think most of the issues I found can be fixed (or are even 
irrelevant), the only thing that really worries me are the two places 
that are clearly I/O paths that call permission functions: Namely first 
block_crypto_amend_options_generic_luks() (part of the luks block 
driver’s .bdrv_co_amend implementation), which calls 
bdrv_child_refresh_perms(); and second fuse_do_truncate(), which calls 
blk_set_perm().


In the first case, we need this call so that we don’t permanently hog 
the WRITE permission for the luks file, which used to be a problem, I 
believe.  We want to unshare the WRITE permission (and apparently also 
CONSISTENT_READ) during the key update, so we need some way to 
temporarily update the permissions.


I only really see four solutions for this:
(1) We somehow make the amend job run in the main context under the BQL 
and have it prevent all concurrent I/O access (seems bad)
(2) We can make the permission functions part of the I/O path (seems 
wrong and probably impossible?)
(3) We can drop the permissions update and permanently require the 
permissions that we need when updating keys (I think this might break 
existing use cases)
(4) We can acquire the BQL around the permission update call and perhaps 
that works?


I don’t know how (4) would work but it’s basically the only reasonable 
solution I can come up with.  Would this be a way to call a BQL function 
from an I/O function?


As for the second case, the same applies as above, with the differences 
that we have no jobs, so this code must always run in the block device’s 
AioContext (I think), which rules out (1); but (3) would become easier 
(i.e. require the RESIZE permission all the time), although that too 
might have an impact on existing users (don’t think so, though).  In any 
case, if we could do (4), that would solve the problem here, too.



And finally, another notable thing I noticed is that the way how 
create-related functions are handled is inconsistent.  I believe they 
should all be GS functions; qmp_blockdev_create() seems to agree with me 
on this, but we currently seem to have some bugs there.  It’s possible 
to invoke blockdev-create on a block device that’s in an I/O thread, and 
then qemu crashes.  Oops.  (The comment in qmp_blockdev_create() says 
that the block drivers’ implementations should prevent this, but 
apparently they don’t...?) In any case, that’s a pre-existing bug, of 
course, that doesn’t concern this series (other than that it suggests 
that “create” functions should be classified as GS).


Hanna




Re: [PATCH RFC 0/2] Eliminate drive_get_next()

2021-11-15 Thread Markus Armbruster
Peter Maydell  writes:

> On Mon, 15 Nov 2021 at 12:55, Markus Armbruster  wrote:
>>
>> This is RFC because I'm unsure about the removal of
>>
>> /* Reason: init() method uses drive_get_next() */
>> dc->user_creatable = false;
>>
>> in PATCH 1.  Both users appear to wire up some GPIO.  If that's
>> necessary for the thing to work, we should just replace the comment.
>
> Looking at the code, it sort of is and sort of isn't. The GPIO line
> is the chip-select line. If you don't connect it then (because the
> ssi-sd device configures cs_polarity to SSI_CS_LOW, requesting an
> active-low chip-select) the device will always be selected. If
> the machine created an SSI bus with no SSI device attached to it
> then in theory the user could create an ssi-sd device and connect
> it there and have it work. But in practice it's really unlikely:
> machines create SSI buses with specific wired-in devices on them,
> and the guest OS knows what it has to do to enable the chip select
> for the device it wants to talk to (often some known GPIO pin on
> a GPIO controller).
>
> So I would keep the user_creatable = false, with a reason of
> "user should wire up GPIO chip-select line". But see below for

I'll do it this way.

> a pile of contrary precedent.
>
> (The chip-select GPIO is created in the parent class, incidentally.)
>
>> Aside: there may be devices that need manual wiring to work, yet don't
>> have user_creatable unset.  Bugs if you ask me.  I don't have smart
>> ideas on how to track them down.
>
> Me neither. I notice that the TYPE_M25P80 is also an SSI peripheral
> with an active-low chipselect but that one doesn't set user_creatable
> to false. TYPE_SSD0323 also is user-creatable and that one has an
> active-high chipselect, which means the user can create a device but
> it will then never be usable because it will ignore all transactions.
> (More generally, looks like most subclasses of TYPE_SSI_PERIPHERAL
> don't set user_creatable = false.)

For sysbus devices, we clear user_creatable by default, because sysbus
devices pretty much always[*] need wiring.  Is this the case for SSI bus
devices, too?


[*] The most prominent exception is "dynamic sysbus", which I believe
was a mistake.




Re: [PATCH RFC 2/2] hw: Replace drive_get_next() by drive_get()

2021-11-15 Thread Markus Armbruster
Philippe Mathieu-Daudé  writes:

> On 11/15/21 13:55, Markus Armbruster wrote:
>> drive_get_next() is basically a bad idea.  It returns the "next" block
>> backend of a certain interface type.  "Next" means bus=0,unit=N, where
>> subsequent calls count N up from zero, per interface type.
>> 
>> This lets you define unit numbers implicitly by execution order.  If the
>> order changes, or new calls appear "in the middle", unit numbers change.
>> ABI break.  Hard to spot in review.
>> 
>> Explicit is better than implicit: use drive_get() directly.
>> 
>> Signed-off-by: Markus Armbruster 
>> ---
>>  include/sysemu/blockdev.h   |  1 -
>>  blockdev.c  | 10 --
>>  hw/arm/aspeed.c | 21 +
>>  hw/arm/cubieboard.c |  2 +-
>>  hw/arm/imx25_pdk.c  |  2 +-
>>  hw/arm/integratorcp.c   |  2 +-
>>  hw/arm/mcimx6ul-evk.c   |  2 +-
>>  hw/arm/mcimx7d-sabre.c  |  2 +-
>>  hw/arm/msf2-som.c   |  2 +-
>>  hw/arm/npcm7xx_boards.c |  6 +++---
>>  hw/arm/orangepi.c   |  2 +-
>>  hw/arm/raspi.c  |  2 +-
>>  hw/arm/realview.c   |  2 +-
>>  hw/arm/sabrelite.c  |  2 +-
>>  hw/arm/versatilepb.c|  4 ++--
>>  hw/arm/vexpress.c   |  6 +++---
>>  hw/arm/xilinx_zynq.c| 16 +---
>>  hw/arm/xlnx-versal-virt.c   |  3 ++-
>>  hw/arm/xlnx-zcu102.c|  6 +++---
>>  hw/microblaze/petalogix_ml605_mmu.c |  2 +-
>>  hw/misc/sifive_u_otp.c  |  2 +-
>>  hw/riscv/microchip_pfsoc.c  |  2 +-
>>  hw/sparc64/niagara.c|  2 +-
>>  23 files changed, 49 insertions(+), 52 deletions(-)
>
>> @@ -435,11 +438,13 @@ static void aspeed_machine_init(MachineState *machine)
>>  }
>>  
>>  for (i = 0; i < bmc->soc.sdhci.num_slots; i++) {
>> -sdhci_attach_drive(>soc.sdhci.slots[i], drive_get_next(IF_SD));
>> +sdhci_attach_drive(>soc.sdhci.slots[i],
>> +   drive_get(IF_SD, 0, i));
>
> If we put SD on bus #0, ...
>
>>  }
>>  
>>  if (bmc->soc.emmc.num_slots) {
>> -sdhci_attach_drive(>soc.emmc.slots[0], drive_get_next(IF_SD));
>> +sdhci_attach_drive(>soc.emmc.slots[0],
>> +   drive_get(IF_SD, 0, bmc->soc.sdhci.num_slots));
>
> ... we'd want to put eMMC on bus #1

Using separate buses for different kinds of devices would be neater, but
it also would be an incompatible change.  This patch keeps existing
bus/unit numbers working.  drive_get_next() can only use bus 0.

>  but I see having eMMC cards on a
> IF_SD bus as a bug, since these cards are soldered on the board.

IF_SD is not a bus, it's an "block interface type", which is really just
a user interface thing.

>> --- a/hw/arm/vexpress.c
>> +++ b/hw/arm/vexpress.c
>> @@ -625,7 +625,7 @@ static void vexpress_common_init(MachineState *machine)
>>qdev_get_gpio_in(sysctl, 
>> ARM_SYSCTL_GPIO_MMC_WPROT));
>>  qdev_connect_gpio_out_named(dev, "card-inserted", 0,
>>qdev_get_gpio_in(sysctl, 
>> ARM_SYSCTL_GPIO_MMC_CARDIN));
>> -dinfo = drive_get_next(IF_SD);
>> +dinfo = drive_get(IF_SD, 0, 0);
>
> Can we have one interface refactor per patch (IF_SD, IF_PFLASH, IF_MTD...)?

Peter asked for one patch per "board/SoC model".  I'll do whatever helps
reviewers.

>> @@ -657,7 +657,7 @@ static void vexpress_common_init(MachineState *machine)
>>  
>>  sysbus_create_simple("pl111", map[VE_CLCD], pic[14]);
>>  
>> -dinfo = drive_get_next(IF_PFLASH);
>> +dinfo = drive_get(IF_PFLASH, 0, 0);
>
>> -static inline void zynq_init_spi_flashes(uint32_t base_addr, qemu_irq irq,
>> - bool is_qspi)
>> +static inline int zynq_init_spi_flashes(uint32_t base_addr, qemu_irq irq,
>> +bool is_qspi, int unit0)
>>  {
>> +int unit = unit0;
>>  DeviceState *dev;
>>  SysBusDevice *busdev;
>>  SSIBus *spi;
>> @@ -156,7 +157,7 @@ static inline void zynq_init_spi_flashes(uint32_t 
>> base_addr, qemu_irq irq,
>>  spi = (SSIBus *)qdev_get_child_bus(dev, bus_name);
>>  
>>  for (j = 0; j < num_ss; ++j) {
>> -DriveInfo *dinfo = drive_get_next(IF_MTD);
>> +DriveInfo *dinfo = drive_get(IF_MTD, 0, unit++);
>
>> diff --git a/hw/arm/xlnx-zcu102.c b/hw/arm/xlnx-zcu102.c
>> index 3dc2b5e8ca..45eb19ab3b 100644
>> --- a/hw/arm/xlnx-zcu102.c
>> +++ b/hw/arm/xlnx-zcu102.c
>> @@ -190,7 +190,7 @@ static void xlnx_zcu102_init(MachineState *machine)
>>  BusState *spi_bus;
>>  DeviceState *flash_dev;
>>  qemu_irq cs_line;
>> -DriveInfo *dinfo = drive_get_next(IF_MTD);
>> +DriveInfo *dinfo = drive_get(IF_MTD, 0, i);
>
> If this is bus #0, ...
>
>>   

On configuring onboard block devices with -blockdev (was: [PATCH v4 6/7] hw/nvram: Update at24c EEPROM init function in NPCM7xx boards)

2021-11-15 Thread Markus Armbruster
Kevin Wolf  writes:

> Am 03.11.2021 um 23:01 hat Hao Wu geschrieben:
>> We made 3 changes to the at24c_eeprom_init function in
>> npcm7xx_boards.c:
>> 
>> 1. We allow the function to take a I2CBus* as parameter. This allows
>>us to attach an EEPROM device behind an I2C mux which is not
>>possible with the old method.
>> 
>> 2. We make at24c EEPROMs are backed by drives so that we can
>>specify the content of the EEPROMs.
>> 
>> 3. Instead of using i2c address as unit number, This patch assigns
>>unique unit numbers for each eeproms in each board. This avoids
>>conflict in providing multiple eeprom contents with the same address.
>>In the old method if we specify two drives with the same unit number,
>>the following error will occur: `Device with id 'none85' exists`.
>> 
>> Signed-off-by: Hao Wu 
>> ---
>>  hw/arm/npcm7xx_boards.c | 14 +-
>>  1 file changed, 9 insertions(+), 5 deletions(-)
>> 
>> diff --git a/hw/arm/npcm7xx_boards.c b/hw/arm/npcm7xx_boards.c
>> index dec7d16ae5..9121e081fa 100644
>> --- a/hw/arm/npcm7xx_boards.c
>> +++ b/hw/arm/npcm7xx_boards.c
>> @@ -126,13 +126,17 @@ static I2CBus *npcm7xx_i2c_get_bus(NPCM7xxState *soc, 
>> uint32_t num)
>>  return I2C_BUS(qdev_get_child_bus(DEVICE(>smbus[num]), "i2c-bus"));
>>  }
>>  
>> -static void at24c_eeprom_init(NPCM7xxState *soc, int bus, uint8_t addr,
>> -  uint32_t rsize)
>> +static void at24c_eeprom_init(I2CBus *i2c_bus, int bus, uint8_t addr,
>> +  uint32_t rsize, int unit_number)
>>  {
>> -I2CBus *i2c_bus = npcm7xx_i2c_get_bus(soc, bus);
>>  I2CSlave *i2c_dev = i2c_slave_new("at24c-eeprom", addr);
>>  DeviceState *dev = DEVICE(i2c_dev);
>> +DriveInfo *dinfo;
>>  
>> +dinfo = drive_get(IF_OTHER, bus, unit_number);
>> +if (dinfo) {
>> +qdev_prop_set_drive(dev, "drive", blk_by_legacy_dinfo(dinfo));
>> +}
>
> I may be missing the history of this series, but why do we have to use a
> legacy DriveInfo here instead of adding a drive property to the board to
> make this configurable with -blockdev?
>
> Adding even more devices that can only be configured with -drive feels
> like a step backwards to me.

More like sideways.

The big unfinished part of -blockdev is configuring onboard devices with
it.

I took a stab at one instance in commit ebc29e1bea "pc: Support firmware
configuration with -blockdev", 2019-03-11.  Quoting from the commit
message:

Mapping -drive if=none,... to -blockdev is a solved problem.  With
if=T other than if=none, -drive additionally configures a block device
frontend.  For non-onboard devices, that part maps to -device.  Also a
solved problem.  For onboard devices such as PC flash memory, we have
an unsolved problem.

This is actually an instance of a wider problem: our general device
configuration interface doesn't cover onboard devices.  Instead, we have
a zoo of ad hoc interfaces that are much more limited.  One of them is
-drive, which we'd rather deprecate, but can't until we have suitable
replacements for all its uses.

Sadly, I can't attack the wider problem today.  So back to the narrow
problem.

My first idea was to reduce it to its solved buddy by using pluggable
instead of onboard devices for the flash memory.  Workable, but it
requires some extra smarts in firmware descriptors and libvirt.  Paolo
had an idea that is simpler for libvirt: keep the devices onboard, and
add machine properties for their block backends.

The implementation is less than straightforward, I'm afraid.

First, block backend properties are *qdev* properties.  Machines can't
have those, as they're not devices.  I could duplicate these qdev
properties as QOM properties, but I hate that.

More seriously, the properties do not belong to the machine, they
belong to the onboard flash devices.  Adding them to the machine would
then require bad magic to somehow transfer them to the flash devices.
Fortunately, QOM provides the means to handle exactly this case: add
alias properties to the machine that forward to the onboard devices'
properties.

Properties need to be created in .instance_init() methods.  For PC
machines, that's pc_machine_initfn().  To make alias properties work,
we need to create the onboard flash devices there, too.  Requires
several bug fixes, in the previous commits.  We also have to realize
the devices.  More on that below.

The need to create onboard devices differently results in a non-trivial
refactoring.  The need for keeping -drive working complicates things
further.

The "several bug fixes" took me 26 preparatory patches, plus at least
three more to fix regressions caused by one of them.

I then did the same for ARM virt in commit e0561e60f1 "hw/arm/virt:
Support firmware configuration with -blockdev", 2019-05-07.  Only a few
preparatory patches, but the patch 

Re: does drive_get_next(IF_NONE) make sense?

2021-11-15 Thread Markus Armbruster
Peter Maydell  writes:

> On Mon, 15 Nov 2021 at 13:24, Kevin Wolf  wrote:
>> Same question as for Hao Wu's series: Wouldn't the proper solution be to
>> add a drive property to the machine type?
>>
>> If you can't use -blockdev, it's not done right.
>
> Is there an example of "doing it right" for built-in-to-the-machine
> devices?
>
> (My experience with the new-style options is that almost
> always they're designed for x86 where the device they're attached
> to is also created on the command line, and then handling of boards
> where the device is builtin is either an afterthought or forgotten.
> See also -netdev, where it took forever for built-in-ethernet to
> be usable.)

I'm afraid the situation for onboard block devices is far worse than it
ever was for NICs.  See my reply "On configuring onboard block devices
with -blockdev" to Kevin's other message on the topic.

To be fair, improving onboard device configuration is *hard*.  Our
general device configuration interface doesn't cover them, and we've so
far failed finding a general solution.  Without one, we're drowning in
the sheer number of boards and onboard devices.  Which is ever growing.




Re: [PATCH v4 03/25] assertions for block global state API

2021-11-15 Thread Hanna Reitz

On 15.11.21 13:27, Emanuele Giuseppe Esposito wrote:


@@ -586,12 +589,14 @@ void bdrv_unapply_subtree_drain(BdrvChild 
*child, BlockDriverState *old_parent)

  void coroutine_fn bdrv_co_drain(BlockDriverState *bs)
  {
  assert(qemu_in_coroutine());
+    assert(qemu_in_main_thread());
  bdrv_drained_begin(bs);
  bdrv_drained_end(bs);
  }
  void bdrv_drain(BlockDriverState *bs)
  {
+    assert(qemu_in_main_thread());
  bdrv_drained_begin(bs);
  bdrv_drained_end(bs);
  }


Why are these GS functions when both bdrv_drained_begin() and 
bdrv_drained_end() are I/O functions?


I can understand making the drain_all functions GS functions, but it 
seems weird to say it’s an I/O function when a single BDS is drained 
via bdrv_drained_begin() and bdrv_drained_end(), but not via 
bdrv_drain(), which just does both.


(I can see that there are no I/O path callers, but I still find it 
strange.)


The problem as you saw is on the path callers: bdrv_drain seems to be 
called only in main thread, while others or similar drains are also 
coming from I/O. I would say maybe it's a better idea to just put 
everything I/O, maybe (probably) there are cases not caught by the tests.


The only ones in global-state will be:
- bdrv_apply_subtree_drain and bdrv_unapply_subtree_drain (called only 
in .attach and .detach callbacks of BdrvChildClass, run under BQL).

- bdrv_drain_all_end_quiesce (called only by bdrv_close thus under BQL).
- bdrv_drain_all_begin, bdrv_drain_all_end and bdrv_drain_all (have 
already BQL assertions)


In general, this is the underlying problem with these assertions: if a 
test doesn't test a specific code path, an unneeded assertion might 
pass undetected.


I believe the I/O vs. GS classification should not only be done based on 
functional concerns, i.e. who calls this function (is it called by an 
I/O function?) and whether it can be thread-safe or not (does it call a 
BQL function?), but also needs to be done semantically.  I believe there 
are some functions that we consider thread-safe but are also only called 
by BQL functions, for example apparently bdrv_drain(), 
bdrv_apply_subtree_drain(), and bdrv_unapply_subtree_drain().  Such 
functions can functionally be considered both GS and I/O functions, so 
the decision should be done semantically.  I believe that it makes sense 
to say all drain-related functions for a single subtree are I/O functions.


OTOH, functions operating on multiple trees in the block graph (i.e. all 
functions that have some “_all_” in their name) should naturally be 
considered GS functions, regardless of whether their implementation is 
thread-safe or not, because they are by nature global functions.


That’s why I’m wondering why some drain functions are I/O and others are 
GS; or why this block-status function is I/O and this one is GS.  It may 
make sense functionally, but semantically it’s strange.


I understand it may be difficult for you to know which functions relate 
to each other and thus know these semantic relationships, but in most of 
the series the split seems very reasonable to me, semantically.  If it 
didn’t, I said so. :)


Hanna




Re: [PATCH v4 24/25] job.h: split function pointers in JobDriver

2021-11-15 Thread Hanna Reitz

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:

The job API will be handled separately in another serie.

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
---
  include/qemu/job.h | 16 
  1 file changed, 16 insertions(+)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 6e67b6977f..7e9e59f4b8 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -169,12 +169,21 @@ typedef struct Job {
   * Callbacks and other information about a Job driver.
   */
  struct JobDriver {
+
+/* Fields initialized in struct definition and never changed. */


Like in patch 19, I’d prefer a slightly more verbose comment that I’d 
find more easily readable.



+
  /** Derived Job struct size */
  size_t instance_size;
  
  /** Enum describing the operation */

  JobType job_type;
  
+/*

+ * Functions run without regard to the BQL and may run in any


s/and/that/?


+ * arbitrary thread. These functions do not need to be thread-safe
+ * because the caller ensures that are invoked from one thread at time.


s/that/they/ (or “that they”)

I believe .run() must be run in the job’s context, though.  Not sure if 
that’s necessary to note, but it isn’t really an arbitrary thread, and 
block jobs certainly require this (because they run in the block 
device’s context).  Or is that something that’s going to change with I/O 
threading?


Hanna




Re: does drive_get_next(IF_NONE) make sense?

2021-11-15 Thread Kevin Wolf
Am 15.11.2021 um 14:31 hat Peter Maydell geschrieben:
> On Mon, 15 Nov 2021 at 13:24, Kevin Wolf  wrote:
> > Same question as for Hao Wu's series: Wouldn't the proper solution be to
> > add a drive property to the machine type?
> >
> > If you can't use -blockdev, it's not done right.
> 
> Is there an example of "doing it right" for built-in-to-the-machine
> devices?
> 
> (My experience with the new-style options is that almost
> always they're designed for x86 where the device they're attached
> to is also created on the command line, and then handling of boards
> where the device is builtin is either an afterthought or forgotten.
> See also -netdev, where it took forever for built-in-ethernet to
> be usable.)

As long as we don't have a separate way to configure built-in devices
without creating them, for boards where the device is built in, the
properties for the device have to become machine properties.

I seem to remember that Markus implemented this for some boards.

Just doing without properties for these devices, either by just hard
coding things instead of providing options to the user, or by bypassing
the property system and accessing global state instead feels wrong.

Maybe once we have .instance_config (see my QOM/QAPI integration RFC),
forwarding these properties from -machine could actually become trivial,
just one call to set the properties for each built-in device. For now,
it's probably not as nice because each property needs to be forwarded
individually instead of just providing access to all properties of the
device.

Kevin




[PULL 09/13] block: Let replace_child_noperm free children

2021-11-15 Thread Kevin Wolf
From: Hanna Reitz 

In most of the block layer, especially when traversing down from other
BlockDriverStates, we assume that BdrvChild.bs can never be NULL.  When
it becomes NULL, it is expected that the corresponding BdrvChild pointer
also becomes NULL and the BdrvChild object is freed.

Therefore, once bdrv_replace_child_noperm() sets the BdrvChild.bs
pointer to NULL, it should also immediately set the corresponding
BdrvChild pointer (like bs->file or bs->backing) to NULL.

In that context, it also makes sense for this function to free the
child.  Sometimes we cannot do so, though, because it is called in a
transactional context where the caller might still want to reinstate the
child in the abort branch (and free it only on commit), so this behavior
has to remain optional.

In bdrv_replace_child_tran()'s abort handler, we now rely on the fact
that the BdrvChild passed to bdrv_replace_child_tran() must have had a
non-NULL .bs pointer initially.  Make a note of that and assert it.

Signed-off-by: Hanna Reitz 
Message-Id: <2021120829.81329-10-hre...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block.c | 102 +++-
 1 file changed, 79 insertions(+), 23 deletions(-)

diff --git a/block.c b/block.c
index a40027161c..0ac5b163d2 100644
--- a/block.c
+++ b/block.c
@@ -87,8 +87,10 @@ static BlockDriverState *bdrv_open_inherit(const char 
*filename,
 static bool bdrv_recurse_has_child(BlockDriverState *bs,
BlockDriverState *child);
 
+static void bdrv_child_free(BdrvChild *child);
 static void bdrv_replace_child_noperm(BdrvChild **child,
-  BlockDriverState *new_bs);
+  BlockDriverState *new_bs,
+  bool free_empty_child);
 static void bdrv_remove_file_or_backing_child(BlockDriverState *bs,
   BdrvChild *child,
   Transaction *tran);
@@ -2256,12 +2258,16 @@ typedef struct BdrvReplaceChildState {
 BdrvChild *child;
 BdrvChild **childp;
 BlockDriverState *old_bs;
+bool free_empty_child;
 } BdrvReplaceChildState;
 
 static void bdrv_replace_child_commit(void *opaque)
 {
 BdrvReplaceChildState *s = opaque;
 
+if (s->free_empty_child && !s->child->bs) {
+bdrv_child_free(s->child);
+}
 bdrv_unref(s->old_bs);
 }
 
@@ -2278,22 +2284,26 @@ static void bdrv_replace_child_abort(void *opaque)
  * modify the BdrvChild * pointer we indirectly pass to it, i.e. it
  * will not modify s->child.  From that perspective, it does not matter
  * whether we pass s->childp or >child.
- * (TODO: Right now, bdrv_replace_child_noperm() never modifies that
- * pointer anyway (though it will in the future), so at this point it
- * absolutely does not matter whether we pass s->childp or >child.)
  * (2) If new_bs is not NULL, s->childp will be NULL.  We then cannot use
  * it here.
  * (3) If new_bs is NULL, *s->childp will have been NULLed by
  * bdrv_replace_child_tran()'s bdrv_replace_child_noperm() call, and we
  * must not pass a NULL *s->childp here.
- * (TODO: In its current state, bdrv_replace_child_noperm() will not
- * have NULLed *s->childp, so this does not apply yet.  It will in the
- * future.)
  *
  * So whether new_bs was NULL or not, we cannot pass s->childp here; and in
  * any case, there is no reason to pass it anyway.
  */
-bdrv_replace_child_noperm(>child, s->old_bs);
+bdrv_replace_child_noperm(>child, s->old_bs, true);
+/*
+ * The child was pre-existing, so s->old_bs must be non-NULL, and
+ * s->child thus must not have been freed
+ */
+assert(s->child != NULL);
+if (!new_bs) {
+/* As described above, *s->childp was cleared, so restore it */
+assert(s->childp != NULL);
+*s->childp = s->child;
+}
 bdrv_unref(new_bs);
 }
 
@@ -2310,30 +2320,44 @@ static TransactionActionDrv bdrv_replace_child_drv = {
  *
  * The function doesn't update permissions, caller is responsible for this.
  *
+ * (*childp)->bs must not be NULL.
+ *
  * Note that if new_bs == NULL, @childp is stored in a state object attached
  * to @tran, so that the old child can be reinstated in the abort handler.
  * Therefore, if @new_bs can be NULL, @childp must stay valid until the
  * transaction is committed or aborted.
  *
- * (TODO: The reinstating does not happen yet, but it will once
- * bdrv_replace_child_noperm() NULLs *childp when new_bs is NULL.)
+ * If @free_empty_child is true and @new_bs is NULL, the BdrvChild is
+ * freed (on commit).  @free_empty_child should only be false if the
+ * caller will free the BDrvChild themselves (which may be important
+ * if this is in turn called in another transactional context).
  */
 static void 

Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-11-15 Thread Christian Schoenebeck
On Montag, 15. November 2021 12:54:32 CET Stefan Hajnoczi wrote:
> On Thu, Nov 11, 2021 at 06:54:03PM +0100, Christian Schoenebeck wrote:
> > On Donnerstag, 11. November 2021 17:31:52 CET Stefan Hajnoczi wrote:
> > > On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote:
> > > > On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> > > > > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck 
wrote:
> > > > > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > > > > As you are apparently reluctant for changing the virtio specs,
> > > > > > what
> > > > > > about
> > > > > > introducing those discussed virtio capabalities either as
> > > > > > experimental
> > > > > > ones
> > > > > > without specs changes, or even just as 9p specific device
> > > > > > capabilities
> > > > > > for
> > > > > > now. I mean those could be revoked on both sides at any time
> > > > > > anyway.
> > > > > 
> > > > > I would like to understand the root cause before making changes.
> > > > > 
> > > > > "It's faster when I do X" is useful information but it doesn't
> > > > > necessarily mean doing X is the solution. The "it's faster when I do
> > > > > X
> > > > > because Y" part is missing in my mind. Once there is evidence that
> > > > > shows
> > > > > Y then it will be clearer if X is a good solution, if there's a more
> > > > > general solution, or if it was just a side-effect.
> > > > 
> > > > I think I made it clear that the root cause of the observed
> > > > performance
> > > > gain with rising transmission size is latency (and also that
> > > > performance
> > > > is not the only reason for addressing this queue size issue).
> > > > 
> > > > Each request roundtrip has a certain minimum latency, the virtio ring
> > > > alone
> > > > has its latency, plus latency of the controller portion of the file
> > > > server
> > > > (e.g. permissions, sandbox checks, file IDs) that is executed with
> > > > *every*
> > > > request, plus latency of dispatching the request handling between
> > > > threads
> > > > several times back and forth (also for each request).
> > > > 
> > > > Therefore when you split large payloads (e.g. reading a large file)
> > > > into
> > > > smaller n amount of chunks, then that individual latency per request
> > > > accumulates to n times the individual latency, eventually leading to
> > > > degraded transmission speed as those requests are serialized.
> > > 
> > > It's easy to increase the blocksize in benchmarks, but real applications
> > > offer less control over the I/O pattern. If latency in the device
> > > implementation (QEMU) is the root cause then reduce the latency to speed
> > > up all applications, even those that cannot send huge requests.
> > 
> > Which I did, still do, and also mentioned before, e.g.:
> > 
> > 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 9pfs: reduce latency of Twalk
> > 0c4356ba7dafc8ecb5877a42fc0d68d45ccf5951 9pfs: T_readdir latency
> > optimization
> > 
> > Reducing overall latency is a process that is ongoing and will still take
> > a
> > very long development time. Not because of me, but because of lack of
> > reviewers. And even then, it does not make the effort to support higher
> > transmission sizes obsolete.
> > 
> > > One idea is request merging on the QEMU side. If the application sends
> > > 10 sequential read or write requests, coalesce them together before the
> > > main part of request processing begins in the device. Process a single
> > > large request to spread the cost of the file server over the 10
> > > requests. (virtio-blk has request merging to help with the cost of lots
> > > of small qcow2 I/O requests.) The cool thing about this is that the
> > > guest does not need to change its I/O pattern to benefit from the
> > > optimization.
> > > 
> > > Stefan
> > 
> > Ok, don't get me wrong: I appreciate that you are suggesting approaches
> > that could improve things. But I could already hand you over a huge list
> > of mine. The limiting factor here is not the lack of ideas of what could
> > be improved, but rather the lack of people helping out actively on 9p
> > side:
> > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06452.html
> > 
> > The situation on kernel side is the same. I already have a huge list of
> > what could & should be improved. But there is basically no reviewer for
> > 9p patches on Linux kernel side either.
> > 
> > The much I appreciate suggestions of what could be improved, I would
> > appreciate much more if there was *anybody* actively assisting as well. In
> > the time being I have to work the list down in small patch chunks,
> > priority based.
> I see request merging as an alternative to this patch series, not as an
> additional idea.

It is not an alternative. Like I said before, even if it would solve the 
sequential I/O performance issue (by not simply moving the problem somewhere 
else), which I doubt, your suggestion would still not solve the semantical 

[PULL 07/13] transactions: Invoke clean() after everything else

2021-11-15 Thread Kevin Wolf
From: Hanna Reitz 

Invoke the transaction drivers' .clean() methods only after all
.commit() or .abort() handlers are done.

This makes it easier to have nested transactions where the top-level
transactions pass objects to lower transactions that the latter can
still use throughout their commit/abort phases, while the top-level
transaction keeps a reference that is released in its .clean() method.

(Before this commit, that is also possible, but the top-level
transaction would need to take care to invoke tran_add() before the
lower-level transaction does.  This commit makes the ordering
irrelevant, which is just a bit nicer.)

Signed-off-by: Hanna Reitz 
Message-Id: <2021120829.81329-8-hre...@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Kevin Wolf 
---
 include/qemu/transactions.h | 3 +++
 util/transactions.c | 8 ++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/qemu/transactions.h b/include/qemu/transactions.h
index 92c5965235..2f2060acd9 100644
--- a/include/qemu/transactions.h
+++ b/include/qemu/transactions.h
@@ -31,6 +31,9 @@
  * tran_create(), call your "prepare" functions on it, and finally call
  * tran_abort() or tran_commit() to finalize the transaction by corresponding
  * finalization actions in reverse order.
+ *
+ * The clean() functions registered by the drivers in a transaction are called
+ * last, after all abort() or commit() functions have been called.
  */
 
 #ifndef QEMU_TRANSACTIONS_H
diff --git a/util/transactions.c b/util/transactions.c
index d0bc9a3e73..2dbdedce95 100644
--- a/util/transactions.c
+++ b/util/transactions.c
@@ -61,11 +61,13 @@ void tran_abort(Transaction *tran)
 {
 TransactionAction *act, *next;
 
-QSLIST_FOREACH_SAFE(act, >actions, entry, next) {
+QSLIST_FOREACH(act, >actions, entry) {
 if (act->drv->abort) {
 act->drv->abort(act->opaque);
 }
+}
 
+QSLIST_FOREACH_SAFE(act, >actions, entry, next) {
 if (act->drv->clean) {
 act->drv->clean(act->opaque);
 }
@@ -80,11 +82,13 @@ void tran_commit(Transaction *tran)
 {
 TransactionAction *act, *next;
 
-QSLIST_FOREACH_SAFE(act, >actions, entry, next) {
+QSLIST_FOREACH(act, >actions, entry) {
 if (act->drv->commit) {
 act->drv->commit(act->opaque);
 }
+}
 
+QSLIST_FOREACH_SAFE(act, >actions, entry, next) {
 if (act->drv->clean) {
 act->drv->clean(act->opaque);
 }
-- 
2.31.1




[PULL 12/13] file-posix: Fix alignment after reopen changing O_DIRECT

2021-11-15 Thread Kevin Wolf
At the end of a reopen, we already call bdrv_refresh_limits(), which
should update bs->request_alignment according to the new file
descriptor. However, raw_probe_alignment() relies on s->needs_alignment
and just uses 1 if it isn't set. We neglected to update this field, so
starting with cache=writeback and then reopening with cache=none means
that we get an incorrect bs->request_alignment == 1 and unaligned
requests fail instead of being automatically aligned.

Fix this by recalculating s->needs_alignment in raw_refresh_limits()
before calling raw_probe_alignment().

Signed-off-by: Kevin Wolf 
Message-Id: <20211104113109.56336-1-kw...@redhat.com>
Reviewed-by: Hanna Reitz 
Reviewed-by: Stefano Garzarella 
Signed-off-by: Kevin Wolf 
---
 block/file-posix.c | 20 
 tests/qemu-iotests/142 | 22 ++
 tests/qemu-iotests/142.out | 15 +++
 3 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 7a27c83060..b283093e5b 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -167,6 +167,7 @@ typedef struct BDRVRawState {
 int page_cache_inconsistent; /* errno from fdatasync failure */
 bool has_fallocate;
 bool needs_alignment;
+bool force_alignment;
 bool drop_cache;
 bool check_cache_dropped;
 struct {
@@ -351,6 +352,17 @@ static bool dio_byte_aligned(int fd)
 return false;
 }
 
+static bool raw_needs_alignment(BlockDriverState *bs)
+{
+BDRVRawState *s = bs->opaque;
+
+if ((bs->open_flags & BDRV_O_NOCACHE) != 0 && !dio_byte_aligned(s->fd)) {
+return true;
+}
+
+return s->force_alignment;
+}
+
 /* Check if read is allowed with given memory buffer and length.
  *
  * This function is used to check O_DIRECT memory buffer and request alignment.
@@ -728,9 +740,6 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 
 s->has_discard = true;
 s->has_write_zeroes = true;
-if ((bs->open_flags & BDRV_O_NOCACHE) != 0 && !dio_byte_aligned(s->fd)) {
-s->needs_alignment = true;
-}
 
 if (fstat(s->fd, ) < 0) {
 ret = -errno;
@@ -784,9 +793,10 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
  * so QEMU makes sure all IO operations on the device are aligned
  * to sector size, or else FreeBSD will reject them with EINVAL.
  */
-s->needs_alignment = true;
+s->force_alignment = true;
 }
 #endif
+s->needs_alignment = raw_needs_alignment(bs);
 
 #ifdef CONFIG_XFS
 if (platform_test_xfs_fd(s->fd)) {
@@ -1251,7 +1261,9 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 BDRVRawState *s = bs->opaque;
 struct stat st;
 
+s->needs_alignment = raw_needs_alignment(bs);
 raw_probe_alignment(bs, s->fd, errp);
+
 bs->bl.min_mem_alignment = s->buf_align;
 bs->bl.opt_mem_alignment = MAX(s->buf_align, qemu_real_host_page_size);
 
diff --git a/tests/qemu-iotests/142 b/tests/qemu-iotests/142
index 69fd10ef51..07003c597a 100755
--- a/tests/qemu-iotests/142
+++ b/tests/qemu-iotests/142
@@ -350,6 +350,28 @@ info block backing-file"
 
 echo "$hmp_cmds" | run_qemu -drive "$files","$ids" | grep "Cache"
 
+echo
+echo "--- Alignment after changing O_DIRECT ---"
+echo
+
+# Directly test the protocol level: Can unaligned requests succeed even if
+# O_DIRECT was only enabled through a reopen and vice versa?
+
+$QEMU_IO --cache=writeback -f file $TEST_IMG <

[PULL 03/13] block: Unite remove_empty_child and child_free

2021-11-15 Thread Kevin Wolf
From: Hanna Reitz 

Now that bdrv_remove_empty_child() no longer removes the child from the
parent's children list but only checks that it is not in such a list, it
is only a wrapper around bdrv_child_free() that checks that the child is
empty and unused.  That should apply to all children that we free, so
put those checks into bdrv_child_free() and drop
bdrv_remove_empty_child().

Signed-off-by: Hanna Reitz 
Reviewed-by: Kevin Wolf 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <2021120829.81329-4-hre...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/block.c b/block.c
index ca024ffced..19bff4f95c 100644
--- a/block.c
+++ b/block.c
@@ -2740,19 +2740,19 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 }
 }
 
-static void bdrv_child_free(void *opaque)
-{
-BdrvChild *c = opaque;
-
-g_free(c->name);
-g_free(c);
-}
-
-static void bdrv_remove_empty_child(BdrvChild *child)
+/**
+ * Free the given @child.
+ *
+ * The child must be empty (i.e. `child->bs == NULL`) and it must be
+ * unused (i.e. not in a children list).
+ */
+static void bdrv_child_free(BdrvChild *child)
 {
 assert(!child->bs);
 assert(!child->next.le_prev); /* not in children list */
-bdrv_child_free(child);
+
+g_free(child->name);
+g_free(child);
 }
 
 typedef struct BdrvAttachChildCommonState {
@@ -2786,7 +2786,7 @@ static void bdrv_attach_child_common_abort(void *opaque)
 }
 
 bdrv_unref(bs);
-bdrv_remove_empty_child(child);
+bdrv_child_free(child);
 *s->child = NULL;
 }
 
@@ -2859,7 +2859,7 @@ static int bdrv_attach_child_common(BlockDriverState 
*child_bs,
 
 if (ret < 0) {
 error_propagate(errp, local_err);
-bdrv_remove_empty_child(new_child);
+bdrv_child_free(new_child);
 return ret;
 }
 }
@@ -2925,7 +2925,7 @@ static void bdrv_detach_child(BdrvChild *child)
 BlockDriverState *old_bs = child->bs;
 
 bdrv_replace_child_noperm(child, NULL);
-bdrv_remove_empty_child(child);
+bdrv_child_free(child);
 
 if (old_bs) {
 /*
-- 
2.31.1




[PULL 11/13] docs: Deprecate incorrectly typed device_add arguments

2021-11-15 Thread Kevin Wolf
While introducing a non-QemuOpts code path for device creation for JSON
-device, we noticed that QMP device_add doesn't check its input
correctly (accepting arguments that should have been rejected), and that
users may be relying on this behaviour (libvirt did until it was fixed
recently).

Let's use a deprecation period before we fix this bug in QEMU to avoid
nasty surprises for users.

Signed-off-by: Kevin Wolf 
Message-Id: <2021143530.18985-1-kw...@redhat.com>
Reviewed-by: Markus Armbruster 
Reviewed-by: Daniel P. Berrangé 
Signed-off-by: Kevin Wolf 
---
 docs/about/deprecated.rst | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
index 600031210d..c03fcf951f 100644
--- a/docs/about/deprecated.rst
+++ b/docs/about/deprecated.rst
@@ -250,6 +250,20 @@ options are removed in favor of using explicit 
``blockdev-create`` and
 ``blockdev-add`` calls. See :doc:`/interop/live-block-operations` for
 details.
 
+Incorrectly typed ``device_add`` arguments (since 6.2)
+''
+
+Due to shortcomings in the internal implementation of ``device_add``, QEMU
+incorrectly accepts certain invalid arguments: Any object or list arguments are
+silently ignored. Other argument types are not checked, but an implicit
+conversion happens, so that e.g. string values can be assigned to integer
+device properties or vice versa.
+
+This is a bug in QEMU that will be fixed in the future so that previously
+accepted incorrect commands will return an error. Users should make sure that
+all arguments passed to ``device_add`` are consistent with the documented
+property types.
+
 System accelerators
 ---
 
-- 
2.31.1




[PULL 08/13] block: Let replace_child_tran keep indirect pointer

2021-11-15 Thread Kevin Wolf
From: Hanna Reitz 

As of a future commit, bdrv_replace_child_noperm() will clear the
indirect BdrvChild pointer passed to it if the new child BDS is NULL.
bdrv_replace_child_tran() will want to let it do that, but revert this
change in its abort handler.  For that, we need to have it receive a
BdrvChild ** pointer, too, and keep it stored in the
BdrvReplaceChildState object that we attach to the transaction.

Note that we do not need to store it in the BdrvReplaceChildState when
new_bs is not NULL, because then there is nothing to revert.  This is
important so that bdrv_replace_node_noperm() can pass a pointer to a
loop-local variable to bdrv_replace_child_tran() without worrying that
this pointer will outlive one loop iteration.

(Of course, for that to work, bdrv_replace_node_noperm() and in turn
bdrv_replace_node() and its relatives may not be called with a NULL @to
node.  Luckily, they already are not, but now we should assert this.)

bdrv_remove_file_or_backing_child() on the other hand needs to ensure
that the indirect pointer it passes will stay valid for the duration of
the transaction.  Ensure this by keeping a strong reference to the BDS
whose >backing or >file it passes to bdrv_replace_child_tran(),
and giving up that reference only in the transaction .clean() handler.

Signed-off-by: Hanna Reitz 
Message-Id: <2021120829.81329-9-hre...@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Kevin Wolf 
---
 block.c | 83 ++---
 1 file changed, 73 insertions(+), 10 deletions(-)

diff --git a/block.c b/block.c
index 8da057f800..a40027161c 100644
--- a/block.c
+++ b/block.c
@@ -2254,6 +2254,7 @@ static int bdrv_drv_set_perm(BlockDriverState *bs, 
uint64_t perm,
 
 typedef struct BdrvReplaceChildState {
 BdrvChild *child;
+BdrvChild **childp;
 BlockDriverState *old_bs;
 } BdrvReplaceChildState;
 
@@ -2269,7 +2270,29 @@ static void bdrv_replace_child_abort(void *opaque)
 BdrvReplaceChildState *s = opaque;
 BlockDriverState *new_bs = s->child->bs;
 
-/* old_bs reference is transparently moved from @s to @s->child */
+/*
+ * old_bs reference is transparently moved from @s to s->child.
+ *
+ * Pass >child here instead of s->childp, because:
+ * (1) s->old_bs must be non-NULL, so bdrv_replace_child_noperm() will not
+ * modify the BdrvChild * pointer we indirectly pass to it, i.e. it
+ * will not modify s->child.  From that perspective, it does not matter
+ * whether we pass s->childp or >child.
+ * (TODO: Right now, bdrv_replace_child_noperm() never modifies that
+ * pointer anyway (though it will in the future), so at this point it
+ * absolutely does not matter whether we pass s->childp or >child.)
+ * (2) If new_bs is not NULL, s->childp will be NULL.  We then cannot use
+ * it here.
+ * (3) If new_bs is NULL, *s->childp will have been NULLed by
+ * bdrv_replace_child_tran()'s bdrv_replace_child_noperm() call, and we
+ * must not pass a NULL *s->childp here.
+ * (TODO: In its current state, bdrv_replace_child_noperm() will not
+ * have NULLed *s->childp, so this does not apply yet.  It will in the
+ * future.)
+ *
+ * So whether new_bs was NULL or not, we cannot pass s->childp here; and in
+ * any case, there is no reason to pass it anyway.
+ */
 bdrv_replace_child_noperm(>child, s->old_bs);
 bdrv_unref(new_bs);
 }
@@ -2286,22 +2309,32 @@ static TransactionActionDrv bdrv_replace_child_drv = {
  * Note: real unref of old_bs is done only on commit.
  *
  * The function doesn't update permissions, caller is responsible for this.
+ *
+ * Note that if new_bs == NULL, @childp is stored in a state object attached
+ * to @tran, so that the old child can be reinstated in the abort handler.
+ * Therefore, if @new_bs can be NULL, @childp must stay valid until the
+ * transaction is committed or aborted.
+ *
+ * (TODO: The reinstating does not happen yet, but it will once
+ * bdrv_replace_child_noperm() NULLs *childp when new_bs is NULL.)
  */
-static void bdrv_replace_child_tran(BdrvChild *child, BlockDriverState *new_bs,
+static void bdrv_replace_child_tran(BdrvChild **childp,
+BlockDriverState *new_bs,
 Transaction *tran)
 {
 BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1);
 *s = (BdrvReplaceChildState) {
-.child = child,
-.old_bs = child->bs,
+.child = *childp,
+.childp = new_bs == NULL ? childp : NULL,
+.old_bs = (*childp)->bs,
 };
 tran_add(tran, _replace_child_drv, s);
 
 if (new_bs) {
 bdrv_ref(new_bs);
 }
-bdrv_replace_child_noperm(, new_bs);
-/* old_bs reference is transparently moved from @child to @s */
+bdrv_replace_child_noperm(childp, new_bs);
+/* old_bs reference is transparently moved from 

[PULL 05/13] block: Pass BdrvChild ** to replace_child_noperm

2021-11-15 Thread Kevin Wolf
From: Hanna Reitz 

bdrv_replace_child_noperm() modifies BdrvChild.bs, and can potentially
set it to NULL.  That is dangerous, because BDS parents generally assume
that their children's .bs pointer is never NULL.  We therefore want to
let bdrv_replace_child_noperm() set the corresponding BdrvChild pointer
to NULL, too.

This patch lays the foundation for it by passing a BdrvChild ** pointer
to bdrv_replace_child_noperm() so that it can later use it to NULL the
BdrvChild pointer immediately after setting BdrvChild.bs to NULL.

(We will still need to undertake some intermediate steps, though.)

Signed-off-by: Hanna Reitz 
Message-Id: <2021120829.81329-6-hre...@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Kevin Wolf 
---
 block.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/block.c b/block.c
index c7d5aa5254..d668156eca 100644
--- a/block.c
+++ b/block.c
@@ -87,7 +87,7 @@ static BlockDriverState *bdrv_open_inherit(const char 
*filename,
 static bool bdrv_recurse_has_child(BlockDriverState *bs,
BlockDriverState *child);
 
-static void bdrv_replace_child_noperm(BdrvChild *child,
+static void bdrv_replace_child_noperm(BdrvChild **child,
   BlockDriverState *new_bs);
 static void bdrv_remove_file_or_backing_child(BlockDriverState *bs,
   BdrvChild *child,
@@ -2270,7 +2270,7 @@ static void bdrv_replace_child_abort(void *opaque)
 BlockDriverState *new_bs = s->child->bs;
 
 /* old_bs reference is transparently moved from @s to @s->child */
-bdrv_replace_child_noperm(s->child, s->old_bs);
+bdrv_replace_child_noperm(>child, s->old_bs);
 bdrv_unref(new_bs);
 }
 
@@ -2300,7 +2300,7 @@ static void bdrv_replace_child_tran(BdrvChild *child, 
BlockDriverState *new_bs,
 if (new_bs) {
 bdrv_ref(new_bs);
 }
-bdrv_replace_child_noperm(child, new_bs);
+bdrv_replace_child_noperm(, new_bs);
 /* old_bs reference is transparently moved from @child to @s */
 }
 
@@ -2672,9 +2672,10 @@ uint64_t bdrv_qapi_perm_to_blk_perm(BlockPermission 
qapi_perm)
 return permissions[qapi_perm];
 }
 
-static void bdrv_replace_child_noperm(BdrvChild *child,
+static void bdrv_replace_child_noperm(BdrvChild **childp,
   BlockDriverState *new_bs)
 {
+BdrvChild *child = *childp;
 BlockDriverState *old_bs = child->bs;
 int new_bs_quiesce_counter;
 int drain_saldo;
@@ -2767,7 +2768,7 @@ static void bdrv_attach_child_common_abort(void *opaque)
 BdrvChild *child = *s->child;
 BlockDriverState *bs = child->bs;
 
-bdrv_replace_child_noperm(child, NULL);
+bdrv_replace_child_noperm(s->child, NULL);
 
 if (bdrv_get_aio_context(bs) != s->old_child_ctx) {
 bdrv_try_set_aio_context(bs, s->old_child_ctx, _abort);
@@ -2867,7 +2868,7 @@ static int bdrv_attach_child_common(BlockDriverState 
*child_bs,
 }
 
 bdrv_ref(child_bs);
-bdrv_replace_child_noperm(new_child, child_bs);
+bdrv_replace_child_noperm(_child, child_bs);
 
 *child = new_child;
 
@@ -2922,12 +2923,12 @@ static int bdrv_attach_child_noperm(BlockDriverState 
*parent_bs,
 return 0;
 }
 
-static void bdrv_detach_child(BdrvChild *child)
+static void bdrv_detach_child(BdrvChild **childp)
 {
-BlockDriverState *old_bs = child->bs;
+BlockDriverState *old_bs = (*childp)->bs;
 
-bdrv_replace_child_noperm(child, NULL);
-bdrv_child_free(child);
+bdrv_replace_child_noperm(childp, NULL);
+bdrv_child_free(*childp);
 
 if (old_bs) {
 /*
@@ -3033,7 +3034,7 @@ void bdrv_root_unref_child(BdrvChild *child)
 BlockDriverState *child_bs;
 
 child_bs = child->bs;
-bdrv_detach_child(child);
+bdrv_detach_child();
 bdrv_unref(child_bs);
 }
 
-- 
2.31.1




[PULL 13/13] softmmu/qdev-monitor: fix use-after-free in qdev_set_id()

2021-11-15 Thread Kevin Wolf
From: Stefan Hajnoczi 

Reported by Coverity (CID 1465222).

Fixes: 4a1d937796de0fecd8b22d7dbebf87f38e8282fd ("softmmu/qdev-monitor: add 
error handling in qdev_set_id")
Cc: Damien Hedde 
Cc: Kevin Wolf 
Cc: Michael S. Tsirkin 
Signed-off-by: Stefan Hajnoczi 
Message-Id: <20211102163342.31162-1-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Michael S. Tsirkin 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Marc-André Lureau 
Reviewed-by: Damien Hedde 
Reviewed-by: Markus Armbruster 
Signed-off-by: Kevin Wolf 
---
 softmmu/qdev-monitor.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index b5aaae4b8c..01ec420e61 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -593,8 +593,8 @@ const char *qdev_set_id(DeviceState *dev, char *id, Error 
**errp)
 if (prop) {
 dev->id = id;
 } else {
-g_free(id);
 error_setg(errp, "Duplicate device ID '%s'", id);
+g_free(id);
 return NULL;
 }
 } else {
-- 
2.31.1




[PULL 04/13] block: Drop detached child from ignore list

2021-11-15 Thread Kevin Wolf
From: Hanna Reitz 

bdrv_attach_child_common_abort() restores the parent's AioContext.  To
do so, the child (which was supposed to be attached, but is now detached
again by this abort handler) is added to the ignore list for the
AioContext changing functions.

However, since we modify a BDS's children list in the BdrvChildClass's
.attach and .detach handlers, the child is already effectively detached
from the parent by this point.  We do not need to put it into the ignore
list.

Use this opportunity to clean up the empty line structure: Keep setting
the ignore list, invoking the AioContext function, and freeing the
ignore list in blocks separated by empty lines.

Signed-off-by: Hanna Reitz 
Reviewed-by: Kevin Wolf 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <2021120829.81329-5-hre...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index 19bff4f95c..c7d5aa5254 100644
--- a/block.c
+++ b/block.c
@@ -2774,14 +2774,16 @@ static void bdrv_attach_child_common_abort(void *opaque)
 }
 
 if (bdrv_child_get_parent_aio_context(child) != s->old_parent_ctx) {
-GSList *ignore = g_slist_prepend(NULL, child);
+GSList *ignore;
 
+/* No need to ignore `child`, because it has been detached already */
+ignore = NULL;
 child->klass->can_set_aio_ctx(child, s->old_parent_ctx, ,
   _abort);
 g_slist_free(ignore);
-ignore = g_slist_prepend(NULL, child);
-child->klass->set_aio_ctx(child, s->old_parent_ctx, );
 
+ignore = NULL;
+child->klass->set_aio_ctx(child, s->old_parent_ctx, );
 g_slist_free(ignore);
 }
 
-- 
2.31.1




[PULL 01/13] stream: Traverse graph after modification

2021-11-15 Thread Kevin Wolf
From: Hanna Reitz 

bdrv_cor_filter_drop() modifies the block graph.  That means that other
parties can also modify the block graph before it returns.  Therefore,
we cannot assume that the result of a graph traversal we did before
remains valid afterwards.

We should thus fetch `base` and `unfiltered_base` afterwards instead of
before.

Signed-off-by: Hanna Reitz 
Reviewed-by: Kevin Wolf 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <2021120829.81329-2-hre...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block/stream.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/stream.c b/block/stream.c
index 97bee482dc..e45113aed6 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -54,8 +54,8 @@ static int stream_prepare(Job *job)
 {
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
 BlockDriverState *unfiltered_bs = bdrv_skip_filters(s->target_bs);
-BlockDriverState *base = bdrv_filter_or_cow_bs(s->above_base);
-BlockDriverState *unfiltered_base = bdrv_skip_filters(base);
+BlockDriverState *base;
+BlockDriverState *unfiltered_base;
 Error *local_err = NULL;
 int ret = 0;
 
@@ -63,6 +63,9 @@ static int stream_prepare(Job *job)
 bdrv_cor_filter_drop(s->cor_filter_bs);
 s->cor_filter_bs = NULL;
 
+base = bdrv_filter_or_cow_bs(s->above_base);
+unfiltered_base = bdrv_skip_filters(base);
+
 if (bdrv_cow_child(unfiltered_bs)) {
 const char *base_id = NULL, *base_fmt = NULL;
 if (unfiltered_base) {
-- 
2.31.1




[PULL 06/13] block: Restructure remove_file_or_backing_child()

2021-11-15 Thread Kevin Wolf
From: Hanna Reitz 

As of a future patch, bdrv_replace_child_tran() will take a BdrvChild **
pointer.  Prepare for that by getting such a pointer and using it where
applicable, and (dereferenced) as a parameter for
bdrv_replace_child_tran().

Signed-off-by: Hanna Reitz 
Message-Id: <2021120829.81329-7-hre...@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Kevin Wolf 
---
 block.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/block.c b/block.c
index d668156eca..8da057f800 100644
--- a/block.c
+++ b/block.c
@@ -4887,30 +4887,33 @@ static void 
bdrv_remove_file_or_backing_child(BlockDriverState *bs,
   BdrvChild *child,
   Transaction *tran)
 {
+BdrvChild **childp;
 BdrvRemoveFilterOrCowChild *s;
 
-assert(child == bs->backing || child == bs->file);
-
 if (!child) {
 return;
 }
 
+if (child == bs->backing) {
+childp = >backing;
+} else if (child == bs->file) {
+childp = >file;
+} else {
+g_assert_not_reached();
+}
+
 if (child->bs) {
-bdrv_replace_child_tran(child, NULL, tran);
+bdrv_replace_child_tran(*childp, NULL, tran);
 }
 
 s = g_new(BdrvRemoveFilterOrCowChild, 1);
 *s = (BdrvRemoveFilterOrCowChild) {
 .child = child,
-.is_backing = (child == bs->backing),
+.is_backing = (childp == >backing),
 };
 tran_add(tran, _remove_filter_or_cow_child_drv, s);
 
-if (s->is_backing) {
-bs->backing = NULL;
-} else {
-bs->file = NULL;
-}
+*childp = NULL;
 }
 
 /*
-- 
2.31.1




[PULL 10/13] iotests/030: Unthrottle parallel jobs in reverse

2021-11-15 Thread Kevin Wolf
From: Hanna Reitz 

See the comment for why this is necessary.

Signed-off-by: Hanna Reitz 
Message-Id: <2021120829.81329-11-hre...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/030 | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/030 b/tests/qemu-iotests/030
index 5fb65b4bef..567bf1da67 100755
--- a/tests/qemu-iotests/030
+++ b/tests/qemu-iotests/030
@@ -251,7 +251,16 @@ class TestParallelOps(iotests.QMPTestCase):
  speed=1024)
 self.assert_qmp(result, 'return', {})
 
-for job in pending_jobs:
+# Do this in reverse: After unthrottling them, some jobs may finish
+# before we have unthrottled all of them.  This will drain their
+# subgraph, and this will make jobs above them advance (despite those
+# jobs on top being throttled).  In the worst case, all jobs below the
+# top one are finished before we can unthrottle it, and this makes it
+# advance so far that it completes before we can unthrottle it - which
+# results in an error.
+# Starting from the top (i.e. in reverse) does not have this problem:
+# When a job finishes, the ones below it are not advanced.
+for job in reversed(pending_jobs):
 result = self.vm.qmp('block-job-set-speed', device=job, speed=0)
 self.assert_qmp(result, 'return', {})
 
-- 
2.31.1




[PULL 02/13] block: Manipulate children list in .attach/.detach

2021-11-15 Thread Kevin Wolf
From: Hanna Reitz 

The children list is specific to BDS parents.  We should not modify it
in the general children modification code, but let BDS parents deal with
it in their .attach() and .detach() methods.

This also has the advantage that a BdrvChild is removed from the
children list before its .bs pointer can become NULL.  BDS parents
generally assume that their children's .bs pointer is never NULL, so
this is actually a bug fix.

Signed-off-by: Hanna Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <2021120829.81329-3-hre...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/block.c b/block.c
index 580cb77a70..ca024ffced 100644
--- a/block.c
+++ b/block.c
@@ -1387,6 +1387,8 @@ static void bdrv_child_cb_attach(BdrvChild *child)
 {
 BlockDriverState *bs = child->opaque;
 
+QLIST_INSERT_HEAD(>children, child, next);
+
 if (child->role & BDRV_CHILD_COW) {
 bdrv_backing_attach(child);
 }
@@ -1403,6 +1405,8 @@ static void bdrv_child_cb_detach(BdrvChild *child)
 }
 
 bdrv_unapply_subtree_drain(child, bs);
+
+QLIST_REMOVE(child, next);
 }
 
 static int bdrv_child_cb_update_filename(BdrvChild *c, BlockDriverState *base,
@@ -2747,7 +2751,7 @@ static void bdrv_child_free(void *opaque)
 static void bdrv_remove_empty_child(BdrvChild *child)
 {
 assert(!child->bs);
-QLIST_SAFE_REMOVE(child, next);
+assert(!child->next.le_prev); /* not in children list */
 bdrv_child_free(child);
 }
 
@@ -2913,12 +2917,6 @@ static int bdrv_attach_child_noperm(BlockDriverState 
*parent_bs,
 return ret;
 }
 
-QLIST_INSERT_HEAD(_bs->children, *child, next);
-/*
- * child is removed in bdrv_attach_child_common_abort(), so don't care to
- * abort this change separately.
- */
-
 return 0;
 }
 
@@ -4851,7 +4849,6 @@ static void bdrv_remove_filter_or_cow_child_abort(void 
*opaque)
 BdrvRemoveFilterOrCowChild *s = opaque;
 BlockDriverState *parent_bs = s->child->opaque;
 
-QLIST_INSERT_HEAD(_bs->children, s->child, next);
 if (s->is_backing) {
 parent_bs->backing = s->child;
 } else {
@@ -4906,7 +4903,6 @@ static void 
bdrv_remove_file_or_backing_child(BlockDriverState *bs,
 };
 tran_add(tran, _remove_filter_or_cow_child_drv, s);
 
-QLIST_SAFE_REMOVE(child, next);
 if (s->is_backing) {
 bs->backing = NULL;
 } else {
-- 
2.31.1




[PULL 00/13] Block layer patches

2021-11-15 Thread Kevin Wolf
The following changes since commit 42f6c9179be4401974dd3a75ee72defd16b5092d:

  Merge tag 'pull-ppc-2022' of https://github.com/legoater/qemu into 
staging (2021-11-12 12:28:25 +0100)

are available in the Git repository at:

  git://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 7461272c5f6032436ef9032c091c0118539483e4:

  softmmu/qdev-monitor: fix use-after-free in qdev_set_id() (2021-11-15 
15:49:46 +0100)


Block layer patches

- Fixes to image streaming job and block layer reconfiguration to make
  iotest 030 pass again
- docs: Deprecate incorrectly typed device_add arguments
- file-posix: Fix alignment after reopen changing O_DIRECT


Hanna Reitz (10):
  stream: Traverse graph after modification
  block: Manipulate children list in .attach/.detach
  block: Unite remove_empty_child and child_free
  block: Drop detached child from ignore list
  block: Pass BdrvChild ** to replace_child_noperm
  block: Restructure remove_file_or_backing_child()
  transactions: Invoke clean() after everything else
  block: Let replace_child_tran keep indirect pointer
  block: Let replace_child_noperm free children
  iotests/030: Unthrottle parallel jobs in reverse

Kevin Wolf (2):
  docs: Deprecate incorrectly typed device_add arguments
  file-posix: Fix alignment after reopen changing O_DIRECT

Stefan Hajnoczi (1):
  softmmu/qdev-monitor: fix use-after-free in qdev_set_id()

 docs/about/deprecated.rst   |  14 +++
 include/qemu/transactions.h |   3 +
 block.c | 233 +---
 block/file-posix.c  |  20 +++-
 block/stream.c  |   7 +-
 softmmu/qdev-monitor.c  |   2 +-
 util/transactions.c |   8 +-
 tests/qemu-iotests/030  |  11 ++-
 tests/qemu-iotests/142  |  22 +
 tests/qemu-iotests/142.out  |  15 +++
 10 files changed, 269 insertions(+), 66 deletions(-)




Re: [PATCH v4 22/25] block_int-common.h: assertions in the callers of BdrvChildClass function pointers

2021-11-15 Thread Hanna Reitz

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
---
  block.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/block.c b/block.c
index 40c4729b8d..da80e89ad4 100644
--- a/block.c
+++ b/block.c


[...]


@@ -7173,6 +7182,7 @@ static bool bdrv_parent_can_set_aio_context(BdrvChild *c, 
AioContext *ctx,
  return true;
  }
  *ignore = g_slist_prepend(*ignore, c);
+assert(qemu_in_main_thread());


It definitely isn’t wrong to place the assert() here, of course, but 
it’s an interesting place nonetheless.  In other places it seemed like 
you’d prefer to place it above the first non-declaration statement.  
Absolutely no need to change it if you don’t want to, just something 
that I noticed.


Hanna




Re: [PATCH v4 21/25] block_int-common.h: split function pointers in BdrvChildClass

2021-11-15 Thread Hanna Reitz

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
---
  include/block/block_int-common.h | 51 
  1 file changed, 32 insertions(+), 19 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 9857e775fe..ea16099c53 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h


[...]


+/*
+ * I/O API functions. These functions are thread-safe.
+ *
+ * See include/block/block-io.h for more information about
+ * the I/O API.
+ */
+
+void (*resize)(BdrvChild *child);
+
  /*
   * If this pair of functions is implemented, the parent doesn't issue new
   * requests after returning from .drained_begin() until .drained_end() is
@@ -859,23 +889,6 @@ struct BdrvChildClass {
   */
  void (*activate)(BdrvChild *child, Error **errp);
  int (*inactivate)(BdrvChild *child);


These are now I/O functions, but I believe they should be GS functions: 
BlockBackend’s implementation for .activate (blk_root_activate()) 
invokes blk_set_perm(), which is a GS function.  Its .inactivate 
implementation (blk_root_inactivate()) invokes bdrv_child_try_set_perm().


(I also believe this makes sense in context: .activate is called by 
bdrv_co_invalidate_cache(), which should be a GS function as I’ve 
suggested in my reply to patch 20.  .inactivate is called by 
bdrv_inactivate_recurse(), which is a GS function, too.)


Hanna




Re: [PATCH 2/2] virtio-blk: support BLKSECDISCARD

2021-11-15 Thread Stefan Hajnoczi
On Mon, Nov 15, 2021 at 12:52:00PM +0800, yadong...@intel.com wrote:

The Linux block layer shares the DISCARD queue limits with SECDISCARD.
That's different from BLKZEROOUT (QEMU's WRITE_ZEROES). This is a Linux
implementation detail but I guess virtio-blk can share the DISCARD
limits with SECDISCARD too...

> @@ -622,6 +628,7 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, 
> MultiReqBuffer *mrb)
>  unsigned out_num = req->elem.out_num;
>  VirtIOBlock *s = req->dev;
>  VirtIODevice *vdev = VIRTIO_DEVICE(s);
> +bool is_secdiscard = false;
>  
>  if (req->elem.out_num < 1 || req->elem.in_num < 1) {
>  virtio_error(vdev, "virtio-blk missing headers");
> @@ -722,6 +729,9 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, 
> MultiReqBuffer *mrb)
>   * VIRTIO_BLK_T_OUT flag set. We masked this flag in the switch 
> statement,
>   * so we must mask it for these requests, then we will check if it is 
> set.
>   */
> +case VIRTIO_BLK_T_SECDISCARD & ~VIRTIO_BLK_T_OUT:
> +is_secdiscard = true;
> +__attribute__((fallthrough));

The DISCARD case doesn't use __attribute__((fallthrough)) so this is
inconsistent. QEMU doesn't use __attribute__((fallthrough)) so I suggest
dropping this.

> diff --git a/include/standard-headers/linux/virtio_blk.h 
> b/include/standard-headers/linux/virtio_blk.h
> index 2dcc90826a..c55a07840c 100644
> --- a/include/standard-headers/linux/virtio_blk.h
> +++ b/include/standard-headers/linux/virtio_blk.h
> @@ -40,6 +40,7 @@
>  #define VIRTIO_BLK_F_MQ  12  /* support more than one vq */
>  #define VIRTIO_BLK_F_DISCARD 13  /* DISCARD is supported */
>  #define VIRTIO_BLK_F_WRITE_ZEROES14  /* WRITE ZEROES is supported */
> +#define VIRTIO_BLK_F_SECDISCARD  15  /* WRITE ZEROES is supported */

The comment is copy-pasted from WRITE_ZEROES.


signature.asc
Description: PGP signature


Re: [PATCH v4 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers

2021-11-15 Thread Hanna Reitz

On 15.11.21 13:48, Hanna Reitz wrote:

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
---
  block.c | 17 +
  1 file changed, 17 insertions(+)

diff --git a/block.c b/block.c
index 94bff5c757..40c4729b8d 100644
--- a/block.c
+++ b/block.c


[...]

@@ -2148,6 +2152,7 @@ static void bdrv_child_perm(BlockDriverState 
*bs, BlockDriverState *child_bs,

  uint64_t *nperm, uint64_t *nshared)
  {
  assert(bs->drv && bs->drv->bdrv_child_perm);
+    assert(qemu_in_main_thread());
  bs->drv->bdrv_child_perm(bs, c, role, reopen_queue,
   parent_perm, parent_shared,
   nperm, nshared);


(Should’ve noticed earlier, but only did now...)

First, this function is indirectly called by bdrv_refresh_perms(). I 
understand that all perm-related functions are classified as GS.


However, bdrv_co_invalidate_cache() invokes bdrv_refresh_perms. Being 
declared in block/coroutine.h, it’s an I/O function, so it mustn’t 
call such a GS function. BlockDriver.bdrv_co_invalidate_cache(), 
bdrv_invalidate_cache(), and blk_invalidate_cache() are also 
classified as I/O functions. Perhaps all of these functions should be 
classified as GS functions?  I believe their callers and their purpose 
would allow for this.



Second, it’s called by bdrv_child_refresh_perms(), which is called by 
block_crypto_amend_options_generic_luks().  This function is called by 
block_crypto_co_amend_luks(), which is a BlockDriver.bdrv_co_amend 
implementation, which is classified as an I/O function.


Honestly, I don’t know how to fix that mess.  The best would be if we 
could make the perm functions thread-safe and classify them as I/O, 
but it seems to me like that’s impossible (I sure hope I’m wrong).  On 
the other hand, .bdrv_co_amend very much strikes me like a GS 
function, but it isn’t.  I’m afraid it must work on nodes that are not 
in the main context, and it launches a job, so AFAIU we absolutely 
cannot run it under the BQL.


It almost seems to me like we’d need a thread-safe variant of the perm 
functions that’s allowed to fail when it cannot guarantee thread 
safety or something.  Or perhaps I’m wrong and the perm functions can 
actually be classified as thread-safe and I/O, that’d be great…


Hm.  Can we perhaps let block_crypto_amend_options_generic_luks() take 
the BQL just for the permission adjustment (i.e. the 
bdrv_child_refresh_perms() call)?  Would that work? :/


Hanna




Re: [PATCH 1/2] block:hdev: support BLKSECDISCARD

2021-11-15 Thread Stefan Hajnoczi
On Mon, Nov 15, 2021 at 12:51:59PM +0800, yadong...@intel.com wrote:
> From: Yadong Qi 
> 
> Add a new option "secdiscard" for block drive. Parse it and
> record it in bs->open_flags as bit(BDRV_O_SECDISCARD).
> 
> Add a new BDRV_REQ_SECDISCARD bit for secure discard request
> from virtual device.
> 
> For host_device backend: implement by ioctl(BLKSECDISCARD) on
> real host block device.
> For other backend, no implementation.
> 
> E.g.:
> qemu-system-x86_64 \
> ... \
> -drive 
> file=/dev/mmcblk0p2,if=none,format=raw,discard=on,secdiscard=on,id=sd0 \
> -device virtio-blk-pci,drive=sd0,id=sd0_vblk \
> ...

I'm curious why there is explicit control over this feature (-drive
secdiscard=on|off). For example, is there a reason why users would want
to disable secdiscard on a drive that supports it? I guess there is no
way to emulate it correctly so secdiscard=on on a drive that doesn't
support it produces an error?


signature.asc
Description: PGP signature


Re: [PATCH RFC 0/2] Eliminate drive_get_next()

2021-11-15 Thread Peter Maydell
On Mon, 15 Nov 2021 at 12:55, Markus Armbruster  wrote:
>
> This is RFC because I'm unsure about the removal of
>
> /* Reason: init() method uses drive_get_next() */
> dc->user_creatable = false;
>
> in PATCH 1.  Both users appear to wire up some GPIO.  If that's
> necessary for the thing to work, we should just replace the comment.

Looking at the code, it sort of is and sort of isn't. The GPIO line
is the chip-select line. If you don't connect it then (because the
ssi-sd device configures cs_polarity to SSI_CS_LOW, requesting an
active-low chip-select) the device will always be selected. If
the machine created an SSI bus with no SSI device attached to it
then in theory the user could create an ssi-sd device and connect
it there and have it work. But in practice it's really unlikely:
machines create SSI buses with specific wired-in devices on them,
and the guest OS knows what it has to do to enable the chip select
for the device it wants to talk to (often some known GPIO pin on
a GPIO controller).

So I would keep the user_creatable = false, with a reason of
"user should wire up GPIO chip-select line". But see below for
a pile of contrary precedent.

(The chip-select GPIO is created in the parent class, incidentally.)

> Aside: there may be devices that need manual wiring to work, yet don't
> have user_creatable unset.  Bugs if you ask me.  I don't have smart
> ideas on how to track them down.

Me neither. I notice that the TYPE_M25P80 is also an SSI peripheral
with an active-low chipselect but that one doesn't set user_creatable
to false. TYPE_SSD0323 also is user-creatable and that one has an
active-high chipselect, which means the user can create a device but
it will then never be usable because it will ignore all transactions.
(More generally, looks like most subclasses of TYPE_SSI_PERIPHERAL
don't set user_creatable = false.)

-- PMM



Re: [PATCH RFC 2/2] hw: Replace drive_get_next() by drive_get()

2021-11-15 Thread Philippe Mathieu-Daudé
On 11/15/21 13:55, Markus Armbruster wrote:
> drive_get_next() is basically a bad idea.  It returns the "next" block
> backend of a certain interface type.  "Next" means bus=0,unit=N, where
> subsequent calls count N up from zero, per interface type.
> 
> This lets you define unit numbers implicitly by execution order.  If the
> order changes, or new calls appear "in the middle", unit numbers change.
> ABI break.  Hard to spot in review.
> 
> Explicit is better than implicit: use drive_get() directly.
> 
> Signed-off-by: Markus Armbruster 
> ---
>  include/sysemu/blockdev.h   |  1 -
>  blockdev.c  | 10 --
>  hw/arm/aspeed.c | 21 +
>  hw/arm/cubieboard.c |  2 +-
>  hw/arm/imx25_pdk.c  |  2 +-
>  hw/arm/integratorcp.c   |  2 +-
>  hw/arm/mcimx6ul-evk.c   |  2 +-
>  hw/arm/mcimx7d-sabre.c  |  2 +-
>  hw/arm/msf2-som.c   |  2 +-
>  hw/arm/npcm7xx_boards.c |  6 +++---
>  hw/arm/orangepi.c   |  2 +-
>  hw/arm/raspi.c  |  2 +-
>  hw/arm/realview.c   |  2 +-
>  hw/arm/sabrelite.c  |  2 +-
>  hw/arm/versatilepb.c|  4 ++--
>  hw/arm/vexpress.c   |  6 +++---
>  hw/arm/xilinx_zynq.c| 16 +---
>  hw/arm/xlnx-versal-virt.c   |  3 ++-
>  hw/arm/xlnx-zcu102.c|  6 +++---
>  hw/microblaze/petalogix_ml605_mmu.c |  2 +-
>  hw/misc/sifive_u_otp.c  |  2 +-
>  hw/riscv/microchip_pfsoc.c  |  2 +-
>  hw/sparc64/niagara.c|  2 +-
>  23 files changed, 49 insertions(+), 52 deletions(-)

> @@ -435,11 +438,13 @@ static void aspeed_machine_init(MachineState *machine)
>  }
>  
>  for (i = 0; i < bmc->soc.sdhci.num_slots; i++) {
> -sdhci_attach_drive(>soc.sdhci.slots[i], drive_get_next(IF_SD));
> +sdhci_attach_drive(>soc.sdhci.slots[i],
> +   drive_get(IF_SD, 0, i));

If we put SD on bus #0, ...

>  }
>  
>  if (bmc->soc.emmc.num_slots) {
> -sdhci_attach_drive(>soc.emmc.slots[0], drive_get_next(IF_SD));
> +sdhci_attach_drive(>soc.emmc.slots[0],
> +   drive_get(IF_SD, 0, bmc->soc.sdhci.num_slots));

... we'd want to put eMMC on bus #1, but I see having eMMC cards on a
IF_SD bus as a bug, since these cards are soldered on the board.

> --- a/hw/arm/vexpress.c
> +++ b/hw/arm/vexpress.c
> @@ -625,7 +625,7 @@ static void vexpress_common_init(MachineState *machine)
>qdev_get_gpio_in(sysctl, 
> ARM_SYSCTL_GPIO_MMC_WPROT));
>  qdev_connect_gpio_out_named(dev, "card-inserted", 0,
>qdev_get_gpio_in(sysctl, 
> ARM_SYSCTL_GPIO_MMC_CARDIN));
> -dinfo = drive_get_next(IF_SD);
> +dinfo = drive_get(IF_SD, 0, 0);

Can we have one interface refactor per patch (IF_SD, IF_PFLASH, IF_MTD...)?

> @@ -657,7 +657,7 @@ static void vexpress_common_init(MachineState *machine)
>  
>  sysbus_create_simple("pl111", map[VE_CLCD], pic[14]);
>  
> -dinfo = drive_get_next(IF_PFLASH);
> +dinfo = drive_get(IF_PFLASH, 0, 0);

> -static inline void zynq_init_spi_flashes(uint32_t base_addr, qemu_irq irq,
> - bool is_qspi)
> +static inline int zynq_init_spi_flashes(uint32_t base_addr, qemu_irq irq,
> +bool is_qspi, int unit0)
>  {
> +int unit = unit0;
>  DeviceState *dev;
>  SysBusDevice *busdev;
>  SSIBus *spi;
> @@ -156,7 +157,7 @@ static inline void zynq_init_spi_flashes(uint32_t 
> base_addr, qemu_irq irq,
>  spi = (SSIBus *)qdev_get_child_bus(dev, bus_name);
>  
>  for (j = 0; j < num_ss; ++j) {
> -DriveInfo *dinfo = drive_get_next(IF_MTD);
> +DriveInfo *dinfo = drive_get(IF_MTD, 0, unit++);

> diff --git a/hw/arm/xlnx-zcu102.c b/hw/arm/xlnx-zcu102.c
> index 3dc2b5e8ca..45eb19ab3b 100644
> --- a/hw/arm/xlnx-zcu102.c
> +++ b/hw/arm/xlnx-zcu102.c
> @@ -190,7 +190,7 @@ static void xlnx_zcu102_init(MachineState *machine)
>  BusState *spi_bus;
>  DeviceState *flash_dev;
>  qemu_irq cs_line;
> -DriveInfo *dinfo = drive_get_next(IF_MTD);
> +DriveInfo *dinfo = drive_get(IF_MTD, 0, i);

If this is bus #0, ...

>  gchar *bus_name = g_strdup_printf("spi%d", i);
>  
>  spi_bus = qdev_get_child_bus(DEVICE(>soc), bus_name);
> @@ -212,7 +212,7 @@ static void xlnx_zcu102_init(MachineState *machine)
>  BusState *spi_bus;
>  DeviceState *flash_dev;
>  qemu_irq cs_line;
> -DriveInfo *dinfo = drive_get_next(IF_MTD);
> +DriveInfo *dinfo = drive_get(IF_MTD, 0, XLNX_ZYNQMP_NUM_SPIS + i);

... I'd expect we use bus #1 here (different connector on the board).

>  int bus = i / XLNX_ZYNQMP_NUM_QSPI_BUS_CS;
>  gchar *bus_name = 

Re: [PATCH v2 10/10] iotests/030: Unthrottle parallel jobs in reverse

2021-11-15 Thread Hanna Reitz

On 12.11.21 17:25, Vladimir Sementsov-Ogievskiy wrote:

11.11.2021 15:08, Hanna Reitz wrote:

See the comment for why this is necessary.

Signed-off-by: Hanna Reitz 
---
  tests/qemu-iotests/030 | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/030 b/tests/qemu-iotests/030
index 5fb65b4bef..567bf1da67 100755
--- a/tests/qemu-iotests/030
+++ b/tests/qemu-iotests/030
@@ -251,7 +251,16 @@ class TestParallelOps(iotests.QMPTestCase):
   speed=1024)
  self.assert_qmp(result, 'return', {})
  -    for job in pending_jobs:
+    # Do this in reverse: After unthrottling them, some jobs may 
finish
+    # before we have unthrottled all of them.  This will drain 
their
+    # subgraph, and this will make jobs above them advance 
(despite those
+    # jobs on top being throttled).  In the worst case, all jobs 
below the
+    # top one are finished before we can unthrottle it, and this 
makes it
+    # advance so far that it completes before we can unthrottle 
it - which

+    # results in an error.
+    # Starting from the top (i.e. in reverse) does not have this 
problem:

+    # When a job finishes, the ones below it are not advanced.


Hmm, interesting why only jobs above the finished job may advance in 
the situation..


Looks like something may change and this workaround will stop working.

Isn't it better just handle the error, and don't care if job was just 
finished?


Something like

if result['return'] != {}:
   # Job was finished during drain caused by finish of already 
unthrottled job

   self.assert_qmp(result, 'error/class', 'DeviceNotActive')


Well.  My explanation (excuse) is that I felt like this was the hack-ish 
solution that I could have gone for from the start without understanding 
what the issue is (and in fact it was the solution I used while 
debugging the other problems).  I went with `reversed()`, because that 
really addresses the problem.


You’re right in that it only addresses the problem for now and there’s a 
chance it might reappear.  If we want to go with ignoring 
DeviceNotActive errors, then I think we should at least query all block 
jobs before the unthrottle loop and see that at least at one point they 
were all running simultaneously.


I don’t really have a strong opinion.  We can exchange this patch now 
(though I’d rather not hold up the rest of the series for it), or have a 
patch on top later, or, well, just keep it for now.  I think the least 
stressful option would be to just fix it up later.


Hanna




Re: [PATCH RFC 1/2] hw/sd/ssi-sd: Do not create SD card within controller's realize

2021-11-15 Thread Markus Armbruster
Peter Maydell  writes:

> On Mon, 15 Nov 2021 at 12:56, Markus Armbruster  wrote:
>>
>> ssi_sd_realize() creates an "sd-card" device.  This is inappropriate,
>> and marked FIXME.
>>
>> Move it to the boards that create these devices.  Prior art: commit
>> eb4f566bbb for device "generic-sdhci", and commit 26c607b86b for
>> device "pl181".
>>
>> Signed-off-by: Markus Armbruster 
>> ---
>
>> @@ -670,7 +672,7 @@ static void sifive_u_machine_init(MachineState *machine)
>>
>>  /* Connect an SPI flash to SPI0 */
>>  flash_dev = qdev_new("is25wp256");
>> -dinfo = drive_get_next(IF_MTD);
>> +dinfo = drive_get(IF_MTD, 0, 0);
>>  if (dinfo) {
>>  qdev_prop_set_drive_err(flash_dev, "drive",
>>  blk_by_legacy_dinfo(dinfo),
>
>
> This part looks like it should have been in the other patch.

You're right.  Thanks!




Re: [PATCH RFC 2/2] hw: Replace drive_get_next() by drive_get()

2021-11-15 Thread Markus Armbruster
Peter Maydell  writes:

> On Mon, 15 Nov 2021 at 12:55, Markus Armbruster  wrote:
>>
>> drive_get_next() is basically a bad idea.  It returns the "next" block
>> backend of a certain interface type.  "Next" means bus=0,unit=N, where
>> subsequent calls count N up from zero, per interface type.
>>
>> This lets you define unit numbers implicitly by execution order.  If the
>> order changes, or new calls appear "in the middle", unit numbers change.
>> ABI break.  Hard to spot in review.
>>
>> Explicit is better than implicit: use drive_get() directly.
>>
>> Signed-off-by: Markus Armbruster 
>> ---
>>  include/sysemu/blockdev.h   |  1 -
>>  blockdev.c  | 10 --
>>  hw/arm/aspeed.c | 21 +
>>  hw/arm/cubieboard.c |  2 +-
>>  hw/arm/imx25_pdk.c  |  2 +-
>>  hw/arm/integratorcp.c   |  2 +-
>>  hw/arm/mcimx6ul-evk.c   |  2 +-
>>  hw/arm/mcimx7d-sabre.c  |  2 +-
>>  hw/arm/msf2-som.c   |  2 +-
>>  hw/arm/npcm7xx_boards.c |  6 +++---
>>  hw/arm/orangepi.c   |  2 +-
>>  hw/arm/raspi.c  |  2 +-
>>  hw/arm/realview.c   |  2 +-
>>  hw/arm/sabrelite.c  |  2 +-
>>  hw/arm/versatilepb.c|  4 ++--
>>  hw/arm/vexpress.c   |  6 +++---
>>  hw/arm/xilinx_zynq.c| 16 +---
>>  hw/arm/xlnx-versal-virt.c   |  3 ++-
>>  hw/arm/xlnx-zcu102.c|  6 +++---
>>  hw/microblaze/petalogix_ml605_mmu.c |  2 +-
>>  hw/misc/sifive_u_otp.c  |  2 +-
>>  hw/riscv/microchip_pfsoc.c  |  2 +-
>>  hw/sparc64/niagara.c|  2 +-
>>  23 files changed, 49 insertions(+), 52 deletions(-)
>
> This would be easier to review if it didn't try to change all of
> these board/SoC models at once. Each one of them is entirely
> separate review work.

Happy to split, but first I'd like to get some advice on the part I'm
unsure about; see my cover letter.




Re: [PATCH RFC 1/2] hw/sd/ssi-sd: Do not create SD card within controller's realize

2021-11-15 Thread Peter Maydell
On Mon, 15 Nov 2021 at 12:56, Markus Armbruster  wrote:
>
> ssi_sd_realize() creates an "sd-card" device.  This is inappropriate,
> and marked FIXME.
>
> Move it to the boards that create these devices.  Prior art: commit
> eb4f566bbb for device "generic-sdhci", and commit 26c607b86b for
> device "pl181".
>
> Signed-off-by: Markus Armbruster 
> ---

> @@ -670,7 +672,7 @@ static void sifive_u_machine_init(MachineState *machine)
>
>  /* Connect an SPI flash to SPI0 */
>  flash_dev = qdev_new("is25wp256");
> -dinfo = drive_get_next(IF_MTD);
> +dinfo = drive_get(IF_MTD, 0, 0);
>  if (dinfo) {
>  qdev_prop_set_drive_err(flash_dev, "drive",
>  blk_by_legacy_dinfo(dinfo),


This part looks like it should have been in the other patch.

-- PMM



Re: [PATCH RFC 2/2] hw: Replace drive_get_next() by drive_get()

2021-11-15 Thread Peter Maydell
On Mon, 15 Nov 2021 at 12:55, Markus Armbruster  wrote:
>
> drive_get_next() is basically a bad idea.  It returns the "next" block
> backend of a certain interface type.  "Next" means bus=0,unit=N, where
> subsequent calls count N up from zero, per interface type.
>
> This lets you define unit numbers implicitly by execution order.  If the
> order changes, or new calls appear "in the middle", unit numbers change.
> ABI break.  Hard to spot in review.
>
> Explicit is better than implicit: use drive_get() directly.
>
> Signed-off-by: Markus Armbruster 
> ---
>  include/sysemu/blockdev.h   |  1 -
>  blockdev.c  | 10 --
>  hw/arm/aspeed.c | 21 +
>  hw/arm/cubieboard.c |  2 +-
>  hw/arm/imx25_pdk.c  |  2 +-
>  hw/arm/integratorcp.c   |  2 +-
>  hw/arm/mcimx6ul-evk.c   |  2 +-
>  hw/arm/mcimx7d-sabre.c  |  2 +-
>  hw/arm/msf2-som.c   |  2 +-
>  hw/arm/npcm7xx_boards.c |  6 +++---
>  hw/arm/orangepi.c   |  2 +-
>  hw/arm/raspi.c  |  2 +-
>  hw/arm/realview.c   |  2 +-
>  hw/arm/sabrelite.c  |  2 +-
>  hw/arm/versatilepb.c|  4 ++--
>  hw/arm/vexpress.c   |  6 +++---
>  hw/arm/xilinx_zynq.c| 16 +---
>  hw/arm/xlnx-versal-virt.c   |  3 ++-
>  hw/arm/xlnx-zcu102.c|  6 +++---
>  hw/microblaze/petalogix_ml605_mmu.c |  2 +-
>  hw/misc/sifive_u_otp.c  |  2 +-
>  hw/riscv/microchip_pfsoc.c  |  2 +-
>  hw/sparc64/niagara.c|  2 +-
>  23 files changed, 49 insertions(+), 52 deletions(-)

This would be easier to review if it didn't try to change all of
these board/SoC models at once. Each one of them is entirely
separate review work.

-- PMM



Re: does drive_get_next(IF_NONE) make sense?

2021-11-15 Thread Peter Maydell
On Mon, 15 Nov 2021 at 13:24, Kevin Wolf  wrote:
> Same question as for Hao Wu's series: Wouldn't the proper solution be to
> add a drive property to the machine type?
>
> If you can't use -blockdev, it's not done right.

Is there an example of "doing it right" for built-in-to-the-machine
devices?

(My experience with the new-style options is that almost
always they're designed for x86 where the device they're attached
to is also created on the command line, and then handling of boards
where the device is builtin is either an afterthought or forgotten.
See also -netdev, where it took forever for built-in-ethernet to
be usable.)

-- PMM



Re: does drive_get_next(IF_NONE) make sense?

2021-11-15 Thread Kevin Wolf
Am 15.11.2021 um 06:31 hat Markus Armbruster geschrieben:
> Peter Maydell  writes:
> 
> > On Fri, 12 Nov 2021 at 13:34, Markus Armbruster  wrote:
> >>
> >> Thomas Huth  writes:
> >>
> >> > On 03/11/2021 09.41, Markus Armbruster wrote:
> >> >> Peter Maydell  writes:
> >> >>
> >> >>> Does it make sense for a device/board to do drive_get_next(IF_NONE) ?
> >> >> Short answer: hell, no!  ;)
> >> >
> >> > Would it make sense to add an "assert(type != IF_NONE)" to drive_get()
> >> > to avoid such mistakes in the future?
> >>
> >> Worth a try.
> >
> > You need to fix the sifive_u_otp device first :-)
> 
> And for that, we may want Hao Wu's "[PATCH v4 5/7] blockdev: Add a new
> IF type IF_OTHER" first.

Same question as for Hao Wu's series: Wouldn't the proper solution be to
add a drive property to the machine type?

If you can't use -blockdev, it's not done right.

Kevin




Re: [PATCH v4 6/7] hw/nvram: Update at24c EEPROM init function in NPCM7xx boards

2021-11-15 Thread Kevin Wolf
Am 03.11.2021 um 23:01 hat Hao Wu geschrieben:
> We made 3 changes to the at24c_eeprom_init function in
> npcm7xx_boards.c:
> 
> 1. We allow the function to take a I2CBus* as parameter. This allows
>us to attach an EEPROM device behind an I2C mux which is not
>possible with the old method.
> 
> 2. We make at24c EEPROMs are backed by drives so that we can
>specify the content of the EEPROMs.
> 
> 3. Instead of using i2c address as unit number, This patch assigns
>unique unit numbers for each eeproms in each board. This avoids
>conflict in providing multiple eeprom contents with the same address.
>In the old method if we specify two drives with the same unit number,
>the following error will occur: `Device with id 'none85' exists`.
> 
> Signed-off-by: Hao Wu 
> ---
>  hw/arm/npcm7xx_boards.c | 14 +-
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/arm/npcm7xx_boards.c b/hw/arm/npcm7xx_boards.c
> index dec7d16ae5..9121e081fa 100644
> --- a/hw/arm/npcm7xx_boards.c
> +++ b/hw/arm/npcm7xx_boards.c
> @@ -126,13 +126,17 @@ static I2CBus *npcm7xx_i2c_get_bus(NPCM7xxState *soc, 
> uint32_t num)
>  return I2C_BUS(qdev_get_child_bus(DEVICE(>smbus[num]), "i2c-bus"));
>  }
>  
> -static void at24c_eeprom_init(NPCM7xxState *soc, int bus, uint8_t addr,
> -  uint32_t rsize)
> +static void at24c_eeprom_init(I2CBus *i2c_bus, int bus, uint8_t addr,
> +  uint32_t rsize, int unit_number)
>  {
> -I2CBus *i2c_bus = npcm7xx_i2c_get_bus(soc, bus);
>  I2CSlave *i2c_dev = i2c_slave_new("at24c-eeprom", addr);
>  DeviceState *dev = DEVICE(i2c_dev);
> +DriveInfo *dinfo;
>  
> +dinfo = drive_get(IF_OTHER, bus, unit_number);
> +if (dinfo) {
> +qdev_prop_set_drive(dev, "drive", blk_by_legacy_dinfo(dinfo));
> +}

I may be missing the history of this series, but why do we have to use a
legacy DriveInfo here instead of adding a drive property to the board to
make this configurable with -blockdev?

Adding even more devices that can only be configured with -drive feels
like a step backwards to me.

Kevin




Re: [PATCH v2 09/10] block: Let replace_child_noperm free children

2021-11-15 Thread Hanna Reitz

On 12.11.21 17:10, Vladimir Sementsov-Ogievskiy wrote:

11.11.2021 15:08, Hanna Reitz wrote:

In most of the block layer, especially when traversing down from other
BlockDriverStates, we assume that BdrvChild.bs can never be NULL.  When
it becomes NULL, it is expected that the corresponding BdrvChild pointer
also becomes NULL and the BdrvChild object is freed.

Therefore, once bdrv_replace_child_noperm() sets the BdrvChild.bs
pointer to NULL, it should also immediately set the corresponding
BdrvChild pointer (like bs->file or bs->backing) to NULL.

In that context, it also makes sense for this function to free the
child.  Sometimes we cannot do so, though, because it is called in a
transactional context where the caller might still want to reinstate the
child in the abort branch (and free it only on commit), so this behavior
has to remain optional.

In bdrv_replace_child_tran()'s abort handler, we now rely on the fact
that the BdrvChild passed to bdrv_replace_child_tran() must have had a
non-NULL .bs pointer initially.  Make a note of that and assert it.

Signed-off-by: Hanna Reitz 
---
  block.c | 102 +++-
  1 file changed, 79 insertions(+), 23 deletions(-)

diff --git a/block.c b/block.c
index a40027161c..0ac5b163d2 100644
--- a/block.c
+++ b/block.c
@@ -87,8 +87,10 @@ static BlockDriverState *bdrv_open_inherit(const 
char *filename,

  static bool bdrv_recurse_has_child(BlockDriverState *bs,
 BlockDriverState *child);
  +static void bdrv_child_free(BdrvChild *child);
  static void bdrv_replace_child_noperm(BdrvChild **child,
-  BlockDriverState *new_bs);
+  BlockDriverState *new_bs,
+  bool free_empty_child);
  static void bdrv_remove_file_or_backing_child(BlockDriverState *bs,
    BdrvChild *child,
    Transaction *tran);
@@ -2256,12 +2258,16 @@ typedef struct BdrvReplaceChildState {
  BdrvChild *child;
  BdrvChild **childp;
  BlockDriverState *old_bs;
+    bool free_empty_child;
  } BdrvReplaceChildState;
    static void bdrv_replace_child_commit(void *opaque)
  {
  BdrvReplaceChildState *s = opaque;
  +    if (s->free_empty_child && !s->child->bs) {
+    bdrv_child_free(s->child);
+    }
  bdrv_unref(s->old_bs);
  }
  @@ -2278,22 +2284,26 @@ static void bdrv_replace_child_abort(void 
*opaque)
   * modify the BdrvChild * pointer we indirectly pass to it, 
i.e. it
   * will not modify s->child.  From that perspective, it 
does not matter

   * whether we pass s->childp or >child.
- * (TODO: Right now, bdrv_replace_child_noperm() never 
modifies that
- * pointer anyway (though it will in the future), so at this 
point it
- * absolutely does not matter whether we pass s->childp or 
>child.)
   * (2) If new_bs is not NULL, s->childp will be NULL. We then 
cannot use

   * it here.
   * (3) If new_bs is NULL, *s->childp will have been NULLed by
   * bdrv_replace_child_tran()'s bdrv_replace_child_noperm() 
call, and we

   * must not pass a NULL *s->childp here.
- * (TODO: In its current state, bdrv_replace_child_noperm() 
will not
- * have NULLed *s->childp, so this does not apply yet.  It 
will in the

- * future.)


What I don't like about this patch is that it does two different 
things: zeroing the pointer and clearing the object. And if we look at 
the latter in separate, it seems that it's not needed:


Look: bdrv_replace_child_tran(): new parameter is set to true in two 
places, in both of them we are sure (and do assertion and comment) 
that new bs is not NULL and nothing will be freed.


Similarly, bdrv_replace_child_noperm() is called with true in two 
places where we sure that new bs is not NULL.


and only one place where new parameter set to true really do something:


@@ -2960,8 +3013,7 @@ static void bdrv_detach_child(BdrvChild **childp)
  {
  BlockDriverState *old_bs = (*childp)->bs;
  -    bdrv_replace_child_noperm(childp, NULL);
-    bdrv_child_free(*childp);
+    bdrv_replace_child_noperm(childp, NULL, true);
    if (old_bs) {
  /*


And it doesn't worth the whole complexity of new parameters for two 
functions.


In this place we can simply do something like

BdrvChild *child = *childp;

bdrv_replace_child_noperm(childp, NULL);

bdrv_child_free(child);


I understand the idea: it seems good and intuitive to do zeroing the 
pointer and clearing the object in one shot. But this patch itself 
shows that we just can't do it in 90% of cases. So, I think better is 
not do it and live with only "zeroing the pointer" part of this patch.


I see your point, but I don’t find it too complex.  Passing `true` is 
the default and then calling the function is easy.  Passing 

[PATCH RFC 1/2] hw/sd/ssi-sd: Do not create SD card within controller's realize

2021-11-15 Thread Markus Armbruster
ssi_sd_realize() creates an "sd-card" device.  This is inappropriate,
and marked FIXME.

Move it to the boards that create these devices.  Prior art: commit
eb4f566bbb for device "generic-sdhci", and commit 26c607b86b for
device "pl181".

Signed-off-by: Markus Armbruster 
---
 hw/arm/stellaris.c  | 15 ++-
 hw/riscv/sifive_u.c | 15 +--
 hw/sd/ssi-sd.c  | 29 -
 3 files changed, 27 insertions(+), 32 deletions(-)

diff --git a/hw/arm/stellaris.c b/hw/arm/stellaris.c
index 78827ace6b..b6c8a5d609 100644
--- a/hw/arm/stellaris.c
+++ b/hw/arm/stellaris.c
@@ -10,6 +10,7 @@
 #include "qemu/osdep.h"
 #include "qapi/error.h"
 #include "hw/sysbus.h"
+#include "hw/sd/sd.h"
 #include "hw/ssi/ssi.h"
 #include "hw/arm/boot.h"
 #include "qemu/timer.h"
@@ -1157,6 +1158,9 @@ static void stellaris_init(MachineState *ms, 
stellaris_board_info *board)
 void *bus;
 DeviceState *sddev;
 DeviceState *ssddev;
+DriveInfo *dinfo;
+DeviceState *carddev;
+BlockBackend *blk;
 
 /*
  * Some boards have both an OLED controller and SD card connected 
to
@@ -1221,8 +1225,17 @@ static void stellaris_init(MachineState *ms, 
stellaris_board_info *board)
  *  - Make the ssd0323 OLED controller chipselect active-low
  */
 bus = qdev_get_child_bus(dev, "ssi");
-
 sddev = ssi_create_peripheral(bus, "ssi-sd");
+
+dinfo = drive_get(IF_SD, 0, 0);
+blk = dinfo ? blk_by_legacy_dinfo(dinfo) : NULL;
+carddev = qdev_new(TYPE_SD_CARD);
+qdev_prop_set_drive_err(carddev, "drive", blk, _fatal);
+qdev_prop_set_bit(carddev, "spi", true);
+qdev_realize_and_unref(carddev,
+   qdev_get_child_bus(sddev, "sd-bus"),
+   _fatal);
+
 ssddev = ssi_create_peripheral(bus, "ssd0323");
 gpio_out[GPIO_D][0] = qemu_irq_split(
 qdev_get_gpio_in_named(sddev, SSI_GPIO_CS, 0),
diff --git a/hw/riscv/sifive_u.c b/hw/riscv/sifive_u.c
index 589ae72a59..aa74e67889 100644
--- a/hw/riscv/sifive_u.c
+++ b/hw/riscv/sifive_u.c
@@ -46,6 +46,7 @@
 #include "hw/char/serial.h"
 #include "hw/cpu/cluster.h"
 #include "hw/misc/unimp.h"
+#include "hw/sd/sd.h"
 #include "hw/ssi/ssi.h"
 #include "target/riscv/cpu.h"
 #include "hw/riscv/riscv_hart.h"
@@ -536,7 +537,8 @@ static void sifive_u_machine_init(MachineState *machine)
 uint32_t fdt_load_addr;
 uint64_t kernel_entry;
 DriveInfo *dinfo;
-DeviceState *flash_dev, *sd_dev;
+BlockBackend *blk;
+DeviceState *flash_dev, *sd_dev, *card_dev;
 qemu_irq flash_cs, sd_cs;
 
 /* Initialize SoC */
@@ -670,7 +672,7 @@ static void sifive_u_machine_init(MachineState *machine)
 
 /* Connect an SPI flash to SPI0 */
 flash_dev = qdev_new("is25wp256");
-dinfo = drive_get_next(IF_MTD);
+dinfo = drive_get(IF_MTD, 0, 0);
 if (dinfo) {
 qdev_prop_set_drive_err(flash_dev, "drive",
 blk_by_legacy_dinfo(dinfo),
@@ -686,6 +688,15 @@ static void sifive_u_machine_init(MachineState *machine)
 
 sd_cs = qdev_get_gpio_in_named(sd_dev, SSI_GPIO_CS, 0);
 sysbus_connect_irq(SYS_BUS_DEVICE(>soc.spi2), 1, sd_cs);
+
+dinfo = drive_get(IF_SD, 0, 0);
+blk = dinfo ? blk_by_legacy_dinfo(dinfo) : NULL;
+card_dev = qdev_new(TYPE_SD_CARD);
+qdev_prop_set_drive_err(card_dev, "drive", blk, _fatal);
+qdev_prop_set_bit(card_dev, "spi", true);
+qdev_realize_and_unref(card_dev,
+   qdev_get_child_bus(sd_dev, "sd-bus"),
+   _fatal);
 }
 
 static bool sifive_u_machine_get_start_in_flash(Object *obj, Error **errp)
diff --git a/hw/sd/ssi-sd.c b/hw/sd/ssi-sd.c
index e60854eeef..558506f6a0 100644
--- a/hw/sd/ssi-sd.c
+++ b/hw/sd/ssi-sd.c
@@ -368,36 +368,9 @@ static const VMStateDescription vmstate_ssi_sd = {
 
 static void ssi_sd_realize(SSIPeripheral *d, Error **errp)
 {
-ERRP_GUARD();
 ssi_sd_state *s = SSI_SD(d);
-DeviceState *carddev;
-DriveInfo *dinfo;
 
 qbus_init(>sdbus, sizeof(s->sdbus), TYPE_SD_BUS, DEVICE(d), "sd-bus");
-
-/* Create and plug in the sd card */
-/* FIXME use a qdev drive property instead of drive_get_next() */
-dinfo = drive_get_next(IF_SD);
-carddev = qdev_new(TYPE_SD_CARD);
-if (dinfo) {
-if (!qdev_prop_set_drive_err(carddev, "drive",
- blk_by_legacy_dinfo(dinfo), errp)) {
-goto fail;
-}
-}
-
-if (!object_property_set_bool(OBJECT(carddev), "spi", true, errp)) {
-goto fail;
-}
-
-if (!qdev_realize_and_unref(carddev, BUS(>sdbus), errp)) {
-goto fail;
-}
-
-return;
-
-fail:
-error_prepend(errp, "failed to init SD card: ");
 }
 
 static void ssi_sd_reset(DeviceState *dev)
@@ -426,8 +399,6 @@ 

[PATCH RFC 2/2] hw: Replace drive_get_next() by drive_get()

2021-11-15 Thread Markus Armbruster
drive_get_next() is basically a bad idea.  It returns the "next" block
backend of a certain interface type.  "Next" means bus=0,unit=N, where
subsequent calls count N up from zero, per interface type.

This lets you define unit numbers implicitly by execution order.  If the
order changes, or new calls appear "in the middle", unit numbers change.
ABI break.  Hard to spot in review.

Explicit is better than implicit: use drive_get() directly.

Signed-off-by: Markus Armbruster 
---
 include/sysemu/blockdev.h   |  1 -
 blockdev.c  | 10 --
 hw/arm/aspeed.c | 21 +
 hw/arm/cubieboard.c |  2 +-
 hw/arm/imx25_pdk.c  |  2 +-
 hw/arm/integratorcp.c   |  2 +-
 hw/arm/mcimx6ul-evk.c   |  2 +-
 hw/arm/mcimx7d-sabre.c  |  2 +-
 hw/arm/msf2-som.c   |  2 +-
 hw/arm/npcm7xx_boards.c |  6 +++---
 hw/arm/orangepi.c   |  2 +-
 hw/arm/raspi.c  |  2 +-
 hw/arm/realview.c   |  2 +-
 hw/arm/sabrelite.c  |  2 +-
 hw/arm/versatilepb.c|  4 ++--
 hw/arm/vexpress.c   |  6 +++---
 hw/arm/xilinx_zynq.c| 16 +---
 hw/arm/xlnx-versal-virt.c   |  3 ++-
 hw/arm/xlnx-zcu102.c|  6 +++---
 hw/microblaze/petalogix_ml605_mmu.c |  2 +-
 hw/misc/sifive_u_otp.c  |  2 +-
 hw/riscv/microchip_pfsoc.c  |  2 +-
 hw/sparc64/niagara.c|  2 +-
 23 files changed, 49 insertions(+), 52 deletions(-)

diff --git a/include/sysemu/blockdev.h b/include/sysemu/blockdev.h
index 32c2d6023c..a750f99b79 100644
--- a/include/sysemu/blockdev.h
+++ b/include/sysemu/blockdev.h
@@ -50,7 +50,6 @@ void drive_check_orphaned(void);
 DriveInfo *drive_get_by_index(BlockInterfaceType type, int index);
 int drive_get_max_bus(BlockInterfaceType type);
 int drive_get_max_devs(BlockInterfaceType type);
-DriveInfo *drive_get_next(BlockInterfaceType type);
 
 QemuOpts *drive_def(const char *optstr);
 QemuOpts *drive_add(BlockInterfaceType type, int index, const char *file,
diff --git a/blockdev.c b/blockdev.c
index b35072644e..0eb2823b1b 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -303,16 +303,6 @@ int drive_get_max_bus(BlockInterfaceType type)
 return max_bus;
 }
 
-/* Get a block device.  This should only be used for single-drive devices
-   (e.g. SD/Floppy/MTD).  Multi-disk devices (scsi/ide) should use the
-   appropriate bus.  */
-DriveInfo *drive_get_next(BlockInterfaceType type)
-{
-static int next_block_unit[IF_COUNT];
-
-return drive_get(type, 0, next_block_unit[type]++);
-}
-
 static void bdrv_format_print(void *opaque, const char *name)
 {
 qemu_printf(" %s", name);
diff --git a/hw/arm/aspeed.c b/hw/arm/aspeed.c
index a77f46b3ad..cf20ae0db5 100644
--- a/hw/arm/aspeed.c
+++ b/hw/arm/aspeed.c
@@ -284,12 +284,13 @@ static void write_boot_rom(DriveInfo *dinfo, hwaddr addr, 
size_t rom_size,
 }
 
 static void aspeed_board_init_flashes(AspeedSMCState *s,
-  const char *flashtype)
+  const char *flashtype,
+  int unit0)
 {
 int i ;
 
 for (i = 0; i < s->num_cs; ++i) {
-DriveInfo *dinfo = drive_get_next(IF_MTD);
+DriveInfo *dinfo = drive_get(IF_MTD, 0, unit0 + i);
 qemu_irq cs_line;
 DeviceState *dev;
 
@@ -382,10 +383,12 @@ static void aspeed_machine_init(MachineState *machine)
   "max_ram", max_ram_size  - machine->ram_size);
 memory_region_add_subregion(>ram_container, machine->ram_size, 
>max_ram);
 
-aspeed_board_init_flashes(>soc.fmc, bmc->fmc_model ?
-  bmc->fmc_model : amc->fmc_model);
-aspeed_board_init_flashes(>soc.spi[0], bmc->spi_model ?
-  bmc->spi_model : amc->spi_model);
+aspeed_board_init_flashes(>soc.fmc,
+  bmc->fmc_model ? bmc->fmc_model : amc->fmc_model,
+  0);
+aspeed_board_init_flashes(>soc.spi[0],
+  bmc->spi_model ? bmc->spi_model : amc->spi_model,
+  bmc->soc.fmc.num_cs);
 
 /* Install first FMC flash content as a boot rom. */
 if (drive0) {
@@ -435,11 +438,13 @@ static void aspeed_machine_init(MachineState *machine)
 }
 
 for (i = 0; i < bmc->soc.sdhci.num_slots; i++) {
-sdhci_attach_drive(>soc.sdhci.slots[i], drive_get_next(IF_SD));
+sdhci_attach_drive(>soc.sdhci.slots[i],
+   drive_get(IF_SD, 0, i));
 }
 
 if (bmc->soc.emmc.num_slots) {
-sdhci_attach_drive(>soc.emmc.slots[0], drive_get_next(IF_SD));
+sdhci_attach_drive(>soc.emmc.slots[0],
+   drive_get(IF_SD, 0, bmc->soc.sdhci.num_slots));
 }
 
 

[PATCH RFC 0/2] Eliminate drive_get_next()

2021-11-15 Thread Markus Armbruster
This is RFC because I'm unsure about the removal of

/* Reason: init() method uses drive_get_next() */
dc->user_creatable = false;

in PATCH 1.  Both users appear to wire up some GPIO.  If that's
necessary for the thing to work, we should just replace the comment.

Aside: there may be devices that need manual wiring to work, yet don't
have user_creatable unset.  Bugs if you ask me.  I don't have smart
ideas on how to track them down.

Markus Armbruster (2):
  hw/sd/ssi-sd: Do not create SD card within controller's realize
  hw: Replace drive_get_next() by drive_get()

 include/sysemu/blockdev.h   |  1 -
 blockdev.c  | 10 --
 hw/arm/aspeed.c | 21 +
 hw/arm/cubieboard.c |  2 +-
 hw/arm/imx25_pdk.c  |  2 +-
 hw/arm/integratorcp.c   |  2 +-
 hw/arm/mcimx6ul-evk.c   |  2 +-
 hw/arm/mcimx7d-sabre.c  |  2 +-
 hw/arm/msf2-som.c   |  2 +-
 hw/arm/npcm7xx_boards.c |  6 +++---
 hw/arm/orangepi.c   |  2 +-
 hw/arm/raspi.c  |  2 +-
 hw/arm/realview.c   |  2 +-
 hw/arm/sabrelite.c  |  2 +-
 hw/arm/stellaris.c  | 15 ++-
 hw/arm/versatilepb.c|  4 ++--
 hw/arm/vexpress.c   |  6 +++---
 hw/arm/xilinx_zynq.c| 16 +---
 hw/arm/xlnx-versal-virt.c   |  3 ++-
 hw/arm/xlnx-zcu102.c|  6 +++---
 hw/microblaze/petalogix_ml605_mmu.c |  2 +-
 hw/misc/sifive_u_otp.c  |  2 +-
 hw/riscv/microchip_pfsoc.c  |  2 +-
 hw/riscv/sifive_u.c | 15 +--
 hw/sd/ssi-sd.c  | 29 -
 hw/sparc64/niagara.c|  2 +-
 26 files changed, 76 insertions(+), 84 deletions(-)

-- 
2.31.1




Re: [PATCH v4 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers

2021-11-15 Thread Hanna Reitz

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
---
  block.c | 17 +
  1 file changed, 17 insertions(+)

diff --git a/block.c b/block.c
index 94bff5c757..40c4729b8d 100644
--- a/block.c
+++ b/block.c


[...]


@@ -2148,6 +2152,7 @@ static void bdrv_child_perm(BlockDriverState *bs, 
BlockDriverState *child_bs,
  uint64_t *nperm, uint64_t *nshared)
  {
  assert(bs->drv && bs->drv->bdrv_child_perm);
+assert(qemu_in_main_thread());
  bs->drv->bdrv_child_perm(bs, c, role, reopen_queue,
   parent_perm, parent_shared,
   nperm, nshared);


(Should’ve noticed earlier, but only did now...)

First, this function is indirectly called by bdrv_refresh_perms(). I 
understand that all perm-related functions are classified as GS.


However, bdrv_co_invalidate_cache() invokes bdrv_refresh_perms. Being 
declared in block/coroutine.h, it’s an I/O function, so it mustn’t call 
such a GS function. BlockDriver.bdrv_co_invalidate_cache(), 
bdrv_invalidate_cache(), and blk_invalidate_cache() are also classified 
as I/O functions. Perhaps all of these functions should be classified as 
GS functions?  I believe their callers and their purpose would allow for 
this.



Second, it’s called by bdrv_child_refresh_perms(), which is called by 
block_crypto_amend_options_generic_luks().  This function is called by 
block_crypto_co_amend_luks(), which is a BlockDriver.bdrv_co_amend 
implementation, which is classified as an I/O function.


Honestly, I don’t know how to fix that mess.  The best would be if we 
could make the perm functions thread-safe and classify them as I/O, but 
it seems to me like that’s impossible (I sure hope I’m wrong).  On the 
other hand, .bdrv_co_amend very much strikes me like a GS function, but 
it isn’t.  I’m afraid it must work on nodes that are not in the main 
context, and it launches a job, so AFAIU we absolutely cannot run it 
under the BQL.


It almost seems to me like we’d need a thread-safe variant of the perm 
functions that’s allowed to fail when it cannot guarantee thread safety 
or something.  Or perhaps I’m wrong and the perm functions can actually 
be classified as thread-safe and I/O, that’d be great…


Hanna




Re: [PATCH 1/2] block:hdev: support BLKSECDISCARD

2021-11-15 Thread Kevin Wolf
Am 15.11.2021 um 05:51 hat yadong...@intel.com geschrieben:
> From: Yadong Qi 
> 
> Add a new option "secdiscard" for block drive. Parse it and
> record it in bs->open_flags as bit(BDRV_O_SECDISCARD).
> 
> Add a new BDRV_REQ_SECDISCARD bit for secure discard request
> from virtual device.
> 
> For host_device backend: implement by ioctl(BLKSECDISCARD) on
> real host block device.
> For other backend, no implementation.
> 
> E.g.:
> qemu-system-x86_64 \
> ... \
> -drive 
> file=/dev/mmcblk0p2,if=none,format=raw,discard=on,secdiscard=on,id=sd0 \
> -device virtio-blk-pci,drive=sd0,id=sd0_vblk \
> ...
> 
> Signed-off-by: Yadong Qi 
> ---
>  block.c  | 46 +
>  block/blkdebug.c |  5 ++--
>  block/blklogwrites.c |  6 ++--
>  block/blkreplay.c|  5 ++--
>  block/block-backend.c| 15 ++
>  block/copy-before-write.c|  5 ++--
>  block/copy-on-read.c |  5 ++--
>  block/coroutines.h   |  6 ++--
>  block/file-posix.c   | 50 
>  block/filter-compress.c  |  5 ++--
>  block/io.c   |  5 ++--
>  block/mirror.c   |  5 ++--
>  block/nbd.c  |  3 +-
>  block/nvme.c |  3 +-
>  block/preallocate.c  |  5 ++--
>  block/qcow2-refcount.c   |  4 +--
>  block/qcow2.c|  3 +-
>  block/raw-format.c   |  5 ++--
>  block/throttle.c |  5 ++--
>  hw/block/virtio-blk.c|  2 +-
>  hw/ide/core.c|  1 +
>  hw/nvme/ctrl.c   |  3 +-
>  hw/scsi/scsi-disk.c  |  2 +-
>  include/block/block.h| 13 +++--
>  include/block/block_int.h|  2 +-
>  include/block/raw-aio.h  |  4 ++-
>  include/sysemu/block-backend.h   |  1 +
>  tests/unit/test-block-iothread.c |  9 +++---
>  28 files changed, 171 insertions(+), 52 deletions(-)

Notably absent: qapi/block-core.json. Without changing this, the option
can't be available in -blockdev, which is the primary option to configure
block device backends.

This patch seems to contain multiple logical changes that should be
split into separate patches:

* Adding a flags parameter to .bdrv_co_pdiscard

* Support for the new feature in the core block layer (should be done
  with -blockdev)

* Convenience magic for -drive (BDRV_O_SECDISCARD). It's not clear that
  this should be done at all because the option is really specific to
  one single block driver (file-posix). I think in your patch, all
  other block drivers silently ignore the option, which is not what we
  want.

> diff --git a/block.c b/block.c
> index 580cb77a70..4f05e96d12 100644
> --- a/block.c
> +++ b/block.c
> @@ -1128,6 +1128,32 @@ int bdrv_parse_discard_flags(const char *mode, int 
> *flags)
>  return 0;
>  }
>  
> +/**
> + * Set open flags for a given secdiscard mode
> + *
> + * Return 0 on success, -1 if the secdiscard mode was invalid.
> + */
> +int bdrv_parse_secdiscard_flags(const char *mode, int *flags, Error **errp)
> +{
> +*flags &= ~BDRV_O_SECDISCARD;
> +
> +if (!strcmp(mode, "off")) {
> +/* do nothing */
> +} else if (!strcmp(mode, "on")) {
> +if (!(*flags & BDRV_O_UNMAP)) {
> +error_setg(errp, "cannot enable secdiscard when discard is "
> + "disabled!");
> +return -1;
> +}

This check has nothing to do with parsing the option, it's validating
its value.

You don't even need a new function to parse it, because there is already
qemu_opt_get_bool(). Duplicating it means only that you're inconsistent
with other boolean options, which alternatively accept "yes"/"no",
"true"/"false", "y/n".

> +
> +*flags |= BDRV_O_SECDISCARD;
> +} else {
> +return -1;
> +}
> +
> +return 0;
> +}
> +
>  /**
>   * Set open flags for a given cache mode
>   *
> @@ -1695,6 +1721,11 @@ QemuOptsList bdrv_runtime_opts = {
>  .type = QEMU_OPT_STRING,
>  .help = "discard operation (ignore/off, unmap/on)",
>  },
> +{
> +.name = BDRV_OPT_SECDISCARD,
> +.type = QEMU_OPT_STRING,
> +.help = "secure discard operation (off, on)",
> +},
>  {
>  .name = BDRV_OPT_FORCE_SHARE,
>  .type = QEMU_OPT_BOOL,
> @@ -1735,6 +1766,7 @@ static int bdrv_open_common(BlockDriverState *bs, 
> BlockBackend *file,
>  const char *driver_name = NULL;
>  const char *node_name = NULL;
>  const char *discard;
> +const char *secdiscard;
>  QemuOpts *opts;
>  BlockDriver *drv;
>  Error *local_err = NULL;
> @@ -1829,6 +1861,16 @@ static int bdrv_open_common(BlockDriverState *bs, 
> BlockBackend *file,
>  }
>  }
>  
> +
> +secdiscard = qemu_opt_get(opts, BDRV_OPT_SECDISCARD);
> +if (secdiscard 

Re: [PATCH v4 03/25] assertions for block global state API

2021-11-15 Thread Emanuele Giuseppe Esposito


@@ -5958,6 +6043,7 @@ const char *bdrv_get_parent_name(const 
BlockDriverState *bs)

  /* TODO check what callers really want: bs->node_name or blk_name() */
  const char *bdrv_get_device_name(const BlockDriverState *bs)
  {
+    assert(qemu_in_main_thread());
  return bdrv_get_parent_name(bs) ?: "";
  }


This function is invoked from qcow2_signal_corruption(), which comes 
generally from an I/O path.  Is it safe to assert that we’re in the main 
thread here?


Well, the question is probably rather whether this needs really be a 
considered a global-state function, or whether putting it in common or 
I/O is fine.  I believe you’re right given that it invokes 
bdrv_get_parent_name(), it cannot be thread-safe, but then we’ll have to 
change qcow2_signal_corruption() so it doesn’t invoke this function.




I think that the assertion is wrong here. As long as a single caller is 
not under BQL, we cannot keep the function in global-state headers.
It should go into block-io.h, together with bdrv_get_parent_name and 
bdrv_get_device_or_node_name (caller of bdrv_get_parent_name).


Since bdrv_get_parent_name only scans through the list of bs->parents, 
as long as we can assert that the parent list is modified only under BQL 
+ drain, it is safe to be read and can be I/O.



[...]


diff --git a/block/io.c b/block/io.c
index bb0a254def..c5d7f8495e 100644
--- a/block/io.c
+++ b/block/io.c


[...]


@@ -544,6 +546,7 @@ void bdrv_drained_end(BlockDriverState *bs)
  void bdrv_drained_end_no_poll(BlockDriverState *bs, int 
*drained_end_counter)

  {
+    assert(qemu_in_main_thread());
  bdrv_do_drained_end(bs, false, NULL, false, drained_end_counter);
  }


Why is bdrv_drained_end an I/O function and this is a GS function, even 
though it does just a subset?


Right this is clearly called in bdrv_drained_end. I don't know how is it 
possible that no test managed to trigger this assertion... This is 
actually invoked in both unit and qemu-iothread tests...




@@ -586,12 +589,14 @@ void bdrv_unapply_subtree_drain(BdrvChild 
*child, BlockDriverState *old_parent)

  void coroutine_fn bdrv_co_drain(BlockDriverState *bs)
  {
  assert(qemu_in_coroutine());
+    assert(qemu_in_main_thread());
  bdrv_drained_begin(bs);
  bdrv_drained_end(bs);
  }
  void bdrv_drain(BlockDriverState *bs)
  {
+    assert(qemu_in_main_thread());
  bdrv_drained_begin(bs);
  bdrv_drained_end(bs);
  }


Why are these GS functions when both bdrv_drained_begin() and 
bdrv_drained_end() are I/O functions?


I can understand making the drain_all functions GS functions, but it 
seems weird to say it’s an I/O function when a single BDS is drained via 
bdrv_drained_begin() and bdrv_drained_end(), but not via bdrv_drain(), 
which just does both.


(I can see that there are no I/O path callers, but I still find it 
strange.)


The problem as you saw is on the path callers: bdrv_drain seems to be 
called only in main thread, while others or similar drains are also 
coming from I/O. I would say maybe it's a better idea to just put 
everything I/O, maybe (probably) there are cases not caught by the tests.


The only ones in global-state will be:
- bdrv_apply_subtree_drain and bdrv_unapply_subtree_drain (called only 
in .attach and .detach callbacks of BdrvChildClass, run under BQL).

- bdrv_drain_all_end_quiesce (called only by bdrv_close thus under BQL).
- bdrv_drain_all_begin, bdrv_drain_all_end and bdrv_drain_all (have 
already BQL assertions)


In general, this is the underlying problem with these assertions: if a 
test doesn't test a specific code path, an unneeded assertion might pass 
undetected.




[...]

@@ -2731,6 +2742,7 @@ int bdrv_block_status_above(BlockDriverState 
*bs, BlockDriverState *base,
  int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t 
bytes,
    int64_t *pnum, int64_t *map, BlockDriverState 
**file)

  {
+    assert(qemu_in_main_thread());
  return bdrv_block_status_above(bs, bdrv_filter_or_cow_bs(bs),
 offset, bytes, pnum, map, file);
  }


Why is this a GS function as opposed to all other block-status 
functions?  Because of the bdrv_filter_or_cow_bs() call?


This function is in block.io but has the assertion. I think it's a 
proper I/O but I forgot to remove the assertion in my various trial


Let me know if you disagree on anything of what I said.

Thank you,
Emanuele




Re: [PATCH v4 02/25] include/block/block: split header into I/O and global state API

2021-11-15 Thread Emanuele Giuseppe Esposito

+/*
+ * I/O API functions. These functions are thread-safe, and therefore
+ * can run in any thread as long as the thread has called
+ * aio_context_acquire/release().
+ */
+
+int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
+  Error **errp);


Why is this function here?  Naïvely, I would’ve assumed as a 
graph-modifying function it should be in block-global-state.h.


I mean, perhaps it’s thread-safe and then it can fit here, too. Still, 
it surprises me a bit to find this here.


Hanna



Agree, I also tested this, it can go in global state. Will fix that.

Thank you,
Emanuele




Re: [PATCH v4 19/25] block_int-common.h: split function pointers in BlockDriver

2021-11-15 Thread Hanna Reitz

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:

Similar to the header split, also the function pointers in BlockDriver
can be split in I/O and global state.

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
---
  include/block/block_int-common.h | 458 ---
  1 file changed, 237 insertions(+), 221 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 79a3d801d2..9857e775fe 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -96,6 +96,7 @@ typedef struct BdrvTrackedRequest {
  
  
  struct BlockDriver {

+/* Fields initialized in struct definition and never changed. */


I find this a bit difficult to understand.  Perhaps we could be more 
verbose to make it easier to read?  Like


“These fields are initialized when this object is created, and are never 
changed afterwards”


And I’d also add an empty line below to make visually clear that this 
does not describe the single subsequent field (format_name) but the 
whole chunk of fields.


I also wonder if we could substantiate the claim “never changed 
afterwards” by making these fields consts.  Then again, I suppose 
technically none of the fields in BlockDriver is supposed to be mutable 
(except for .list), neither these fields nor the function pointers...  
Yeah, maybe scratch that idea.



  const char *format_name;
  int instance_size;


[...]


+int coroutine_fn (*bdrv_save_vmstate)(BlockDriverState *bs,
+  QEMUIOVector *qiov,
+  int64_t pos);
+int coroutine_fn (*bdrv_load_vmstate)(BlockDriverState *bs,
+  QEMUIOVector *qiov,
+  int64_t pos);


In patch 18, you classified bdrv_co_readv_vmstate() and 
bdrv_co_writev_vmstate() as I/O functions.  They call these methods, 
though, so something doesn’t seem quite right.


Now I also remember you classified the global bdrv_save_vmstate() and 
bdrv_load_vmstate() functions as GS.  I don’t mind either way; AFAIU 
they are I/O functions that are generally called under the BQL.  But the 
classification should be consistent, of course.



+
+int (*bdrv_change_backing_file)(BlockDriverState *bs,
+const char *backing_file, const char *backing_fmt);
+
+void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
+void (*bdrv_lock_medium)(BlockDriverState *bs, bool locked);


Above bdrv_is_inserted(), there’s a comment (“removable device 
specific”) that I think should be duplicated here.



+
+/* TODO Better pass a option string/QDict/QemuOpts to add any rule? */
+int (*bdrv_debug_breakpoint)(BlockDriverState *bs, const char *event,
+const char *tag);
+int (*bdrv_debug_remove_breakpoint)(BlockDriverState *bs,
+const char *tag);
+int (*bdrv_debug_resume)(BlockDriverState *bs, const char *tag);
+bool (*bdrv_debug_is_suspended)(BlockDriverState *bs, const char *tag);
+void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);


Originally, there was an empty line before bdrv_refresh_limits(), and 
I’d keep it.


[...]


+/**
+ * Try to get @bs's logical and physical block size.
+ * On success, store them in @bsz and return zero.
+ * On failure, return negative errno.
+ */
+int (*bdrv_probe_blocksizes)(BlockDriverState *bs, BlockSizes *bsz);


Originally, there was a comment above this function (“I/O API, even 
though if it's a filter jumps on parent”) that you added in patch 6 and 
that I didn’t understand.  Given this, I’m not unhappy to see it go 
again, but now I wonder about its purpose even more...



+/**
+ * Try to get @bs's geometry (cyls, heads, sectors)
+ * On success, store them in @geo and return 0.
+ * On failure return -errno.
+ * Only drivers that want to override guest geometry implement this
+ * callback; see hd_geometry_guess().
+ */
+int (*bdrv_probe_geometry)(BlockDriverState *bs, HDGeometry *geo);


(Here, too)

[...]


+/**
+ * Register/unregister a buffer for I/O. For example, when the driver is
+ * interested to know the memory areas that will later be used in iovs, so
+ * that it can do IOMMU mapping with VFIO etc., in order to get better
+ * performance. In the case of VFIO drivers, this callback is used to do
+ * DMA mapping for hot buffers.
+ */
+void (*bdrv_register_buf)(BlockDriverState *bs, void *host, size_t size);
+void (*bdrv_unregister_buf)(BlockDriverState *bs, void *host);
+QLIST_ENTRY(BlockDriver) list;


This field is interesting because it’s the only mutable field in the 
whole struct.  I think it should be separated from the function pointers 
above and given a comment like “This field is modified only under the 
BQL, and is part of the global state”.



+
+/*
+ * I/O API functions. These functions 

Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-11-15 Thread Stefan Hajnoczi
On Thu, Nov 11, 2021 at 06:54:03PM +0100, Christian Schoenebeck wrote:
> On Donnerstag, 11. November 2021 17:31:52 CET Stefan Hajnoczi wrote:
> > On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote:
> > > On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> > > > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> > > > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > > > As you are apparently reluctant for changing the virtio specs, what
> > > > > about
> > > > > introducing those discussed virtio capabalities either as experimental
> > > > > ones
> > > > > without specs changes, or even just as 9p specific device capabilities
> > > > > for
> > > > > now. I mean those could be revoked on both sides at any time anyway.
> > > > 
> > > > I would like to understand the root cause before making changes.
> > > > 
> > > > "It's faster when I do X" is useful information but it doesn't
> > > > necessarily mean doing X is the solution. The "it's faster when I do X
> > > > because Y" part is missing in my mind. Once there is evidence that shows
> > > > Y then it will be clearer if X is a good solution, if there's a more
> > > > general solution, or if it was just a side-effect.
> > > 
> > > I think I made it clear that the root cause of the observed performance
> > > gain with rising transmission size is latency (and also that performance
> > > is not the only reason for addressing this queue size issue).
> > > 
> > > Each request roundtrip has a certain minimum latency, the virtio ring
> > > alone
> > > has its latency, plus latency of the controller portion of the file server
> > > (e.g. permissions, sandbox checks, file IDs) that is executed with *every*
> > > request, plus latency of dispatching the request handling between threads
> > > several times back and forth (also for each request).
> > > 
> > > Therefore when you split large payloads (e.g. reading a large file) into
> > > smaller n amount of chunks, then that individual latency per request
> > > accumulates to n times the individual latency, eventually leading to
> > > degraded transmission speed as those requests are serialized.
> > 
> > It's easy to increase the blocksize in benchmarks, but real applications
> > offer less control over the I/O pattern. If latency in the device
> > implementation (QEMU) is the root cause then reduce the latency to speed
> > up all applications, even those that cannot send huge requests.
> 
> Which I did, still do, and also mentioned before, e.g.:
> 
> 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 9pfs: reduce latency of Twalk
> 0c4356ba7dafc8ecb5877a42fc0d68d45ccf5951 9pfs: T_readdir latency optimization
> 
> Reducing overall latency is a process that is ongoing and will still take a 
> very long development time. Not because of me, but because of lack of 
> reviewers. And even then, it does not make the effort to support higher 
> transmission sizes obsolete.
> 
> > One idea is request merging on the QEMU side. If the application sends
> > 10 sequential read or write requests, coalesce them together before the
> > main part of request processing begins in the device. Process a single
> > large request to spread the cost of the file server over the 10
> > requests. (virtio-blk has request merging to help with the cost of lots
> > of small qcow2 I/O requests.) The cool thing about this is that the
> > guest does not need to change its I/O pattern to benefit from the
> > optimization.
> > 
> > Stefan
> 
> Ok, don't get me wrong: I appreciate that you are suggesting approaches that 
> could improve things. But I could already hand you over a huge list of mine. 
> The limiting factor here is not the lack of ideas of what could be improved, 
> but rather the lack of people helping out actively on 9p side:
> https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06452.html
> 
> The situation on kernel side is the same. I already have a huge list of what 
> could & should be improved. But there is basically no reviewer for 9p patches 
> on Linux kernel side either.
> 
> The much I appreciate suggestions of what could be improved, I would 
> appreciate much more if there was *anybody* actively assisting as well. In 
> the 
> time being I have to work the list down in small patch chunks, priority based.

I see request merging as an alternative to this patch series, not as an
additional idea.

My thoughts behind this is that request merging is less work than this
patch series and more broadly applicable. It would be easy to merge (no
idea how easy it is to implement though) in QEMU's virtio-9p device
implementation, does not require changes across the stack, and benefits
applications that can't change their I/O pattern to take advantage of
huge requests.

There is a risk that request merging won't pan out, it could have worse
performance than submitting huge requests.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-11-15 Thread Peter Lieven

Am 26.10.21 um 16:53 schrieb Peter Lieven:

Am 25.10.21 um 14:58 schrieb Kevin Wolf:

Am 25.10.2021 um 13:39 hat Peter Lieven geschrieben:

Am 16.09.21 um 14:34 schrieb Peter Lieven:

Am 09.07.21 um 12:21 schrieb Kevin Wolf:

Am 08.07.2021 um 20:23 hat Peter Lieven geschrieben:

Am 08.07.2021 um 14:18 schrieb Kevin Wolf :

Am 07.07.2021 um 20:13 hat Peter Lieven geschrieben:

Am 06.07.2021 um 17:25 schrieb Kevin Wolf :
Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben:

I will have a decent look after my vacation.

Sounds good, thanks. Enjoy your vacation!

As I had to fire up my laptop to look into another issue anyway, I
have sent two patches for updating MAINTAINERS and to fix the int vs.
bool mix for task->complete.

I think you need to reevaluate your definition of vacation. ;-)

Lets talk about this when the kids are grown up. Sometimes sending
patches can be quite relaxing :-)

Heh, fair enough. :-)


But thanks anyway.


As Paolos fix (5f50be9b5) is relatively new and there are maybe other
non obvious problems when removing the BH indirection and we are close
to soft freeze I would leave the BH removal change for 6.2.

Sure, code cleanups aren't urgent.

Isn’t the indirection also a slight performance drop?

Yeah, I guess technically it is, though I doubt it's measurable.

As promised I was trying to remove the indirection through the BH after Qemu 
6.1 release.

However, if I remove the BH I run into the following assertion while running 
some fio tests:


qemu-system-x86_64: ../block/block-backend.c:1197: blk_wait_while_drained: Assertion 
`blk->in_flight > 0' failed.


Any idea?


This is what I changed:


diff --git a/block/rbd.c b/block/rbd.c
index 3cb24f9981..bc1dbc20f7 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1063,13 +1063,6 @@ static int qemu_rbd_resize(BlockDriverState *bs, 
uint64_t size)
  return 0;
  }

-static void qemu_rbd_finish_bh(void *opaque)
-{
-    RBDTask *task = opaque;
-    task->complete = true;
-    aio_co_wake(task->co);
-}
-
  /*
   * This is the completion callback function for all rbd aio calls
   * started from qemu_rbd_start_co().
@@ -1083,8 +1076,8 @@ static void qemu_rbd_completion_cb(rbd_completion_t c, 
RBDTask *task)
  {
  task->ret = rbd_aio_get_return_value(c);
  rbd_aio_release(c);
-    aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs),
-    qemu_rbd_finish_bh, task);
+    task->complete = true;
+    aio_co_wake(task->co);
  }

Kevin, Paolo, any idea?

Not really, I don't see the connection between both places.

Do you have a stack trace for the crash?


The crash seems not to be limited to that assertion. I have also seen:


qemu-system-x86_64: ../block/block-backend.c:1497: blk_aio_write_entry: Assertion 
`!qiov || qiov->size == acb->bytes' failed.


Altough harder to trigger I catch this backtrace in gdb:


qemu-system-x86_64: ../block/block-backend.c:1497: blk_aio_write_entry: Assertion 
`!qiov || qiov->size == acb->bytes' failed.
[Wechseln zu Thread 0x77fa8f40 (LWP 17852)]

Thread 1 "qemu-system-x86" hit Breakpoint 1, __GI_abort () at abort.c:49
49    abort.c: Datei oder Verzeichnis nicht gefunden.
(gdb) bt
#0  0x742567e0 in __GI_abort () at abort.c:49
#1  0x7424648a in __assert_fail_base (fmt=0x743cd750 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
assertion=assertion@entry=0x55e638e0 "!qiov || qiov->size == acb->bytes", file=file@entry=0x55e634b2 
"../block/block-backend.c", line=line@entry=1497, function=function@entry=0x55e63b20 
<__PRETTY_FUNCTION__.32450> "blk_aio_write_entry") at assert.c:92
#2  0x74246502 in __GI___assert_fail (assertion=assertion@entry=0x55e638e0 "!qiov || qiov->size == 
acb->bytes", file=file@entry=0x55e634b2 "../block/block-backend.c", line=line@entry=1497, 
function=function@entry=0x55e63b20 <__PRETTY_FUNCTION__.32450> "blk_aio_write_entry") at assert.c:101
#3  0x55becc78 in blk_aio_write_entry (opaque=0x56b534f0) at 
../block/block-backend.c:1497
#4  0x55cf0e4c in coroutine_trampoline (i0=, i1=) at ../util/coroutine-ucontext.c:173
#5  0x7426e7b0 in __start_context () at /lib/x86_64-linux-gnu/libc.so.6
#6  0x7fffd5a0 in  ()
#7  0x in  ()




any ideas? Or should we just abandon the idea of removing the BH?


Peter






Re: [PATCH 2/2] virtio-blk: support BLKSECDISCARD

2021-11-15 Thread Michael S. Tsirkin
On Mon, Nov 15, 2021 at 12:52:00PM +0800, yadong...@intel.com wrote:
> From: Yadong Qi 
> 
> Add new virtio feature: VIRTIO_BLK_F_SECDISCARD.
> Add new virtio command: VIRTIO_BLK_T_SECDISCARD.
> 
> This feature is disabled by default, it will check the backend
> bs->open_flags & BDRV_O_SECDISCARD, enable it if BDRV_O_SECDISCARD
> is supported.
> 
> Signed-off-by: Yadong Qi 
> ---
>  hw/block/virtio-blk.c   | 26 +
>  include/standard-headers/linux/virtio_blk.h |  4 


Any changes to standard headers need to go to linux and virtio TC.

>  2 files changed, 26 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index dbc4c5a3cd..7bc3484521 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -536,7 +536,8 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
>  }
>  
>  static uint8_t virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req,
> -struct virtio_blk_discard_write_zeroes *dwz_hdr, bool is_write_zeroes)
> +struct virtio_blk_discard_write_zeroes *dwz_hdr, bool is_write_zeroes,
> +bool is_secdiscard)
>  {
>  VirtIOBlock *s = req->dev;
>  VirtIODevice *vdev = VIRTIO_DEVICE(s);
> @@ -577,8 +578,8 @@ static uint8_t 
> virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req,
>  goto err;
>  }
>  
> +int blk_aio_flags = 0;
>  if (is_write_zeroes) { /* VIRTIO_BLK_T_WRITE_ZEROES */
> -int blk_aio_flags = 0;
>  
>  if (flags & VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP) {
>  blk_aio_flags |= BDRV_REQ_MAY_UNMAP;
> @@ -600,7 +601,12 @@ static uint8_t 
> virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req,
>  goto err;
>  }
>  
> -blk_aio_pdiscard(s->blk, sector << BDRV_SECTOR_BITS, bytes, 0,
> +if (is_secdiscard) {
> +blk_aio_flags |= BDRV_REQ_SECDISCARD;
> +}
> +
> +blk_aio_pdiscard(s->blk, sector << BDRV_SECTOR_BITS, bytes,
> + blk_aio_flags,
>   virtio_blk_discard_write_zeroes_complete, req);
>  }
>  
> @@ -622,6 +628,7 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, 
> MultiReqBuffer *mrb)
>  unsigned out_num = req->elem.out_num;
>  VirtIOBlock *s = req->dev;
>  VirtIODevice *vdev = VIRTIO_DEVICE(s);
> +bool is_secdiscard = false;
>  
>  if (req->elem.out_num < 1 || req->elem.in_num < 1) {
>  virtio_error(vdev, "virtio-blk missing headers");
> @@ -722,6 +729,9 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, 
> MultiReqBuffer *mrb)
>   * VIRTIO_BLK_T_OUT flag set. We masked this flag in the switch 
> statement,
>   * so we must mask it for these requests, then we will check if it is 
> set.
>   */
> +case VIRTIO_BLK_T_SECDISCARD & ~VIRTIO_BLK_T_OUT:
> +is_secdiscard = true;
> +__attribute__((fallthrough));
>  case VIRTIO_BLK_T_DISCARD & ~VIRTIO_BLK_T_OUT:
>  case VIRTIO_BLK_T_WRITE_ZEROES & ~VIRTIO_BLK_T_OUT:
>  {
> @@ -752,7 +762,8 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, 
> MultiReqBuffer *mrb)
>  }
>  
>  err_status = virtio_blk_handle_discard_write_zeroes(req, _hdr,
> -is_write_zeroes);
> +is_write_zeroes,
> +is_secdiscard);
>  if (err_status != VIRTIO_BLK_S_OK) {
>  virtio_blk_req_complete(req, err_status);
>  virtio_blk_free_request(req);
> @@ -1201,6 +1212,11 @@ static void virtio_blk_device_realize(DeviceState 
> *dev, Error **errp)
>  return;
>  }
>  
> +if (blk_get_flags(conf->conf.blk) & BDRV_O_SECDISCARD)
> +virtio_add_feature(>host_features, VIRTIO_BLK_F_SECDISCARD);
> +else
> +virtio_clear_feature(>host_features, VIRTIO_BLK_F_SECDISCARD);
> +
>  if (virtio_has_feature(s->host_features, VIRTIO_BLK_F_WRITE_ZEROES) &&
>  (!conf->max_write_zeroes_sectors ||
>   conf->max_write_zeroes_sectors > BDRV_REQUEST_MAX_SECTORS)) {
> @@ -1307,6 +1323,8 @@ static Property virtio_blk_properties[] = {
>   conf.report_discard_granularity, true),
>  DEFINE_PROP_BIT64("write-zeroes", VirtIOBlock, host_features,
>VIRTIO_BLK_F_WRITE_ZEROES, true),
> +DEFINE_PROP_BIT64("secdiscard", VirtIOBlock, host_features,
> +  VIRTIO_BLK_F_SECDISCARD, false),
>  DEFINE_PROP_UINT32("max-discard-sectors", VirtIOBlock,
> conf.max_discard_sectors, BDRV_REQUEST_MAX_SECTORS),
>  DEFINE_PROP_UINT32("max-write-zeroes-sectors", VirtIOBlock,
> diff --git a/include/standard-headers/linux/virtio_blk.h 
> b/include/standard-headers/linux/virtio_blk.h
> index 2dcc90826a..c55a07840c 100644
> --- a/include/standard-headers/linux/virtio_blk.h
> +++ 

Re: [PATCH 2/2] virtio-blk: support BLKSECDISCARD

2021-11-15 Thread Stefano Garzarella

On Mon, Nov 15, 2021 at 12:52:00PM +0800, yadong...@intel.com wrote:

From: Yadong Qi 

Add new virtio feature: VIRTIO_BLK_F_SECDISCARD.
Add new virtio command: VIRTIO_BLK_T_SECDISCARD.


Has a proposal been sent out yet to virtio-comment mailing list for 
discussing these specification changes?




This feature is disabled by default, it will check the backend
bs->open_flags & BDRV_O_SECDISCARD, enable it if BDRV_O_SECDISCARD
is supported.

Signed-off-by: Yadong Qi 
---
hw/block/virtio-blk.c   | 26 +
include/standard-headers/linux/virtio_blk.h |  4 
2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index dbc4c5a3cd..7bc3484521 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -536,7 +536,8 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
}

static uint8_t virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req,
-struct virtio_blk_discard_write_zeroes *dwz_hdr, bool is_write_zeroes)
+struct virtio_blk_discard_write_zeroes *dwz_hdr, bool is_write_zeroes,
+bool is_secdiscard)


Since the function now handles 3 commands, I'm thinking if it's better 
to pass the command directly and then make a switch instead of using 2 
booleans.



{
VirtIOBlock *s = req->dev;
VirtIODevice *vdev = VIRTIO_DEVICE(s);
@@ -577,8 +578,8 @@ static uint8_t 
virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req,
goto err;
}

+int blk_aio_flags = 0;


Maybe better to move it to the beginning of the function.


if (is_write_zeroes) { /* VIRTIO_BLK_T_WRITE_ZEROES */
-int blk_aio_flags = 0;

if (flags & VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP) {
blk_aio_flags |= BDRV_REQ_MAY_UNMAP;
@@ -600,7 +601,12 @@ static uint8_t 
virtio_blk_handle_discard_write_zeroes(VirtIOBlockReq *req,
goto err;
}

-blk_aio_pdiscard(s->blk, sector << BDRV_SECTOR_BITS, bytes, 0,
+if (is_secdiscard) {
+blk_aio_flags |= BDRV_REQ_SECDISCARD;
+}
+
+blk_aio_pdiscard(s->blk, sector << BDRV_SECTOR_BITS, bytes,
+ blk_aio_flags,
 virtio_blk_discard_write_zeroes_complete, req);
}

@@ -622,6 +628,7 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, 
MultiReqBuffer *mrb)
unsigned out_num = req->elem.out_num;
VirtIOBlock *s = req->dev;
VirtIODevice *vdev = VIRTIO_DEVICE(s);
+bool is_secdiscard = false;

if (req->elem.out_num < 1 || req->elem.in_num < 1) {
virtio_error(vdev, "virtio-blk missing headers");
@@ -722,6 +729,9 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, 
MultiReqBuffer *mrb)
 * VIRTIO_BLK_T_OUT flag set. We masked this flag in the switch statement,
 * so we must mask it for these requests, then we will check if it is set.
 */
+case VIRTIO_BLK_T_SECDISCARD & ~VIRTIO_BLK_T_OUT:
+is_secdiscard = true;
+__attribute__((fallthrough));


We can use QEMU_FALLTHROUGH here.


case VIRTIO_BLK_T_DISCARD & ~VIRTIO_BLK_T_OUT:
case VIRTIO_BLK_T_WRITE_ZEROES & ~VIRTIO_BLK_T_OUT:
{
@@ -752,7 +762,8 @@ static int virtio_blk_handle_request(VirtIOBlockReq 
*req, MultiReqBuffer *mrb)

}

err_status = virtio_blk_handle_discard_write_zeroes(req, _hdr,
-is_write_zeroes);
+is_write_zeroes,
+is_secdiscard);
if (err_status != VIRTIO_BLK_S_OK) {
virtio_blk_req_complete(req, err_status);
virtio_blk_free_request(req);
@@ -1201,6 +1212,11 @@ static void virtio_blk_device_realize(DeviceState *dev, 
Error **errp)
return;
}

+if (blk_get_flags(conf->conf.blk) & BDRV_O_SECDISCARD)
+virtio_add_feature(>host_features, 
VIRTIO_BLK_F_SECDISCARD);

+else
+virtio_clear_feature(>host_features, VIRTIO_BLK_F_SECDISCARD);
+


IIUC here we set or not the feature if BDRV_O_SECDISCARD is set.

Should we keep it disabled if "secdiscard" is false? (e.g. to avoid 
migration problems)


Otherwise what is the purpose of the "secdiscard" property?

Thanks,
Stefano