date:20201104

Re: [PATCH v9 02/12] hw/block/nvme: Generate namespace UUIDs

2020-11-04 Thread Klaus Jensen

On Nov  5 11:53, Dmitry Fomichev wrote:
> In NVMe 1.4, a namespace must report an ID descriptor of UUID type
> if it doesn't support EUI64 or NGUID. Add a new namespace property,
> "uuid", that provides the user the option to either specify the UUID
> explicitly or have a UUID generated automatically every time a
> namespace is initialized.
> 
> Suggested-by: Klaus Jensen 
> Signed-off-by: Dmitry Fomichev 
> Reviewed-by: Klaus Jensen 
> Reviewed-by: Keith Busch 
> Reviewed-by: Niklas Cassel 

Please use the tag that I originally R-b'ed with:

Reviewed-by: Klaus Jensen 


signature.asc
Description: PGP signature

Re: [PATCH 1/5] file-posix: split hdev_refresh_limits from raw_refresh_limits

2020-11-04 Thread Tom Yan

Actually I made a mistake in this. BLKSECTGET (the one in the block
layer) returns the number of "sectors", which is "defined" as 512-byte
block. So we shouldn't use BLKSSZGET here, but simply 512 (1 << 9).
See logical_to_sectors() in sd.h of the kernel.

On Thu, 5 Nov 2020 at 01:32, Maxim Levitsky  wrote:
>
> From: Tom Yan 
>
> We can and should get max transfer length and max segments for all host
> devices / cdroms (on Linux).
>
> Also use MIN_NON_ZERO instead when we clamp max transfer length against
> max segments.
>
> Signed-off-by: Tom Yan 
> Signed-off-by: Maxim Levitsky 
> ---
>  block/file-posix.c | 61 ++
>  1 file changed, 45 insertions(+), 16 deletions(-)
>
> diff --git a/block/file-posix.c b/block/file-posix.c
> index c63926d592..6581f41b2b 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1162,6 +1162,12 @@ static void raw_reopen_abort(BDRVReopenState *state)
>
>  static int sg_get_max_transfer_length(int fd)
>  {
> +/*
> + * BLKSECTGET for /dev/sg* character devices incorrectly returns
> + * the max transfer size in bytes (rather than in blocks).
> + * Also note that /dev/sg* doesn't support BLKSSZGET ioctl.
> + */
> +
>  #ifdef BLKSECTGET
>  int max_bytes = 0;
>
> @@ -1175,7 +1181,24 @@ static int sg_get_max_transfer_length(int fd)
>  #endif
>  }
>
> -static int sg_get_max_segments(int fd)
> +static int get_max_transfer_length(int fd)
> +{
> +#if defined(BLKSECTGET) && defined(BLKSSZGET)
> +int sect = 0;
> +int ssz = 0;
> +
> +if (ioctl(fd, BLKSECTGET, ) == 0 &&
> +ioctl(fd, BLKSSZGET, ) == 0) {
> +return sect * ssz;
> +} else {
> +return -errno;
> +}
> +#else
> +return -ENOSYS;
> +#endif
> +}
> +
> +static int get_max_segments(int fd)
>  {
>  #ifdef CONFIG_LINUX
>  char buf[32];
> @@ -1230,23 +1253,29 @@ static void raw_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  {
>  BDRVRawState *s = bs->opaque;
>
> -if (bs->sg) {
> -int ret = sg_get_max_transfer_length(s->fd);
> +raw_probe_alignment(bs, s->fd, errp);
> +bs->bl.min_mem_alignment = s->buf_align;
> +bs->bl.opt_mem_alignment = MAX(s->buf_align, qemu_real_host_page_size);
> +}
>
> -if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
> -bs->bl.max_transfer = pow2floor(ret);
> -}
> +static void hdev_refresh_limits(BlockDriverState *bs, Error **errp)
> +{
> +BDRVRawState *s = bs->opaque;
>
> -ret = sg_get_max_segments(s->fd);
> -if (ret > 0) {
> -bs->bl.max_transfer = MIN(bs->bl.max_transfer,
> -  ret * qemu_real_host_page_size);
> -}
> +int ret = bs->sg ? sg_get_max_transfer_length(s->fd) :
> +   get_max_transfer_length(s->fd);
> +
> +if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
> +bs->bl.max_transfer = pow2floor(ret);
>  }
>
> -raw_probe_alignment(bs, s->fd, errp);
> -bs->bl.min_mem_alignment = s->buf_align;
> -bs->bl.opt_mem_alignment = MAX(s->buf_align, qemu_real_host_page_size);
> +ret = get_max_segments(s->fd);
> +if (ret > 0) {
> +bs->bl.max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> +   ret * qemu_real_host_page_size);
> +}
> +
> +raw_refresh_limits(bs, errp);
>  }
>
>  static int check_for_dasd(int fd)
> @@ -3600,7 +3629,7 @@ static BlockDriver bdrv_host_device = {
>  .bdrv_co_pdiscard   = hdev_co_pdiscard,
>  .bdrv_co_copy_range_from = raw_co_copy_range_from,
>  .bdrv_co_copy_range_to  = raw_co_copy_range_to,
> -.bdrv_refresh_limits = raw_refresh_limits,
> +.bdrv_refresh_limits = hdev_refresh_limits,
>  .bdrv_io_plug = raw_aio_plug,
>  .bdrv_io_unplug = raw_aio_unplug,
>  .bdrv_attach_aio_context = raw_aio_attach_aio_context,
> @@ -3724,7 +3753,7 @@ static BlockDriver bdrv_host_cdrom = {
>  .bdrv_co_preadv = raw_co_preadv,
>  .bdrv_co_pwritev= raw_co_pwritev,
>  .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
> -.bdrv_refresh_limits = raw_refresh_limits,
> +.bdrv_refresh_limits = hdev_refresh_limits,
>  .bdrv_io_plug = raw_aio_plug,
>  .bdrv_io_unplug = raw_aio_unplug,
>  .bdrv_attach_aio_context = raw_aio_attach_aio_context,
> --
> 2.26.2
>

[PATCH] block: Fix integer promotion error in bdrv_getlength()

2020-11-04 Thread Tuguoyi

As BDRV_SECTOR_SIZE is of type uint64_t, the expression will
automatically convert the @ret to uint64_t. When an error code
returned from bdrv_nb_sectors(), the promoted @ret will be a very
large number, as a result the -EFBIG will be returned which is not the
real error code.

Signed-off-by: Guoyi Tu 
---
 block.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 430edf7..f14254c 100644
--- a/block.c
+++ b/block.c
@@ -5082,7 +5082,7 @@ int64_t bdrv_getlength(BlockDriverState *bs)
 {
 int64_t ret = bdrv_nb_sectors(bs);
 
-ret = ret > INT64_MAX / BDRV_SECTOR_SIZE ? -EFBIG : ret;
+ret = ret > INT64_MAX / (int64_t)BDRV_SECTOR_SIZE ? -EFBIG : ret;
 return ret < 0 ? ret : ret * BDRV_SECTOR_SIZE;
 }
 
-- 
2.7.4

Please ignore the [PATCH] block: Return the real error code in bdrv_getlength

--
Best regards,
Guoyi

RE: [PATCH] block: Return the real error code in bdrv_getlength

2020-11-04 Thread Tuguoyi

Sorry, please ignore this patch, it's not a right fix

--
Best regards,
Guoyi


> -Original Message-
> From: tuguoyi (Cloud)
> Sent: Thursday, November 05, 2020 11:11 AM
> To: 'Kevin Wolf' ; 'Max Reitz' ;
> 'qemu-block@nongnu.org' 
> Cc: 'qemu-de...@nongnu.org' 
> Subject: [PATCH] block: Return the real error code in bdrv_getlength
> 
> The return code from  bdrv_nb_sectors() should be checked before doing
> the following sanity check.
> 
> Signed-off-by: Guoyi Tu 
> ---
>  block.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/block.c b/block.c
> index 430edf7..19ebbc0 100644
> --- a/block.c
> +++ b/block.c
> @@ -5082,6 +5082,10 @@ int64_t bdrv_getlength(BlockDriverState *bs)
>  {
>  int64_t ret = bdrv_nb_sectors(bs);
> 
> +if (ret < 0) {
> +return ret;
> +}
> +
>  ret = ret > INT64_MAX / BDRV_SECTOR_SIZE ? -EFBIG : ret;
>  return ret < 0 ? ret : ret * BDRV_SECTOR_SIZE;
>  }
> --
> 2.7.4
> 
> 
> --
> Best regards,
> Guoyi
>

[PATCH] block: Return the real error code in bdrv_getlength

2020-11-04 Thread Tuguoyi

The return code from  bdrv_nb_sectors() should be checked before doing
the following sanity check.

Signed-off-by: Guoyi Tu 
---
 block.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block.c b/block.c
index 430edf7..19ebbc0 100644
--- a/block.c
+++ b/block.c
@@ -5082,6 +5082,10 @@ int64_t bdrv_getlength(BlockDriverState *bs)
 {
 int64_t ret = bdrv_nb_sectors(bs);
 
+if (ret < 0) {
+return ret;
+}
+
 ret = ret > INT64_MAX / BDRV_SECTOR_SIZE ? -EFBIG : ret;
 return ret < 0 ? ret : ret * BDRV_SECTOR_SIZE;
 }
-- 
2.7.4


--
Best regards,
Guoyi

[PATCH v9 07/12] block/nvme: Make ZNS-related definitions

2020-11-04 Thread Dmitry Fomichev

Define values and structures that are needed to support Zoned
Namespace Command Set (NVMe TP 4053).

Signed-off-by: Dmitry Fomichev 
---
 include/block/nvme.h | 114 ++-
 1 file changed, 113 insertions(+), 1 deletion(-)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index 394db19022..752623b4f9 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -489,6 +489,9 @@ enum NvmeIoCommands {
 NVME_CMD_COMPARE= 0x05,
 NVME_CMD_WRITE_ZEROES   = 0x08,
 NVME_CMD_DSM= 0x09,
+NVME_CMD_ZONE_MGMT_SEND = 0x79,
+NVME_CMD_ZONE_MGMT_RECV = 0x7a,
+NVME_CMD_ZONE_APPEND= 0x7d,
 };
 
 typedef struct QEMU_PACKED NvmeDeleteQ {
@@ -648,9 +651,13 @@ typedef struct QEMU_PACKED NvmeAerResult {
 uint8_t resv;
 } NvmeAerResult;
 
+typedef struct QEMU_PACKED NvmeZonedResult {
+uint64_t slba;
+} NvmeZonedResult;
+
 typedef struct QEMU_PACKED NvmeCqe {
 uint32_tresult;
-uint32_trsvd;
+uint32_tdw1;
 uint16_tsq_head;
 uint16_tsq_id;
 uint16_tcid;
@@ -679,6 +686,7 @@ enum NvmeStatusCodes {
 NVME_INVALID_USE_OF_CMB = 0x0012,
 NVME_INVALID_PRP_OFFSET = 0x0013,
 NVME_CMD_SET_CMB_REJECTED   = 0x002b,
+NVME_INVALID_CMD_SET= 0x002c,
 NVME_LBA_RANGE  = 0x0080,
 NVME_CAP_EXCEEDED   = 0x0081,
 NVME_NS_NOT_READY   = 0x0082,
@@ -703,6 +711,14 @@ enum NvmeStatusCodes {
 NVME_CONFLICTING_ATTRS  = 0x0180,
 NVME_INVALID_PROT_INFO  = 0x0181,
 NVME_WRITE_TO_RO= 0x0182,
+NVME_ZONE_BOUNDARY_ERROR= 0x01b8,
+NVME_ZONE_FULL  = 0x01b9,
+NVME_ZONE_READ_ONLY = 0x01ba,
+NVME_ZONE_OFFLINE   = 0x01bb,
+NVME_ZONE_INVALID_WRITE = 0x01bc,
+NVME_ZONE_TOO_MANY_ACTIVE   = 0x01bd,
+NVME_ZONE_TOO_MANY_OPEN = 0x01be,
+NVME_ZONE_INVAL_TRANSITION  = 0x01bf,
 NVME_WRITE_FAULT= 0x0280,
 NVME_UNRECOVERED_READ   = 0x0281,
 NVME_E2E_GUARD_ERROR= 0x0282,
@@ -887,6 +903,11 @@ typedef struct QEMU_PACKED NvmeIdCtrl {
 uint8_t vs[1024];
 } NvmeIdCtrl;
 
+typedef struct NvmeIdCtrlZoned {
+uint8_t zasl;
+uint8_t rsvd1[4095];
+} NvmeIdCtrlZoned;
+
 enum NvmeIdCtrlOacs {
 NVME_OACS_SECURITY  = 1 << 0,
 NVME_OACS_FORMAT= 1 << 1,
@@ -1012,6 +1033,12 @@ typedef struct QEMU_PACKED NvmeLBAF {
 uint8_t rp;
 } NvmeLBAF;
 
+typedef struct QEMU_PACKED NvmeLBAFE {
+uint64_tzsze;
+uint8_t zdes;
+uint8_t rsvd9[7];
+} NvmeLBAFE;
+
 #define NVME_NSID_BROADCAST 0x
 
 typedef struct QEMU_PACKED NvmeIdNs {
@@ -1066,10 +1093,24 @@ enum NvmeNsIdentifierType {
 
 enum NvmeCsi {
 NVME_CSI_NVM= 0x00,
+NVME_CSI_ZONED  = 0x02,
 };
 
 #define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
 
+typedef struct QEMU_PACKED NvmeIdNsZoned {
+uint16_tzoc;
+uint16_tozcs;
+uint32_tmar;
+uint32_tmor;
+uint32_trrl;
+uint32_tfrl;
+uint8_t rsvd20[2796];
+NvmeLBAFE   lbafe[16];
+uint8_t rsvd3072[768];
+uint8_t vs[256];
+} NvmeIdNsZoned;
+
 /*Deallocate Logical Block Features*/
 #define NVME_ID_NS_DLFEAT_GUARD_CRC(dlfeat)   ((dlfeat) & 0x10)
 #define NVME_ID_NS_DLFEAT_WRITE_ZEROES(dlfeat)((dlfeat) & 0x08)
@@ -1101,10 +1142,76 @@ enum NvmeIdNsDps {
 DPS_FIRST_EIGHT = 8,
 };
 
+enum NvmeZoneAttr {
+NVME_ZA_FINISHED_BY_CTLR = 1 << 0,
+NVME_ZA_FINISH_RECOMMENDED   = 1 << 1,
+NVME_ZA_RESET_RECOMMENDED= 1 << 2,
+NVME_ZA_ZD_EXT_VALID = 1 << 7,
+};
+
+typedef struct QEMU_PACKED NvmeZoneReportHeader {
+uint64_tnr_zones;
+uint8_t rsvd[56];
+} NvmeZoneReportHeader;
+
+enum NvmeZoneReceiveAction {
+NVME_ZONE_REPORT = 0,
+NVME_ZONE_REPORT_EXTENDED= 1,
+};
+
+enum NvmeZoneReportType {
+NVME_ZONE_REPORT_ALL = 0,
+NVME_ZONE_REPORT_EMPTY   = 1,
+NVME_ZONE_REPORT_IMPLICITLY_OPEN = 2,
+NVME_ZONE_REPORT_EXPLICITLY_OPEN = 3,
+NVME_ZONE_REPORT_CLOSED  = 4,
+NVME_ZONE_REPORT_FULL= 5,
+NVME_ZONE_REPORT_READ_ONLY   = 6,
+NVME_ZONE_REPORT_OFFLINE = 7,
+};
+
+enum NvmeZoneType {
+NVME_ZONE_TYPE_RESERVED  = 0x00,
+NVME_ZONE_TYPE_SEQ_WRITE = 0x02,
+};
+
+enum NvmeZoneSendAction {
+NVME_ZONE_ACTION_RSD = 0x00,
+NVME_ZONE_ACTION_CLOSE   = 0x01,
+NVME_ZONE_ACTION_FINISH  = 0x02,
+NVME_ZONE_ACTION_OPEN= 0x03,
+NVME_ZONE_ACTION_RESET   = 0x04,
+NVME_ZONE_ACTION_OFFLINE = 0x05,
+NVME_ZONE_ACTION_SET_ZD_EXT  = 0x10,
+};
+
+typedef struct QEMU_PACKED NvmeZoneDescr {
+uint8_t zt;
+uint8_t zs;
+uint8_t za;
+uint8_t rsvd3[5];
+uint64_tzcap;
+

[PATCH v9 11/12] hw/block/nvme: Add injection of Offline/Read-Only zones

2020-11-04 Thread Dmitry Fomichev

ZNS specification defines two zone conditions for the zones that no
longer can function properly, possibly because of flash wear or other
internal fault. It is useful to be able to "inject" a small number of
such zones for testing purposes.

This commit defines two optional device properties, "offline_zones"
and "rdonly_zones". Users can assign non-zero values to these variables
to specify the number of zones to be initialized as Offline or
Read-Only. The actual number of injected zones may be smaller than the
requested amount - Read-Only and Offline counts are expected to be much
smaller than the total number of zones on a drive.

Signed-off-by: Dmitry Fomichev 
Reviewed-by: Niklas Cassel 
---
 hw/block/nvme-ns.h |  2 ++
 hw/block/nvme-ns.c | 52 ++
 2 files changed, 54 insertions(+)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 50a6a0e1ac..b30478e5d7 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -36,6 +36,8 @@ typedef struct NvmeNamespaceParams {
 uint32_t max_active_zones;
 uint32_t max_open_zones;
 uint32_t zd_extension_size;
+uint32_t nr_offline_zones;
+uint32_t nr_rdonly_zones;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 85dc73cf06..5e4a6705cd 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -21,6 +21,7 @@
 #include "sysemu/sysemu.h"
 #include "sysemu/block-backend.h"
 #include "qapi/error.h"
+#include "crypto/random.h"
 
 #include "hw/qdev-properties.h"
 #include "hw/qdev-core.h"
@@ -145,6 +146,20 @@ static int nvme_calc_zone_geometry(NvmeNamespace *ns, 
Error **errp)
 }
 }
 
+if (ns->params.max_open_zones < nz) {
+if (ns->params.nr_offline_zones > nz - ns->params.max_open_zones) {
+error_setg(errp, "offline_zones value %u is too large",
+ns->params.nr_offline_zones);
+return -1;
+}
+if (ns->params.nr_rdonly_zones >
+nz - ns->params.max_open_zones - ns->params.nr_offline_zones) {
+error_setg(errp, "rdonly_zones value %u is too large",
+ns->params.nr_rdonly_zones);
+return -1;
+}
+}
+
 return 0;
 }
 
@@ -153,7 +168,9 @@ static void nvme_init_zone_state(NvmeNamespace *ns)
 uint64_t start = 0, zone_size = ns->zone_size;
 uint64_t capacity = ns->num_zones * zone_size;
 NvmeZone *zone;
+uint32_t rnd;
 int i;
+uint16_t zs;
 
 ns->zone_array = g_malloc0(ns->zone_array_size);
 if (ns->params.zd_extension_size) {
@@ -180,6 +197,37 @@ static void nvme_init_zone_state(NvmeNamespace *ns)
 zone->w_ptr = start;
 start += zone_size;
 }
+
+/* If required, make some zones Offline or Read Only */
+
+for (i = 0; i < ns->params.nr_offline_zones; i++) {
+do {
+qcrypto_random_bytes(, sizeof(rnd), NULL);
+rnd %= ns->num_zones;
+} while (rnd < ns->params.max_open_zones);
+zone = >zone_array[rnd];
+zs = nvme_get_zone_state(zone);
+if (zs != NVME_ZONE_STATE_OFFLINE) {
+nvme_set_zone_state(zone, NVME_ZONE_STATE_OFFLINE);
+} else {
+i--;
+}
+}
+
+for (i = 0; i < ns->params.nr_rdonly_zones; i++) {
+do {
+qcrypto_random_bytes(, sizeof(rnd), NULL);
+rnd %= ns->num_zones;
+} while (rnd < ns->params.max_open_zones);
+zone = >zone_array[rnd];
+zs = nvme_get_zone_state(zone);
+if (zs != NVME_ZONE_STATE_OFFLINE &&
+zs != NVME_ZONE_STATE_READ_ONLY) {
+nvme_set_zone_state(zone, NVME_ZONE_STATE_READ_ONLY);
+} else {
+i--;
+}
+}
 }
 
 static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
@@ -353,6 +401,10 @@ static Property nvme_ns_props[] = {
params.max_open_zones, 0),
 DEFINE_PROP_UINT32("zoned.descr_ext_size", NvmeNamespace,
params.zd_extension_size, 0),
+DEFINE_PROP_UINT32("zoned.offline_zones", NvmeNamespace,
+   params.nr_offline_zones, 0),
+DEFINE_PROP_UINT32("zoned.rdonly_zones", NvmeNamespace,
+   params.nr_rdonly_zones, 0),
 DEFINE_PROP_END_OF_LIST(),
 };
 
-- 
2.21.0

[PATCH v9 10/12] hw/block/nvme: Support Zone Descriptor Extensions

2020-11-04 Thread Dmitry Fomichev

Zone Descriptor Extension is a label that can be assigned to a zone.
It can be set to an Empty zone and it stays assigned until the zone
is reset.

This commit adds a new optional module property,
"zoned.descr_ext_size". Its value must be a multiple of 64 bytes.
If this value is non-zero, it becomes possible to assign extensions
of that size to any Empty zones. The default value for this property
is 0, therefore setting extensions is disabled by default.

Signed-off-by: Hans Holmberg 
Signed-off-by: Dmitry Fomichev 
Reviewed-by: Klaus Jensen 
Reviewed-by: Niklas Cassel 
---
 hw/block/nvme-ns.h|  8 +++
 hw/block/nvme-ns.c| 25 ++--
 hw/block/nvme.c   | 54 +--
 hw/block/trace-events |  2 ++
 4 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 421bab0a57..50a6a0e1ac 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -35,6 +35,7 @@ typedef struct NvmeNamespaceParams {
 uint64_t zone_cap_bs;
 uint32_t max_active_zones;
 uint32_t max_open_zones;
+uint32_t zd_extension_size;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
@@ -58,6 +59,7 @@ typedef struct NvmeNamespace {
 uint64_tzone_capacity;
 uint64_tzone_array_size;
 uint32_tzone_size_log2;
+uint8_t *zd_extensions;
 int32_t nr_open_zones;
 int32_t nr_active_zones;
 
@@ -127,6 +129,12 @@ static inline bool nvme_wp_is_valid(NvmeZone *zone)
st != NVME_ZONE_STATE_OFFLINE;
 }
 
+static inline uint8_t *nvme_get_zd_extension(NvmeNamespace *ns,
+ uint32_t zone_idx)
+{
+return >zd_extensions[zone_idx * ns->params.zd_extension_size];
+}
+
 static inline void nvme_aor_inc_open(NvmeNamespace *ns)
 {
 assert(ns->nr_open_zones >= 0);
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 2e45838c15..85dc73cf06 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -133,6 +133,18 @@ static int nvme_calc_zone_geometry(NvmeNamespace *ns, 
Error **errp)
 return -1;
 }
 
+if (ns->params.zd_extension_size) {
+if (ns->params.zd_extension_size & 0x3f) {
+error_setg(errp,
+"zone descriptor extension size must be a multiple of 64B");
+return -1;
+}
+if ((ns->params.zd_extension_size >> 6) > 0xff) {
+error_setg(errp, "zone descriptor extension size is too large");
+return -1;
+}
+}
+
 return 0;
 }
 
@@ -144,6 +156,10 @@ static void nvme_init_zone_state(NvmeNamespace *ns)
 int i;
 
 ns->zone_array = g_malloc0(ns->zone_array_size);
+if (ns->params.zd_extension_size) {
+ns->zd_extensions = g_malloc0(ns->params.zd_extension_size *
+  ns->num_zones);
+}
 
 QTAILQ_INIT(>exp_open_zones);
 QTAILQ_INIT(>imp_open_zones);
@@ -186,7 +202,8 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace 
*ns, int lba_index,
 id_ns_z->ozcs = ns->params.cross_zone_read ? 0x01 : 0x00;
 
 id_ns_z->lbafe[lba_index].zsze = cpu_to_le64(ns->zone_size);
-id_ns_z->lbafe[lba_index].zdes = 0;
+id_ns_z->lbafe[lba_index].zdes =
+ns->params.zd_extension_size >> 6; /* Units of 64B */
 
 ns->csi = NVME_CSI_ZONED;
 ns->id_ns.nsze = cpu_to_le64(ns->zone_size * ns->num_zones);
@@ -204,7 +221,8 @@ static void nvme_clear_zone(NvmeNamespace *ns, NvmeZone 
*zone)
 
 zone->w_ptr = zone->d.wp;
 state = nvme_get_zone_state(zone);
-if (zone->d.wp != zone->d.zslba) {
+if (zone->d.wp != zone->d.zslba ||
+(zone->d.za & NVME_ZA_ZD_EXT_VALID)) {
 if (state != NVME_ZONE_STATE_CLOSED) {
 trace_pci_nvme_clear_ns_close(state, zone->d.zslba);
 nvme_set_zone_state(zone, NVME_ZONE_STATE_CLOSED);
@@ -301,6 +319,7 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
 if (ns->params.zoned) {
 g_free(ns->id_ns_zoned);
 g_free(ns->zone_array);
+g_free(ns->zd_extensions);
 }
 }
 
@@ -332,6 +351,8 @@ static Property nvme_ns_props[] = {
params.max_active_zones, 0),
 DEFINE_PROP_UINT32("zoned.max_open", NvmeNamespace,
params.max_open_zones, 0),
+DEFINE_PROP_UINT32("zoned.descr_ext_size", NvmeNamespace,
+   params.zd_extension_size, 0),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index cbfd58b7c1..0db51995cc 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1703,6 +1703,26 @@ static uint16_t nvme_offline_zone(NvmeNamespace *ns, 
NvmeZone *zone,
 return NVME_ZONE_INVAL_TRANSITION;
 }
 
+static uint16_t nvme_set_zd_ext(NvmeNamespace *ns, NvmeZone *zone)
+{
+uint16_t status;
+uint8_t state = nvme_get_zone_state(zone);
+
+if (state == NVME_ZONE_STATE_EMPTY) {
+nvme_auto_transition_zone(ns, false, true);
+

[PATCH v9 12/12] hw/block/nvme: Document zoned parameters in usage text

2020-11-04 Thread Dmitry Fomichev

Added brief descriptions of the new device properties that are
now available to users to configure features of Zoned Namespace
Command Set in the emulator.

This patch is for documentation only, no functionality change.

Signed-off-by: Dmitry Fomichev 
Reviewed-by: Niklas Cassel 
---
 hw/block/nvme.c | 47 ++-
 1 file changed, 42 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 0db51995cc..8901321317 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,7 +9,7 @@
  */
 
 /**
- * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
+ * Reference Specs: http://www.nvmexpress.org, 1.4, 1.3, 1.2, 1.1, 1.0e
  *
  *  https://nvmexpress.org/developers/nvme-specification/
  */
@@ -22,8 +22,9 @@
  *  [pmrdev=,] \
  *  max_ioqpairs=, \
  *  aerl=, aer_max_queued=, \
- *  mdts=
- *  -device nvme-ns,drive=,bus=bus_name,nsid=
+ *  mdts=,zoned.append_size_limit= \
+ *  -device nvme-ns,drive=,bus=,nsid=,\
+ *  zoned=
  *
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
  * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
@@ -41,14 +42,50 @@
  * ~~
  * - `aerl`
  *   The Asynchronous Event Request Limit (AERL). Indicates the maximum number
- *   of concurrently outstanding Asynchronous Event Request commands suppoert
+ *   of concurrently outstanding Asynchronous Event Request commands support
  *   by the controller. This is a 0's based value.
  *
  * - `aer_max_queued`
  *   This is the maximum number of events that the device will enqueue for
- *   completion when there are no oustanding AERs. When the maximum number of
+ *   completion when there are no outstanding AERs. When the maximum number of
  *   enqueued events are reached, subsequent events will be dropped.
  *
+ * - `zoned.append_size_limit`
+ *   The maximum I/O size in bytes that is allowed in Zone Append command.
+ *   The default is 128KiB. Since internally this this value is maintained as
+ *   ZASL = log2( / ), some values assigned
+ *   to this property may be rounded down and result in a lower maximum ZA
+ *   data size being in effect. By setting this property to 0, users can make
+ *   ZASL to be equal to MDTS. This property only affects zoned namespaces.
+ *
+ * Setting `zoned` to true selects Zoned Command Set at the namespace.
+ * In this case, the following namespace properties are available to configure
+ * zoned operation:
+ * zoned.zsze=
+ * The number may be followed by K, M, G as in kilo-, mega- or giga-.
+ *
+ * zoned.zcap=
+ * The value 0 (default) forces zone capacity to be the same as zone
+ * size. The value of this property may not exceed zone size.
+ *
+ * zoned.descr_ext_size=
+ * This value needs to be specified in 64B units. If it is zero,
+ * namespace(s) will not support zone descriptor extensions.
+ *
+ * zoned.max_active=
+ * The default value means there is no limit to the number of
+ * concurrently active zones.
+ *
+ * zoned.max_open=
+ * The default value means there is no limit to the number of
+ * concurrently open zones.
+ *
+ * zoned.offline_zones=
+ *
+ * zoned.rdonly_zones=
+ *
+ * zoned.cross_zone_read=
+ * Setting this property to true enables Read Across Zone Boundaries.
  */
 
 #include "qemu/osdep.h"
-- 
2.21.0

[PATCH v9 09/12] hw/block/nvme: Introduce max active and open zone limits

2020-11-04 Thread Dmitry Fomichev

Add two module properties, "zoned.max_active" and "zoned.max_open"
to control the maximum number of zones that can be active or open.
Once these variables are set to non-default values, these limits are
checked during I/O and Too Many Active or Too Many Open command status
is returned if they are exceeded.

Signed-off-by: Hans Holmberg 
Signed-off-by: Dmitry Fomichev 
Reviewed-by: Niklas Cassel 
---
 hw/block/nvme-ns.h| 41 +++
 hw/block/nvme-ns.c| 30 +-
 hw/block/nvme.c   | 94 +++
 hw/block/trace-events |  2 +
 4 files changed, 165 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index d2631ff5a3..421bab0a57 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -33,6 +33,8 @@ typedef struct NvmeNamespaceParams {
 bool cross_zone_read;
 uint64_t zone_size_bs;
 uint64_t zone_cap_bs;
+uint32_t max_active_zones;
+uint32_t max_open_zones;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
@@ -56,6 +58,8 @@ typedef struct NvmeNamespace {
 uint64_tzone_capacity;
 uint64_tzone_array_size;
 uint32_tzone_size_log2;
+int32_t nr_open_zones;
+int32_t nr_active_zones;
 
 NvmeNamespaceParams params;
 } NvmeNamespace;
@@ -123,6 +127,43 @@ static inline bool nvme_wp_is_valid(NvmeZone *zone)
st != NVME_ZONE_STATE_OFFLINE;
 }
 
+static inline void nvme_aor_inc_open(NvmeNamespace *ns)
+{
+assert(ns->nr_open_zones >= 0);
+if (ns->params.max_open_zones) {
+ns->nr_open_zones++;
+assert(ns->nr_open_zones <= ns->params.max_open_zones);
+}
+}
+
+static inline void nvme_aor_dec_open(NvmeNamespace *ns)
+{
+if (ns->params.max_open_zones) {
+assert(ns->nr_open_zones > 0);
+ns->nr_open_zones--;
+}
+assert(ns->nr_open_zones >= 0);
+}
+
+static inline void nvme_aor_inc_active(NvmeNamespace *ns)
+{
+assert(ns->nr_active_zones >= 0);
+if (ns->params.max_active_zones) {
+ns->nr_active_zones++;
+assert(ns->nr_active_zones <= ns->params.max_active_zones);
+}
+}
+
+static inline void nvme_aor_dec_active(NvmeNamespace *ns)
+{
+if (ns->params.max_active_zones) {
+assert(ns->nr_active_zones > 0);
+ns->nr_active_zones--;
+assert(ns->nr_active_zones >= ns->nr_open_zones);
+}
+assert(ns->nr_active_zones >= 0);
+}
+
 int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
 void nvme_ns_drain(NvmeNamespace *ns);
 void nvme_ns_flush(NvmeNamespace *ns);
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index e6db7f7d3b..2e45838c15 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -119,6 +119,20 @@ static int nvme_calc_zone_geometry(NvmeNamespace *ns, 
Error **errp)
 ns->zone_size_log2 = 63 - clz64(ns->zone_size);
 }
 
+/* Make sure that the values of all ZNS properties are sane */
+if (ns->params.max_open_zones > nz) {
+error_setg(errp,
+   "max_open_zones value %u exceeds the number of zones %u",
+   ns->params.max_open_zones, nz);
+return -1;
+}
+if (ns->params.max_active_zones > nz) {
+error_setg(errp,
+   "max_active_zones value %u exceeds the number of zones %u",
+   ns->params.max_active_zones, nz);
+return -1;
+}
+
 return 0;
 }
 
@@ -166,8 +180,8 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace 
*ns, int lba_index,
 id_ns_z = g_malloc0(sizeof(NvmeIdNsZoned));
 
 /* MAR/MOR are zeroes-based, 0x means no limit */
-id_ns_z->mar = 0x;
-id_ns_z->mor = 0x;
+id_ns_z->mar = cpu_to_le32(ns->params.max_active_zones - 1);
+id_ns_z->mor = cpu_to_le32(ns->params.max_open_zones - 1);
 id_ns_z->zoc = 0;
 id_ns_z->ozcs = ns->params.cross_zone_read ? 0x01 : 0x00;
 
@@ -195,6 +209,7 @@ static void nvme_clear_zone(NvmeNamespace *ns, NvmeZone 
*zone)
 trace_pci_nvme_clear_ns_close(state, zone->d.zslba);
 nvme_set_zone_state(zone, NVME_ZONE_STATE_CLOSED);
 }
+nvme_aor_inc_active(ns);
 QTAILQ_INSERT_HEAD(>closed_zones, zone, entry);
 } else {
 trace_pci_nvme_clear_ns_reset(state, zone->d.zslba);
@@ -211,16 +226,23 @@ static void nvme_zoned_ns_shutdown(NvmeNamespace *ns)
 
 QTAILQ_FOREACH_SAFE(zone, >closed_zones, entry, next) {
 QTAILQ_REMOVE(>closed_zones, zone, entry);
+nvme_aor_dec_active(ns);
 nvme_clear_zone(ns, zone);
 }
 QTAILQ_FOREACH_SAFE(zone, >imp_open_zones, entry, next) {
 QTAILQ_REMOVE(>imp_open_zones, zone, entry);
+nvme_aor_dec_open(ns);
+nvme_aor_dec_active(ns);
 nvme_clear_zone(ns, zone);
 }
 QTAILQ_FOREACH_SAFE(zone, >exp_open_zones, entry, next) {
 QTAILQ_REMOVE(>exp_open_zones, zone, entry);
+nvme_aor_dec_open(ns);
+

[PATCH v9 06/12] hw/block/nvme: Support allocated CNS command variants

2020-11-04 Thread Dmitry Fomichev

From: Niklas Cassel 

Many CNS commands have "allocated" command variants. These include
a namespace as long as it is allocated, that is a namespace is
included regardless if it is active (attached) or not.

While these commands are optional (they are mandatory for controllers
supporting the namespace attachment command), our QEMU implementation
is more complete by actually providing support for these CNS values.

However, since our QEMU model currently does not support the namespace
attachment command, these new allocated CNS commands will return the
same result as the active CNS command variants.

In NVMe, a namespace is active if it exists and is attached to the
controller.

Add a new Boolean namespace flag, "attached", to provide the most
basic namespace attachment support. The default value for this new
flag is true. Also, implement the logic in the new CNS values to
include/exclude namespaces based on this new property. The only thing
missing is hooking up the actual Namespace Attachment command opcode,
which will allow a user to toggle the "attached" flag per namespace.

The reason for not hooking up this command completely is because the
NVMe specification requires the namespace management command to be
supported if the namespace attachment command is supported.

Signed-off-by: Niklas Cassel 
Signed-off-by: Dmitry Fomichev 
Reviewed-by: Keith Busch 
---
 hw/block/nvme-ns.h   |  1 +
 include/block/nvme.h | 20 +++
 hw/block/nvme-ns.c   |  1 +
 hw/block/nvme.c  | 46 +---
 4 files changed, 49 insertions(+), 19 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index d795e44bab..2d9cd29d07 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -31,6 +31,7 @@ typedef struct NvmeNamespace {
 int64_t  size;
 NvmeIdNs id_ns;
 const uint32_t *iocs;
+bool attached;
 uint8_t  csi;
 
 NvmeNamespaceParams params;
diff --git a/include/block/nvme.h b/include/block/nvme.h
index af23514713..394db19022 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -806,14 +806,18 @@ typedef struct QEMU_PACKED NvmePSD {
 #define NVME_IDENTIFY_DATA_SIZE 4096
 
 enum NvmeIdCns {
-NVME_ID_CNS_NS= 0x00,
-NVME_ID_CNS_CTRL  = 0x01,
-NVME_ID_CNS_NS_ACTIVE_LIST= 0x02,
-NVME_ID_CNS_NS_DESCR_LIST = 0x03,
-NVME_ID_CNS_CS_NS = 0x05,
-NVME_ID_CNS_CS_CTRL   = 0x06,
-NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
-NVME_ID_CNS_IO_COMMAND_SET= 0x1c,
+NVME_ID_CNS_NS= 0x00,
+NVME_ID_CNS_CTRL  = 0x01,
+NVME_ID_CNS_NS_ACTIVE_LIST= 0x02,
+NVME_ID_CNS_NS_DESCR_LIST = 0x03,
+NVME_ID_CNS_CS_NS = 0x05,
+NVME_ID_CNS_CS_CTRL   = 0x06,
+NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
+NVME_ID_CNS_NS_PRESENT_LIST   = 0x10,
+NVME_ID_CNS_NS_PRESENT= 0x11,
+NVME_ID_CNS_CS_NS_PRESENT_LIST= 0x1a,
+NVME_ID_CNS_CS_NS_PRESENT = 0x1b,
+NVME_ID_CNS_IO_COMMAND_SET= 0x1c,
 };
 
 typedef struct QEMU_PACKED NvmeIdCtrl {
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index c0362426cc..e191ef9be0 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -42,6 +42,7 @@ static void nvme_ns_init(NvmeNamespace *ns)
 id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(ns));
 
 ns->csi = NVME_CSI_NVM;
+ns->attached = true;
 
 /* no thin provisioning */
 id_ns->ncap = id_ns->nsze;
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index bb82dd9975..7495cdb5ef 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1236,6 +1236,7 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t rae, 
uint32_t buf_len,
 uint32_t trans_len;
 NvmeNamespace *ns;
 time_t current_ms;
+int i;
 
 if (off >= sizeof(smart)) {
 return NVME_INVALID_FIELD | NVME_DNR;
@@ -1246,10 +1247,7 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t 
rae, uint32_t buf_len,
 if (!ns) {
 return NVME_INVALID_NSID | NVME_DNR;
 }
-nvme_set_blk_stats(ns, );
 } else {
-int i;
-
 for (i = 1; i <= n->num_namespaces; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
@@ -1552,7 +1550,8 @@ static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, 
NvmeRequest *req)
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req,
+ bool only_active)
 {
 NvmeNamespace *ns;
 NvmeIdentify *c = (NvmeIdentify *)>cmd;
@@ -1569,11 +1568,16 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
NvmeRequest *req)
 return nvme_rpt_empty_id_struct(n, req);
 }
 
+if (only_active && !ns->attached) {
+return nvme_rpt_empty_id_struct(n, req);
+}
+
 return nvme_dma(n, (uint8_t *)>id_ns,

[PATCH v9 04/12] hw/block/nvme: Merge nvme_write_zeroes() with nvme_write()

2020-11-04 Thread Dmitry Fomichev

nvme_write() now handles WRITE, WRITE ZEROES and ZONE_APPEND.

Signed-off-by: Dmitry Fomichev 
Reviewed-by: Niklas Cassel 
Acked-by: Klaus Jensen 
---
 hw/block/nvme.c   | 72 +--
 hw/block/trace-events |  1 -
 2 files changed, 28 insertions(+), 45 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 770e42a066..48adbe84f5 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1001,32 +1001,7 @@ invalid:
 return status | NVME_DNR;
 }
 
-static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
-{
-NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;
-NvmeNamespace *ns = req->ns;
-uint64_t slba = le64_to_cpu(rw->slba);
-uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
-uint64_t offset = nvme_l2b(ns, slba);
-uint32_t count = nvme_l2b(ns, nlb);
-uint16_t status;
-
-trace_pci_nvme_write_zeroes(nvme_cid(req), nvme_nsid(ns), slba, nlb);
-
-status = nvme_check_bounds(n, ns, slba, nlb);
-if (status) {
-trace_pci_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
-return status;
-}
-
-block_acct_start(blk_get_stats(req->ns->blkconf.blk), >acct, 0,
- BLOCK_ACCT_WRITE);
-req->aiocb = blk_aio_pwrite_zeroes(req->ns->blkconf.blk, offset, count,
-   BDRV_REQ_MAY_UNMAP, nvme_rw_cb, req);
-return NVME_NO_COMPLETE;
-}
-
-static uint16_t nvme_write(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_write(NvmeCtrl *n, NvmeRequest *req, bool wrz)
 {
 NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;
 NvmeNamespace *ns = req->ns;
@@ -1040,10 +1015,12 @@ static uint16_t nvme_write(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_write(nvme_cid(req), nvme_io_opc_str(rw->opcode),
  nvme_nsid(ns), nlb, data_size, slba);
 
-status = nvme_check_mdts(n, data_size);
-if (status) {
-trace_pci_nvme_err_mdts(nvme_cid(req), data_size);
-goto invalid;
+if (!wrz) {
+status = nvme_check_mdts(n, data_size);
+if (status) {
+trace_pci_nvme_err_mdts(nvme_cid(req), data_size);
+goto invalid;
+}
 }
 
 status = nvme_check_bounds(n, ns, slba, nlb);
@@ -1052,21 +1029,28 @@ static uint16_t nvme_write(NvmeCtrl *n, NvmeRequest 
*req)
 goto invalid;
 }
 
-status = nvme_map_dptr(n, data_size, req);
-if (status) {
-goto invalid;
-}
-
 data_offset = nvme_l2b(ns, slba);
 
-block_acct_start(blk_get_stats(blk), >acct, data_size,
- BLOCK_ACCT_WRITE);
-if (req->qsg.sg) {
-req->aiocb = dma_blk_write(blk, >qsg, data_offset,
-   BDRV_SECTOR_SIZE, nvme_rw_cb, req);
+if (!wrz) {
+status = nvme_map_dptr(n, data_size, req);
+if (status) {
+goto invalid;
+}
+
+block_acct_start(blk_get_stats(blk), >acct, data_size,
+ BLOCK_ACCT_WRITE);
+if (req->qsg.sg) {
+req->aiocb = dma_blk_write(blk, >qsg, data_offset,
+   BDRV_SECTOR_SIZE, nvme_rw_cb, req);
+} else {
+req->aiocb = blk_aio_pwritev(blk, data_offset, >iov, 0,
+ nvme_rw_cb, req);
+}
 } else {
-req->aiocb = blk_aio_pwritev(blk, data_offset, >iov, 0,
- nvme_rw_cb, req);
+block_acct_start(blk_get_stats(blk), >acct, 0, BLOCK_ACCT_WRITE);
+req->aiocb = blk_aio_pwrite_zeroes(blk, data_offset, data_size,
+   BDRV_REQ_MAY_UNMAP, nvme_rw_cb,
+   req);
 }
 return NVME_NO_COMPLETE;
 
@@ -1100,9 +1084,9 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
 case NVME_CMD_FLUSH:
 return nvme_flush(n, req);
 case NVME_CMD_WRITE_ZEROES:
-return nvme_write_zeroes(n, req);
+return nvme_write(n, req, true);
 case NVME_CMD_WRITE:
-return nvme_write(n, req);
+return nvme_write(n, req, false);
 case NVME_CMD_READ:
 return nvme_read(n, req);
 default:
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 540c600931..e67e96c2b5 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -43,7 +43,6 @@ pci_nvme_admin_cmd(uint16_t cid, uint16_t sqid, uint8_t 
opcode, const char *opna
 pci_nvme_read(uint16_t cid, uint32_t nsid, uint32_t nlb, uint64_t count, 
uint64_t lba) "cid %"PRIu16" nsid %"PRIu32" nlb %"PRIu32" count %"PRIu64" lba 
0x%"PRIx64""
 pci_nvme_write(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, 
uint64_t count, uint64_t lba) "cid %"PRIu16" opname '%s' nsid %"PRIu32" nlb 
%"PRIu32" count %"PRIu64" lba 0x%"PRIx64""
 pci_nvme_rw_cb(uint16_t cid, const char *blkname) "cid %"PRIu16" blk '%s'"
-pci_nvme_write_zeroes(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t 
nlb) "cid %"PRIu16" nsid %"PRIu32"

[PATCH v9 08/12] hw/block/nvme: Support Zoned Namespace Command Set

2020-11-04 Thread Dmitry Fomichev

The emulation code has been changed to advertise NVM Command Set when
"zoned" device property is not set (default) and Zoned Namespace
Command Set otherwise.

Define values and structures that are needed to support Zoned
Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.
Define trace events where needed in newly introduced code.

In order to improve scalability, all open, closed and full zones
are organized in separate linked lists. Consequently, almost all
zone operations don't require scanning of the entire zone array
(which potentially can be quite large) - it is only necessary to
enumerate one or more zone lists.

Handlers for three new NVMe commands introduced in Zoned Namespace
Command Set specification are added, namely for Zone Management
Receive, Zone Management Send and Zone Append.

Device initialization code has been extended to create a proper
configuration for zoned operation using device properties.

Read/Write command handler is modified to only allow writes at the
write pointer if the namespace is zoned. For Zone Append command,
writes implicitly happen at the write pointer and the starting write
pointer value is returned as the result of the command. Write Zeroes
handler is modified to add zoned checks that are identical to those
done as a part of Write flow.

Subsequent commits in this series add ZDE support and checks for
active and open zone limits.

Signed-off-by: Niklas Cassel 
Signed-off-by: Hans Holmberg 
Signed-off-by: Ajay Joshi 
Signed-off-by: Chaitanya Kulkarni 
Signed-off-by: Matias Bjorling 
Signed-off-by: Aravind Ramesh 
Signed-off-by: Shin'ichiro Kawasaki 
Signed-off-by: Adam Manzanares 
Signed-off-by: Dmitry Fomichev 
Reviewed-by: Niklas Cassel 
---
 hw/block/nvme-ns.h|  54 +++
 hw/block/nvme.h   |   8 +
 hw/block/nvme-ns.c| 173 
 hw/block/nvme.c   | 971 +-
 hw/block/trace-events |  18 +-
 5 files changed, 1209 insertions(+), 15 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 2d9cd29d07..d2631ff5a3 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -19,9 +19,20 @@
 #define NVME_NS(obj) \
 OBJECT_CHECK(NvmeNamespace, (obj), TYPE_NVME_NS)
 
+typedef struct NvmeZone {
+NvmeZoneDescr   d;
+uint64_tw_ptr;
+QTAILQ_ENTRY(NvmeZone) entry;
+} NvmeZone;
+
 typedef struct NvmeNamespaceParams {
 uint32_t nsid;
 QemuUUID uuid;
+
+bool zoned;
+bool cross_zone_read;
+uint64_t zone_size_bs;
+uint64_t zone_cap_bs;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
@@ -34,6 +45,18 @@ typedef struct NvmeNamespace {
 bool attached;
 uint8_t  csi;
 
+NvmeIdNsZoned   *id_ns_zoned;
+NvmeZone*zone_array;
+QTAILQ_HEAD(, NvmeZone) exp_open_zones;
+QTAILQ_HEAD(, NvmeZone) imp_open_zones;
+QTAILQ_HEAD(, NvmeZone) closed_zones;
+QTAILQ_HEAD(, NvmeZone) full_zones;
+uint32_tnum_zones;
+uint64_tzone_size;
+uint64_tzone_capacity;
+uint64_tzone_array_size;
+uint32_tzone_size_log2;
+
 NvmeNamespaceParams params;
 } NvmeNamespace;
 
@@ -71,8 +94,39 @@ static inline size_t nvme_l2b(NvmeNamespace *ns, uint64_t 
lba)
 
 typedef struct NvmeCtrl NvmeCtrl;
 
+static inline uint8_t nvme_get_zone_state(NvmeZone *zone)
+{
+return zone->d.zs >> 4;
+}
+
+static inline void nvme_set_zone_state(NvmeZone *zone, enum NvmeZoneState 
state)
+{
+zone->d.zs = state << 4;
+}
+
+static inline uint64_t nvme_zone_rd_boundary(NvmeNamespace *ns, NvmeZone *zone)
+{
+return zone->d.zslba + ns->zone_size;
+}
+
+static inline uint64_t nvme_zone_wr_boundary(NvmeZone *zone)
+{
+return zone->d.zslba + zone->d.zcap;
+}
+
+static inline bool nvme_wp_is_valid(NvmeZone *zone)
+{
+uint8_t st = nvme_get_zone_state(zone);
+
+return st != NVME_ZONE_STATE_FULL &&
+   st != NVME_ZONE_STATE_READ_ONLY &&
+   st != NVME_ZONE_STATE_OFFLINE;
+}
+
 int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
 void nvme_ns_drain(NvmeNamespace *ns);
 void nvme_ns_flush(NvmeNamespace *ns);
+void nvme_ns_shutdown(NvmeNamespace *ns);
+void nvme_ns_cleanup(NvmeNamespace *ns);
 
 #endif /* NVME_NS_H */
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index e080a2318a..4cb0615128 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -6,6 +6,9 @@
 
 #define NVME_MAX_NAMESPACES 256
 
+#define NVME_DEFAULT_ZONE_SIZE   (128 * MiB)
+#define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
+
 typedef struct NvmeParams {
 char *serial;
 uint32_t num_queues; /* deprecated since 5.1 */
@@ -16,6 +19,7 @@ typedef struct NvmeParams {
 uint32_t aer_max_queued;
 uint8_t  mdts;
 bool use_intel_id;
+uint32_t zasl_bs;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
@@ -28,6 +32,8 @@ typedef struct NvmeRequest {
 struct NvmeNamespace*ns;
 BlockAIOCB  *aiocb;
 uint16_tstatus;
+

[PATCH v9 02/12] hw/block/nvme: Generate namespace UUIDs

2020-11-04 Thread Dmitry Fomichev

In NVMe 1.4, a namespace must report an ID descriptor of UUID type
if it doesn't support EUI64 or NGUID. Add a new namespace property,
"uuid", that provides the user the option to either specify the UUID
explicitly or have a UUID generated automatically every time a
namespace is initialized.

Suggested-by: Klaus Jensen 
Signed-off-by: Dmitry Fomichev 
Reviewed-by: Klaus Jensen 
Reviewed-by: Keith Busch 
Reviewed-by: Niklas Cassel 
---
 hw/block/nvme-ns.h | 1 +
 hw/block/nvme-ns.c | 1 +
 hw/block/nvme.c| 9 +
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index ea8c2f785d..a38071884a 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -21,6 +21,7 @@
 
 typedef struct NvmeNamespaceParams {
 uint32_t nsid;
+QemuUUID uuid;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index b69cdaf27e..de735eb9f3 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -129,6 +129,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 static Property nvme_ns_props[] = {
 DEFINE_BLOCK_PROPERTIES(NvmeNamespace, blkconf),
 DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
+DEFINE_PROP_UUID("uuid", NvmeNamespace, params.uuid),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 702f7cc2e3..ed3f38f01d 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1564,6 +1564,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
NvmeRequest *req)
 
 static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
 {
+NvmeNamespace *ns;
 NvmeIdentify *c = (NvmeIdentify *)>cmd;
 uint32_t nsid = le32_to_cpu(c->nsid);
 uint8_t list[NVME_IDENTIFY_DATA_SIZE];
@@ -1583,7 +1584,8 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, 
NvmeRequest *req)
 return NVME_INVALID_NSID | NVME_DNR;
 }
 
-if (unlikely(!nvme_ns(n, nsid))) {
+ns = nvme_ns(n, nsid);
+if (unlikely(!ns)) {
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
@@ -1592,12 +1594,11 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl 
*n, NvmeRequest *req)
 /*
  * Because the NGUID and EUI64 fields are 0 in the Identify Namespace data
  * structure, a Namespace UUID (nidt = 0x3) must be reported in the
- * Namespace Identification Descriptor. Add a very basic Namespace UUID
- * here.
+ * Namespace Identification Descriptor. Add the namespace UUID here.
  */
 ns_descrs->uuid.hdr.nidt = NVME_NIDT_UUID;
 ns_descrs->uuid.hdr.nidl = NVME_NIDT_UUID_LEN;
-stl_be_p(_descrs->uuid.v, nsid);
+memcpy(_descrs->uuid.v, ns->params.uuid.data, NVME_NIDT_UUID_LEN);
 
 return nvme_dma(n, list, NVME_IDENTIFY_DATA_SIZE,
 DMA_DIRECTION_FROM_DEVICE, req);
-- 
2.21.0

[PATCH v9 05/12] hw/block/nvme: Add support for Namespace Types

2020-11-04 Thread Dmitry Fomichev

From: Niklas Cassel 

Define the structures and constants required to implement
Namespace Types support.

Namespace Types introduce a new command set, "I/O Command Sets",
that allows the host to retrieve the command sets associated with
a namespace. Introduce support for the command set and enable
detection for the NVM Command Set.

The new workflows for identify commands rely heavily on zero-filled
identify structs. E.g., certain CNS commands are defined to return
a zero-filled identify struct when an inactive namespace NSID
is supplied.

Add a helper function in order to avoid code duplication when
reporting zero-filled identify structures.

Signed-off-by: Niklas Cassel 
Signed-off-by: Dmitry Fomichev 
Reviewed-by: Keith Busch 
---
 hw/block/nvme-ns.h|   1 +
 include/block/nvme.h  |  66 +++
 hw/block/nvme-ns.c|   2 +
 hw/block/nvme.c   | 188 +++---
 hw/block/trace-events |   7 ++
 5 files changed, 219 insertions(+), 45 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index a38071884a..d795e44bab 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -31,6 +31,7 @@ typedef struct NvmeNamespace {
 int64_t  size;
 NvmeIdNs id_ns;
 const uint32_t *iocs;
+uint8_t  csi;
 
 NvmeNamespaceParams params;
 } NvmeNamespace;
diff --git a/include/block/nvme.h b/include/block/nvme.h
index f62cc90d49..af23514713 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -84,6 +84,7 @@ enum NvmeCapMask {
 
 enum NvmeCapCss {
 NVME_CAP_CSS_NVM= 1 << 0,
+NVME_CAP_CSS_CSI_SUPP   = 1 << 6,
 NVME_CAP_CSS_ADMIN_ONLY = 1 << 7,
 };
 
@@ -117,9 +118,25 @@ enum NvmeCcMask {
 
 enum NvmeCcCss {
 NVME_CC_CSS_NVM= 0x0,
+NVME_CC_CSS_CSI= 0x6,
 NVME_CC_CSS_ADMIN_ONLY = 0x7,
 };
 
+#define NVME_SET_CC_EN(cc, val) \
+(cc |= (uint32_t)((val) & CC_EN_MASK) << CC_EN_SHIFT)
+#define NVME_SET_CC_CSS(cc, val)\
+(cc |= (uint32_t)((val) & CC_CSS_MASK) << CC_CSS_SHIFT)
+#define NVME_SET_CC_MPS(cc, val)\
+(cc |= (uint32_t)((val) & CC_MPS_MASK) << CC_MPS_SHIFT)
+#define NVME_SET_CC_AMS(cc, val)\
+(cc |= (uint32_t)((val) & CC_AMS_MASK) << CC_AMS_SHIFT)
+#define NVME_SET_CC_SHN(cc, val)\
+(cc |= (uint32_t)((val) & CC_SHN_MASK) << CC_SHN_SHIFT)
+#define NVME_SET_CC_IOSQES(cc, val) \
+(cc |= (uint32_t)((val) & CC_IOSQES_MASK) << CC_IOSQES_SHIFT)
+#define NVME_SET_CC_IOCQES(cc, val) \
+(cc |= (uint32_t)((val) & CC_IOCQES_MASK) << CC_IOCQES_SHIFT)
+
 enum NvmeCstsShift {
 CSTS_RDY_SHIFT  = 0,
 CSTS_CFS_SHIFT  = 1,
@@ -534,8 +551,13 @@ typedef struct QEMU_PACKED NvmeIdentify {
 uint64_trsvd2[2];
 uint64_tprp1;
 uint64_tprp2;
-uint32_tcns;
-uint32_trsvd11[5];
+uint8_t cns;
+uint8_t rsvd10;
+uint16_tctrlid;
+uint16_tnvmsetid;
+uint8_t rsvd11;
+uint8_t csi;
+uint32_trsvd12[4];
 } NvmeIdentify;
 
 typedef struct QEMU_PACKED NvmeRwCmd {
@@ -656,6 +678,7 @@ enum NvmeStatusCodes {
 NVME_SGL_DESCR_TYPE_INVALID = 0x0011,
 NVME_INVALID_USE_OF_CMB = 0x0012,
 NVME_INVALID_PRP_OFFSET = 0x0013,
+NVME_CMD_SET_CMB_REJECTED   = 0x002b,
 NVME_LBA_RANGE  = 0x0080,
 NVME_CAP_EXCEEDED   = 0x0081,
 NVME_NS_NOT_READY   = 0x0082,
@@ -782,11 +805,15 @@ typedef struct QEMU_PACKED NvmePSD {
 
 #define NVME_IDENTIFY_DATA_SIZE 4096
 
-enum {
-NVME_ID_CNS_NS = 0x0,
-NVME_ID_CNS_CTRL   = 0x1,
-NVME_ID_CNS_NS_ACTIVE_LIST = 0x2,
-NVME_ID_CNS_NS_DESCR_LIST  = 0x3,
+enum NvmeIdCns {
+NVME_ID_CNS_NS= 0x00,
+NVME_ID_CNS_CTRL  = 0x01,
+NVME_ID_CNS_NS_ACTIVE_LIST= 0x02,
+NVME_ID_CNS_NS_DESCR_LIST = 0x03,
+NVME_ID_CNS_CS_NS = 0x05,
+NVME_ID_CNS_CS_CTRL   = 0x06,
+NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
+NVME_ID_CNS_IO_COMMAND_SET= 0x1c,
 };
 
 typedef struct QEMU_PACKED NvmeIdCtrl {
@@ -934,6 +961,7 @@ enum NvmeFeatureIds {
 NVME_WRITE_ATOMICITY= 0xa,
 NVME_ASYNCHRONOUS_EVENT_CONF= 0xb,
 NVME_TIMESTAMP  = 0xe,
+NVME_COMMAND_SET_PROFILE= 0x19,
 NVME_SOFTWARE_PROGRESS_MARKER   = 0x80,
 NVME_FID_MAX= 0x100,
 };
@@ -1018,18 +1046,26 @@ typedef struct QEMU_PACKED NvmeIdNsDescr {
 uint8_t rsvd2[2];
 } NvmeIdNsDescr;
 
-enum {
-NVME_NIDT_EUI64_LEN =  8,
-NVME_NIDT_NGUID_LEN = 16,
-NVME_NIDT_UUID_LEN  = 16,
+enum NvmeNsIdentifierLength {
+NVME_NIDL_EUI64 = 8,
+NVME_NIDL_NGUID = 16,
+NVME_NIDL_UUID  = 16,
+NVME_NIDL_CSI   = 1,
 };
 
 enum NvmeNsIdentifierType {
-NVME_NIDT_EUI64 = 0x1,
-NVME_NIDT_NGUID = 0x2,
-NVME_NIDT_UUID  = 0x3,
+NVME_NIDT_EUI64 = 0x01,
+NVME_NIDT_NGUID = 0x02,
+

[PATCH v9 01/12] hw/block/nvme: Add Commands Supported and Effects log

2020-11-04 Thread Dmitry Fomichev

This log page becomes necessary to implement to allow checking for
Zone Append command support in Zoned Namespace Command Set.

This commit adds the code to report this log page for NVM Command
Set only. The parts that are specific to zoned operation will be
added later in the series.

All incoming admin and i/o commands are now only processed if their
corresponding support bits are set in this log. This provides an
easy way to control what commands to support and what not to
depending on set CC.CSS.

Signed-off-by: Dmitry Fomichev 
Reviewed-by: Niklas Cassel 
---
 hw/block/nvme-ns.h|  1 +
 include/block/nvme.h  | 19 +
 hw/block/nvme.c   | 96 +++
 hw/block/trace-events |  1 +
 4 files changed, 108 insertions(+), 9 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 83734f4606..ea8c2f785d 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -29,6 +29,7 @@ typedef struct NvmeNamespace {
 int32_t  bootindex;
 int64_t  size;
 NvmeIdNs id_ns;
+const uint32_t *iocs;
 
 NvmeNamespaceParams params;
 } NvmeNamespace;
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 8a46d9cf01..f62cc90d49 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -745,10 +745,27 @@ enum NvmeSmartWarn {
 NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,
 };
 
+typedef struct NvmeEffectsLog {
+uint32_tacs[256];
+uint32_tiocs[256];
+uint8_t resv[2048];
+} NvmeEffectsLog;
+
+enum {
+NVME_CMD_EFF_CSUPP  = 1 << 0,
+NVME_CMD_EFF_LBCC   = 1 << 1,
+NVME_CMD_EFF_NCC= 1 << 2,
+NVME_CMD_EFF_NIC= 1 << 3,
+NVME_CMD_EFF_CCC= 1 << 4,
+NVME_CMD_EFF_CSE_MASK   = 3 << 16,
+NVME_CMD_EFF_UUID_SEL   = 1 << 19,
+};
+
 enum NvmeLogIdentifier {
 NVME_LOG_ERROR_INFO = 0x01,
 NVME_LOG_SMART_INFO = 0x02,
 NVME_LOG_FW_SLOT_INFO   = 0x03,
+NVME_LOG_CMD_EFFECTS= 0x05,
 };
 
 typedef struct QEMU_PACKED NvmePSD {
@@ -861,6 +878,7 @@ enum NvmeIdCtrlFrmw {
 
 enum NvmeIdCtrlLpa {
 NVME_LPA_NS_SMART = 1 << 0,
+NVME_LPA_CSE  = 1 << 1,
 NVME_LPA_EXTENDED = 1 << 2,
 };
 
@@ -1060,6 +1078,7 @@ static inline void _nvme_check_size(void)
 QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64);
 QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
 QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
+QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
 QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
 QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
 QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 065e763e4f..702f7cc2e3 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -111,6 +111,28 @@ static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
 [NVME_TIMESTAMP]= NVME_FEAT_CAP_CHANGE,
 };
 
+static const uint32_t nvme_cse_acs[256] = {
+[NVME_ADM_CMD_DELETE_SQ]= NVME_CMD_EFF_CSUPP,
+[NVME_ADM_CMD_CREATE_SQ]= NVME_CMD_EFF_CSUPP,
+[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFF_CSUPP,
+[NVME_ADM_CMD_DELETE_CQ]= NVME_CMD_EFF_CSUPP,
+[NVME_ADM_CMD_CREATE_CQ]= NVME_CMD_EFF_CSUPP,
+[NVME_ADM_CMD_IDENTIFY] = NVME_CMD_EFF_CSUPP,
+[NVME_ADM_CMD_ABORT]= NVME_CMD_EFF_CSUPP,
+[NVME_ADM_CMD_SET_FEATURES] = NVME_CMD_EFF_CSUPP,
+[NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP,
+[NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP,
+};
+
+static const uint32_t nvme_cse_iocs_none[256];
+
+static const uint32_t nvme_cse_iocs_nvm[256] = {
+[NVME_CMD_FLUSH]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
+[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
+[NVME_CMD_WRITE]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
+[NVME_CMD_READ] = NVME_CMD_EFF_CSUPP,
+};
+
 static void nvme_process_sq(void *opaque);
 
 static uint16_t nvme_cid(NvmeRequest *req)
@@ -1022,10 +1044,6 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req),
   req->cmd.opcode, nvme_io_opc_str(req->cmd.opcode));
 
-if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_ADMIN_ONLY) {
-return NVME_INVALID_OPCODE | NVME_DNR;
-}
-
 if (!nvme_nsid_valid(n, nsid)) {
 return NVME_INVALID_NSID | NVME_DNR;
 }
@@ -1035,6 +1053,11 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest 
*req)
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
+if (!(req->ns->iocs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
+trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
+return NVME_INVALID_OPCODE | NVME_DNR;
+}
+
 switch (req->cmd.opcode) {
 case NVME_CMD_FLUSH:
 return nvme_flush(n, req);
@@ -1044,8 +1067,7 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
 case

[PATCH v9 03/12] hw/block/nvme: Separate read and write handlers

2020-11-04 Thread Dmitry Fomichev

With ZNS support in place, the majority of code in nvme_rw() has
become read- or write-specific. Move these parts to two separate
handlers, nvme_read() and nvme_write() to make the code more
readable and to remove multiple is_write checks that so far existed
in the i/o path.

This is a refactoring patch, no change in functionality.

Signed-off-by: Dmitry Fomichev 
Reviewed-by: Niklas Cassel 
Acked-by: Klaus Jensen 
---
 hw/block/nvme.c   | 91 ++-
 hw/block/trace-events |  3 +-
 2 files changed, 67 insertions(+), 27 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index ed3f38f01d..770e42a066 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -953,6 +953,54 @@ static uint16_t nvme_flush(NvmeCtrl *n, NvmeRequest *req)
 return NVME_NO_COMPLETE;
 }
 
+static uint16_t nvme_read(NvmeCtrl *n, NvmeRequest *req)
+{
+NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;
+NvmeNamespace *ns = req->ns;
+uint64_t slba = le64_to_cpu(rw->slba);
+uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
+uint64_t data_size = nvme_l2b(ns, nlb);
+uint64_t data_offset;
+BlockBackend *blk = ns->blkconf.blk;
+uint16_t status;
+
+trace_pci_nvme_read(nvme_cid(req), nvme_nsid(ns), nlb, data_size, slba);
+
+status = nvme_check_mdts(n, data_size);
+if (status) {
+trace_pci_nvme_err_mdts(nvme_cid(req), data_size);
+goto invalid;
+}
+
+status = nvme_check_bounds(n, ns, slba, nlb);
+if (status) {
+trace_pci_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
+goto invalid;
+}
+
+status = nvme_map_dptr(n, data_size, req);
+if (status) {
+goto invalid;
+}
+
+data_offset = nvme_l2b(ns, slba);
+
+block_acct_start(blk_get_stats(blk), >acct, data_size,
+ BLOCK_ACCT_READ);
+if (req->qsg.sg) {
+req->aiocb = dma_blk_read(blk, >qsg, data_offset,
+  BDRV_SECTOR_SIZE, nvme_rw_cb, req);
+} else {
+req->aiocb = blk_aio_preadv(blk, data_offset, >iov, 0,
+nvme_rw_cb, req);
+}
+return NVME_NO_COMPLETE;
+
+invalid:
+block_acct_invalid(blk_get_stats(blk), BLOCK_ACCT_READ);
+return status | NVME_DNR;
+}
+
 static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
 {
 NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;
@@ -978,22 +1026,19 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, 
NvmeRequest *req)
 return NVME_NO_COMPLETE;
 }
 
-static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_write(NvmeCtrl *n, NvmeRequest *req)
 {
 NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;
 NvmeNamespace *ns = req->ns;
-uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
 uint64_t slba = le64_to_cpu(rw->slba);
-
+uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
 uint64_t data_size = nvme_l2b(ns, nlb);
-uint64_t data_offset = nvme_l2b(ns, slba);
-enum BlockAcctType acct = req->cmd.opcode == NVME_CMD_WRITE ?
-BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
+uint64_t data_offset;
 BlockBackend *blk = ns->blkconf.blk;
 uint16_t status;
 
-trace_pci_nvme_rw(nvme_cid(req), nvme_io_opc_str(rw->opcode),
-  nvme_nsid(ns), nlb, data_size, slba);
+trace_pci_nvme_write(nvme_cid(req), nvme_io_opc_str(rw->opcode),
+ nvme_nsid(ns), nlb, data_size, slba);
 
 status = nvme_check_mdts(n, data_size);
 if (status) {
@@ -1012,29 +1057,22 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
 goto invalid;
 }
 
-block_acct_start(blk_get_stats(blk), >acct, data_size, acct);
+data_offset = nvme_l2b(ns, slba);
+
+block_acct_start(blk_get_stats(blk), >acct, data_size,
+ BLOCK_ACCT_WRITE);
 if (req->qsg.sg) {
-if (acct == BLOCK_ACCT_WRITE) {
-req->aiocb = dma_blk_write(blk, >qsg, data_offset,
-   BDRV_SECTOR_SIZE, nvme_rw_cb, req);
-} else {
-req->aiocb = dma_blk_read(blk, >qsg, data_offset,
-  BDRV_SECTOR_SIZE, nvme_rw_cb, req);
-}
+req->aiocb = dma_blk_write(blk, >qsg, data_offset,
+   BDRV_SECTOR_SIZE, nvme_rw_cb, req);
 } else {
-if (acct == BLOCK_ACCT_WRITE) {
-req->aiocb = blk_aio_pwritev(blk, data_offset, >iov, 0,
- nvme_rw_cb, req);
-} else {
-req->aiocb = blk_aio_preadv(blk, data_offset, >iov, 0,
-nvme_rw_cb, req);
-}
+req->aiocb = blk_aio_pwritev(blk, data_offset, >iov, 0,
+ nvme_rw_cb, req);
 }
 return NVME_NO_COMPLETE;
 
 invalid:
-block_acct_invalid(blk_get_stats(ns->blkconf.blk), acct);
-return status;
+block_acct_invalid(blk_get_stats(blk), BLOCK_ACCT_WRITE);
+return status | NVME_DNR;

[PATCH v9 00/12] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-11-04 Thread Dmitry Fomichev

v8 -> v9:

 - Move the modifications to "include/block/nvme.h" made to
   introduce ZNS-related definitions into a separate patch.

 - Add a new struct, NvmeZonedResult, along the same lines as the
   existing NvmeAerResult, to carry Zone Append LBA returned to
   the host. Now, there is no need to modify NvmeCqe struct except
   renaming DW1 field from "rsvd" to "dw1".

 - Add check for MDTS in Zone Management Receive handler.

 - Remove checks for ns->attached since the value of this flag
   is always true for now.

 - Rebase to the current quemu-nvme/nvme-next branch.

v7 -> v8:

 - Move refactoring commits to the front of the series.

 - Remove "attached" and "fill_pattern" device properties.

 - Only close open zones upon subsystem shutdown, not when CC.EN flag
   is set to 0. Avoid looping through all zones by iterating through
   lists of open and closed zones.

 - Improve bulk processing of zones aka zoned operations with "all"
   flag set. Avoid looping through the entire zone array for all zone
   operations except Offline Zone.

 - Prefix ZNS-related property names with "zoned.". The "zoned" Boolean
   property is retained to turn on zoned command set as it is much more
   intuitive and user-friendly compared to setting a magic number value
   to csi property.

 - Address review comments.

 - Remove unused trace events.

v6 -> v7:

 - Introduce ns->iocs initialization function earlier in the series,
   in CSE Log patch.

 - Set NVM iocs for zoned namespaces when CC.CSS is set to
   NVME_CC_CSS_NVM.

 - Clean up code in CSE log handler.
 
v5 -> v6:

 - Remove zoned state persistence code. Replace position-independent
   zone lists with QTAILQs.

 - Close all open zones upon clearing of the controller. This is
   a similar procedure to the one previously performed upon powering
   up with zone persistence. 

 - Squash NS Types and ZNS triplets of commits to keep definitions
   and trace event definitions together with the implementation code.

 - Move namespace UUID generation to a separate patch. Add the new
   "uuid" property as suggested by Klaus.

 - Rework Commands and Effects patch to make sure that the log is
   always in sync with the actual set of commands supported.

 - Add two refactoring commits at the end of the series to
   optimize read and write i/o path.

- Incorporate feedback from Keith, Klaus and Niklas:

  * fix rebase errors in nvme_identify_ns_descr_list()
  * remove unnecessary code from nvme_write_bar()
  * move csi to NvmeNamespace and use it from the beginning in NSTypes
patch
  * change zone read processing to cover all corner cases with RAZB=1
  * sync w_ptr and d.wp in case of a i/o error at the preceding zone
  * reword the commit message in active/inactive patch with the new
text from Niklas
  * correct dlfeat reporting depending on the fill pattern set
  * add more checks for "attached" n/s parameter to prevent i/o and
get/set features on inactive namespaces
  * Use DEFINE_PROP_SIZE and DEFINE_PROP_SIZE32 for zone size/capacity
and ZASL respectively
  * Improve zone size and capacity validation
  * Correctly report NSZE

v4 -> v5:

 - Rebase to the current qemu-nvme.

 - Use HostMemoryBackendFile as the backing storage for persistent
   zone metadata.

 - Fix the issue with filling the valid data in the next zone if RAZB
   is enabled.

v3 -> v4:

 - Fix bugs introduced in v2/v3 for QD > 1 operation. Now, all writes
   to a zone happen at the new write pointer variable, zone->w_ptr,
   that is advanced right after submitting the backend i/o. The existing
   zone->d.wp variable is updated upon the successful write completion
   and it is used for zone reporting. Some code has been split from
   nvme_finalize_zoned_write() function to a new function,
   nvme_advance_zone_wp().

 - Make the code compile under mingw. Switch to using QEMU API for
   mmap/msync, i.e. memory_region...(). Since mmap is not available in
   mingw (even though there is mman-win32 library available on Github),
   conditional compilation is added around these calls to avoid
   undefined symbols under mingw. A better fix would be to add stub
   functions to softmmu/memory.c for the case when CONFIG_POSIX is not
   defined, but such change is beyond the scope of this patchset and it
   can be made in a separate patch.

 - Correct permission mask used to open zone metadata file.

 - Fold "Define 64 bit cqe.result" patch into ZNS commit.

 - Use clz64/clz32 instead of defining nvme_ilog2() function.

 - Simplify rpt_empty_id_struct() code, move nvme_fill_data() back
   to ZNS patch.

 - Fix a power-on processing bug.

 - Rename NVME_CMD_ZONE_APND to NVME_CMD_ZONE_APPEND.

 - Make the list of review comments addressed in v2 of the series
   (see below).

v2 -> v3:

 - Moved nvme_fill_data() function to the NSTypes patch as it is
   now used there to output empty namespace identify structs.
 - Fixed typo in Maxim's email address.

v1 -> v2:

 - Rebased on top of qemu-nvme/next

Re: [PULL 00/33] Block patches

2020-11-04 Thread Peter Maydell

On Wed, 4 Nov 2020 at 15:18, Stefan Hajnoczi  wrote:
>
> The following changes since commit 8507c9d5c9a62de2a0e281b640f995e26eac46af:
>
>   Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging 
> (2020-11-03 15:59:44 +)
>
> are available in the Git repository at:
>
>   https://gitlab.com/stefanha/qemu.git tags/block-pull-request
>
> for you to fetch changes up to fc107d86840b3364e922c26cf7631b7fd38ce523:
>
>   util/vfio-helpers: Assert offset is aligned to page size (2020-11-03 
> 19:06:23 +)
>
> 
> Pull request for 5.2
>
> NVMe fixes to solve IOMMU issues on non-x86 and error message/tracing
> improvements. Elena Afanasova's ioeventfd fixes are also included.
>
> Signed-off-by: Stefan Hajnoczi 
>


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/5.2
for any user-visible changes.

-- PMM

Re: [PATCH v2 2/2] iotests: rewrite iotest 240 in python

2020-11-04 Thread Maxim Levitsky

On Tue, 2020-11-03 at 13:53 +0100, Max Reitz wrote:
> On 01.11.20 17:15, Maxim Levitsky wrote:
> > The recent changes that brought RCU delayed device deletion,
> > broke few tests and this test breakage went unnoticed.
> > 
> > Fix this test by rewriting it in python
> > (which allows to wait for DEVICE_DELETED events before continuing).
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  tests/qemu-iotests/240 | 228 -
> >  tests/qemu-iotests/240.out |  76 -
> >  2 files changed, 143 insertions(+), 161 deletions(-)
> > 
> > diff --git a/tests/qemu-iotests/240 b/tests/qemu-iotests/240
> > index 8b4337b58d..bfc9b72f36 100755
> > --- a/tests/qemu-iotests/240
> > +++ b/tests/qemu-iotests/240
> 
> [...]
> 
> > +class TestCase(iotests.QMPTestCase):
> > +test_driver = "null-co"
> > +
> > +def required_drivers(self):
> > +return [self.test_driver]
> > +
> > +@iotests.skip_if_unsupported(required_drivers)
> > +def setUp(self):
> > +self.vm = iotests.VM()
> > +self.vm.launch()
> 
> It seems to me like all tests create a null-co block device.  The only
> difference is that test1() creates an R/W node, whereas all others
> create an RO node.  I don’t think that matters though, so maybe we can
> replace code duplication by creating the (RO) null-co node here.
> 
> Furthermore, we could also create two I/O threads and two accompanying
> virtio-scsi devices here (tests that don’t use them shouldn’t care)...
I agree with that, the test can be improved a lot. I on purpose didn't do
this since I wanted 1:1 translation of the bash test.
After that a refactoring, or a rewrite of this test can be very welcome.

> 
> > +
> > +def tearDown(self):
> 
> ...and clean all of those up here.
> 
> > +self.vm.shutdown()
> > +
> > +def test1(self):
> > +iotests.log('==Unplug a SCSI disk and then plug it again==')
> > +self.vm.qmp_log('blockdev-add', driver='null-co', 
> > read_zeroes=True, node_name='hd0')
> > +self.vm.qmp_log('object-add', qom_type='iothread', id="iothread0")
> > +self.vm.qmp_log('device_add', id='scsi0', 
> > driver=iotests.get_virtio_scsi_device(), iothread='iothread0', filters = 
> > [iotests.filter_qmp_virtio_scsi])
> 
> A bit weird that you change your coding style for the @filters parameter
> (i.e., putting spaces around '='), and pylint thinks so, too.
It is a late change, which I copy'ed so no wonder it is a bit different.
Fixed!

That code too I would like later to move to a function or so.

> 
> > +self.vm.qmp_log('device_add', id='scsi-hd0', driver='scsi-hd', 
> > drive='hd0')
> > +self.vm.qmp_log('device_del', id='scsi-hd0')
> > +self.vm.event_wait('DEVICE_DELETED')
> > +self.vm.qmp_log('device_add', id='scsi-hd0', driver='scsi-hd', 
> > drive='hd0')
> > +self.vm.qmp_log('device_del', id='scsi-hd0')
> > +self.vm.event_wait('DEVICE_DELETED')
> > +self.vm.qmp_log('device_del', id='scsi0')
> 
> I don’t think it makes a difference for this test, but perhaps it would
> be cleaner to wait for the DEVICE_DELETED event even for the virtio-scsi
> devices?

Actually this doesn't work since the whole thing is started with 'qtest' accel,
and there is no real guest there. 

The pcie device removal in 'production' happens only when you place the device 
on a 
pcie root port or on pci hotplug enabled bridge,
which then signals the removal of the device to the guest which needs to ACK 
this,
and only then the qemu actually removes the device and sends the DEVICE_DELETED 
event.

So this device_del is pretty much pointless I think, and it never will be 
really removed.
Again I tried to just translate the test without doing any more changes, but I 
will remove this.

> 
> > +self.vm.qmp_log('blockdev-del', node_name='hd0')
> > +
> > +def test2(self):
> > +iotests.log('==Attach two SCSI disks using the same block device 
> > and the same iothread==')
> > +self.vm.qmp_log('blockdev-add', driver='null-co', 
> > read_zeroes=True, node_name='hd0', read_only=True)
> > +self.vm.qmp_log('object-add', qom_type='iothread', id="iothread0")
> > +self.vm.qmp_log('device_add', id='scsi0', 
> > driver=iotests.get_virtio_scsi_device(), iothread='iothread0', filters = 
> > [iotests.filter_qmp_virtio_scsi])
> > +
> > +self.vm.qmp_log('device_add', id='scsi-hd0', driver='scsi-hd', 
> > drive='hd0')
> > +self.vm.qmp_log('device_add', id='scsi-hd1', driver='scsi-hd', 
> > drive='hd0')
> > +self.vm.qmp_log('device_del', id='scsi-hd1')
> > +self.vm.event_wait('DEVICE_DELETED')
> > +self.vm.qmp_log('device_del', id='scsi-hd0')
> > +self.vm.event_wait('DEVICE_DELETED')
> 
> The bash version deleted them the other way around, but I suppose it
> doesn’t matter.  (It just made me wonder why you changed the order.)
I just didn't notice it since it didn't seem to be

[PULL v3 29/31] block/export: make vhost-user-blk config space little-endian

2020-11-04 Thread Michael S. Tsirkin

From: Stefan Hajnoczi 

VIRTIO 1.0 devices have little-endian configuration space. The
vhost-user-blk-server.c code already uses little-endian for virtqueue
processing but not for the configuration space fields. Fix this so the
vhost-user-blk export works on big-endian hosts.

Signed-off-by: Stefan Hajnoczi 
Message-Id: <20201027173528.213464-4-stefa...@redhat.com>
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 block/export/vhost-user-blk-server.c | 25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index 41f4933d6e..33cc0818b8 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -264,7 +264,6 @@ static uint64_t vu_blk_get_protocol_features(VuDev *dev)
 static int
 vu_blk_get_config(VuDev *vu_dev, uint8_t *config, uint32_t len)
 {
-/* TODO blkcfg must be little-endian for VIRTIO 1.0 */
 VuServer *server = container_of(vu_dev, VuServer, vu_dev);
 VuBlkExport *vexp = container_of(server, VuBlkExport, vu_server);
 memcpy(config, >blkcfg, len);
@@ -343,18 +342,18 @@ vu_blk_initialize_config(BlockDriverState *bs,
  uint32_t blk_size,
  uint16_t num_queues)
 {
-config->capacity = bdrv_getlength(bs) >> BDRV_SECTOR_BITS;
-config->blk_size = blk_size;
-config->size_max = 0;
-config->seg_max = 128 - 2;
-config->min_io_size = 1;
-config->opt_io_size = 1;
-config->num_queues = num_queues;
-config->max_discard_sectors = 32768;
-config->max_discard_seg = 1;
-config->discard_sector_alignment = config->blk_size >> 9;
-config->max_write_zeroes_sectors = 32768;
-config->max_write_zeroes_seg = 1;
+config->capacity = cpu_to_le64(bdrv_getlength(bs) >> BDRV_SECTOR_BITS);
+config->blk_size = cpu_to_le32(blk_size);
+config->size_max = cpu_to_le32(0);
+config->seg_max = cpu_to_le32(128 - 2);
+config->min_io_size = cpu_to_le16(1);
+config->opt_io_size = cpu_to_le32(1);
+config->num_queues = cpu_to_le16(num_queues);
+config->max_discard_sectors = cpu_to_le32(32768);
+config->max_discard_seg = cpu_to_le32(1);
+config->discard_sector_alignment = cpu_to_le32(config->blk_size >> 9);
+config->max_write_zeroes_sectors = cpu_to_le32(32768);
+config->max_write_zeroes_seg = cpu_to_le32(1);
 }
 
 static void vu_blk_exp_request_shutdown(BlockExport *exp)
-- 
MST

[PULL v3 28/31] configure: introduce --enable-vhost-user-blk-server

2020-11-04 Thread Michael S. Tsirkin

From: Stefan Hajnoczi 

Make it possible to compile out the vhost-user-blk server. It is enabled
by default on Linux.

Note that vhost-user-server.c depends on libvhost-user, which requires
CONFIG_LINUX. The CONFIG_VHOST_USER dependency was erroneous since that
option controls vhost-user frontends (previously known as "master") and
not device backends (previously known as "slave").

Signed-off-by: Stefan Hajnoczi 
Message-Id: <20201027173528.213464-3-stefa...@redhat.com>
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 configure| 15 +++
 block/export/export.c|  4 ++--
 block/export/meson.build |  2 +-
 util/meson.build |  2 +-
 4 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/configure b/configure
index 2c3c69f118..b5e8f5f72c 100755
--- a/configure
+++ b/configure
@@ -329,6 +329,7 @@ vhost_crypto=""
 vhost_scsi=""
 vhost_vsock=""
 vhost_user=""
+vhost_user_blk_server=""
 vhost_user_fs=""
 kvm="auto"
 hax="auto"
@@ -1246,6 +1247,10 @@ for opt do
   ;;
   --enable-vhost-vsock) vhost_vsock="yes"
   ;;
+  --disable-vhost-user-blk-server) vhost_user_blk_server="no"
+  ;;
+  --enable-vhost-user-blk-server) vhost_user_blk_server="yes"
+  ;;
   --disable-vhost-user-fs) vhost_user_fs="no"
   ;;
   --enable-vhost-user-fs) vhost_user_fs="yes"
@@ -1791,6 +1796,7 @@ disabled with --disable-FEATURE, default is enabled if 
available:
   vhost-cryptovhost-user-crypto backend support
   vhost-kernelvhost kernel backend support
   vhost-user  vhost-user backend support
+  vhost-user-blk-servervhost-user-blk server support
   vhost-vdpa  vhost-vdpa kernel backend support
   spice   spice
   rbd rados block device (rbd)
@@ -2382,6 +2388,12 @@ if test "$vhost_net" = ""; then
   test "$vhost_kernel" = "yes" && vhost_net=yes
 fi
 
+# libvhost-user is Linux-only
+test "$vhost_user_blk_server" = "" && vhost_user_blk_server=$linux
+if test "$vhost_user_blk_server" = "yes" && test "$linux" = "no"; then
+  error_exit "--enable-vhost-user-blk-server is only available on Linux"
+fi
+
 ##
 # pkg-config probe
 
@@ -6275,6 +6287,9 @@ fi
 if test "$vhost_vdpa" = "yes" ; then
   echo "CONFIG_VHOST_VDPA=y" >> $config_host_mak
 fi
+if test "$vhost_user_blk_server" = "yes" ; then
+  echo "CONFIG_VHOST_USER_BLK_SERVER=y" >> $config_host_mak
+fi
 if test "$vhost_user_fs" = "yes" ; then
   echo "CONFIG_VHOST_USER_FS=y" >> $config_host_mak
 fi
diff --git a/block/export/export.c b/block/export/export.c
index c3478c6c97..bad6f21b1c 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -22,13 +22,13 @@
 #include "qapi/qapi-commands-block-export.h"
 #include "qapi/qapi-events-block-export.h"
 #include "qemu/id.h"
-#if defined(CONFIG_LINUX) && defined(CONFIG_VHOST_USER)
+#ifdef CONFIG_VHOST_USER_BLK_SERVER
 #include "vhost-user-blk-server.h"
 #endif
 
 static const BlockExportDriver *blk_exp_drivers[] = {
 _exp_nbd,
-#if defined(CONFIG_LINUX) && defined(CONFIG_VHOST_USER)
+#ifdef CONFIG_VHOST_USER_BLK_SERVER
 _exp_vhost_user_blk,
 #endif
 };
diff --git a/block/export/meson.build b/block/export/meson.build
index 9fb4fbf81d..19526435d8 100644
--- a/block/export/meson.build
+++ b/block/export/meson.build
@@ -1,2 +1,2 @@
 blockdev_ss.add(files('export.c'))
-blockdev_ss.add(when: ['CONFIG_LINUX', 'CONFIG_VHOST_USER'], if_true: 
files('vhost-user-blk-server.c'))
+blockdev_ss.add(when: 'CONFIG_VHOST_USER_BLK_SERVER', if_true: 
files('vhost-user-blk-server.c'))
diff --git a/util/meson.build b/util/meson.build
index c5159ad79d..f359af0d46 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -66,7 +66,7 @@ if have_block
   util_ss.add(files('main-loop.c'))
   util_ss.add(files('nvdimm-utils.c'))
   util_ss.add(files('qemu-coroutine.c', 'qemu-coroutine-lock.c', 
'qemu-coroutine-io.c'))
-  util_ss.add(when: ['CONFIG_LINUX', 'CONFIG_VHOST_USER'], if_true: [
+  util_ss.add(when: 'CONFIG_LINUX', if_true: [
 files('vhost-user-server.c'), vhost_user
   ])
   util_ss.add(files('block-helpers.c'))
-- 
MST

[PATCH v3 2/2] iotests: rewrite iotest 240 in python

2020-11-04 Thread Maxim Levitsky

The recent changes that brought RCU delayed device deletion,
broke few tests and this test breakage went unnoticed.

Fix this test by rewriting it in python
(which allows to wait for DEVICE_DELETED events before continuing).

Signed-off-by: Maxim Levitsky 
Tested-by: Christian Borntraeger 
Reviewed-by: Paolo Bonzini 
---
 tests/qemu-iotests/240 | 219 +++--
 tests/qemu-iotests/240.out |  76 +++--
 2 files changed, 130 insertions(+), 165 deletions(-)

diff --git a/tests/qemu-iotests/240 b/tests/qemu-iotests/240
index 8b4337b58d..c0f71f0461 100755
--- a/tests/qemu-iotests/240
+++ b/tests/qemu-iotests/240
@@ -1,5 +1,5 @@
-#!/usr/bin/env bash
-#
+#!/usr/bin/env python3
+
 # Test hot plugging and unplugging with iothreads
 #
 # Copyright (C) 2019 Igalia, S.L.
@@ -17,133 +17,90 @@
 #
 # You should have received a copy of the GNU General Public License
 # along with this program.  If not, see .
-#
 
-# creator
-owner=be...@igalia.com
-
-seq=`basename $0`
-echo "QA output created by $seq"
-
-status=1   # failure is the default!
-
-_cleanup()
-{
-rm -f "$SOCK_DIR/nbd"
-}
-trap "_cleanup; exit \$status" 0 1 2 3 15
-
-# get standard environment, filters and checks
-. ./common.rc
-. ./common.filter
-
-_supported_fmt generic
-_supported_proto generic
-
-do_run_qemu()
-{
-echo Testing: "$@"
-$QEMU -nographic -qmp stdio -serial none "$@"
-echo
-}
-
-# Remove QMP events from (pretty-printed) output. Doesn't handle
-# nested dicts correctly, but we don't get any of those in this test.
-_filter_qmp_events()
-{
-tr '\n' '\t' | sed -e \
-   
's/{\s*"timestamp":\s*{[^}]*},\s*"event":[^,}]*\(,\s*"data":\s*{[^}]*}\)\?\s*}\s*//g'
 \
-   | tr '\t' '\n'
-}
-
-run_qemu()
-{
-do_run_qemu "$@" 2>&1 | _filter_qmp | _filter_qmp_events
-}
-
-case "$QEMU_DEFAULT_MACHINE" in
-  s390-ccw-virtio)
-  virtio_scsi=virtio-scsi-ccw
-  ;;
-  *)
-  virtio_scsi=virtio-scsi-pci
-  ;;
-esac
-
-echo
-echo === Unplug a SCSI disk and then plug it again ===
-echo
-
-run_qemu <

[PATCH v3 1/2] iotests: add filter_qmp_virtio_scsi function

2020-11-04 Thread Maxim Levitsky

filter_qmp_virtio_scsi can be used to filter virtio-scsi-pci/ccw differences.
Note that this patch was only tested on x86.

Suggested-by: Paolo Bonzini 
Signed-off-by: Maxim Levitsky 
Tested-by: Christian Borntraeger 
Reviewed-by: Paolo Bonzini 
---
 tests/qemu-iotests/iotests.py | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 814804a4c6..bcd4fe5b6f 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -392,6 +392,16 @@ def filter_qmp_testfiles(qmsg):
 return value
 return filter_qmp(qmsg, _filter)
 
+def filter_virtio_scsi(output: str) -> str:
+return re.sub(r'(virtio-scsi)-(ccw|pci)', r'\1', output)
+
+def filter_qmp_virtio_scsi(qmsg):
+def _filter(_key, value):
+if is_str(value):
+return filter_virtio_scsi(value)
+return value
+return filter_qmp(qmsg, _filter)
+
 def filter_generated_node_ids(msg):
 return re.sub("#block[0-9]+", "NODE_NAME", msg)
 
-- 
2.26.2

[PULL v3 30/31] block/export: fix vhost-user-blk get_config() information leak

2020-11-04 Thread Michael S. Tsirkin

From: Stefan Hajnoczi 

Refuse get_config() requests in excess of sizeof(struct virtio_blk_config).

Signed-off-by: Stefan Hajnoczi 
Message-Id: <20201027173528.213464-5-stefa...@redhat.com>
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 block/export/vhost-user-blk-server.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index 33cc0818b8..62672d1cb9 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -266,6 +266,9 @@ vu_blk_get_config(VuDev *vu_dev, uint8_t *config, uint32_t 
len)
 {
 VuServer *server = container_of(vu_dev, VuServer, vu_dev);
 VuBlkExport *vexp = container_of(server, VuBlkExport, vu_server);
+
+g_return_val_if_fail(len <= sizeof(struct virtio_blk_config), -1);
+
 memcpy(config, >blkcfg, len);
 return 0;
 }
-- 
MST

[PATCH v3 0/2] Assorted fixes to tests that were broken by recent scsi changes

2020-11-04 Thread Maxim Levitsky

While most of the patches in V1 of this series are already merged upstream,
the patch that fixes iotest 240 was broken on s390 and was not accepted.

This is an updated version of this patch, based on Paulo's suggestion,
that hopefully makes this iotest work on both x86 and s390.

V3: addressed review feedback

Best regards,
Maxim Levitsky

Maxim Levitsky (2):
  iotests: add filter_qmp_virtio_scsi function
  iotests: rewrite iotest 240 in python

 tests/qemu-iotests/240| 219 ++
 tests/qemu-iotests/240.out|  76 ++--
 tests/qemu-iotests/iotests.py |  10 ++
 3 files changed, 140 insertions(+), 165 deletions(-)

-- 
2.26.2

[PULL v3 25/31] Revert "vhost-blk: set features before setting inflight feature"

2020-11-04 Thread Michael S. Tsirkin

From: Stefan Hajnoczi 

This reverts commit adb29c027341ba095a3ef4beef6aaef86d3a520e.

The commit broke -device vhost-user-blk-pci because the
vhost_dev_prepare_inflight() function it introduced segfaults in
vhost_dev_set_features() when attempting to access struct vhost_dev's
vdev pointer before it has been assigned.

To reproduce the segfault simply launch a vhost-user-blk device with the
contrib vhost-user-blk device backend:

  $ build/contrib/vhost-user-blk/vhost-user-blk -s /tmp/vhost-user-blk.sock -r 
-b /var/tmp/foo.img
  $ build/qemu-system-x86_64 \
-device vhost-user-blk-pci,id=drv0,chardev=char1,addr=4.0 \
-object memory-backend-memfd,id=mem,size=1G,share=on \
-M memory-backend=mem,accel=kvm \
-chardev socket,id=char1,path=/tmp/vhost-user-blk.sock
  Segmentation fault (core dumped)

Cc: Jin Yu 
Cc: Raphael Norwitz 
Cc: Michael S. Tsirkin 
Signed-off-by: Stefan Hajnoczi 
Message-Id: <20201102165709.232180-1-stefa...@redhat.com>
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 include/hw/virtio/vhost.h |  1 -
 hw/block/vhost-user-blk.c |  6 --
 hw/virtio/vhost.c | 18 --
 3 files changed, 25 deletions(-)

diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 839bfb153c..94585067f7 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -141,7 +141,6 @@ void vhost_dev_reset_inflight(struct vhost_inflight 
*inflight);
 void vhost_dev_free_inflight(struct vhost_inflight *inflight);
 void vhost_dev_save_inflight(struct vhost_inflight *inflight, QEMUFile *f);
 int vhost_dev_load_inflight(struct vhost_inflight *inflight, QEMUFile *f);
-int vhost_dev_prepare_inflight(struct vhost_dev *hdev);
 int vhost_dev_set_inflight(struct vhost_dev *dev,
struct vhost_inflight *inflight);
 int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index f67b29bbf3..a076b1e54d 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -131,12 +131,6 @@ static int vhost_user_blk_start(VirtIODevice *vdev)
 
 s->dev.acked_features = vdev->guest_features;
 
-ret = vhost_dev_prepare_inflight(>dev);
-if (ret < 0) {
-error_report("Error set inflight format: %d", -ret);
-goto err_guest_notifiers;
-}
-
 if (!s->inflight->addr) {
 ret = vhost_dev_get_inflight(>dev, s->queue_size, s->inflight);
 if (ret < 0) {
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index f2482378c6..79b2be20df 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1645,24 +1645,6 @@ int vhost_dev_load_inflight(struct vhost_inflight 
*inflight, QEMUFile *f)
 return 0;
 }
 
-int vhost_dev_prepare_inflight(struct vhost_dev *hdev)
-{
-int r;
- 
-if (hdev->vhost_ops->vhost_get_inflight_fd == NULL ||
-hdev->vhost_ops->vhost_set_inflight_fd == NULL) {
-return 0;
-}
- 
-r = vhost_dev_set_features(hdev, hdev->log_enabled);
-if (r < 0) {
-VHOST_OPS_DEBUG("vhost_dev_prepare_inflight failed");
-return r;
-}
-
-return 0;
-}
-
 int vhost_dev_set_inflight(struct vhost_dev *dev,
struct vhost_inflight *inflight)
 {
-- 
MST

[PULL v3 26/31] vhost-blk: set features before setting inflight feature

2020-11-04 Thread Michael S. Tsirkin

From: Jin Yu 

Virtqueue has split and packed, so before setting inflight,
you need to inform the back-end virtqueue format.

Signed-off-by: Jin Yu 
Acked-by: Raphael Norwitz 
Message-Id: <20201103123617.28256-1-jin...@intel.com>
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 include/hw/virtio/vhost.h |  1 +
 hw/block/vhost-user-blk.c |  6 ++
 hw/virtio/vhost.c | 20 
 3 files changed, 27 insertions(+)

diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 94585067f7..4a8bc75415 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -141,6 +141,7 @@ void vhost_dev_reset_inflight(struct vhost_inflight 
*inflight);
 void vhost_dev_free_inflight(struct vhost_inflight *inflight);
 void vhost_dev_save_inflight(struct vhost_inflight *inflight, QEMUFile *f);
 int vhost_dev_load_inflight(struct vhost_inflight *inflight, QEMUFile *f);
+int vhost_dev_prepare_inflight(struct vhost_dev *hdev, VirtIODevice *vdev);
 int vhost_dev_set_inflight(struct vhost_dev *dev,
struct vhost_inflight *inflight);
 int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index a076b1e54d..2dd3d93ca0 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -131,6 +131,12 @@ static int vhost_user_blk_start(VirtIODevice *vdev)
 
 s->dev.acked_features = vdev->guest_features;
 
+ret = vhost_dev_prepare_inflight(>dev, vdev);
+if (ret < 0) {
+error_report("Error set inflight format: %d", -ret);
+goto err_guest_notifiers;
+}
+
 if (!s->inflight->addr) {
 ret = vhost_dev_get_inflight(>dev, s->queue_size, s->inflight);
 if (ret < 0) {
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 79b2be20df..614ccc2bcb 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1645,6 +1645,26 @@ int vhost_dev_load_inflight(struct vhost_inflight 
*inflight, QEMUFile *f)
 return 0;
 }
 
+int vhost_dev_prepare_inflight(struct vhost_dev *hdev, VirtIODevice *vdev)
+{
+int r;
+
+if (hdev->vhost_ops->vhost_get_inflight_fd == NULL ||
+hdev->vhost_ops->vhost_set_inflight_fd == NULL) {
+return 0;
+}
+
+hdev->vdev = vdev;
+
+r = vhost_dev_set_features(hdev, hdev->log_enabled);
+if (r < 0) {
+VHOST_OPS_DEBUG("vhost_dev_prepare_inflight failed");
+return r;
+}
+
+return 0;
+}
+
 int vhost_dev_set_inflight(struct vhost_dev *dev,
struct vhost_inflight *inflight)
 {
-- 
MST

Re: [PATCH 0/5] SCSI: fix transfer limits for SCSI passthrough

2020-11-04 Thread Maxim Levitsky

On Wed, 2020-11-04 at 09:48 -0800, no-re...@patchew.org wrote:
> Patchew URL: 
> https://patchew.org/QEMU/20201104173217.417538-1-mlevi...@redhat.com/
> 
> 
> 
> Hi,
> 
> This series seems to have some coding style problems. See output below for
> more information:
> 
> Type: series
> Message-id: 20201104173217.417538-1-mlevi...@redhat.com
> Subject: [PATCH 0/5] SCSI: fix transfer limits for SCSI passthrough
> 
> === TEST SCRIPT BEGIN ===
> #!/bin/bash
> git rev-parse base > /dev/null || exit 0
> git config --local diff.renamelimit 0
> git config --local diff.renames True
> git config --local diff.algorithm histogram
> ./scripts/checkpatch.pl --mailback base..
> === TEST SCRIPT END ===
> 
> Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
> From https://github.com/patchew-project/qemu
>b1266b6..3c8c36c  master -> master
>  - [tag update]  patchew/20201104160021.2342108-1-ehabk...@redhat.com -> 
> patchew/20201104160021.2342108-1-ehabk...@redhat.com
>  * [new tag] patchew/20201104173217.417538-1-mlevi...@redhat.com -> 
> patchew/20201104173217.417538-1-mlevi...@redhat.com
> Switched to a new branch 'test'
> bde6371 block/scsi: correctly emulate the VPD block limits page
> c4180d6 block: use blk_get_max_ioctl_transfer for SCSI passthrough
> 9ff7edc block: add max_ioctl_transfer to BlockLimits
> dd2f1f7 file-posix: add sg_get_max_segments that actually works with sg
> f9ad940 file-posix: split hdev_refresh_limits from raw_refresh_limits
> 
> === OUTPUT BEGIN ===
> 1/5 Checking commit f9ad9400e011 (file-posix: split hdev_refresh_limits from 
> raw_refresh_limits)
> 2/5 Checking commit dd2f1f77a5d2 (file-posix: add sg_get_max_segments that 
> actually works with sg)
> 3/5 Checking commit 9ff7edc31002 (block: add max_ioctl_transfer to 
> BlockLimits)
> 4/5 Checking commit c4180d6accff (block: use blk_get_max_ioctl_transfer for 
> SCSI passthrough)
> 5/5 Checking commit bde637139536 (block/scsi: correctly emulate the VPD block 
> limits page)
> ERROR: braces {} are necessary for all arms of this statement
> #51: FILE: hw/scsi/scsi-generic.c:196:
> +if (page_idx >= r->buflen)
Sorry about that. Triple checked this code for correctness,
but didn't run checkpatch on the last revision :-(

Best regards,
Maxim Levitsky

> [...]
> 
> total: 1 errors, 0 warnings, 53 lines checked
> 
> Patch 5/5 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 
> === OUTPUT END ===
> 
> Test command exited with code: 1
> 
> 
> The full log is available at
> http://patchew.org/logs/20201104173217.417538-1-mlevi...@redhat.com/testing.checkpatch/?type=message.
> ---
> Email generated automatically by Patchew [https://patchew.org/].
> Please send your feedback to patchew-de...@redhat.com

[PATCH] quorum: Implement bdrv_co_block_status()

2020-11-04 Thread Alberto Garcia

The quorum driver does not implement bdrv_co_block_status() and
because of that it always reports to contain data even if all its
children are known to be empty.

One consequence of this is that if we for example create a quorum with
a size of 10GB and we mirror it to a new image the operation will
write 10GB of actual zeroes to the destination image wasting a lot of
time and disk space.

Since a quorum has an arbitrary number of children of potentially
different formats there is no way to report all possible allocation
status flags in a way that makes sense, so this implementation only
reports when a given region is known to contain zeroes
(BDRV_BLOCK_ZERO) or not (BDRV_BLOCK_DATA).

If all children agree that a region contains zeroes then we can return
BDRV_BLOCK_ZERO using the smallest size reported by the children
(because all agree that a region of at least that size contains
zeroes).

If at least one child disagrees we have to return BDRV_BLOCK_DATA.
In this case we use the largest of the sizes reported by the children
that didn't return BDRV_BLOCK_ZERO (because we know that there won't
be an agreement for at least that size).

Signed-off-by: Alberto Garcia 
---
 block/quorum.c |  49 
 tests/qemu-iotests/312 | 148 +
 tests/qemu-iotests/312.out |  67 +
 tests/qemu-iotests/group   |   1 +
 4 files changed, 265 insertions(+)
 create mode 100755 tests/qemu-iotests/312
 create mode 100644 tests/qemu-iotests/312.out

diff --git a/block/quorum.c b/block/quorum.c
index e846a7e892..29cee42705 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -18,6 +18,7 @@
 #include "qemu/module.h"
 #include "qemu/option.h"
 #include "block/block_int.h"
+#include "block/coroutines.h"
 #include "block/qdict.h"
 #include "qapi/error.h"
 #include "qapi/qapi-events-block.h"
@@ -1174,6 +1175,53 @@ static void quorum_child_perm(BlockDriverState *bs, 
BdrvChild *c,
  | DEFAULT_PERM_UNCHANGED;
 }
 
+/*
+ * Each one of the children can report different status flags even
+ * when they contain the same data, so what this function does is
+ * return BDRV_BLOCK_ZERO if *all* children agree that a certain
+ * region contains zeroes, and BDRV_BLOCK_DATA otherwise.
+ */
+static int coroutine_fn quorum_co_block_status(BlockDriverState *bs,
+   bool want_zero,
+   int64_t offset, int64_t count,
+   int64_t *pnum, int64_t *map,
+   BlockDriverState **file)
+{
+BDRVQuorumState *s = bs->opaque;
+int i, ret;
+int64_t pnum_zero = count;
+int64_t pnum_data = 0;
+
+for (i = 0; i < s->num_children; i++) {
+int64_t bytes;
+ret = bdrv_co_common_block_status_above(s->children[i]->bs, NULL, 
false,
+want_zero, offset, count,
+, NULL, NULL, NULL);
+if (ret < 0) {
+return ret;
+}
+/*
+ * Even if all children agree about whether there are zeroes
+ * or not at @offset they might disagree on the size, so use
+ * the smallest when reporting BDRV_BLOCK_ZERO and the largest
+ * when reporting BDRV_BLOCK_DATA.
+ */
+if (ret & BDRV_BLOCK_ZERO) {
+pnum_zero = MIN(pnum_zero, bytes);
+} else {
+pnum_data = MAX(pnum_data, bytes);
+}
+}
+
+if (pnum_data) {
+*pnum = pnum_data;
+return BDRV_BLOCK_DATA;
+} else {
+*pnum = pnum_zero;
+return BDRV_BLOCK_ZERO;
+}
+}
+
 static const char *const quorum_strong_runtime_opts[] = {
 QUORUM_OPT_VOTE_THRESHOLD,
 QUORUM_OPT_BLKVERIFY,
@@ -1192,6 +1240,7 @@ static BlockDriver bdrv_quorum = {
 .bdrv_close = quorum_close,
 .bdrv_gather_child_options  = quorum_gather_child_options,
 .bdrv_dirname   = quorum_dirname,
+.bdrv_co_block_status   = quorum_co_block_status,
 
 .bdrv_co_flush_to_disk  = quorum_co_flush,
 
diff --git a/tests/qemu-iotests/312 b/tests/qemu-iotests/312
new file mode 100755
index 00..1b08f1552f
--- /dev/null
+++ b/tests/qemu-iotests/312
@@ -0,0 +1,148 @@
+#!/usr/bin/env bash
+#
+# Test drive-mirror with quorum
+#
+# The goal of this test is to check how the quorum driver reports
+# regions that are known to read as zeroes (BDRV_BLOCK_ZERO). The idea
+# is that drive-mirror will try the efficient representation of zeroes
+# in the destination image instead of writing actual zeroes.
+#
+# Copyright (C) 2020 Igalia, S.L.
+# Author: Alberto Garcia 
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the

Re: Libvirt driver iothread property for virtio-scsi disks

2020-11-04 Thread Nir Soffer

On Wed, Nov 4, 2020 at 6:54 PM Daniel P. Berrangé  wrote:
>
> On Wed, Nov 04, 2020 at 05:48:40PM +0200, Nir Soffer wrote:
> > The docs[1] say:
> >
> > - The optional iothread attribute assigns the disk to an IOThread as 
> > defined by
> >   the range for the domain iothreads value. Multiple disks may be assigned 
> > to
> >   the same IOThread and are numbered from 1 to the domain iothreads value.
> >   Available for a disk device target configured to use "virtio" bus and 
> > "pci"
> >   or "ccw" address types. Since 1.2.8 (QEMU 2.1)
> >
> > Does it mean that virtio-scsi disks do not use iothreads?
> >
> > I'm experiencing a horrible performance using nested vms (up to 2 levels of
> > nesting) when accessing NFS storage running on one of the VMs. The NFS
> > server is using scsi disk.
>
> When you say  2 levels of nesting do you definitely have KVM enabled at
> all levels, or are you ending up using TCG emulation, because the latter
> would certainly explain terrible performance.

Good point, I'll check that out, thanks.

> > My theory is:
> > - Writing to NFS server is very slow (too much nesting, slow disk)
> > - Not using iothreads (because we don't use virtio?)
> > - Guest CPU is blocked by slow I/O
>
> Regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Re: Libvirt driver iothread property for virtio-scsi disks

2020-11-04 Thread Nir Soffer

On Wed, Nov 4, 2020 at 6:42 PM Sergio Lopez  wrote:
>
> On Wed, Nov 04, 2020 at 05:48:40PM +0200, Nir Soffer wrote:
> > The docs[1] say:
> >
> > - The optional iothread attribute assigns the disk to an IOThread as 
> > defined by
> >   the range for the domain iothreads value. Multiple disks may be assigned 
> > to
> >   the same IOThread and are numbered from 1 to the domain iothreads value.
> >   Available for a disk device target configured to use "virtio" bus and 
> > "pci"
> >   or "ccw" address types. Since 1.2.8 (QEMU 2.1)
> >
> > Does it mean that virtio-scsi disks do not use iothreads?
>
> virtio-scsi disks can use iothreads, but they are configured in the
> scsi controller, not in the disk itself. All disks attached to the
> same controller will share the same iothread, but you can also attach
> multiple controllers.

Thanks, I found that we do use this in ovirt:


  
  
  


However the VMs in this setup are not created by oVirt, but manually using
libvirt. I'll make sure we configure the controller in the same way.

> > I'm experiencing a horrible performance using nested vms (up to 2 levels of
> > nesting) when accessing NFS storage running on one of the VMs. The NFS
> > server is using scsi disk.
> >
> > My theory is:
> > - Writing to NFS server is very slow (too much nesting, slow disk)
> > - Not using iothreads (because we don't use virtio?)
> > - Guest CPU is blocked by slow I/O
>
> I would discard the lack of iothreads as the culprit. They do improve
> the performance, but without them the performance should be quite
> decent anyway. Probably something else is causing the trouble.
>
> I would do a step by step analysis, testing the NFS performance from
> outside the VM first, and then elaborating upwards from that.

Makes sense, thanks.

Re: [PATCH 0/5] SCSI: fix transfer limits for SCSI passthrough

2020-11-04 Thread no-reply

Patchew URL: 
https://patchew.org/QEMU/20201104173217.417538-1-mlevi...@redhat.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20201104173217.417538-1-mlevi...@redhat.com
Subject: [PATCH 0/5] SCSI: fix transfer limits for SCSI passthrough

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
   b1266b6..3c8c36c  master -> master
 - [tag update]  patchew/20201104160021.2342108-1-ehabk...@redhat.com -> 
patchew/20201104160021.2342108-1-ehabk...@redhat.com
 * [new tag] patchew/20201104173217.417538-1-mlevi...@redhat.com -> 
patchew/20201104173217.417538-1-mlevi...@redhat.com
Switched to a new branch 'test'
bde6371 block/scsi: correctly emulate the VPD block limits page
c4180d6 block: use blk_get_max_ioctl_transfer for SCSI passthrough
9ff7edc block: add max_ioctl_transfer to BlockLimits
dd2f1f7 file-posix: add sg_get_max_segments that actually works with sg
f9ad940 file-posix: split hdev_refresh_limits from raw_refresh_limits

=== OUTPUT BEGIN ===
1/5 Checking commit f9ad9400e011 (file-posix: split hdev_refresh_limits from 
raw_refresh_limits)
2/5 Checking commit dd2f1f77a5d2 (file-posix: add sg_get_max_segments that 
actually works with sg)
3/5 Checking commit 9ff7edc31002 (block: add max_ioctl_transfer to BlockLimits)
4/5 Checking commit c4180d6accff (block: use blk_get_max_ioctl_transfer for 
SCSI passthrough)
5/5 Checking commit bde637139536 (block/scsi: correctly emulate the VPD block 
limits page)
ERROR: braces {} are necessary for all arms of this statement
#51: FILE: hw/scsi/scsi-generic.c:196:
+if (page_idx >= r->buflen)
[...]

total: 1 errors, 0 warnings, 53 lines checked

Patch 5/5 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20201104173217.417538-1-mlevi...@redhat.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [PATCH v2 17/20] backup: move to block-copy

2020-11-04 Thread Max Reitz

On 26.10.20 16:18, Vladimir Sementsov-Ogievskiy wrote:
> 23.07.2020 12:47, Max Reitz wrote:
>>> +static void coroutine_fn backup_set_speed(BlockJob *job, int64_t speed)
>>> +{
>>> +    BackupBlockJob *s = container_of(job, BackupBlockJob, common);
>>> +
>>> +    if (s->bcs) {
>>> +    /* In block_job_create we yet don't have bcs */
>> Shouldn’t hurt to make it conditional, but how can we get here in
>> block_job_create()?
>>
> 
> block_job_set_speed is called from block_job_create.

Ah, right.

Max

[PATCH 4/5] block: use blk_get_max_ioctl_transfer for SCSI passthrough

2020-11-04 Thread Maxim Levitsky

Switch file-posix to expose only the max_ioctl_transfer limit.

Let the iscsi driver work as it did before since it is bound by the transfer
limit in both regular read/write and in SCSI passthrough case.

Switch the scsi-disk and scsi-block drivers to read the SG max transfer limits
using the new blk_get_max_ioctl_transfer interface.


Fixes: 867eccfed8 ("file-posix: Use max transfer length/segment count only for 
SCSI passthrough")
Signed-off-by: Maxim Levitsky 
---
 block/file-posix.c | 4 ++--
 block/iscsi.c  | 1 +
 hw/scsi/scsi-generic.c | 4 ++--
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index c4df757504..edba8fc86d 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1286,12 +1286,12 @@ static void hdev_refresh_limits(BlockDriverState *bs, 
Error **errp)
get_max_transfer_length(s->fd);
 
 if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
-bs->bl.max_transfer = pow2floor(ret);
+bs->bl.max_ioctl_transfer = pow2floor(ret);
 }
 
 ret = bs->sg ? sg_get_max_segments(s->fd) : get_max_segments(s->fd);
 if (ret > 0) {
-bs->bl.max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
+bs->bl.max_ioctl_transfer = MIN_NON_ZERO(bs->bl.max_ioctl_transfer,
ret * qemu_real_host_page_size);
 }
 
diff --git a/block/iscsi.c b/block/iscsi.c
index e30a7e3606..3685da2971 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -2065,6 +2065,7 @@ static void iscsi_refresh_limits(BlockDriverState *bs, 
Error **errp)
 
 if (max_xfer_len * block_size < INT_MAX) {
 bs->bl.max_transfer = max_xfer_len * iscsilun->block_size;
+bs->bl.max_ioctl_transfer = bs->bl.max_transfer;
 }
 
 if (iscsilun->lbp.lbpu) {
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 2cb23ca891..6df67bf889 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -167,7 +167,7 @@ static void scsi_handle_inquiry_reply(SCSIGenericReq *r, 
SCSIDevice *s)
 page = r->req.cmd.buf[2];
 if (page == 0xb0) {
 uint32_t max_transfer =
-blk_get_max_transfer(s->conf.blk) / s->blocksize;
+blk_get_max_ioctl_transfer(s->conf.blk) / s->blocksize;
 
 assert(max_transfer);
 stl_be_p(>buf[8], max_transfer);
@@ -210,7 +210,7 @@ static int scsi_generic_emulate_block_limits(SCSIGenericReq 
*r, SCSIDevice *s)
 uint8_t buf[64];
 
 SCSIBlockLimits bl = {
-.max_io_sectors = blk_get_max_transfer(s->conf.blk) / s->blocksize
+.max_io_sectors = blk_get_max_ioctl_transfer(s->conf.blk) / 
s->blocksize
 };
 
 memset(r->buf, 0, r->buflen);
-- 
2.26.2

[PATCH 1/5] file-posix: split hdev_refresh_limits from raw_refresh_limits

2020-11-04 Thread Maxim Levitsky

From: Tom Yan 

We can and should get max transfer length and max segments for all host
devices / cdroms (on Linux).

Also use MIN_NON_ZERO instead when we clamp max transfer length against
max segments.

Signed-off-by: Tom Yan 
Signed-off-by: Maxim Levitsky 
---
 block/file-posix.c | 61 ++
 1 file changed, 45 insertions(+), 16 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index c63926d592..6581f41b2b 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1162,6 +1162,12 @@ static void raw_reopen_abort(BDRVReopenState *state)
 
 static int sg_get_max_transfer_length(int fd)
 {
+/*
+ * BLKSECTGET for /dev/sg* character devices incorrectly returns
+ * the max transfer size in bytes (rather than in blocks).
+ * Also note that /dev/sg* doesn't support BLKSSZGET ioctl.
+ */
+
 #ifdef BLKSECTGET
 int max_bytes = 0;
 
@@ -1175,7 +1181,24 @@ static int sg_get_max_transfer_length(int fd)
 #endif
 }
 
-static int sg_get_max_segments(int fd)
+static int get_max_transfer_length(int fd)
+{
+#if defined(BLKSECTGET) && defined(BLKSSZGET)
+int sect = 0;
+int ssz = 0;
+
+if (ioctl(fd, BLKSECTGET, ) == 0 &&
+ioctl(fd, BLKSSZGET, ) == 0) {
+return sect * ssz;
+} else {
+return -errno;
+}
+#else
+return -ENOSYS;
+#endif
+}
+
+static int get_max_segments(int fd)
 {
 #ifdef CONFIG_LINUX
 char buf[32];
@@ -1230,23 +1253,29 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 {
 BDRVRawState *s = bs->opaque;
 
-if (bs->sg) {
-int ret = sg_get_max_transfer_length(s->fd);
+raw_probe_alignment(bs, s->fd, errp);
+bs->bl.min_mem_alignment = s->buf_align;
+bs->bl.opt_mem_alignment = MAX(s->buf_align, qemu_real_host_page_size);
+}
 
-if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
-bs->bl.max_transfer = pow2floor(ret);
-}
+static void hdev_refresh_limits(BlockDriverState *bs, Error **errp)
+{
+BDRVRawState *s = bs->opaque;
 
-ret = sg_get_max_segments(s->fd);
-if (ret > 0) {
-bs->bl.max_transfer = MIN(bs->bl.max_transfer,
-  ret * qemu_real_host_page_size);
-}
+int ret = bs->sg ? sg_get_max_transfer_length(s->fd) :
+   get_max_transfer_length(s->fd);
+
+if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
+bs->bl.max_transfer = pow2floor(ret);
 }
 
-raw_probe_alignment(bs, s->fd, errp);
-bs->bl.min_mem_alignment = s->buf_align;
-bs->bl.opt_mem_alignment = MAX(s->buf_align, qemu_real_host_page_size);
+ret = get_max_segments(s->fd);
+if (ret > 0) {
+bs->bl.max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
+   ret * qemu_real_host_page_size);
+}
+
+raw_refresh_limits(bs, errp);
 }
 
 static int check_for_dasd(int fd)
@@ -3600,7 +3629,7 @@ static BlockDriver bdrv_host_device = {
 .bdrv_co_pdiscard   = hdev_co_pdiscard,
 .bdrv_co_copy_range_from = raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = raw_co_copy_range_to,
-.bdrv_refresh_limits = raw_refresh_limits,
+.bdrv_refresh_limits = hdev_refresh_limits,
 .bdrv_io_plug = raw_aio_plug,
 .bdrv_io_unplug = raw_aio_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
@@ -3724,7 +3753,7 @@ static BlockDriver bdrv_host_cdrom = {
 .bdrv_co_preadv = raw_co_preadv,
 .bdrv_co_pwritev= raw_co_pwritev,
 .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
-.bdrv_refresh_limits = raw_refresh_limits,
+.bdrv_refresh_limits = hdev_refresh_limits,
 .bdrv_io_plug = raw_aio_plug,
 .bdrv_io_unplug = raw_aio_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
-- 
2.26.2

[PATCH 5/5] block/scsi: correctly emulate the VPD block limits page

2020-11-04 Thread Maxim Levitsky

When the device doesn't support the VPD block limits page, we emulate it even
for SCSI passthrough.

As a part of the emulation we need to add it to the 'Supported VPD Pages'

The code that does this adds it to the page, but it doesn't increase the length
of the data to be copied to the guest, thus the guest never sees the VPD block
limits page as supported.

Bump the transfer size by 1 in this case.

I also refactored the code a bit, and I hopefully didn't introduce
another buffer overflow to this code...

Signed-off-by: Maxim Levitsky 
---
 hw/scsi/scsi-generic.c | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 6df67bf889..387d885aee 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -134,9 +134,9 @@ static int execute_command(BlockBackend *blk,
 return 0;
 }
 
-static void scsi_handle_inquiry_reply(SCSIGenericReq *r, SCSIDevice *s)
+static int scsi_handle_inquiry_reply(SCSIGenericReq *r, SCSIDevice *s, int len)
 {
-uint8_t page, page_idx;
+uint8_t page;
 
 /*
  *  EVPD set to zero returns the standard INQUIRY data.
@@ -188,20 +188,26 @@ static void scsi_handle_inquiry_reply(SCSIGenericReq *r, 
SCSIDevice *s)
  * right place with an in-place insert.  When the while loop
  * begins the device response is at r[0] to r[page_idx - 1].
  */
-page_idx = lduw_be_p(r->buf + 2) + 4;
-page_idx = MIN(page_idx, r->buflen);
+uint16_t page_len = lduw_be_p(r->buf + 2) + 4;
+
+/* pointer to first byte after the page that device gave us */
+uint16_t page_idx = page_len;
+
+if (page_idx >= r->buflen)
+return len;
+
 while (page_idx > 4 && r->buf[page_idx - 1] >= 0xb0) {
-if (page_idx < r->buflen) {
-r->buf[page_idx] = r->buf[page_idx - 1];
-}
+r->buf[page_idx] = r->buf[page_idx - 1];
 page_idx--;
 }
-if (page_idx < r->buflen) {
-r->buf[page_idx] = 0xb0;
-}
+r->buf[page_idx] = 0xb0;
+
+/* increase the page len field */
 stw_be_p(r->buf + 2, lduw_be_p(r->buf + 2) + 1);
+len++;
 }
 }
+return len;
 }
 
 static int scsi_generic_emulate_block_limits(SCSIGenericReq *r, SCSIDevice *s)
@@ -316,7 +322,7 @@ static void scsi_read_complete(void * opaque, int ret)
 }
 }
 if (r->req.cmd.buf[0] == INQUIRY) {
-scsi_handle_inquiry_reply(r, s);
+len = scsi_handle_inquiry_reply(r, s, len);
 }
 
 req_complete:
-- 
2.26.2

Re: [PATCH v2 13/20] iotests: 129: prepare for backup over block-copy

2020-11-04 Thread Max Reitz

On 22.10.20 23:10, Vladimir Sementsov-Ogievskiy wrote:
> 23.07.2020 11:03, Max Reitz wrote:
>> On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:
>>> After introducing parallel async copy requests instead of plain
>>> cluster-by-cluster copying loop, backup job may finish earlier than
>>> final assertion in do_test_stop. Let's require slow backup explicitly
>>> by specifying speed parameter.
>>
>> Isn’t the problem really that block_set_io_throttle does absolutely
>> nothing?  (Which is a long-standing problem with 129.  I personally just
>> never run it, honestly.)
> 
> Hmm.. is it better to drop test_drive_backup() from here ?

I think the best would be to revisit this:

https://lists.nongnu.org/archive/html/qemu-block/2019-06/msg00499.html

Max

[PATCH 2/5] file-posix: add sg_get_max_segments that actually works with sg

2020-11-04 Thread Maxim Levitsky

From: Tom Yan 

sg devices have different major/minor than their corresponding
block devices. Using sysfs to get max segments never really worked
for them.

Fortunately the sg driver provides an ioctl to get sg_tablesize,
which is apparently equivalent to max segments.

Signed-off-by: Tom Yan 
Signed-off-by: Maxim Levitsky 
---
 block/file-posix.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 6581f41b2b..c4df757504 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1181,6 +1181,26 @@ static int sg_get_max_transfer_length(int fd)
 #endif
 }
 
+static int sg_get_max_segments(int fd)
+{
+/*
+ * /dev/sg* character devices report 'max_segments' via
+ * SG_GET_SG_TABLESIZE ioctl
+ */
+
+#ifdef SG_GET_SG_TABLESIZE
+long max_segments = 0;
+
+if (ioctl(fd, SG_GET_SG_TABLESIZE, _segments) == 0) {
+return max_segments;
+} else {
+return -errno;
+}
+#else
+return -ENOSYS;
+#endif
+}
+
 static int get_max_transfer_length(int fd)
 {
 #if defined(BLKSECTGET) && defined(BLKSSZGET)
@@ -1269,7 +1289,7 @@ static void hdev_refresh_limits(BlockDriverState *bs, 
Error **errp)
 bs->bl.max_transfer = pow2floor(ret);
 }
 
-ret = get_max_segments(s->fd);
+ret = bs->sg ? sg_get_max_segments(s->fd) : get_max_segments(s->fd);
 if (ret > 0) {
 bs->bl.max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
ret * qemu_real_host_page_size);
-- 
2.26.2

[PATCH 3/5] block: add max_ioctl_transfer to BlockLimits

2020-11-04 Thread Maxim Levitsky

Maximum transfer size when accessing a kernel block device is only relevant
when using SCSI passthrough (SG_IO ioctl) since only in this case the requests
are passed directly to underlying hardware with no pre-processing.
Same is true when using /dev/sg* character devices (which only support SG_IO)

Therefore split the block driver's advertized max transfer size by
the regular max transfer size, and the max transfer size for SCSI passthrough
(the new max_ioctl_transfer field)

In the next patch, the qemu block drivers that support SCSI passthrough
will set the max_ioctl_transfer field, and simultaneously, the block devices
that implement scsi passthrough will switch to 'blk_get_max_ioctl_transfer' to
query and to pass it to the guest.

Signed-off-by: Maxim Levitsky 
---
 block/block-backend.c  | 12 
 block/io.c |  2 ++
 include/block/block_int.h  |  4 
 include/sysemu/block-backend.h |  1 +
 4 files changed, 19 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index ce78d30794..c1d149a755 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1938,6 +1938,18 @@ uint32_t blk_get_max_transfer(BlockBackend *blk)
 return MIN_NON_ZERO(max, INT_MAX);
 }
 
+/* Returns the maximum transfer length, for SCSI passthrough */
+uint32_t blk_get_max_ioctl_transfer(BlockBackend *blk)
+{
+BlockDriverState *bs = blk_bs(blk);
+uint32_t max = 0;
+
+if (bs) {
+max = bs->bl.max_ioctl_transfer;
+}
+return MIN_NON_ZERO(max, INT_MAX);
+}
+
 int blk_get_max_iov(BlockBackend *blk)
 {
 return blk->root->bs->bl.max_iov;
diff --git a/block/io.c b/block/io.c
index ec5e152bb7..3eae176992 100644
--- a/block/io.c
+++ b/block/io.c
@@ -126,6 +126,8 @@ static void bdrv_merge_limits(BlockLimits *dst, const 
BlockLimits *src)
 {
 dst->opt_transfer = MAX(dst->opt_transfer, src->opt_transfer);
 dst->max_transfer = MIN_NON_ZERO(dst->max_transfer, src->max_transfer);
+dst->max_ioctl_transfer = MIN_NON_ZERO(dst->max_ioctl_transfer,
+src->max_ioctl_transfer);
 dst->opt_mem_alignment = MAX(dst->opt_mem_alignment,
  src->opt_mem_alignment);
 dst->min_mem_alignment = MAX(dst->min_mem_alignment,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 38cad9d15c..b198165114 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -678,6 +678,10 @@ typedef struct BlockLimits {
  * clamped down. */
 uint32_t max_transfer;
 
+/* Maximal transfer length for SCSI passthrough (ioctl interface) */
+uint32_t max_ioctl_transfer;
+
+
 /* memory alignment, in bytes so that no bounce buffer is needed */
 size_t min_mem_alignment;
 
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 8203d7f6f9..b019a37b7a 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -203,6 +203,7 @@ void blk_eject(BlockBackend *blk, bool eject_flag);
 int blk_get_flags(BlockBackend *blk);
 uint32_t blk_get_request_alignment(BlockBackend *blk);
 uint32_t blk_get_max_transfer(BlockBackend *blk);
+uint32_t blk_get_max_ioctl_transfer(BlockBackend *blk);
 int blk_get_max_iov(BlockBackend *blk);
 void blk_set_guest_block_size(BlockBackend *blk, int align);
 void *blk_try_blockalign(BlockBackend *blk, size_t size);
-- 
2.26.2

[PATCH 0/5] SCSI: fix transfer limits for SCSI passthrough

2020-11-04 Thread Maxim Levitsky

This patch series attempts to provide a solution to the problem of the transfer
limits of the raw file driver (host_device/file-posix), some of which I
already tried to fix in the past.

I included 2 patches from Tom Yan which fix two issues with reading the limits
correctly from the */dev/sg* character devices in the first place.

The only change to these patches is that I tweaked a bit the comments in the
source to better document the /dev/sg quirks.

The other two patches in this series split the max transfer limits that qemu
block devices expose in two:
One limit is for the regular IO, and another is for the SG_IO (aka 
bdrv_*_ioctl),
and the two device drivers (scsi-block and scsi-generic) that use the later
are switched to the new interface.

This should ensure that the raw driver can still advertise the unlimited
transfer  length, unless it is used for SG_IO, because that yields the highest
performance.

Also I include a somewhat unrelated fix to a bug I found in qemu's
SCSI passthrough while testing this:
When qemu emulates the VPD block limit page, for a SCSI device that doesn't
implement it, it doesn't really advertise the emulated page to the guest.

I tested this by doing both regular and SG_IO passthrough of my
USB SD card reader.

That device turned out to be a perfect device for the task, since it has max
transfer size of 1024 blocks (512K), and it enforces it.

Also it didn't implement the VPD block limits page,
(transfer size limit probably comes from something USB related) which triggered
the unrelated bug.

I was able to see IO errors without the patches, and the wrong max transfer
size in the guest, and with patches both issues were gone.

I also found an unrelated issue in /dev/sg passthrough in the kernel.
It turns out that in-kernel driver has a limitation of 16 requests in flight,
regardless of what underlying device supports.

With a large multi-threaded fio job  and a debug print in qemu, it is easy to
see it, although the errors don't do much harm to the guest as it retries the
IO, and eventually succeed.
It is an open question if this should be solved.

Maxim Levitsky (3):
  block: add max_ioctl_transfer to BlockLimits
  block: use blk_get_max_ioctl_transfer for SCSI passthrough
  block/scsi: correctly emulate the VPD block limits page

Tom Yan (2):
  file-posix: split hdev_refresh_limits from raw_refresh_limits
  file-posix: add sg_get_max_segments that actually works with sg

 block/block-backend.c  | 12 ++
 block/file-posix.c | 79 +++---
 block/io.c |  2 +
 block/iscsi.c  |  1 +
 hw/scsi/scsi-generic.c | 32 --
 include/block/block_int.h  |  4 ++
 include/sysemu/block-backend.h |  1 +
 7 files changed, 103 insertions(+), 28 deletions(-)

-- 
2.26.2

[PATCH 6/7] qom: Add FIELD_PTR, a type-safe wrapper for object_field_prop_ptr()

2020-11-04 Thread Eduardo Habkost

Introduce a FIELD_PTR macro that will ensure the size of the area
we are accessing has the correct size, and will return a pointer
of the correct type.

Signed-off-by: Eduardo Habkost 
---
Cc: Stefan Berger 
Cc: Stefano Stabellini 
Cc: Anthony Perard 
Cc: Paul Durrant 
Cc: Kevin Wolf 
Cc: Max Reitz 
Cc: Paolo Bonzini 
Cc: "Daniel P. Berrangé" 
Cc: Eduardo Habkost 
Cc: Cornelia Huck 
Cc: Thomas Huth 
Cc: Halil Pasic 
Cc: Christian Borntraeger 
Cc: Richard Henderson 
Cc: David Hildenbrand 
Cc: Matthew Rosato 
Cc: Alex Williamson 
Cc: qemu-de...@nongnu.org
Cc: xen-de...@lists.xenproject.org
Cc: qemu-block@nongnu.org
Cc: qemu-s3...@nongnu.org
---
 include/qom/field-property.h | 21 ++-
 backends/tpm/tpm_util.c  |  6 ++--
 hw/block/xen-block.c |  4 +--
 hw/core/qdev-properties-system.c | 50 +-
 hw/s390x/css.c   |  4 +--
 hw/s390x/s390-pci-bus.c  |  4 +--
 hw/vfio/pci-quirks.c |  4 +--
 qom/field-property.c |  3 +-
 qom/property-types.c | 60 +---
 9 files changed, 89 insertions(+), 67 deletions(-)

diff --git a/include/qom/field-property.h b/include/qom/field-property.h
index 1d3bf9699b..58baaca160 100644
--- a/include/qom/field-property.h
+++ b/include/qom/field-property.h
@@ -125,6 +125,25 @@ object_class_property_add_field(ObjectClass *oc, const 
char *name,
 Property *prop,
 ObjectPropertyAllowSet allow_set);
 
-void *object_field_prop_ptr(Object *obj, Property *prop);
+/**
+ * object_field_prop_ptr: Get pointer to property field
+ * @obj: the object instance
+ * @prop: field property definition
+ * @expected_size: expected size of struct field
+ *
+ * Don't use this function directly, use the FIELD_PTR() macro instead.
+ */
+void *object_field_prop_ptr(Object *obj, Property *prop, size_t expected_size);
+
+/**
+ * FIELD_PTR: Get pointer to struct field for property
+ *
+ * This returns a pointer to type @type, pointing to the struct
+ * field containing the property value.
+ *
+ * @type must match the expected type for the property.
+ */
+#define FIELD_PTR(obj, prop, type) \
+((type *)object_field_prop_ptr((obj), (prop), sizeof(type)))
 
 #endif
diff --git a/backends/tpm/tpm_util.c b/backends/tpm/tpm_util.c
index 556e21388c..da80379404 100644
--- a/backends/tpm/tpm_util.c
+++ b/backends/tpm/tpm_util.c
@@ -35,7 +35,7 @@
 static void get_tpm(Object *obj, Visitor *v, const char *name,
 Property *prop, Error **errp)
 {
-TPMBackend **be = object_field_prop_ptr(obj, prop);
+TPMBackend **be = FIELD_PTR(obj, prop, TPMBackend *);
 char *p;
 
 p = g_strdup(*be ? (*be)->id : "");
@@ -46,7 +46,7 @@ static void get_tpm(Object *obj, Visitor *v, const char *name,
 static void set_tpm(Object *obj, Visitor *v, const char *name,
 Property *prop, Error **errp)
 {
-TPMBackend *s, **be = object_field_prop_ptr(obj, prop);
+TPMBackend *s, **be = FIELD_PTR(obj, prop, TPMBackend *);
 char *str;
 
 if (!visit_type_str(v, name, , errp)) {
@@ -65,7 +65,7 @@ static void set_tpm(Object *obj, Visitor *v, const char *name,
 
 static void release_tpm(Object *obj, const char *name, Property *prop)
 {
-TPMBackend **be = object_field_prop_ptr(obj, prop);
+TPMBackend **be = FIELD_PTR(obj, prop, TPMBackend *);
 
 if (*be) {
 tpm_backend_reset(*be);
diff --git a/hw/block/xen-block.c b/hw/block/xen-block.c
index c1ee634639..390bf417ab 100644
--- a/hw/block/xen-block.c
+++ b/hw/block/xen-block.c
@@ -335,7 +335,7 @@ static char *disk_to_vbd_name(unsigned int disk)
 static void xen_block_get_vdev(Object *obj, Visitor *v, const char *name,
 Property *prop, Error **errp)
 {
-XenBlockVdev *vdev = object_field_prop_ptr(obj, prop);
+XenBlockVdev *vdev = FIELD_PTR(obj, prop, XenBlockVdev);
 char *str;
 
 switch (vdev->type) {
@@ -394,7 +394,7 @@ static int vbd_name_to_disk(const char *name, const char 
**endp,
 static void xen_block_set_vdev(Object *obj, Visitor *v, const char *name,
 Property *prop, Error **errp)
 {
-XenBlockVdev *vdev = object_field_prop_ptr(obj, prop);
+XenBlockVdev *vdev = FIELD_PTR(obj, prop, XenBlockVdev);
 char *str, *p;
 const char *end;
 
diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c
index 2fdd5863bb..1ec64514b9 100644
--- a/hw/core/qdev-properties-system.c
+++ b/hw/core/qdev-properties-system.c
@@ -61,7 +61,7 @@ static bool check_prop_still_unset(Object *obj, const char 
*name,
 static void get_drive(Object *obj, Visitor *v, const char *name,
   Property *prop, Error **errp)
 {
-void **ptr = object_field_prop_ptr(obj, prop);
+void **ptr = FIELD_PTR(obj, prop, void *);
 const char *value;
 char *p;
 
@@ -87,7 +87,7 @@ static void set_drive_helper(Object *obj, Visitor

Re: [PATCH v2 22/44] qdev: Move dev->realized check to qdev_property_set()

2020-11-04 Thread Stefan Berger


On 11/4/20 10:59 AM, Eduardo Habkost wrote:

Every single qdev property setter function manually checks
dev->realized.  We can just check dev->realized inside
qdev_property_set() instead.

The check is being added as a separate function
(qdev_prop_allow_set()) because it will become a callback later.

Signed-off-by: Eduardo Habkost 


Reviewed-by: Stefan Berger 



---
Changes v1 -> v2:
* Removed unused variable at xen_block_set_vdev()
* Redone patch after changes in the previous patches in the
   series
---
Cc: Stefan Berger 
Cc: Stefano Stabellini 
Cc: Anthony Perard 
Cc: Paul Durrant 
Cc: Kevin Wolf 
Cc: Max Reitz 
Cc: Paolo Bonzini 
Cc: "Daniel P. Berrangé" 
Cc: Eduardo Habkost 
Cc: Cornelia Huck 
Cc: Halil Pasic 
Cc: Christian Borntraeger 
Cc: Richard Henderson 
Cc: David Hildenbrand 
Cc: Thomas Huth 
Cc: Matthew Rosato 
Cc: Alex Williamson 
Cc: Mark Cave-Ayland 
Cc: Artyom Tarasenko 
Cc: qemu-de...@nongnu.org
Cc: xen-de...@lists.xenproject.org
Cc: qemu-block@nongnu.org
Cc: qemu-s3...@nongnu.org
---
  backends/tpm/tpm_util.c  |   6 --
  hw/block/xen-block.c |   6 --
  hw/core/qdev-properties-system.c |  70 --
  hw/core/qdev-properties.c| 100 ++-
  hw/s390x/css.c   |   6 --
  hw/s390x/s390-pci-bus.c  |   6 --
  hw/vfio/pci-quirks.c |   6 --
  target/sparc/cpu.c   |   6 --
  8 files changed, 18 insertions(+), 188 deletions(-)

diff --git a/backends/tpm/tpm_util.c b/backends/tpm/tpm_util.c
index dba2f6b04a..0b07cf55ea 100644
--- a/backends/tpm/tpm_util.c
+++ b/backends/tpm/tpm_util.c
@@ -46,16 +46,10 @@ static void get_tpm(Object *obj, Visitor *v, const char 
*name, void *opaque,
  static void set_tpm(Object *obj, Visitor *v, const char *name, void *opaque,
  Error **errp)
  {
-DeviceState *dev = DEVICE(obj);
  Property *prop = opaque;
  TPMBackend *s, **be = qdev_get_prop_ptr(obj, prop);
  char *str;
  
-if (dev->realized) {

-qdev_prop_set_after_realize(dev, name, errp);
-return;
-}
-
  if (!visit_type_str(v, name, , errp)) {
  return;
  }
diff --git a/hw/block/xen-block.c b/hw/block/xen-block.c
index 905e4acd97..bd1aef63a7 100644
--- a/hw/block/xen-block.c
+++ b/hw/block/xen-block.c
@@ -395,17 +395,11 @@ static int vbd_name_to_disk(const char *name, const char 
**endp,
  static void xen_block_set_vdev(Object *obj, Visitor *v, const char *name,
 void *opaque, Error **errp)
  {
-DeviceState *dev = DEVICE(obj);
  Property *prop = opaque;
  XenBlockVdev *vdev = qdev_get_prop_ptr(obj, prop);
  char *str, *p;
  const char *end;
  
-if (dev->realized) {

-qdev_prop_set_after_realize(dev, name, errp);
-return;
-}
-
  if (!visit_type_str(v, name, , errp)) {
  return;
  }
diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c
index 202abd0e4b..0d3e57bba0 100644
--- a/hw/core/qdev-properties-system.c
+++ b/hw/core/qdev-properties-system.c
@@ -94,11 +94,6 @@ static void set_drive_helper(Object *obj, Visitor *v, const 
char *name,
  bool blk_created = false;
  int ret;
  
-if (dev->realized) {

-qdev_prop_set_after_realize(dev, name, errp);
-return;
-}
-
  if (!visit_type_str(v, name, , errp)) {
  return;
  }
@@ -230,17 +225,11 @@ static void get_chr(Object *obj, Visitor *v, const char 
*name, void *opaque,
  static void set_chr(Object *obj, Visitor *v, const char *name, void *opaque,
  Error **errp)
  {
-DeviceState *dev = DEVICE(obj);
  Property *prop = opaque;
  CharBackend *be = qdev_get_prop_ptr(obj, prop);
  Chardev *s;
  char *str;
  
-if (dev->realized) {

-qdev_prop_set_after_realize(dev, name, errp);
-return;
-}
-
  if (!visit_type_str(v, name, , errp)) {
  return;
  }
@@ -311,18 +300,12 @@ static void get_mac(Object *obj, Visitor *v, const char 
*name, void *opaque,
  static void set_mac(Object *obj, Visitor *v, const char *name, void *opaque,
  Error **errp)
  {
-DeviceState *dev = DEVICE(obj);
  Property *prop = opaque;
  MACAddr *mac = qdev_get_prop_ptr(obj, prop);
  int i, pos;
  char *str;
  const char *p;
  
-if (dev->realized) {

-qdev_prop_set_after_realize(dev, name, errp);
-return;
-}
-
  if (!visit_type_str(v, name, , errp)) {
  return;
  }
@@ -390,7 +373,6 @@ static void get_netdev(Object *obj, Visitor *v, const char 
*name,
  static void set_netdev(Object *obj, Visitor *v, const char *name,
 void *opaque, Error **errp)
  {
-DeviceState *dev = DEVICE(obj);
  Property *prop = opaque;
  NICPeers *peers_ptr = qdev_get_prop_ptr(obj, prop);
  NetClientState **ncs = peers_ptr->ncs;
@@ -398,11 +380,6 @@ static void set_netdev(Object

Re: [PATCH v2 08/20] block/block-copy: add block_copy_cancel

2020-11-04 Thread Max Reitz

On 22.10.20 22:50, Vladimir Sementsov-Ogievskiy wrote:
> 22.07.2020 14:28, Max Reitz wrote:
>> On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:
>>> Add function to cancel running async block-copy call. It will be used
>>> in backup.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy 
>>> ---
>>>   include/block/block-copy.h |  7 +++
>>>   block/block-copy.c | 22 +++---
>>>   2 files changed, 26 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/block/block-copy.h b/include/block/block-copy.h
>>> index d40e691123..370a194d3c 100644
>>> --- a/include/block/block-copy.h
>>> +++ b/include/block/block-copy.h
>>> @@ -67,6 +67,13 @@ BlockCopyCallState
>>> *block_copy_async(BlockCopyState *s,
>>>   void block_copy_set_speed(BlockCopyState *s, BlockCopyCallState
>>> *call_state,
>>>     uint64_t speed);
>>>   +/*
>>> + * Cancel running block-copy call.
>>> + * Cancel leaves block-copy state valid: dirty bits are correct and
>>> you may use
>>> + * cancel +  to emulate
>>> pause/resume.
>>> + */
>>> +void block_copy_cancel(BlockCopyCallState *call_state);
>>> +
>>>   BdrvDirtyBitmap *block_copy_dirty_bitmap(BlockCopyState *s);
>>>   void block_copy_set_skip_unallocated(BlockCopyState *s, bool skip);
>>>   diff --git a/block/block-copy.c b/block/block-copy.c
>>> index 851d9c8aaf..b551feb6c2 100644
>>> --- a/block/block-copy.c
>>> +++ b/block/block-copy.c
>>> @@ -44,6 +44,8 @@ typedef struct BlockCopyCallState {
>>>   bool failed;
>>>   bool finished;
>>>   QemuCoSleepState *sleep_state;
>>> +    bool cancelled;
>>> +    Coroutine *canceller;
>>>     /* OUT parameters */
>>>   bool error_is_read;
>>> @@ -582,7 +584,7 @@ block_copy_dirty_clusters(BlockCopyCallState
>>> *call_state)
>>>   assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
>>>   assert(QEMU_IS_ALIGNED(bytes, s->cluster_size));
>>>   -    while (bytes && aio_task_pool_status(aio) == 0) {
>>> +    while (bytes && aio_task_pool_status(aio) == 0 &&
>>> !call_state->cancelled) {
>>>   BlockCopyTask *task;
>>>   int64_t status_bytes;
>>>   @@ -693,7 +695,7 @@ static int coroutine_fn
>>> block_copy_common(BlockCopyCallState *call_state)
>>>   do {
>>>   ret = block_copy_dirty_clusters(call_state);
>>>   -    if (ret == 0) {
>>> +    if (ret == 0 && !call_state->cancelled) {
>>>   ret = block_copy_wait_one(call_state->s,
>>> call_state->offset,
>>>     call_state->bytes);
>>>   }
>>> @@ -707,13 +709,18 @@ static int coroutine_fn
>>> block_copy_common(BlockCopyCallState *call_state)
>>>    * 2. We have waited for some intersecting block-copy request
>>>    *    It may have failed and produced new dirty bits.
>>>    */
>>> -    } while (ret > 0);
>>> +    } while (ret > 0 && !call_state->cancelled);
>>
>> Would it be cleaner if block_copy_dirty_cluster() just returned
>> -ECANCELED?  Or would that pose a problem for its callers or the async
>> callback?
>>
> 
> I'd prefer not to merge io ret with block-copy logic: who knows what
> underlying operations may return.. Can't it be _another_ ECANCELED?
> And it would be just a sugar for block_copy_dirty_clusters() call, I'll
> have to check ->cancelled after block_copy_wait_one() anyway.
> Also, for the next version I try to make it more obvious that finished
> block-copy call is in one of thee states:
>  - success
>  - failed
>  - cancelled

OK.

Max

[PATCH 4/7] qom: Replace void* parameter with Property* on field getters/setters

2020-11-04 Thread Eduardo Habkost

All field property getters and setters must interpret the fourth
argument as Property*.  Change the function signature of field
property getters and setters to indicate that.

Signed-off-by: Eduardo Habkost 
---
Cc: Stefan Berger 
Cc: Stefano Stabellini 
Cc: Anthony Perard 
Cc: Paul Durrant 
Cc: Kevin Wolf 
Cc: Max Reitz 
Cc: Paolo Bonzini 
Cc: "Daniel P. Berrangé" 
Cc: Eduardo Habkost 
Cc: Richard Henderson 
Cc: David Hildenbrand 
Cc: Cornelia Huck 
Cc: Thomas Huth 
Cc: Halil Pasic 
Cc: Christian Borntraeger 
Cc: Matthew Rosato 
Cc: Alex Williamson 
Cc: Mark Cave-Ayland 
Cc: Artyom Tarasenko 
Cc: qemu-de...@nongnu.org
Cc: xen-de...@lists.xenproject.org
Cc: qemu-block@nongnu.org
Cc: qemu-s3...@nongnu.org
---
 include/qom/field-property-internal.h |   8 +-
 include/qom/field-property.h  |  26 ---
 backends/tpm/tpm_util.c   |  11 ++-
 hw/block/xen-block.c  |   6 +-
 hw/core/qdev-properties-system.c  |  86 +-
 hw/s390x/css.c|   6 +-
 hw/s390x/s390-pci-bus.c   |   6 +-
 hw/vfio/pci-quirks.c  |  10 +--
 qom/property-types.c  | 102 +-
 target/sparc/cpu.c|   4 +-
 10 files changed, 105 insertions(+), 160 deletions(-)

diff --git a/include/qom/field-property-internal.h 
b/include/qom/field-property-internal.h
index 7aa27ce836..bc7d25033d 100644
--- a/include/qom/field-property-internal.h
+++ b/include/qom/field-property-internal.h
@@ -9,9 +9,9 @@
 #define QOM_STATIC_PROPERTY_INTERNAL_H
 
 void field_prop_get_enum(Object *obj, Visitor *v, const char *name,
- void *opaque, Error **errp);
+ Property *prop, Error **errp);
 void field_prop_set_enum(Object *obj, Visitor *v, const char *name,
- void *opaque, Error **errp);
+ Property *prop, Error **errp);
 
 void field_prop_set_default_value_enum(ObjectProperty *op,
const Property *prop);
@@ -21,9 +21,9 @@ void field_prop_set_default_value_uint(ObjectProperty *op,
const Property *prop);
 
 void field_prop_get_int32(Object *obj, Visitor *v, const char *name,
-  void *opaque, Error **errp);
+  Property *prop, Error **errp);
 void field_prop_get_size32(Object *obj, Visitor *v, const char *name,
-   void *opaque, Error **errp);
+   Property *prop, Error **errp);
 
 /**
  * object_property_add_field: Add a field property to an object instance
diff --git a/include/qom/field-property.h b/include/qom/field-property.h
index e64a2b3c07..438bb25896 100644
--- a/include/qom/field-property.h
+++ b/include/qom/field-property.h
@@ -54,6 +54,18 @@ struct Property {
 const char   *link_type;
 };
 
+/**
+ * typedef FieldAccessor: a field property getter or setter function
+ * @obj: the object instance
+ * @v: the visitor that contains the property data
+ * @name: the name of the property
+ * @prop: Field property definition
+ * @errp: pointer to error information
+ */
+typedef void FieldAccessor(Object *obj, Visitor *v,
+   const char *name, Property *prop,
+   Error **errp);
+
 /**
  * struct PropertyInfo: information on a specific QOM property type
  */
@@ -71,16 +83,10 @@ struct PropertyInfo {
 /** @create: Optional callback for creation of property */
 ObjectProperty *(*create)(ObjectClass *oc, const char *name,
   Property *prop);
-/**
- * @get: Property getter.  The opaque parameter will point to
- *the  struct for the property.
- */
-ObjectPropertyAccessor *get;
-/**
- * @set: Property setter.  The opaque parameter will point to
- *the  struct for the property.
- */
-ObjectPropertyAccessor *set;
+/** @get: Property getter */
+FieldAccessor *get;
+/** @set: Property setter */
+FieldAccessor *set;
 /**
  * @release: Optional release function, called when the object
  * is destroyed
diff --git a/backends/tpm/tpm_util.c b/backends/tpm/tpm_util.c
index bb1ab34a75..e8837938e5 100644
--- a/backends/tpm/tpm_util.c
+++ b/backends/tpm/tpm_util.c
@@ -32,10 +32,10 @@
 
 /* tpm backend property */
 
-static void get_tpm(Object *obj, Visitor *v, const char *name, void *opaque,
-Error **errp)
+static void get_tpm(Object *obj, Visitor *v, const char *name,
+Property *prop, Error **errp)
 {
-TPMBackend **be = object_field_prop_ptr(obj, opaque);
+TPMBackend **be = object_field_prop_ptr(obj, prop);
 char *p;
 
 p = g_strdup(*be ? (*be)->id : "");
@@ -43,10 +43,9 @@ static void get_tpm(Object *obj, Visitor *v, const char 
*name, void *opaque,
 g_free(p);
 }
 
-static void set_tpm(Object *obj, Visitor *v, const char *name, void

Re: [PATCH v2] block: Remove unused BlockDeviceMapEntry

2020-11-04 Thread Max Reitz

On 30.10.20 07:24, Markus Armbruster wrote:
> BlockDeviceMapEntry has never been used.  It was added in commit
> facd6e2 "so that it is published through the introspection mechanism."
> What exactly introspecting types that aren't used for anything could
> accomplish isn't clear.  What "introspection mechanism" to use is also
> nebulous.  To the best of my knowledge, there has never been one that
> covered this type.  Certainly not query-qmp-schema, which includes
> only types that are actually used in QMP.
> 
> Not being able to introspect BlockDeviceMapEntry hasn't bothered
> anyone enough to complain in almost four years.  Get rid of it.
> 
> Cc: Paolo Bonzini 
> Cc: Eric Blake 
> Reviewed-by: Eric Blake 
> Signed-off-by: Markus Armbruster 
> ---
> I found an old patch I neglected to merge.
> 
> Max replied to a remark in Eric's review of v1:
> 
> Max Reitz  writes:
> 
> > On 2017-07-28 20:10, Eric Blake wrote:
> >> This type is the schema for 'qemu-img map --output=json'.  And I had a
> >> patch once (that I need to revive) that added a JSON Output visitor; at
> >> which point I fixed qemu-img to convert from QAPI to JSON instead of
> >> open-coding its construction of its output string, at which point the
> >> QAPI generated code for this type is useful.
> > (Very late reply, I know, I just stumbled over *MapEntry when looking
> > over block-core.json what we might want to deprecate in 3.0)
> >
> > We already use MapEntry there -- why don't we output just that instead?
> > The only difference seems to be an additional @filename parameter which
> > would probably be actually nice to include in the output.
> >
> > Except that BlockDeviceMapEntry's documentation is better, so we should
> > merge that into MapEntry before removing the former.
> >
> > Max
> 
> https://lists.nongnu.org/archive/html/qemu-devel/2017-12/msg02933.html
> 
> Me doing the doc update Max suggested could take more than one
> iteration, as I know nothing about this stuff.  Max, could you give it
> a try?  Feel free to take over my patch.

Thanks, done :)

https://lists.nongnu.org/archive/html/qemu-block/2020-11/msg00143.html

Max

Re: [PATCH v2 11/20] qapi: backup: add x-max-chunk and x-max-workers parameters

2020-11-04 Thread Max Reitz

On 22.10.20 22:35, Vladimir Sementsov-Ogievskiy wrote:
> 22.07.2020 15:22, Max Reitz wrote:
>> On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:
>>> Add new parameters to configure future backup features. The patch
>>> doesn't introduce aio backup requests (so we actually have only one
>>> worker) neither requests larger than one cluster. Still, formally we
>>> satisfy these maximums anyway, so add the parameters now, to facilitate
>>> further patch which will really change backup job behavior.
>>>
>>> Options are added with x- prefixes, as the only use for them are some
>>> very conservative iotests which will be updated soon.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy 
>>> ---
>>>   qapi/block-core.json  |  9 -
>>>   include/block/block_int.h |  7 +++
>>>   block/backup.c    | 21 +
>>>   block/replication.c   |  2 +-
>>>   blockdev.c    |  5 +
>>>   5 files changed, 42 insertions(+), 2 deletions(-)
>>>
> 
> [..]
> 
>>> @@ -422,6 +436,11 @@ BlockJob *backup_job_create(const char *job_id,
>>> BlockDriverState *bs,
>>>   if (cluster_size < 0) {
>>>   goto error;
>>>   }
>>> +    if (max_chunk && max_chunk < cluster_size) {
>>> +    error_setg(errp, "Required max-chunk (%" PRIi64") is less
>>> than backup "
>>
>> (missing a space after PRIi64)
>>
>>> +   "cluster size (%" PRIi64 ")", max_chunk,
>>> cluster_size);
>>
>> Should this be noted in the QAPI documentation?
> 
> Hmm.. It makes sense, but I don't know what to write: should be >= job
> cluster_size? But then I'll have to describe the logic of
> backup_calculate_cluster_size()...

Isn’t the logic basically just “cluster size of the target image file
(min 64k)”?  The other cases just cover error cases, and they always
return 64k, which would effectively be documented by stating that 64k is
the minimum.

But in the common cases (qcow2 or raw), we’ll never get an error anyway.

I think it’d be good to describe the cluster size somewhere, because
otherwise I find it a bit malicious to throw an error at the user that
they could not have anticipated because they have no idea what valid
values for the parameter are (until they try).

>>  (And perhaps the fact
>> that without copy offloading, we’ll never copy anything bigger than one
>> cluster at a time anyway?)
> 
> This is a parameter for background copying. Look at
> block_copy_task_create(), if call_state has own max_chunk, it's used
> instead of common copy_size (derived from cluster_size). But at a moment
> of this patch background process through async block-copy is not  yet
> implemented, and the parameter doesn't work, which is described in
> commit message.

Ah, OK.  Right.

Max

[PATCH 1/2] qapi/block-core: Improve MapEntry documentation

2020-11-04 Thread Max Reitz

MapEntry and BlockDeviceMapEntry are kind of the same thing, and the
latter is not used, so we want to remove it.  However, the documentation
it provides for some fields is better than that of MapEntry, so steal
some of it for the latter.

(And adjust them a bit in the process, because I feel like we can make
them even clearer.)

Signed-off-by: Max Reitz 
---
 qapi/block-core.json | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 1b8b4156b4..3f86675357 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -244,17 +244,25 @@
 #
 # Mapping information from a virtual block range to a host file range
 #
-# @start: the start byte of the mapped virtual range
+# @start: virtual (guest) offset of the first byte described by this
+# entry
 #
 # @length: the number of bytes of the mapped virtual range
 #
-# @data: whether the mapped range has data
+# @data: reading the image will actually read data from a file (in
+#particular, if @offset is present this means that the sectors
+#are not simply preallocated, but contain actual data in raw
+#format)
 #
-# @zero: whether the virtual blocks are zeroed
+# @zero: whether the virtual blocks read as zeroes
 #
-# @depth: the depth of the mapping
+# @depth: number of layers (0 = top image, 1 = top image's backing
+# file, ..., n - 1 = bottom image (where n is the number of
+# images in the chain)) before reaching one for which the
+# range is allocated
 #
-# @offset: the offset in file that the virtual sectors are mapped to
+# @offset: if present, the image file stores the data for this range
+#  in raw format at the given (host) offset
 #
 # @filename: filename that is referred to by @offset
 #
-- 
2.28.0

[PATCH 2/2] block: Remove unused BlockDeviceMapEntry

2020-11-04 Thread Max Reitz

From: Markus Armbruster 

BlockDeviceMapEntry has never been used.  It was added in commit
facd6e2 "so that it is published through the introspection mechanism."
What exactly introspecting types that aren't used for anything could
accomplish isn't clear.  What "introspection mechanism" to use is also
nebulous.  To the best of my knowledge, there has never been one that
covered this type.  Certainly not query-qmp-schema, which includes
only types that are actually used in QMP.

Not being able to introspect BlockDeviceMapEntry hasn't bothered
anyone enough to complain in almost four years.  Get rid of it.

Cc: Paolo Bonzini 
Cc: Eric Blake 
Reviewed-by: Eric Blake 
Signed-off-by: Markus Armbruster 
---
 qapi/block-core.json | 29 -
 1 file changed, 29 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 3f86675357..04ad80bc1e 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -426,35 +426,6 @@
 ##
 { 'enum': 'BlockDeviceIoStatus', 'data': [ 'ok', 'failed', 'nospace' ] }
 
-##
-# @BlockDeviceMapEntry:
-#
-# Entry in the metadata map of the device (returned by "qemu-img map")
-#
-# @start: Offset in the image of the first byte described by this entry
-# (in bytes)
-#
-# @length: Length of the range described by this entry (in bytes)
-#
-# @depth: Number of layers (0 = top image, 1 = top image's backing file, etc.)
-# before reaching one for which the range is allocated.  The value is
-# in the range 0 to the depth of the image chain - 1.
-#
-# @zero: the sectors in this range read as zeros
-#
-# @data: reading the image will actually read data from a file (in particular,
-#if @offset is present this means that the sectors are not simply
-#preallocated, but contain actual data in raw format)
-#
-# @offset: if present, the image file stores the data for this range in
-#  raw format at the given offset.
-#
-# Since: 1.7
-##
-{ 'struct': 'BlockDeviceMapEntry',
-  'data': { 'start': 'int', 'length': 'int', 'depth': 'int', 'zero': 'bool',
-'data': 'bool', '*offset': 'int' } }
-
 ##
 # @DirtyBitmapStatus:
 #
-- 
2.28.0

[PATCH 0/2] block: Remove unused BlockDeviceMapEntry

2020-11-04 Thread Max Reitz

Hi,

Markus has revived a rather old patch to remove an unused QAPI
structure:

https://lists.nongnu.org/archive/html/qemu-block/2020-10/msg01902.html

He quoted a response of mine to the original patch, where I noted that
removing this structure is OK because it is superseded by another
structure (MapEntry), but that other structure’s documentation is worse.
So we should merge some of BlockDeviceMapEntry’s documentation into
MapEntry before removing it.

This series combines said merge with Markus’s patch to drop
BlockDeviceMapEntry.


Markus Armbruster (1):
  block: Remove unused BlockDeviceMapEntry

Max Reitz (1):
  qapi/block-core: Improve MapEntry documentation

 qapi/block-core.json | 47 
 1 file changed, 13 insertions(+), 34 deletions(-)

-- 
2.28.0

Re: Libvirt driver iothread property for virtio-scsi disks

2020-11-04 Thread Daniel P . Berrangé

On Wed, Nov 04, 2020 at 05:48:40PM +0200, Nir Soffer wrote:
> The docs[1] say:
> 
> - The optional iothread attribute assigns the disk to an IOThread as defined 
> by
>   the range for the domain iothreads value. Multiple disks may be assigned to
>   the same IOThread and are numbered from 1 to the domain iothreads value.
>   Available for a disk device target configured to use "virtio" bus and "pci"
>   or "ccw" address types. Since 1.2.8 (QEMU 2.1)
> 
> Does it mean that virtio-scsi disks do not use iothreads?
> 
> I'm experiencing a horrible performance using nested vms (up to 2 levels of
> nesting) when accessing NFS storage running on one of the VMs. The NFS
> server is using scsi disk.

When you say  2 levels of nesting do you definitely have KVM enabled at
all levels, or are you ending up using TCG emulation, because the latter
would certainly explain terrible performance.

> 
> My theory is:
> - Writing to NFS server is very slow (too much nesting, slow disk)
> - Not using iothreads (because we don't use virtio?)
> - Guest CPU is blocked by slow I/O

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

[PATCH v2] Prefer 'on' | 'off' over 'yes' | 'no' for bool options

2020-11-04 Thread Daniel P . Berrangé

Update some docs and test cases to use 'on' | 'off' as the preferred
value for bool options.

Reviewed-by: Thomas Huth 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Daniel P. Berrangé 
---
 docs/system/vnc-security.rst | 10 +-
 include/authz/listfile.h |  2 +-
 qemu-options.hx  |  4 ++--
 tests/qemu-iotests/233   |  4 ++--
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/docs/system/vnc-security.rst b/docs/system/vnc-security.rst
index b237b07330..3574bdb86c 100644
--- a/docs/system/vnc-security.rst
+++ b/docs/system/vnc-security.rst
@@ -65,7 +65,7 @@ encrypted session.
 .. parsed-literal::
 
|qemu_system| [...OPTIONS...] \
- -object 
tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=no \
+ -object 
tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=off \
  -vnc :1,tls-creds=tls0 -monitor stdio
 
 In the above example ``/etc/pki/qemu`` should contain at least three
@@ -84,12 +84,12 @@ connecting. The server will request that the client provide 
a
 certificate, which it will then validate against the CA certificate.
 This is a good choice if deploying in an environment with a private
 internal certificate authority. It uses the same syntax as previously,
-but with ``verify-peer`` set to ``yes`` instead.
+but with ``verify-peer`` set to ``on`` instead.
 
 .. parsed-literal::
 
|qemu_system| [...OPTIONS...] \
- -object 
tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=yes \
+ -object 
tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=on \
  -vnc :1,tls-creds=tls0 -monitor stdio
 
 .. _vnc_005fsec_005fcertificate_005fpw:
@@ -103,7 +103,7 @@ authentication to provide two layers of authentication for 
clients.
 .. parsed-literal::
 
|qemu_system| [...OPTIONS...] \
- -object 
tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=yes \
+ -object 
tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=on \
  -vnc :1,tls-creds=tls0,password -monitor stdio
(qemu) change vnc password
Password: 
@@ -145,7 +145,7 @@ x509 options:
 .. parsed-literal::
 
|qemu_system| [...OPTIONS...] \
- -object 
tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=yes \
+ -object 
tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=on \
  -vnc :1,tls-creds=tls0,sasl -monitor stdio
 
 .. _vnc_005fsetup_005fsasl:
diff --git a/include/authz/listfile.h b/include/authz/listfile.h
index 0a1e5bddd3..0b7fe72198 100644
--- a/include/authz/listfile.h
+++ b/include/authz/listfile.h
@@ -73,7 +73,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(QAuthZListFile,
  * The object can be created on the command line using
  *
  *   -object authz-list-file,id=authz0,\
- *   filename=/etc/qemu/myvm-vnc.acl,refresh=yes
+ *   filename=/etc/qemu/myvm-vnc.acl,refresh=on
  *
  */
 struct QAuthZListFile {
diff --git a/qemu-options.hx b/qemu-options.hx
index 2c83390504..0bdc07bc47 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -5002,7 +5002,7 @@ SRST
 Note the use of quotes due to the x509 distinguished name
 containing whitespace, and escaping of ','.
 
-``-object authz-listfile,id=id,filename=path,refresh=yes|no``
+``-object authz-listfile,id=id,filename=path,refresh=on|off``
 Create an authorization object that will control access to
 network services.
 
@@ -5047,7 +5047,7 @@ SRST
 
  # |qemu_system| \\
  ... \\
- -object 
authz-simple,id=auth0,filename=/etc/qemu/vnc-sasl.acl,refresh=yes \\
+ -object 
authz-simple,id=auth0,filename=/etc/qemu/vnc-sasl.acl,refresh=on \\
  ...
 
 ``-object authz-pam,id=id,service=string``
diff --git a/tests/qemu-iotests/233 b/tests/qemu-iotests/233
index a5c17c3963..0b99530f7f 100755
--- a/tests/qemu-iotests/233
+++ b/tests/qemu-iotests/233
@@ -83,7 +83,7 @@ echo
 echo "== check plain client to TLS server fails =="
 
 nbd_server_start_tcp_socket \
---object 
tls-creds-x509,dir=${tls_dir}/server1,endpoint=server,id=tls0,verify-peer=yes \
+--object 
tls-creds-x509,dir=${tls_dir}/server1,endpoint=server,id=tls0,verify-peer=on \
 --tls-creds tls0 \
 -f $IMGFMT "$TEST_IMG" 2>> "$TEST_DIR/server.log"
 
@@ -128,7 +128,7 @@ echo "== check TLS with authorization =="
 nbd_server_stop
 
 nbd_server_start_tcp_socket \
---object 
tls-creds-x509,dir=${tls_dir}/server1,endpoint=server,id=tls0,verify-peer=yes \
+--object 
tls-creds-x509,dir=${tls_dir}/server1,endpoint=server,id=tls0,verify-peer=on \
 --object "authz-simple,id=authz0,identity=CN=localhost,, \
   O=Cthulu Dark Lord Enterprises client1,,L=R'lyeh,,C=South Pacific" \
 --tls-authz authz0 \
-- 
2.28.0

Re: Libvirt driver iothread property for virtio-scsi disks

2020-11-04 Thread Sergio Lopez

On Wed, Nov 04, 2020 at 05:48:40PM +0200, Nir Soffer wrote:
> The docs[1] say:
> 
> - The optional iothread attribute assigns the disk to an IOThread as defined 
> by
>   the range for the domain iothreads value. Multiple disks may be assigned to
>   the same IOThread and are numbered from 1 to the domain iothreads value.
>   Available for a disk device target configured to use "virtio" bus and "pci"
>   or "ccw" address types. Since 1.2.8 (QEMU 2.1)
> 
> Does it mean that virtio-scsi disks do not use iothreads?

virtio-scsi disks can use iothreads, but they are configured in the
scsi controller, not in the disk itself. All disks attached to the
same controller will share the same iothread, but you can also attach
multiple controllers.

> I'm experiencing a horrible performance using nested vms (up to 2 levels of
> nesting) when accessing NFS storage running on one of the VMs. The NFS
> server is using scsi disk.
> 
> My theory is:
> - Writing to NFS server is very slow (too much nesting, slow disk)
> - Not using iothreads (because we don't use virtio?)
> - Guest CPU is blocked by slow I/O

I would discard the lack of iothreads as the culprit. They do improve
the performance, but without them the performance should be quite
decent anyway. Probably something else is causing the trouble.

I would do a step by step analysis, testing the NFS performance from
outside the VM first, and then elaborating upwards from that.

Sergio.

> Does this make sense?
> 
> [1] https://libvirt.org/formatdomain.html#hard-drives-floppy-disks-cdroms
> 
> Nir
> 

signature.asc
Description: PGP signature

[PATCH v2 36/44] qdev: Rename qdev_get_prop_ptr() to object_field_prop_ptr()

2020-11-04 Thread Eduardo Habkost

The function will be moved to common QOM code, as it is not
specific to TYPE_DEVICE anymore.

Signed-off-by: Eduardo Habkost 
---
Changes v1 -> v2:
* Rename to object_field_prop_ptr() instead of object_static_prop_ptr()
---
Cc: Stefan Berger 
Cc: Stefano Stabellini 
Cc: Anthony Perard 
Cc: Paul Durrant 
Cc: Kevin Wolf 
Cc: Max Reitz 
Cc: Paolo Bonzini 
Cc: "Daniel P. Berrangé" 
Cc: Eduardo Habkost 
Cc: Cornelia Huck 
Cc: Halil Pasic 
Cc: Christian Borntraeger 
Cc: Richard Henderson 
Cc: David Hildenbrand 
Cc: Thomas Huth 
Cc: Matthew Rosato 
Cc: Alex Williamson 
Cc: qemu-de...@nongnu.org
Cc: xen-de...@lists.xenproject.org
Cc: qemu-block@nongnu.org
Cc: qemu-s3...@nongnu.org
---
 include/hw/qdev-properties.h |  2 +-
 backends/tpm/tpm_util.c  |  6 ++--
 hw/block/xen-block.c |  4 +--
 hw/core/qdev-properties-system.c | 50 +-
 hw/core/qdev-properties.c| 60 
 hw/s390x/css.c   |  4 +--
 hw/s390x/s390-pci-bus.c  |  4 +--
 hw/vfio/pci-quirks.c |  4 +--
 8 files changed, 67 insertions(+), 67 deletions(-)

diff --git a/include/hw/qdev-properties.h b/include/hw/qdev-properties.h
index 7f8d5fc206..2bec65c8e5 100644
--- a/include/hw/qdev-properties.h
+++ b/include/hw/qdev-properties.h
@@ -223,7 +223,7 @@ void qdev_prop_set_macaddr(DeviceState *dev, const char 
*name,
const uint8_t *value);
 void qdev_prop_set_enum(DeviceState *dev, const char *name, int value);
 
-void *qdev_get_prop_ptr(Object *obj, Property *prop);
+void *object_field_prop_ptr(Object *obj, Property *prop);
 
 void qdev_prop_register_global(GlobalProperty *prop);
 const GlobalProperty *qdev_find_global_prop(Object *obj,
diff --git a/backends/tpm/tpm_util.c b/backends/tpm/tpm_util.c
index 0b07cf55ea..bb1ab34a75 100644
--- a/backends/tpm/tpm_util.c
+++ b/backends/tpm/tpm_util.c
@@ -35,7 +35,7 @@
 static void get_tpm(Object *obj, Visitor *v, const char *name, void *opaque,
 Error **errp)
 {
-TPMBackend **be = qdev_get_prop_ptr(obj, opaque);
+TPMBackend **be = object_field_prop_ptr(obj, opaque);
 char *p;
 
 p = g_strdup(*be ? (*be)->id : "");
@@ -47,7 +47,7 @@ static void set_tpm(Object *obj, Visitor *v, const char 
*name, void *opaque,
 Error **errp)
 {
 Property *prop = opaque;
-TPMBackend *s, **be = qdev_get_prop_ptr(obj, prop);
+TPMBackend *s, **be = object_field_prop_ptr(obj, prop);
 char *str;
 
 if (!visit_type_str(v, name, , errp)) {
@@ -67,7 +67,7 @@ static void set_tpm(Object *obj, Visitor *v, const char 
*name, void *opaque,
 static void release_tpm(Object *obj, const char *name, void *opaque)
 {
 Property *prop = opaque;
-TPMBackend **be = qdev_get_prop_ptr(obj, prop);
+TPMBackend **be = object_field_prop_ptr(obj, prop);
 
 if (*be) {
 tpm_backend_reset(*be);
diff --git a/hw/block/xen-block.c b/hw/block/xen-block.c
index bd1aef63a7..718d886e5c 100644
--- a/hw/block/xen-block.c
+++ b/hw/block/xen-block.c
@@ -336,7 +336,7 @@ static void xen_block_get_vdev(Object *obj, Visitor *v, 
const char *name,
void *opaque, Error **errp)
 {
 Property *prop = opaque;
-XenBlockVdev *vdev = qdev_get_prop_ptr(obj, prop);
+XenBlockVdev *vdev = object_field_prop_ptr(obj, prop);
 char *str;
 
 switch (vdev->type) {
@@ -396,7 +396,7 @@ static void xen_block_set_vdev(Object *obj, Visitor *v, 
const char *name,
void *opaque, Error **errp)
 {
 Property *prop = opaque;
-XenBlockVdev *vdev = qdev_get_prop_ptr(obj, prop);
+XenBlockVdev *vdev = object_field_prop_ptr(obj, prop);
 char *str, *p;
 const char *end;
 
diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c
index 96a0bc5109..8781b856d3 100644
--- a/hw/core/qdev-properties-system.c
+++ b/hw/core/qdev-properties-system.c
@@ -62,7 +62,7 @@ static void get_drive(Object *obj, Visitor *v, const char 
*name, void *opaque,
   Error **errp)
 {
 Property *prop = opaque;
-void **ptr = qdev_get_prop_ptr(obj, prop);
+void **ptr = object_field_prop_ptr(obj, prop);
 const char *value;
 char *p;
 
@@ -88,7 +88,7 @@ static void set_drive_helper(Object *obj, Visitor *v, const 
char *name,
 {
 DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
-void **ptr = qdev_get_prop_ptr(obj, prop);
+void **ptr = object_field_prop_ptr(obj, prop);
 char *str;
 BlockBackend *blk;
 bool blk_created = false;
@@ -181,7 +181,7 @@ static void release_drive(Object *obj, const char *name, 
void *opaque)
 {
 DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
-BlockBackend **ptr = qdev_get_prop_ptr(obj, prop);
+BlockBackend **ptr = object_field_prop_ptr(obj, prop);
 
 if (*ptr) {
 AioContext *ctx = blk_get_aio_context(*ptr);
@@ -214,7 +214,7 @@ const PropertyInfo

[PATCH v2 22/44] qdev: Move dev->realized check to qdev_property_set()

2020-11-04 Thread Eduardo Habkost

Every single qdev property setter function manually checks
dev->realized.  We can just check dev->realized inside
qdev_property_set() instead.

The check is being added as a separate function
(qdev_prop_allow_set()) because it will become a callback later.

Signed-off-by: Eduardo Habkost 
---
Changes v1 -> v2:
* Removed unused variable at xen_block_set_vdev()
* Redone patch after changes in the previous patches in the
  series
---
Cc: Stefan Berger 
Cc: Stefano Stabellini 
Cc: Anthony Perard 
Cc: Paul Durrant 
Cc: Kevin Wolf 
Cc: Max Reitz 
Cc: Paolo Bonzini 
Cc: "Daniel P. Berrangé" 
Cc: Eduardo Habkost 
Cc: Cornelia Huck 
Cc: Halil Pasic 
Cc: Christian Borntraeger 
Cc: Richard Henderson 
Cc: David Hildenbrand 
Cc: Thomas Huth 
Cc: Matthew Rosato 
Cc: Alex Williamson 
Cc: Mark Cave-Ayland 
Cc: Artyom Tarasenko 
Cc: qemu-de...@nongnu.org
Cc: xen-de...@lists.xenproject.org
Cc: qemu-block@nongnu.org
Cc: qemu-s3...@nongnu.org
---
 backends/tpm/tpm_util.c  |   6 --
 hw/block/xen-block.c |   6 --
 hw/core/qdev-properties-system.c |  70 --
 hw/core/qdev-properties.c| 100 ++-
 hw/s390x/css.c   |   6 --
 hw/s390x/s390-pci-bus.c  |   6 --
 hw/vfio/pci-quirks.c |   6 --
 target/sparc/cpu.c   |   6 --
 8 files changed, 18 insertions(+), 188 deletions(-)

diff --git a/backends/tpm/tpm_util.c b/backends/tpm/tpm_util.c
index dba2f6b04a..0b07cf55ea 100644
--- a/backends/tpm/tpm_util.c
+++ b/backends/tpm/tpm_util.c
@@ -46,16 +46,10 @@ static void get_tpm(Object *obj, Visitor *v, const char 
*name, void *opaque,
 static void set_tpm(Object *obj, Visitor *v, const char *name, void *opaque,
 Error **errp)
 {
-DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
 TPMBackend *s, **be = qdev_get_prop_ptr(obj, prop);
 char *str;
 
-if (dev->realized) {
-qdev_prop_set_after_realize(dev, name, errp);
-return;
-}
-
 if (!visit_type_str(v, name, , errp)) {
 return;
 }
diff --git a/hw/block/xen-block.c b/hw/block/xen-block.c
index 905e4acd97..bd1aef63a7 100644
--- a/hw/block/xen-block.c
+++ b/hw/block/xen-block.c
@@ -395,17 +395,11 @@ static int vbd_name_to_disk(const char *name, const char 
**endp,
 static void xen_block_set_vdev(Object *obj, Visitor *v, const char *name,
void *opaque, Error **errp)
 {
-DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
 XenBlockVdev *vdev = qdev_get_prop_ptr(obj, prop);
 char *str, *p;
 const char *end;
 
-if (dev->realized) {
-qdev_prop_set_after_realize(dev, name, errp);
-return;
-}
-
 if (!visit_type_str(v, name, , errp)) {
 return;
 }
diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c
index 202abd0e4b..0d3e57bba0 100644
--- a/hw/core/qdev-properties-system.c
+++ b/hw/core/qdev-properties-system.c
@@ -94,11 +94,6 @@ static void set_drive_helper(Object *obj, Visitor *v, const 
char *name,
 bool blk_created = false;
 int ret;
 
-if (dev->realized) {
-qdev_prop_set_after_realize(dev, name, errp);
-return;
-}
-
 if (!visit_type_str(v, name, , errp)) {
 return;
 }
@@ -230,17 +225,11 @@ static void get_chr(Object *obj, Visitor *v, const char 
*name, void *opaque,
 static void set_chr(Object *obj, Visitor *v, const char *name, void *opaque,
 Error **errp)
 {
-DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
 CharBackend *be = qdev_get_prop_ptr(obj, prop);
 Chardev *s;
 char *str;
 
-if (dev->realized) {
-qdev_prop_set_after_realize(dev, name, errp);
-return;
-}
-
 if (!visit_type_str(v, name, , errp)) {
 return;
 }
@@ -311,18 +300,12 @@ static void get_mac(Object *obj, Visitor *v, const char 
*name, void *opaque,
 static void set_mac(Object *obj, Visitor *v, const char *name, void *opaque,
 Error **errp)
 {
-DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
 MACAddr *mac = qdev_get_prop_ptr(obj, prop);
 int i, pos;
 char *str;
 const char *p;
 
-if (dev->realized) {
-qdev_prop_set_after_realize(dev, name, errp);
-return;
-}
-
 if (!visit_type_str(v, name, , errp)) {
 return;
 }
@@ -390,7 +373,6 @@ static void get_netdev(Object *obj, Visitor *v, const char 
*name,
 static void set_netdev(Object *obj, Visitor *v, const char *name,
void *opaque, Error **errp)
 {
-DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
 NICPeers *peers_ptr = qdev_get_prop_ptr(obj, prop);
 NetClientState **ncs = peers_ptr->ncs;
@@ -398,11 +380,6 @@ static void set_netdev(Object *obj, Visitor *v, const char 
*name,
 int queues, err = 0, i = 0;
 char *str;
 
-if (dev->realized) {
-

[PATCH v2 09/44] qdev: Make qdev_get_prop_ptr() get Object* arg

2020-11-04 Thread Eduardo Habkost

Make the code more generic and not specific to TYPE_DEVICE.

Reviewed-by: Marc-André Lureau 
Signed-off-by: Eduardo Habkost 
---
Changes v1 -> v2:
- Fix build error with CONFIG_XEN
  I took the liberty of keeping the Reviewed-by line from
  Marc-André as the build fix is a trivial one line change
---
Cc: Stefan Berger 
Cc: Stefano Stabellini 
Cc: Anthony Perard 
Cc: Paul Durrant 
Cc: Kevin Wolf 
Cc: Max Reitz 
Cc: Paolo Bonzini 
Cc: "Daniel P. Berrangé" 
Cc: Eduardo Habkost 
Cc: Cornelia Huck 
Cc: Thomas Huth 
Cc: Richard Henderson 
Cc: David Hildenbrand 
Cc: Halil Pasic 
Cc: Christian Borntraeger 
Cc: Matthew Rosato 
Cc: Alex Williamson 
Cc: qemu-de...@nongnu.org
Cc: xen-de...@lists.xenproject.org
Cc: qemu-block@nongnu.org
Cc: qemu-s3...@nongnu.org
---
 include/hw/qdev-properties.h |  2 +-
 backends/tpm/tpm_util.c  |  8 ++--
 hw/block/xen-block.c |  5 +-
 hw/core/qdev-properties-system.c | 57 +-
 hw/core/qdev-properties.c| 82 +---
 hw/s390x/css.c   |  5 +-
 hw/s390x/s390-pci-bus.c  |  4 +-
 hw/vfio/pci-quirks.c |  5 +-
 8 files changed, 68 insertions(+), 100 deletions(-)

diff --git a/include/hw/qdev-properties.h b/include/hw/qdev-properties.h
index 0ea822e6a7..0b92cfc761 100644
--- a/include/hw/qdev-properties.h
+++ b/include/hw/qdev-properties.h
@@ -302,7 +302,7 @@ void qdev_prop_set_macaddr(DeviceState *dev, const char 
*name,
const uint8_t *value);
 void qdev_prop_set_enum(DeviceState *dev, const char *name, int value);
 
-void *qdev_get_prop_ptr(DeviceState *dev, Property *prop);
+void *qdev_get_prop_ptr(Object *obj, Property *prop);
 
 void qdev_prop_register_global(GlobalProperty *prop);
 const GlobalProperty *qdev_find_global_prop(DeviceState *dev,
diff --git a/backends/tpm/tpm_util.c b/backends/tpm/tpm_util.c
index b58d298c1a..e91c21dd4a 100644
--- a/backends/tpm/tpm_util.c
+++ b/backends/tpm/tpm_util.c
@@ -35,8 +35,7 @@
 static void get_tpm(Object *obj, Visitor *v, const char *name, void *opaque,
 Error **errp)
 {
-DeviceState *dev = DEVICE(obj);
-TPMBackend **be = qdev_get_prop_ptr(dev, opaque);
+TPMBackend **be = qdev_get_prop_ptr(obj, opaque);
 char *p;
 
 p = g_strdup(*be ? (*be)->id : "");
@@ -49,7 +48,7 @@ static void set_tpm(Object *obj, Visitor *v, const char 
*name, void *opaque,
 {
 DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
-TPMBackend *s, **be = qdev_get_prop_ptr(dev, prop);
+TPMBackend *s, **be = qdev_get_prop_ptr(obj, prop);
 char *str;
 
 if (dev->realized) {
@@ -73,9 +72,8 @@ static void set_tpm(Object *obj, Visitor *v, const char 
*name, void *opaque,
 
 static void release_tpm(Object *obj, const char *name, void *opaque)
 {
-DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
-TPMBackend **be = qdev_get_prop_ptr(dev, prop);
+TPMBackend **be = qdev_get_prop_ptr(obj, prop);
 
 if (*be) {
 tpm_backend_reset(*be);
diff --git a/hw/block/xen-block.c b/hw/block/xen-block.c
index 8a7a3f5452..905e4acd97 100644
--- a/hw/block/xen-block.c
+++ b/hw/block/xen-block.c
@@ -335,9 +335,8 @@ static char *disk_to_vbd_name(unsigned int disk)
 static void xen_block_get_vdev(Object *obj, Visitor *v, const char *name,
void *opaque, Error **errp)
 {
-DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
-XenBlockVdev *vdev = qdev_get_prop_ptr(dev, prop);
+XenBlockVdev *vdev = qdev_get_prop_ptr(obj, prop);
 char *str;
 
 switch (vdev->type) {
@@ -398,7 +397,7 @@ static void xen_block_set_vdev(Object *obj, Visitor *v, 
const char *name,
 {
 DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
-XenBlockVdev *vdev = qdev_get_prop_ptr(dev, prop);
+XenBlockVdev *vdev = qdev_get_prop_ptr(obj, prop);
 char *str, *p;
 const char *end;
 
diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c
index d0fb063a49..c8c73c371b 100644
--- a/hw/core/qdev-properties-system.c
+++ b/hw/core/qdev-properties-system.c
@@ -59,9 +59,8 @@ static bool check_prop_still_unset(DeviceState *dev, const 
char *name,
 static void get_drive(Object *obj, Visitor *v, const char *name, void *opaque,
   Error **errp)
 {
-DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
-void **ptr = qdev_get_prop_ptr(dev, prop);
+void **ptr = qdev_get_prop_ptr(obj, prop);
 const char *value;
 char *p;
 
@@ -87,7 +86,7 @@ static void set_drive_helper(Object *obj, Visitor *v, const 
char *name,
 {
 DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
-void **ptr = qdev_get_prop_ptr(dev, prop);
+void **ptr = qdev_get_prop_ptr(obj, prop);
 char *str;
 BlockBackend *blk;
 bool blk_created = false;
@@ -185,7 +184,7 @@ static void release_drive(Object *obj, const char *name, 
void *opaque)
 {
 DeviceState *dev

Libvirt driver iothread property for virtio-scsi disks

2020-11-04 Thread Nir Soffer

The docs[1] say:

- The optional iothread attribute assigns the disk to an IOThread as defined by
  the range for the domain iothreads value. Multiple disks may be assigned to
  the same IOThread and are numbered from 1 to the domain iothreads value.
  Available for a disk device target configured to use "virtio" bus and "pci"
  or "ccw" address types. Since 1.2.8 (QEMU 2.1)

Does it mean that virtio-scsi disks do not use iothreads?

I'm experiencing a horrible performance using nested vms (up to 2 levels of
nesting) when accessing NFS storage running on one of the VMs. The NFS
server is using scsi disk.

My theory is:
- Writing to NFS server is very slow (too much nesting, slow disk)
- Not using iothreads (because we don't use virtio?)
- Guest CPU is blocked by slow I/O

Does this make sense?

[1] https://libvirt.org/formatdomain.html#hard-drives-floppy-disks-cdroms

Nir

Re: [PATCH] Prefer 'on' | 'off' over 'yes' | 'no' for bool options

2020-11-04 Thread Kevin Wolf

Am 04.11.2020 um 15:05 hat Daniel P. BerrangÃ© geschrieben:
> Update some docs and test cases to use 'on' | 'off' as the preferred
> value for bool options.
> 
> Signed-off-by: Daniel P. BerrangÃ© 
> ---
>  docs/system/vnc-security.rst | 6 +++---
>  include/authz/listfile.h | 2 +-
>  qemu-options.hx  | 4 ++--
>  tests/qemu-iotests/233   | 4 ++--
>  4 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/docs/system/vnc-security.rst b/docs/system/vnc-security.rst
> index b237b07330..e97b42dfdc 100644
> --- a/docs/system/vnc-security.rst
> +++ b/docs/system/vnc-security.rst
> @@ -89,7 +89,7 @@ but with ``verify-peer`` set to ``yes`` instead.

Looks like it's not only in example, but the text has a mention of
"yes", too. It seems to be the only one, but we'll probably want to
replace it there as well.

There is also an example of verify-peer=no in the file which this patch
misses.

Kevin

[PULL 33/33] util/vfio-helpers: Assert offset is aligned to page size

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

mmap(2) states:

  'offset' must be a multiple of the page size as returned
   by sysconf(_SC_PAGE_SIZE).

Add an assertion to be sure we don't break this contract.

Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201103020733.2303148-8-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 util/vfio-helpers.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 73f7bfa754..804768d5c6 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -162,6 +162,7 @@ void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index,
 Error **errp)
 {
 void *p;
+assert(QEMU_IS_ALIGNED(offset, qemu_real_host_page_size));
 assert_bar_index_valid(s, index);
 p = mmap(NULL, MIN(size, s->bar_region_info[index].size - offset),
  prot, MAP_SHARED,
-- 
2.28.0

[PULL 32/33] util/vfio-helpers: Convert vfio_dump_mapping to trace events

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

The QEMU_VFIO_DEBUG definition is only modifiable at build-time.
Trace events can be enabled at run-time. As we prefer the latter,
convert qemu_vfio_dump_mappings() to use trace events instead
of fprintf().

Reviewed-by: Fam Zheng 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201103020733.2303148-7-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 util/vfio-helpers.c | 19 ---
 util/trace-events   |  1 +
 2 files changed, 5 insertions(+), 15 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index c24a510df8..73f7bfa754 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -521,23 +521,12 @@ QEMUVFIOState *qemu_vfio_open_pci(const char *device, 
Error **errp)
 return s;
 }
 
-static void qemu_vfio_dump_mapping(IOVAMapping *m)
-{
-if (QEMU_VFIO_DEBUG) {
-printf("  vfio mapping %p %" PRIx64 " to %" PRIx64 "\n", m->host,
-   (uint64_t)m->size, (uint64_t)m->iova);
-}
-}
-
 static void qemu_vfio_dump_mappings(QEMUVFIOState *s)
 {
-int i;
-
-if (QEMU_VFIO_DEBUG) {
-printf("vfio mappings\n");
-for (i = 0; i < s->nr_mappings; ++i) {
-qemu_vfio_dump_mapping(>mappings[i]);
-}
+for (int i = 0; i < s->nr_mappings; ++i) {
+trace_qemu_vfio_dump_mapping(s->mappings[i].host,
+ s->mappings[i].iova,
+ s->mappings[i].size);
 }
 }
 
diff --git a/util/trace-events b/util/trace-events
index 08239941cb..61e0d4bcdf 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -80,6 +80,7 @@ qemu_mutex_unlock(void *mutex, const char *file, const int 
line) "released mutex
 qemu_vfio_dma_reset_temporary(void *s) "s %p"
 qemu_vfio_ram_block_added(void *s, void *p, size_t size) "s %p host %p size 
0x%zx"
 qemu_vfio_ram_block_removed(void *s, void *p, size_t size) "s %p host %p size 
0x%zx"
+qemu_vfio_dump_mapping(void *host, uint64_t iova, size_t size) "vfio mapping 
%p to iova 0x%08" PRIx64 " size 0x%zx"
 qemu_vfio_find_mapping(void *s, void *p) "s %p host %p"
 qemu_vfio_new_mapping(void *s, void *host, size_t size, int index, uint64_t 
iova) "s %p host %p size 0x%zx index %d iova 0x%"PRIx64
 qemu_vfio_do_mapping(void *s, void *host, uint64_t iova, size_t size) "s %p 
host %p <-> iova 0x%"PRIx64 " size 0x%zx"
-- 
2.28.0

[PULL 30/33] util/vfio-helpers: Trace where BARs are mapped

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

For debugging purpose, trace where a BAR is mapped.

Reviewed-by: Fam Zheng 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201103020733.2303148-5-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 util/vfio-helpers.c | 2 ++
 util/trace-events   | 1 +
 2 files changed, 3 insertions(+)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index cd6287c3a9..278c54902e 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -166,6 +166,8 @@ void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index,
 p = mmap(NULL, MIN(size, s->bar_region_info[index].size - offset),
  prot, MAP_SHARED,
  s->device, s->bar_region_info[index].offset + offset);
+trace_qemu_vfio_pci_map_bar(index, s->bar_region_info[index].offset ,
+size, offset, p);
 if (p == MAP_FAILED) {
 error_setg_errno(errp, errno, "Failed to map BAR region");
 p = NULL;
diff --git a/util/trace-events b/util/trace-events
index 0753e4a0ed..52d43cda3f 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -88,3 +88,4 @@ qemu_vfio_dma_unmap(void *s, void *host) "s %p host %p"
 qemu_vfio_pci_read_config(void *buf, int ofs, int size, uint64_t region_ofs, 
uint64_t region_size) "read cfg ptr %p ofs 0x%x size 0x%x (region addr 
0x%"PRIx64" size 0x%"PRIx64")"
 qemu_vfio_pci_write_config(void *buf, int ofs, int size, uint64_t region_ofs, 
uint64_t region_size) "write cfg ptr %p ofs 0x%x size 0x%x (region addr 
0x%"PRIx64" size 0x%"PRIx64")"
 qemu_vfio_region_info(const char *desc, uint64_t region_ofs, uint64_t 
region_size, uint32_t cap_offset) "region '%s' addr 0x%"PRIx64" size 
0x%"PRIx64" cap_ofs 0x%"PRIx32
+qemu_vfio_pci_map_bar(int index, uint64_t region_ofs, uint64_t region_size, 
int ofs, void *host) "map region bar#%d addr 0x%"PRIx64" size 0x%"PRIx64" ofs 
0x%x host %p"
-- 
2.28.0

[PULL 31/33] util/vfio-helpers: Improve DMA trace events

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

For debugging purpose, trace where DMA regions are mapped.

Reviewed-by: Fam Zheng 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201103020733.2303148-6-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 util/vfio-helpers.c | 3 ++-
 util/trace-events   | 5 +++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 278c54902e..c24a510df8 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -627,7 +627,7 @@ static int qemu_vfio_do_mapping(QEMUVFIOState *s, void 
*host, size_t size,
 .vaddr = (uintptr_t)host,
 .size = size,
 };
-trace_qemu_vfio_do_mapping(s, host, size, iova);
+trace_qemu_vfio_do_mapping(s, host, iova, size);
 
 if (ioctl(s->container, VFIO_IOMMU_MAP_DMA, _map)) {
 error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
@@ -783,6 +783,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 }
 }
 }
+trace_qemu_vfio_dma_mapped(s, host, iova0, size);
 if (iova) {
 *iova = iova0;
 }
diff --git a/util/trace-events b/util/trace-events
index 52d43cda3f..08239941cb 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -82,8 +82,9 @@ qemu_vfio_ram_block_added(void *s, void *p, size_t size) "s 
%p host %p size 0x%z
 qemu_vfio_ram_block_removed(void *s, void *p, size_t size) "s %p host %p size 
0x%zx"
 qemu_vfio_find_mapping(void *s, void *p) "s %p host %p"
 qemu_vfio_new_mapping(void *s, void *host, size_t size, int index, uint64_t 
iova) "s %p host %p size 0x%zx index %d iova 0x%"PRIx64
-qemu_vfio_do_mapping(void *s, void *host, size_t size, uint64_t iova) "s %p 
host %p size 0x%zx iova 0x%"PRIx64
-qemu_vfio_dma_map(void *s, void *host, size_t size, bool temporary, uint64_t 
*iova) "s %p host %p size 0x%zx temporary %d iova %p"
+qemu_vfio_do_mapping(void *s, void *host, uint64_t iova, size_t size) "s %p 
host %p <-> iova 0x%"PRIx64 " size 0x%zx"
+qemu_vfio_dma_map(void *s, void *host, size_t size, bool temporary, uint64_t 
*iova) "s %p host %p size 0x%zx temporary %d  %p"
+qemu_vfio_dma_mapped(void *s, void *host, uint64_t iova, size_t size) "s %p 
host %p <-> iova 0x%"PRIx64" size 0x%zx"
 qemu_vfio_dma_unmap(void *s, void *host) "s %p host %p"
 qemu_vfio_pci_read_config(void *buf, int ofs, int size, uint64_t region_ofs, 
uint64_t region_size) "read cfg ptr %p ofs 0x%x size 0x%x (region addr 
0x%"PRIx64" size 0x%"PRIx64")"
 qemu_vfio_pci_write_config(void *buf, int ofs, int size, uint64_t region_ofs, 
uint64_t region_size) "write cfg ptr %p ofs 0x%x size 0x%x (region addr 
0x%"PRIx64" size 0x%"PRIx64")"
-- 
2.28.0

[PULL 29/33] util/vfio-helpers: Trace PCI BAR region info

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

For debug purpose, trace BAR regions info.

Reviewed-by: Fam Zheng 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201103020733.2303148-4-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 util/vfio-helpers.c | 8 
 util/trace-events   | 1 +
 2 files changed, 9 insertions(+)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 1d4efafcaa..cd6287c3a9 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -136,6 +136,7 @@ static inline void assert_bar_index_valid(QEMUVFIOState *s, 
int index)
 
 static int qemu_vfio_pci_init_bar(QEMUVFIOState *s, int index, Error **errp)
 {
+g_autofree char *barname = NULL;
 assert_bar_index_valid(s, index);
 s->bar_region_info[index] = (struct vfio_region_info) {
 .index = VFIO_PCI_BAR0_REGION_INDEX + index,
@@ -145,6 +146,10 @@ static int qemu_vfio_pci_init_bar(QEMUVFIOState *s, int 
index, Error **errp)
 error_setg_errno(errp, errno, "Failed to get BAR region info");
 return -errno;
 }
+barname = g_strdup_printf("bar[%d]", index);
+trace_qemu_vfio_region_info(barname, s->bar_region_info[index].offset,
+s->bar_region_info[index].size,
+s->bar_region_info[index].cap_offset);
 
 return 0;
 }
@@ -416,6 +421,9 @@ static int qemu_vfio_init_pci(QEMUVFIOState *s, const char 
*device,
 ret = -errno;
 goto fail;
 }
+trace_qemu_vfio_region_info("config", s->config_region_info.offset,
+s->config_region_info.size,
+s->config_region_info.cap_offset);
 
 for (i = 0; i < ARRAY_SIZE(s->bar_region_info); i++) {
 ret = qemu_vfio_pci_init_bar(s, i, errp);
diff --git a/util/trace-events b/util/trace-events
index 8d3615e717..0753e4a0ed 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -87,3 +87,4 @@ qemu_vfio_dma_map(void *s, void *host, size_t size, bool 
temporary, uint64_t *io
 qemu_vfio_dma_unmap(void *s, void *host) "s %p host %p"
 qemu_vfio_pci_read_config(void *buf, int ofs, int size, uint64_t region_ofs, 
uint64_t region_size) "read cfg ptr %p ofs 0x%x size 0x%x (region addr 
0x%"PRIx64" size 0x%"PRIx64")"
 qemu_vfio_pci_write_config(void *buf, int ofs, int size, uint64_t region_ofs, 
uint64_t region_size) "write cfg ptr %p ofs 0x%x size 0x%x (region addr 
0x%"PRIx64" size 0x%"PRIx64")"
+qemu_vfio_region_info(const char *desc, uint64_t region_ofs, uint64_t 
region_size, uint32_t cap_offset) "region '%s' addr 0x%"PRIx64" size 
0x%"PRIx64" cap_ofs 0x%"PRIx32
-- 
2.28.0

[PULL 27/33] util/vfio-helpers: Improve reporting unsupported IOMMU type

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

Change the confuse "VFIO IOMMU check failed" error message by
the explicit "VFIO IOMMU Type1 is not supported" once.

Example on POWER:

 $ qemu-system-ppc64 -drive 
if=none,id=nvme0,file=nvme://0001:01:00.0/1,format=raw
 qemu-system-ppc64: -drive 
if=none,id=nvme0,file=nvme://0001:01:00.0/1,format=raw: VFIO IOMMU Type1 is not 
supported

Suggested-by: Alex Williamson 
Reviewed-by: Fam Zheng 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201103020733.2303148-2-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 util/vfio-helpers.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index c469beb061..14a549510f 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -300,7 +300,7 @@ static int qemu_vfio_init_pci(QEMUVFIOState *s, const char 
*device,
 }
 
 if (!ioctl(s->container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
-error_setg_errno(errp, errno, "VFIO IOMMU check failed");
+error_setg_errno(errp, errno, "VFIO IOMMU Type1 is not supported");
 ret = -EINVAL;
 goto fail_container;
 }
-- 
2.28.0

[PULL 24/33] block/nvme: Align iov's va and size on host page size

2020-11-04 Thread Stefan Hajnoczi

From: Eric Auger 

Make sure iov's va and size are properly aligned on the
host page size.

Signed-off-by: Eric Auger 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-23-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index e807dd56df..f1e2fd34cd 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -1015,11 +1015,12 @@ static coroutine_fn int 
nvme_cmd_map_qiov(BlockDriverState *bs, NvmeCmd *cmd,
 for (i = 0; i < qiov->niov; ++i) {
 bool retry = true;
 uint64_t iova;
+size_t len = QEMU_ALIGN_UP(qiov->iov[i].iov_len,
+   qemu_real_host_page_size);
 try_map:
 r = qemu_vfio_dma_map(s->vfio,
   qiov->iov[i].iov_base,
-  qiov->iov[i].iov_len,
-  true, );
+  len, true, );
 if (r == -ENOMEM && retry) {
 retry = false;
 trace_nvme_dma_flush_queue_wait(s);
@@ -1163,8 +1164,9 @@ static inline bool nvme_qiov_aligned(BlockDriverState *bs,
 BDRVNVMeState *s = bs->opaque;
 
 for (i = 0; i < qiov->niov; ++i) {
-if (!QEMU_PTR_IS_ALIGNED(qiov->iov[i].iov_base, s->page_size) ||
-!QEMU_IS_ALIGNED(qiov->iov[i].iov_len, s->page_size)) {
+if (!QEMU_PTR_IS_ALIGNED(qiov->iov[i].iov_base,
+ qemu_real_host_page_size) ||
+!QEMU_IS_ALIGNED(qiov->iov[i].iov_len, qemu_real_host_page_size)) {
 trace_nvme_qiov_unaligned(qiov, i, qiov->iov[i].iov_base,
   qiov->iov[i].iov_len, s->page_size);
 return false;
@@ -1180,7 +1182,7 @@ static int nvme_co_prw(BlockDriverState *bs, uint64_t 
offset, uint64_t bytes,
 int r;
 uint8_t *buf = NULL;
 QEMUIOVector local_qiov;
-
+size_t len = QEMU_ALIGN_UP(bytes, qemu_real_host_page_size);
 assert(QEMU_IS_ALIGNED(offset, s->page_size));
 assert(QEMU_IS_ALIGNED(bytes, s->page_size));
 assert(bytes <= s->max_transfer);
@@ -1190,7 +1192,7 @@ static int nvme_co_prw(BlockDriverState *bs, uint64_t 
offset, uint64_t bytes,
 }
 s->stats.unaligned_accesses++;
 trace_nvme_prw_buffered(s, offset, bytes, qiov->niov, is_write);
-buf = qemu_try_memalign(s->page_size, bytes);
+buf = qemu_try_memalign(qemu_real_host_page_size, len);
 
 if (!buf) {
 return -ENOMEM;
-- 
2.28.0

[PULL 26/33] block/nvme: Fix nvme_submit_command() on big-endian host

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

The Completion Queue Command Identifier is a 16-bit value,
so nvme_submit_command() is unlikely to work on big-endian
hosts, as the relevant bits are truncated.
Fix by using the correct byte-swap function.

Fixes: bdd6a90a9e5 ("block: Add VFIO based NVMe driver")
Reported-by: Keith Busch 
Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Stefan Hajnoczi 
Message-id: 20201029093306.1063879-25-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index c8ef69cbb2..a06a188d53 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -469,7 +469,7 @@ static void nvme_submit_command(NVMeQueuePair *q, 
NVMeRequest *req,
 assert(!req->cb);
 req->cb = cb;
 req->opaque = opaque;
-cmd->cid = cpu_to_le32(req->cid);
+cmd->cid = cpu_to_le16(req->cid);
 
 trace_nvme_submit_command(q->s, q->index, req->cid);
 nvme_trace_command(cmd);
-- 
2.28.0

[PULL 25/33] block/nvme: Fix use of write-only doorbells page on Aarch64 arch

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

qemu_vfio_pci_map_bar() calls mmap(), and mmap(2) states:

  'offset' must be a multiple of the page size as returned
   by sysconf(_SC_PAGE_SIZE).

In commit f68453237b9 we started to use an offset of 4K which
broke this contract on Aarch64 arch.

Fix by mapping at offset 0, and and accessing doorbells at offset=4K.

Fixes: f68453237b9 ("block/nvme: Map doorbells pages write-only")
Reported-by: Eric Auger 
Reviewed-by: Eric Auger 
Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-24-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index f1e2fd34cd..c8ef69cbb2 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -94,6 +94,7 @@ typedef struct {
 struct BDRVNVMeState {
 AioContext *aio_context;
 QEMUVFIOState *vfio;
+void *bar0_wo_map;
 /* Memory mapped registers */
 volatile struct {
 uint32_t sq_tail;
@@ -777,8 +778,10 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
 }
 }
 
-s->doorbells = qemu_vfio_pci_map_bar(s->vfio, 0, sizeof(NvmeBar),
- NVME_DOORBELL_SIZE, PROT_WRITE, errp);
+s->bar0_wo_map = qemu_vfio_pci_map_bar(s->vfio, 0, 0,
+   sizeof(NvmeBar) + 
NVME_DOORBELL_SIZE,
+   PROT_WRITE, errp);
+s->doorbells = (void *)((uintptr_t)s->bar0_wo_map + sizeof(NvmeBar));
 if (!s->doorbells) {
 ret = -EINVAL;
 goto out;
@@ -910,8 +913,8 @@ static void nvme_close(BlockDriverState *bs)
>irq_notifier[MSIX_SHARED_IRQ_IDX],
false, NULL, NULL);
 event_notifier_cleanup(>irq_notifier[MSIX_SHARED_IRQ_IDX]);
-qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->doorbells,
-sizeof(NvmeBar), NVME_DOORBELL_SIZE);
+qemu_vfio_pci_unmap_bar(s->vfio, 0, s->bar0_wo_map,
+0, sizeof(NvmeBar) + NVME_DOORBELL_SIZE);
 qemu_vfio_close(s->vfio);
 
 g_free(s->device);
-- 
2.28.0

[PULL 19/33] block/nvme: Set request_alignment at initialization

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

Commit bdd6a90a9e5 ("block: Add VFIO based NVMe driver")
sets the request_alignment in nvme_refresh_limits().
For consistency, also set it during initialization.

Reported-by: Stefan Hajnoczi 
Reviewed-by: Eric Auger 
Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-18-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/nvme.c b/block/nvme.c
index cd87ca..bb75448a09 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -758,6 +758,7 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
 s->page_size = MAX(4096, 1 << NVME_CAP_MPSMIN(cap));
 s->doorbell_scale = (4 << NVME_CAP_DSTRD(cap)) / sizeof(uint32_t);
 bs->bl.opt_mem_alignment = s->page_size;
+bs->bl.request_alignment = s->page_size;
 timeout_ms = MIN(500 * NVME_CAP_TO(cap), 3);
 
 /* Reset device to get a clean state. */
-- 
2.28.0

[PULL 28/33] util/vfio-helpers: Trace PCI I/O config accesses

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

We sometime get kernel panic with some devices on Aarch64
hosts. Alex Williamson suggests it might be broken PCIe
root complex. Add trace event to record the latest I/O
access before crashing. In case, assert our accesses are
aligned.

Reviewed-by: Fam Zheng 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201103020733.2303148-3-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 util/vfio-helpers.c | 8 
 util/trace-events   | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 14a549510f..1d4efafcaa 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -227,6 +227,10 @@ static int qemu_vfio_pci_read_config(QEMUVFIOState *s, 
void *buf,
 {
 int ret;
 
+trace_qemu_vfio_pci_read_config(buf, ofs, size,
+s->config_region_info.offset,
+s->config_region_info.size);
+assert(QEMU_IS_ALIGNED(s->config_region_info.offset + ofs, size));
 do {
 ret = pread(s->device, buf, size, s->config_region_info.offset + ofs);
 } while (ret == -1 && errno == EINTR);
@@ -237,6 +241,10 @@ static int qemu_vfio_pci_write_config(QEMUVFIOState *s, 
void *buf, int size, int
 {
 int ret;
 
+trace_qemu_vfio_pci_write_config(buf, ofs, size,
+ s->config_region_info.offset,
+ s->config_region_info.size);
+assert(QEMU_IS_ALIGNED(s->config_region_info.offset + ofs, size));
 do {
 ret = pwrite(s->device, buf, size, s->config_region_info.offset + ofs);
 } while (ret == -1 && errno == EINTR);
diff --git a/util/trace-events b/util/trace-events
index 24c31803b0..8d3615e717 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -85,3 +85,5 @@ qemu_vfio_new_mapping(void *s, void *host, size_t size, int 
index, uint64_t iova
 qemu_vfio_do_mapping(void *s, void *host, size_t size, uint64_t iova) "s %p 
host %p size 0x%zx iova 0x%"PRIx64
 qemu_vfio_dma_map(void *s, void *host, size_t size, bool temporary, uint64_t 
*iova) "s %p host %p size 0x%zx temporary %d iova %p"
 qemu_vfio_dma_unmap(void *s, void *host) "s %p host %p"
+qemu_vfio_pci_read_config(void *buf, int ofs, int size, uint64_t region_ofs, 
uint64_t region_size) "read cfg ptr %p ofs 0x%x size 0x%x (region addr 
0x%"PRIx64" size 0x%"PRIx64")"
+qemu_vfio_pci_write_config(void *buf, int ofs, int size, uint64_t region_ofs, 
uint64_t region_size) "write cfg ptr %p ofs 0x%x size 0x%x (region addr 
0x%"PRIx64" size 0x%"PRIx64")"
-- 
2.28.0

[PULL 21/33] block/nvme: Change size and alignment of IDENTIFY response buffer

2020-11-04 Thread Stefan Hajnoczi

From: Eric Auger 

In preparation of 64kB host page support, let's change the size
and alignment of the IDENTIFY command response buffer so that
the VFIO DMA MAP succeeds. We align on the host page size.

Signed-off-by: Eric Auger 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-20-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index bd3860ac4e..7628623c05 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -522,19 +522,20 @@ static bool nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 .opcode = NVME_ADM_CMD_IDENTIFY,
 .cdw10 = cpu_to_le32(0x1),
 };
+size_t id_size = QEMU_ALIGN_UP(sizeof(*id), qemu_real_host_page_size);
 
-id = qemu_try_memalign(s->page_size, sizeof(*id));
+id = qemu_try_memalign(qemu_real_host_page_size, id_size);
 if (!id) {
 error_setg(errp, "Cannot allocate buffer for identify response");
 goto out;
 }
-r = qemu_vfio_dma_map(s->vfio, id, sizeof(*id), true, );
+r = qemu_vfio_dma_map(s->vfio, id, id_size, true, );
 if (r) {
 error_setg(errp, "Cannot map buffer for DMA");
 goto out;
 }
 
-memset(id, 0, sizeof(*id));
+memset(id, 0, id_size);
 cmd.dptr.prp1 = cpu_to_le64(iova);
 if (nvme_admin_cmd_sync(bs, )) {
 error_setg(errp, "Failed to identify controller");
@@ -556,7 +557,7 @@ static bool nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 s->supports_write_zeroes = !!(oncs & NVME_ONCS_WRITE_ZEROES);
 s->supports_discard = !!(oncs & NVME_ONCS_DSM);
 
-memset(id, 0, sizeof(*id));
+memset(id, 0, id_size);
 cmd.cdw10 = 0;
 cmd.nsid = cpu_to_le32(namespace);
 if (nvme_admin_cmd_sync(bs, )) {
-- 
2.28.0

[PULL 22/33] block/nvme: Change size and alignment of queue

2020-11-04 Thread Stefan Hajnoczi

From: Eric Auger 

In preparation of 64kB host page support, let's change the size
and alignment of the queue so that the VFIO DMA MAP succeeds.
We align on the host page size.

Signed-off-by: Eric Auger 
Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-21-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 7628623c05..4a8589d2d2 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -167,9 +167,9 @@ static bool nvme_init_queue(BDRVNVMeState *s, NVMeQueue *q,
 size_t bytes;
 int r;
 
-bytes = ROUND_UP(nentries * entry_bytes, s->page_size);
+bytes = ROUND_UP(nentries * entry_bytes, qemu_real_host_page_size);
 q->head = q->tail = 0;
-q->queue = qemu_try_memalign(s->page_size, bytes);
+q->queue = qemu_try_memalign(qemu_real_host_page_size, bytes);
 if (!q->queue) {
 error_setg(errp, "Cannot allocate queue");
 return false;
-- 
2.28.0

[PULL 20/33] block/nvme: Correct minimum device page size

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

While trying to simplify the code using a macro, we forgot
the 12-bit shift... Correct that.

Fixes: fad1eb68862 ("block/nvme: Use register definitions from 'block/nvme.h'")
Reported-by: Eric Auger 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Eric Auger 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-19-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index bb75448a09..bd3860ac4e 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -755,7 +755,7 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
 goto out;
 }
 
-s->page_size = MAX(4096, 1 << NVME_CAP_MPSMIN(cap));
+s->page_size = 1u << (12 + NVME_CAP_MPSMIN(cap));
 s->doorbell_scale = (4 << NVME_CAP_DSTRD(cap)) / sizeof(uint32_t);
 bs->bl.opt_mem_alignment = s->page_size;
 bs->bl.request_alignment = s->page_size;
-- 
2.28.0

[PULL 15/33] block/nvme: Use definitions instead of magic values in add_io_queue()

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

Replace magic values by definitions, and simplifiy since the
number of queues will never reach 64K.

Reviewed-by: Eric Auger 
Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-14-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 6eaba4e703..7285bd2e27 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -652,6 +652,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 NvmeCmd cmd;
 unsigned queue_size = NVME_QUEUE_SIZE;
 
+assert(n <= UINT16_MAX);
 q = nvme_create_queue_pair(s, bdrv_get_aio_context(bs),
n, queue_size, errp);
 if (!q) {
@@ -660,8 +661,8 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_CQ,
 .dptr.prp1 = cpu_to_le64(q->cq.iova),
-.cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
-.cdw11 = cpu_to_le32(0x3),
+.cdw10 = cpu_to_le32(((queue_size - 1) << 16) | n),
+.cdw11 = cpu_to_le32(NVME_CQ_IEN | NVME_CQ_PC),
 };
 if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
 error_setg(errp, "Failed to create CQ io queue [%u]", n);
@@ -670,8 +671,8 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_SQ,
 .dptr.prp1 = cpu_to_le64(q->sq.iova),
-.cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
-.cdw11 = cpu_to_le32(0x1 | (n << 16)),
+.cdw10 = cpu_to_le32(((queue_size - 1) << 16) | n),
+.cdw11 = cpu_to_le32(NVME_SQ_PC | (n << 16)),
 };
 if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
 error_setg(errp, "Failed to create SQ io queue [%u]", n);
-- 
2.28.0

[PULL 14/33] block/nvme: Introduce Completion Queue definitions

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

Rename Submission Queue flags with 'Sq' to differentiate
submission queue flags from command queue flags, and introduce
Completion Queue flag definitions.

Reviewed-by: Eric Auger 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Stefan Hajnoczi 
Message-id: 20201029093306.1063879-13-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 include/block/nvme.h | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index 8a46d9cf01..3e02d9ca98 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -501,6 +501,11 @@ typedef struct QEMU_PACKED NvmeCreateCq {
 #define NVME_CQ_FLAGS_PC(cq_flags)  (cq_flags & 0x1)
 #define NVME_CQ_FLAGS_IEN(cq_flags) ((cq_flags >> 1) & 0x1)
 
+enum NvmeFlagsCq {
+NVME_CQ_PC  = 1,
+NVME_CQ_IEN = 2,
+};
+
 typedef struct QEMU_PACKED NvmeCreateSq {
 uint8_t opcode;
 uint8_t flags;
@@ -518,12 +523,13 @@ typedef struct QEMU_PACKED NvmeCreateSq {
 #define NVME_SQ_FLAGS_PC(sq_flags)  (sq_flags & 0x1)
 #define NVME_SQ_FLAGS_QPRIO(sq_flags)   ((sq_flags >> 1) & 0x3)
 
-enum NvmeQueueFlags {
-NVME_Q_PC   = 1,
-NVME_Q_PRIO_URGENT  = 0,
-NVME_Q_PRIO_HIGH= 1,
-NVME_Q_PRIO_NORMAL  = 2,
-NVME_Q_PRIO_LOW = 3,
+enum NvmeFlagsSq {
+NVME_SQ_PC  = 1,
+
+NVME_SQ_PRIO_URGENT = 0,
+NVME_SQ_PRIO_HIGH   = 1,
+NVME_SQ_PRIO_NORMAL = 2,
+NVME_SQ_PRIO_LOW= 3,
 };
 
 typedef struct QEMU_PACKED NvmeIdentify {
-- 
2.28.0

[PULL 16/33] block/nvme: Correctly initialize Admin Queue Attributes

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

From the specification chapter 3.1.8 "AQA - Admin Queue Attributes"
the Admin Submission Queue Size field is a 0’s based value:

  Admin Submission Queue Size (ASQS):

Defines the size of the Admin Submission Queue in entries.
Enabling a controller while this field is cleared to 00h
produces undefined results. The minimum size of the Admin
Submission Queue is two entries. The maximum size of the
Admin Submission Queue is 4096 entries.
This is a 0’s based value.

This bug has never been hit because the device initialization
uses a single command synchronously :)

Reviewed-by: Eric Auger 
Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-15-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 7285bd2e27..0902aa5542 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -789,9 +789,9 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
 goto out;
 }
 s->queue_count = 1;
-QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
-regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << AQA_ACQS_SHIFT) |
-(NVME_QUEUE_SIZE << AQA_ASQS_SHIFT));
+QEMU_BUILD_BUG_ON((NVME_QUEUE_SIZE - 1) & 0xF000);
+regs->aqa = cpu_to_le32(((NVME_QUEUE_SIZE - 1) << AQA_ACQS_SHIFT) |
+((NVME_QUEUE_SIZE - 1) << AQA_ASQS_SHIFT));
 regs->asq = cpu_to_le64(s->queues[INDEX_ADMIN]->sq.iova);
 regs->acq = cpu_to_le64(s->queues[INDEX_ADMIN]->cq.iova);
 
-- 
2.28.0

[PULL 18/33] block/nvme: Simplify nvme_cmd_sync()

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

As all commands use the ADMIN queue, it is pointless to pass
it as argument each time. Remove the argument, and rename the
function as nvme_admin_cmd_sync() to make this new behavior
clearer.

Reviewed-by: Eric Auger 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Stefan Hajnoczi 
Message-id: 20201029093306.1063879-17-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index eed12f4933..cd87ca 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -481,16 +481,17 @@ static void nvme_submit_command(NVMeQueuePair *q, 
NVMeRequest *req,
 qemu_mutex_unlock(>lock);
 }
 
-static void nvme_cmd_sync_cb(void *opaque, int ret)
+static void nvme_admin_cmd_sync_cb(void *opaque, int ret)
 {
 int *pret = opaque;
 *pret = ret;
 aio_wait_kick();
 }
 
-static int nvme_cmd_sync(BlockDriverState *bs, NVMeQueuePair *q,
- NvmeCmd *cmd)
+static int nvme_admin_cmd_sync(BlockDriverState *bs, NvmeCmd *cmd)
 {
+BDRVNVMeState *s = bs->opaque;
+NVMeQueuePair *q = s->queues[INDEX_ADMIN];
 AioContext *aio_context = bdrv_get_aio_context(bs);
 NVMeRequest *req;
 int ret = -EINPROGRESS;
@@ -498,7 +499,7 @@ static int nvme_cmd_sync(BlockDriverState *bs, 
NVMeQueuePair *q,
 if (!req) {
 return -EBUSY;
 }
-nvme_submit_command(q, req, cmd, nvme_cmd_sync_cb, );
+nvme_submit_command(q, req, cmd, nvme_admin_cmd_sync_cb, );
 
 AIO_WAIT_WHILE(aio_context, ret == -EINPROGRESS);
 return ret;
@@ -535,7 +536,7 @@ static bool nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 
 memset(id, 0, sizeof(*id));
 cmd.dptr.prp1 = cpu_to_le64(iova);
-if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
+if (nvme_admin_cmd_sync(bs, )) {
 error_setg(errp, "Failed to identify controller");
 goto out;
 }
@@ -558,7 +559,7 @@ static bool nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 memset(id, 0, sizeof(*id));
 cmd.cdw10 = 0;
 cmd.nsid = cpu_to_le32(namespace);
-if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
+if (nvme_admin_cmd_sync(bs, )) {
 error_setg(errp, "Failed to identify namespace");
 goto out;
 }
@@ -664,7 +665,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | n),
 .cdw11 = cpu_to_le32(NVME_CQ_IEN | NVME_CQ_PC),
 };
-if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
+if (nvme_admin_cmd_sync(bs, )) {
 error_setg(errp, "Failed to create CQ io queue [%u]", n);
 goto out_error;
 }
@@ -674,7 +675,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | n),
 .cdw11 = cpu_to_le32(NVME_SQ_PC | (n << 16)),
 };
-if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
+if (nvme_admin_cmd_sync(bs, )) {
 error_setg(errp, "Failed to create SQ io queue [%u]", n);
 goto out_error;
 }
@@ -887,7 +888,7 @@ static int nvme_enable_disable_write_cache(BlockDriverState 
*bs, bool enable,
 .cdw11 = cpu_to_le32(enable ? 0x01 : 0x00),
 };
 
-ret = nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], );
+ret = nvme_admin_cmd_sync(bs, );
 if (ret) {
 error_setg(errp, "Failed to configure NVMe write cache");
 }
-- 
2.28.0

[PULL 17/33] block/nvme: Simplify ADMIN queue access

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

We don't need to dereference from BDRVNVMeState each time.
Use a NVMeQueuePair pointer on the admin queue.
The nvme_init() becomes easier to review, matching the style
of nvme_add_io_queue().

Reviewed-by: Eric Auger 
Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-16-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 0902aa5542..eed12f4933 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -700,6 +700,7 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
  Error **errp)
 {
 BDRVNVMeState *s = bs->opaque;
+NVMeQueuePair *q;
 AioContext *aio_context = bdrv_get_aio_context(bs);
 int ret;
 uint64_t cap;
@@ -781,19 +782,18 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
 
 /* Set up admin queue. */
 s->queues = g_new(NVMeQueuePair *, 1);
-s->queues[INDEX_ADMIN] = nvme_create_queue_pair(s, aio_context, 0,
-  NVME_QUEUE_SIZE,
-  errp);
-if (!s->queues[INDEX_ADMIN]) {
+q = nvme_create_queue_pair(s, aio_context, 0, NVME_QUEUE_SIZE, errp);
+if (!q) {
 ret = -EINVAL;
 goto out;
 }
+s->queues[INDEX_ADMIN] = q;
 s->queue_count = 1;
 QEMU_BUILD_BUG_ON((NVME_QUEUE_SIZE - 1) & 0xF000);
 regs->aqa = cpu_to_le32(((NVME_QUEUE_SIZE - 1) << AQA_ACQS_SHIFT) |
 ((NVME_QUEUE_SIZE - 1) << AQA_ASQS_SHIFT));
-regs->asq = cpu_to_le64(s->queues[INDEX_ADMIN]->sq.iova);
-regs->acq = cpu_to_le64(s->queues[INDEX_ADMIN]->cq.iova);
+regs->asq = cpu_to_le64(q->sq.iova);
+regs->acq = cpu_to_le64(q->cq.iova);
 
 /* After setting up all control registers we can enable device now. */
 regs->cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << CC_IOCQES_SHIFT) |
-- 
2.28.0

[PULL 12/33] block/nvme: Make nvme_identify() return boolean indicating error

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

Just for consistency, following the example documented since
commit e3fe3988d7 ("error: Document Error API usage rules"),
return a boolean value indicating an error is set or not.
Directly pass errp as the local_err is not requested in our
case.

Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Stefan Hajnoczi 
Message-id: 20201029093306.1063879-11-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index c450499111..9833501245 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -506,9 +506,11 @@ static int nvme_cmd_sync(BlockDriverState *bs, 
NVMeQueuePair *q,
 return ret;
 }
 
-static void nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
+/* Returns true on success, false on failure. */
+static bool nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
 {
 BDRVNVMeState *s = bs->opaque;
+bool ret = false;
 union {
 NvmeIdCtrl ctrl;
 NvmeIdNs ns;
@@ -585,10 +587,13 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 goto out;
 }
 
+ret = true;
 s->blkshift = lbaf->ds;
 out:
 qemu_vfio_dma_unmap(s->vfio, id);
 qemu_vfree(id);
+
+return ret;
 }
 
 static bool nvme_poll_queue(NVMeQueuePair *q)
@@ -701,7 +706,6 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
 uint64_t cap;
 uint64_t timeout_ms;
 uint64_t deadline, now;
-Error *local_err = NULL;
 volatile NvmeBar *regs = NULL;
 
 qemu_co_mutex_init(>dma_map_lock);
@@ -818,9 +822,7 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
>irq_notifier[MSIX_SHARED_IRQ_IDX],
false, nvme_handle_event, nvme_poll_cb);
 
-nvme_identify(bs, namespace, _err);
-if (local_err) {
-error_propagate(errp, local_err);
+if (!nvme_identify(bs, namespace, errp)) {
 ret = -EIO;
 goto out;
 }
-- 
2.28.0

[PULL 13/33] block/nvme: Make nvme_init_queue() return boolean indicating error

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

Just for consistency, following the example documented since
commit e3fe3988d7 ("error: Document Error API usage rules"),
return a boolean value indicating an error is set or not.
Directly pass errp as the local_err is not requested in our
case. This simplifies a bit nvme_create_queue_pair().

Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-12-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 9833501245..6eaba4e703 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -160,7 +160,8 @@ static QemuOptsList runtime_opts = {
 },
 };
 
-static void nvme_init_queue(BDRVNVMeState *s, NVMeQueue *q,
+/* Returns true on success, false on failure. */
+static bool nvme_init_queue(BDRVNVMeState *s, NVMeQueue *q,
 unsigned nentries, size_t entry_bytes, Error 
**errp)
 {
 size_t bytes;
@@ -171,13 +172,15 @@ static void nvme_init_queue(BDRVNVMeState *s, NVMeQueue 
*q,
 q->queue = qemu_try_memalign(s->page_size, bytes);
 if (!q->queue) {
 error_setg(errp, "Cannot allocate queue");
-return;
+return false;
 }
 memset(q->queue, 0, bytes);
 r = qemu_vfio_dma_map(s->vfio, q->queue, bytes, false, >iova);
 if (r) {
 error_setg(errp, "Cannot map queue");
+return false;
 }
+return true;
 }
 
 static void nvme_free_queue_pair(NVMeQueuePair *q)
@@ -210,7 +213,6 @@ static NVMeQueuePair *nvme_create_queue_pair(BDRVNVMeState 
*s,
  Error **errp)
 {
 int i, r;
-Error *local_err = NULL;
 NVMeQueuePair *q;
 uint64_t prp_list_iova;
 
@@ -247,16 +249,12 @@ static NVMeQueuePair 
*nvme_create_queue_pair(BDRVNVMeState *s,
 req->prp_list_iova = prp_list_iova + i * s->page_size;
 }
 
-nvme_init_queue(s, >sq, size, NVME_SQ_ENTRY_BYTES, _err);
-if (local_err) {
-error_propagate(errp, local_err);
+if (!nvme_init_queue(s, >sq, size, NVME_SQ_ENTRY_BYTES, errp)) {
 goto fail;
 }
 q->sq.doorbell = >doorbells[idx * s->doorbell_scale].sq_tail;
 
-nvme_init_queue(s, >cq, size, NVME_CQ_ENTRY_BYTES, _err);
-if (local_err) {
-error_propagate(errp, local_err);
+if (!nvme_init_queue(s, >cq, size, NVME_CQ_ENTRY_BYTES, errp)) {
 goto fail;
 }
 q->cq.doorbell = >doorbells[idx * s->doorbell_scale].cq_head;
-- 
2.28.0

[PULL 06/33] block/nvme: Trace controller capabilities

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

Controllers have different capabilities and report them in the
CAP register. We are particularly interested by the page size
limits.

Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Eric Auger 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-5-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c   | 13 +
 block/trace-events |  2 ++
 2 files changed, 15 insertions(+)

diff --git a/block/nvme.c b/block/nvme.c
index 6f1d7f9b2a..361b5772b7 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -727,6 +727,19 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
  * Initialization". */
 
 cap = le64_to_cpu(regs->cap);
+trace_nvme_controller_capability_raw(cap);
+trace_nvme_controller_capability("Maximum Queue Entries Supported",
+ 1 + NVME_CAP_MQES(cap));
+trace_nvme_controller_capability("Contiguous Queues Required",
+ NVME_CAP_CQR(cap));
+trace_nvme_controller_capability("Doorbell Stride",
+ 2 << (2 + NVME_CAP_DSTRD(cap)));
+trace_nvme_controller_capability("Subsystem Reset Supported",
+ NVME_CAP_NSSRS(cap));
+trace_nvme_controller_capability("Memory Page Size Minimum",
+ 1 << (12 + NVME_CAP_MPSMIN(cap)));
+trace_nvme_controller_capability("Memory Page Size Maximum",
+ 1 << (12 + NVME_CAP_MPSMAX(cap)));
 if (!NVME_CAP_CSS(cap)) {
 error_setg(errp, "Device doesn't support NVMe command set");
 ret = -EINVAL;
diff --git a/block/trace-events b/block/trace-events
index 0955c85c78..b90b07b15f 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -134,6 +134,8 @@ qed_aio_write_postfill(void *s, void *acb, uint64_t start, 
size_t len, uint64_t
 qed_aio_write_main(void *s, void *acb, int ret, uint64_t offset, size_t len) 
"s %p acb %p ret %d offset %"PRIu64" len %zu"
 
 # nvme.c
+nvme_controller_capability_raw(uint64_t value) "0x%08"PRIx64
+nvme_controller_capability(const char *desc, uint64_t value) "%s: %"PRIu64
 nvme_kick(void *s, int queue) "s %p queue %d"
 nvme_dma_flush_queue_wait(void *s) "s %p"
 nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) 
"cmd_specific %d sq_head %d sqid %d cid %d status 0x%x"
-- 
2.28.0

[PULL 10/33] block/nvme: Move definitions before structure declarations

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

To be able to use some definitions in structure declarations,
move them earlier. No logical change.

Reviewed-by: Eric Auger 
Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-9-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index e95d59d312..b0629f5de8 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -41,6 +41,16 @@
 
 typedef struct BDRVNVMeState BDRVNVMeState;
 
+/* Same index is used for queues and IRQs */
+#define INDEX_ADMIN 0
+#define INDEX_IO(n) (1 + n)
+
+/* This driver shares a single MSIX IRQ for the admin and I/O queues */
+enum {
+MSIX_SHARED_IRQ_IDX = 0,
+MSIX_IRQ_COUNT = 1
+};
+
 typedef struct {
 int32_t  head, tail;
 uint8_t  *queue;
@@ -81,15 +91,6 @@ typedef struct {
 QEMUBH  *completion_bh;
 } NVMeQueuePair;
 
-#define INDEX_ADMIN 0
-#define INDEX_IO(n) (1 + n)
-
-/* This driver shares a single MSIX IRQ for the admin and I/O queues */
-enum {
-MSIX_SHARED_IRQ_IDX = 0,
-MSIX_IRQ_COUNT = 1
-};
-
 struct BDRVNVMeState {
 AioContext *aio_context;
 QEMUVFIOState *vfio;
-- 
2.28.0

[PULL 11/33] block/nvme: Use unsigned integer for queue counter/size

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

We can not have negative queue count/size/index, use unsigned type.
Rename 'nr_queues' as 'queue_count' to match the spec naming.

Reviewed-by: Eric Auger 
Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-10-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c   | 38 ++
 block/trace-events | 10 +-
 2 files changed, 23 insertions(+), 25 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index b0629f5de8..c450499111 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -104,7 +104,7 @@ struct BDRVNVMeState {
  * [1..]: io queues.
  */
 NVMeQueuePair **queues;
-int nr_queues;
+unsigned queue_count;
 size_t page_size;
 /* How many uint32_t elements does each doorbell entry take. */
 size_t doorbell_scale;
@@ -161,7 +161,7 @@ static QemuOptsList runtime_opts = {
 };
 
 static void nvme_init_queue(BDRVNVMeState *s, NVMeQueue *q,
-int nentries, int entry_bytes, Error **errp)
+unsigned nentries, size_t entry_bytes, Error 
**errp)
 {
 size_t bytes;
 int r;
@@ -206,7 +206,7 @@ static void nvme_free_req_queue_cb(void *opaque)
 
 static NVMeQueuePair *nvme_create_queue_pair(BDRVNVMeState *s,
  AioContext *aio_context,
- int idx, int size,
+ unsigned idx, size_t size,
  Error **errp)
 {
 int i, r;
@@ -623,7 +623,7 @@ static bool nvme_poll_queues(BDRVNVMeState *s)
 bool progress = false;
 int i;
 
-for (i = 0; i < s->nr_queues; i++) {
+for (i = 0; i < s->queue_count; i++) {
 if (nvme_poll_queue(s->queues[i])) {
 progress = true;
 }
@@ -644,10 +644,10 @@ static void nvme_handle_event(EventNotifier *n)
 static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
 {
 BDRVNVMeState *s = bs->opaque;
-int n = s->nr_queues;
+unsigned n = s->queue_count;
 NVMeQueuePair *q;
 NvmeCmd cmd;
-int queue_size = NVME_QUEUE_SIZE;
+unsigned queue_size = NVME_QUEUE_SIZE;
 
 q = nvme_create_queue_pair(s, bdrv_get_aio_context(bs),
n, queue_size, errp);
@@ -661,7 +661,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 .cdw11 = cpu_to_le32(0x3),
 };
 if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
-error_setg(errp, "Failed to create CQ io queue [%d]", n);
+error_setg(errp, "Failed to create CQ io queue [%u]", n);
 goto out_error;
 }
 cmd = (NvmeCmd) {
@@ -671,12 +671,12 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 .cdw11 = cpu_to_le32(0x1 | (n << 16)),
 };
 if (nvme_cmd_sync(bs, s->queues[INDEX_ADMIN], )) {
-error_setg(errp, "Failed to create SQ io queue [%d]", n);
+error_setg(errp, "Failed to create SQ io queue [%u]", n);
 goto out_error;
 }
 s->queues = g_renew(NVMeQueuePair *, s->queues, n + 1);
 s->queues[n] = q;
-s->nr_queues++;
+s->queue_count++;
 return true;
 out_error:
 nvme_free_queue_pair(q);
@@ -785,7 +785,7 @@ static int nvme_init(BlockDriverState *bs, const char 
*device, int namespace,
 ret = -EINVAL;
 goto out;
 }
-s->nr_queues = 1;
+s->queue_count = 1;
 QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
 regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << AQA_ACQS_SHIFT) |
 (NVME_QUEUE_SIZE << AQA_ASQS_SHIFT));
@@ -895,10 +895,9 @@ static int 
nvme_enable_disable_write_cache(BlockDriverState *bs, bool enable,
 
 static void nvme_close(BlockDriverState *bs)
 {
-int i;
 BDRVNVMeState *s = bs->opaque;
 
-for (i = 0; i < s->nr_queues; ++i) {
+for (unsigned i = 0; i < s->queue_count; ++i) {
 nvme_free_queue_pair(s->queues[i]);
 }
 g_free(s->queues);
@@ -1123,7 +1122,7 @@ static coroutine_fn int 
nvme_co_prw_aligned(BlockDriverState *bs,
 };
 
 trace_nvme_prw_aligned(s, is_write, offset, bytes, flags, qiov->niov);
-assert(s->nr_queues > 1);
+assert(s->queue_count > 1);
 req = nvme_get_free_req(ioq);
 assert(req);
 
@@ -1233,7 +1232,7 @@ static coroutine_fn int nvme_co_flush(BlockDriverState 
*bs)
 .ret = -EINPROGRESS,
 };
 
-assert(s->nr_queues > 1);
+assert(s->queue_count > 1);
 req = nvme_get_free_req(ioq);
 assert(req);
 nvme_submit_command(ioq, req, , nvme_rw_cb, );
@@ -1285,7 +1284,7 @@ static coroutine_fn int 
nvme_co_pwrite_zeroes(BlockDriverState *bs,
 cmd.cdw12 = cpu_to_le32(cdw12);
 
 trace_nvme_write_zeroes(s, offset, bytes, flags);
-assert(s->nr_queues > 1);
+assert(s->queue_count > 1);
 req = nvme_get_free_req(ioq);

[PULL 07/33] block/nvme: Trace nvme_poll_queue() per queue

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

As we want to enable multiple queues, report the event
in each nvme_poll_queue() call, rather than once in
the callback calling nvme_poll_queues().

Reviewed-by: Eric Auger 
Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-6-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c   | 2 +-
 block/trace-events | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 361b5772b7..8d74401ae7 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -594,6 +594,7 @@ static bool nvme_poll_queue(NVMeQueuePair *q)
 const size_t cqe_offset = q->cq.head * NVME_CQ_ENTRY_BYTES;
 NvmeCqe *cqe = (NvmeCqe *)>cq.queue[cqe_offset];
 
+trace_nvme_poll_queue(q->s, q->index);
 /*
  * Do an early check for completions. q->lock isn't needed because
  * nvme_process_completion() only runs in the event loop thread and
@@ -684,7 +685,6 @@ static bool nvme_poll_cb(void *opaque)
 BDRVNVMeState *s = container_of(e, BDRVNVMeState,
 irq_notifier[MSIX_SHARED_IRQ_IDX]);
 
-trace_nvme_poll_cb(s);
 return nvme_poll_queues(s);
 }
 
diff --git a/block/trace-events b/block/trace-events
index b90b07b15f..86292f3312 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -145,7 +145,7 @@ nvme_complete_command(void *s, int index, int cid) "s %p 
queue %d cid %d"
 nvme_submit_command(void *s, int index, int cid) "s %p queue %d cid %d"
 nvme_submit_command_raw(int c0, int c1, int c2, int c3, int c4, int c5, int 
c6, int c7) "%02x %02x %02x %02x %02x %02x %02x %02x"
 nvme_handle_event(void *s) "s %p"
-nvme_poll_cb(void *s) "s %p"
+nvme_poll_queue(void *s, unsigned q_index) "s %p q #%u"
 nvme_prw_aligned(void *s, int is_write, uint64_t offset, uint64_t bytes, int 
flags, int niov) "s %p is_write %d offset 0x%"PRIx64" bytes %"PRId64" flags %d 
niov %d"
 nvme_write_zeroes(void *s, uint64_t offset, uint64_t bytes, int flags) "s %p 
offset 0x%"PRIx64" bytes %"PRId64" flags %d"
 nvme_qiov_unaligned(const void *qiov, int n, void *base, size_t size, int 
align) "qiov %p n %d base %p size 0x%zx align 0x%x"
-- 
2.28.0

[PULL 09/33] block/nvme: Trace queue pair creation/deletion

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

Reviewed-by: Eric Auger 
Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-8-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c   | 3 +++
 block/trace-events | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/block/nvme.c b/block/nvme.c
index 29d2541b91..e95d59d312 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -181,6 +181,7 @@ static void nvme_init_queue(BDRVNVMeState *s, NVMeQueue *q,
 
 static void nvme_free_queue_pair(NVMeQueuePair *q)
 {
+trace_nvme_free_queue_pair(q->index, q);
 if (q->completion_bh) {
 qemu_bh_delete(q->completion_bh);
 }
@@ -216,6 +217,8 @@ static NVMeQueuePair *nvme_create_queue_pair(BDRVNVMeState 
*s,
 if (!q) {
 return NULL;
 }
+trace_nvme_create_queue_pair(idx, q, size, aio_context,
+ event_notifier_get_fd(s->irq_notifier));
 q->prp_list_pages = qemu_try_memalign(s->page_size,
   s->page_size * NVME_NUM_REQS);
 if (!q->prp_list_pages) {
diff --git a/block/trace-events b/block/trace-events
index cc5e2b55cb..f6a0f99df1 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -155,6 +155,8 @@ nvme_dsm(void *s, uint64_t offset, uint64_t bytes) "s %p 
offset 0x%"PRIx64" byte
 nvme_dsm_done(void *s, uint64_t offset, uint64_t bytes, int ret) "s %p offset 
0x%"PRIx64" bytes %"PRId64" ret %d"
 nvme_dma_map_flush(void *s) "s %p"
 nvme_free_req_queue_wait(void *s, unsigned q_index) "s %p q #%u"
+nvme_create_queue_pair(unsigned q_index, void *q, unsigned size, void 
*aio_context, int fd) "index %u q %p size %u aioctx %p fd %d"
+nvme_free_queue_pair(unsigned q_index, void *q) "index %u q %p"
 nvme_cmd_map_qiov(void *s, void *cmd, void *req, void *qiov, int entries) "s 
%p cmd %p req %p qiov %p entries %d"
 nvme_cmd_map_qiov_pages(void *s, int i, uint64_t page) "s %p page[%d] 
0x%"PRIx64
 nvme_cmd_map_qiov_iov(void *s, int i, void *page, int pages) "s %p iov[%d] %p 
pages %d"
-- 
2.28.0

[PULL 04/33] block/nvme: Use hex format to display offset in trace events

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

Use the same format used for the hw/vfio/ trace events.

Suggested-by: Eric Auger 
Reviewed-by: Eric Auger 
Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-3-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/trace-events | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/trace-events b/block/trace-events
index 0e351c3fa3..0955c85c78 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -144,13 +144,13 @@ nvme_submit_command(void *s, int index, int cid) "s %p 
queue %d cid %d"
 nvme_submit_command_raw(int c0, int c1, int c2, int c3, int c4, int c5, int 
c6, int c7) "%02x %02x %02x %02x %02x %02x %02x %02x"
 nvme_handle_event(void *s) "s %p"
 nvme_poll_cb(void *s) "s %p"
-nvme_prw_aligned(void *s, int is_write, uint64_t offset, uint64_t bytes, int 
flags, int niov) "s %p is_write %d offset %"PRId64" bytes %"PRId64" flags %d 
niov %d"
-nvme_write_zeroes(void *s, uint64_t offset, uint64_t bytes, int flags) "s %p 
offset %"PRId64" bytes %"PRId64" flags %d"
+nvme_prw_aligned(void *s, int is_write, uint64_t offset, uint64_t bytes, int 
flags, int niov) "s %p is_write %d offset 0x%"PRIx64" bytes %"PRId64" flags %d 
niov %d"
+nvme_write_zeroes(void *s, uint64_t offset, uint64_t bytes, int flags) "s %p 
offset 0x%"PRIx64" bytes %"PRId64" flags %d"
 nvme_qiov_unaligned(const void *qiov, int n, void *base, size_t size, int 
align) "qiov %p n %d base %p size 0x%zx align 0x%x"
-nvme_prw_buffered(void *s, uint64_t offset, uint64_t bytes, int niov, int 
is_write) "s %p offset %"PRId64" bytes %"PRId64" niov %d is_write %d"
-nvme_rw_done(void *s, int is_write, uint64_t offset, uint64_t bytes, int ret) 
"s %p is_write %d offset %"PRId64" bytes %"PRId64" ret %d"
-nvme_dsm(void *s, uint64_t offset, uint64_t bytes) "s %p offset %"PRId64" 
bytes %"PRId64""
-nvme_dsm_done(void *s, uint64_t offset, uint64_t bytes, int ret) "s %p offset 
%"PRId64" bytes %"PRId64" ret %d"
+nvme_prw_buffered(void *s, uint64_t offset, uint64_t bytes, int niov, int 
is_write) "s %p offset 0x%"PRIx64" bytes %"PRId64" niov %d is_write %d"
+nvme_rw_done(void *s, int is_write, uint64_t offset, uint64_t bytes, int ret) 
"s %p is_write %d offset 0x%"PRIx64" bytes %"PRId64" ret %d"
+nvme_dsm(void *s, uint64_t offset, uint64_t bytes) "s %p offset 0x%"PRIx64" 
bytes %"PRId64""
+nvme_dsm_done(void *s, uint64_t offset, uint64_t bytes, int ret) "s %p offset 
0x%"PRIx64" bytes %"PRId64" ret %d"
 nvme_dma_map_flush(void *s) "s %p"
 nvme_free_req_queue_wait(void *q) "q %p"
 nvme_cmd_map_qiov(void *s, void *cmd, void *req, void *qiov, int entries) "s 
%p cmd %p req %p qiov %p entries %d"
-- 
2.28.0

[PULL 05/33] block/nvme: Report warning with warn_report()

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

Instead of displaying warning on stderr, use warn_report()
which also displays it on the monitor.

Reviewed-by: Eric Auger 
Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-4-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 739a0a700c..6f1d7f9b2a 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -399,8 +399,8 @@ static bool nvme_process_completion(NVMeQueuePair *q)
 }
 cid = le16_to_cpu(c->cid);
 if (cid == 0 || cid > NVME_QUEUE_SIZE) {
-fprintf(stderr, "Unexpected CID in completion queue: %" PRIu32 
"\n",
-cid);
+warn_report("NVMe: Unexpected CID in completion queue: %"PRIu32", "
+"queue size: %u", cid, NVME_QUEUE_SIZE);
 continue;
 }
 trace_nvme_complete_command(s, q->index, cid);
-- 
2.28.0

[PULL 08/33] block/nvme: Improve nvme_free_req_queue_wait() trace information

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

What we want to trace is the block driver state and the queue index.

Suggested-by: Stefan Hajnoczi 
Reviewed-by: Eric Auger 
Reviewed-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20201029093306.1063879-7-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
Tested-by: Eric Auger 
---
 block/nvme.c   | 2 +-
 block/trace-events | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 8d74401ae7..29d2541b91 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -292,7 +292,7 @@ static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
 
 while (q->free_req_head == -1) {
 if (qemu_in_coroutine()) {
-trace_nvme_free_req_queue_wait(q);
+trace_nvme_free_req_queue_wait(q->s, q->index);
 qemu_co_queue_wait(>free_req_queue, >lock);
 } else {
 qemu_mutex_unlock(>lock);
diff --git a/block/trace-events b/block/trace-events
index 86292f3312..cc5e2b55cb 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -154,7 +154,7 @@ nvme_rw_done(void *s, int is_write, uint64_t offset, 
uint64_t bytes, int ret) "s
 nvme_dsm(void *s, uint64_t offset, uint64_t bytes) "s %p offset 0x%"PRIx64" 
bytes %"PRId64""
 nvme_dsm_done(void *s, uint64_t offset, uint64_t bytes, int ret) "s %p offset 
0x%"PRIx64" bytes %"PRId64" ret %d"
 nvme_dma_map_flush(void *s) "s %p"
-nvme_free_req_queue_wait(void *q) "q %p"
+nvme_free_req_queue_wait(void *s, unsigned q_index) "s %p q #%u"
 nvme_cmd_map_qiov(void *s, void *cmd, void *req, void *qiov, int entries) "s 
%p cmd %p req %p qiov %p entries %d"
 nvme_cmd_map_qiov_pages(void *s, int i, uint64_t page) "s %p page[%d] 
0x%"PRIx64
 nvme_cmd_map_qiov_iov(void *s, int i, void *page, int pages) "s %p iov[%d] %p 
pages %d"
-- 
2.28.0

[PULL 03/33] MAINTAINERS: Cover "block/nvme.h" file

2020-11-04 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

The "block/nvme.h" header is shared by both the NVMe block
driver and the NVMe emulated device. Add the 'F:' entry on
both sections, so all maintainers/reviewers are notified
when it is changed.

Signed-off-by: Philippe Mathieu-Daudé 
Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Klaus Jensen 
Message-Id: <20200701140634.25994-1-phi...@redhat.com>
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index c1d16026ba..ea47b9e889 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1878,6 +1878,7 @@ M: Klaus Jensen 
 L: qemu-block@nongnu.org
 S: Supported
 F: hw/block/nvme*
+F: include/block/nvme.h
 F: tests/qtest/nvme-test.c
 F: docs/specs/nvme.txt
 T: git git://git.infradead.org/qemu-nvme.git nvme-next
@@ -2962,6 +2963,7 @@ R: Fam Zheng 
 L: qemu-block@nongnu.org
 S: Supported
 F: block/nvme*
+F: include/block/nvme.h
 T: git https://github.com/stefanha/qemu.git block
 
 Bootdevice
-- 
2.28.0

[PULL 01/33] accel/kvm: add PIO ioeventfds only in case kvm_eventfds_allowed is true

2020-11-04 Thread Stefan Hajnoczi

From: Elena Afanasova 

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Elena Afanasova 
Message-Id: <20201017210102.26036-1-eafanas...@gmail.com>
Signed-off-by: Stefan Hajnoczi 
---
 accel/kvm/kvm-all.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 9ef5daf4c5..baaa54249d 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2239,8 +2239,10 @@ static int kvm_init(MachineState *ms)
 
 kvm_memory_listener_register(s, >memory_listener,
  _space_memory, 0);
-memory_listener_register(_io_listener,
- _space_io);
+if (kvm_eventfds_allowed) {
+memory_listener_register(_io_listener,
+ _space_io);
+}
 memory_listener_register(_coalesced_pio_listener,
  _space_io);
 
-- 
2.28.0

[PULL 02/33] softmmu/memory: fix memory_region_ioeventfd_equal()

2020-11-04 Thread Stefan Hajnoczi

From: Elena Afanasova 

Eventfd can be registered with a zero length when fast_mmio is true.
Handle this case properly when dispatching through QEMU.

Signed-off-by: Elena Afanasova 
Message-id: cf71a62eb04e61932ff8ffdd02e0b2aab4f495a0.ca...@gmail.com
Signed-off-by: Stefan Hajnoczi 
---
 softmmu/memory.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/softmmu/memory.c b/softmmu/memory.c
index 21d533d8ed..8aba4114cf 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -205,8 +205,15 @@ static bool 
memory_region_ioeventfd_before(MemoryRegionIoeventfd *a,
 static bool memory_region_ioeventfd_equal(MemoryRegionIoeventfd *a,
   MemoryRegionIoeventfd *b)
 {
-return !memory_region_ioeventfd_before(a, b)
-&& !memory_region_ioeventfd_before(b, a);
+if (int128_eq(a->addr.start, b->addr.start) &&
+(!int128_nz(a->addr.size) || !int128_nz(b->addr.size) ||
+ (int128_eq(a->addr.size, b->addr.size) &&
+  (a->match_data == b->match_data) &&
+  ((a->match_data && (a->data == b->data)) || !a->match_data) &&
+  (a->e == b->e
+return true;
+
+return false;
 }
 
 /* Range of memory in the global map.  Addresses are absolute. */
-- 
2.28.0

[PULL 00/33] Block patches

2020-11-04 Thread Stefan Hajnoczi

The following changes since commit 8507c9d5c9a62de2a0e281b640f995e26eac46af:

  Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging 
(2020-11-03 15:59:44 +)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to fc107d86840b3364e922c26cf7631b7fd38ce523:

  util/vfio-helpers: Assert offset is aligned to page size (2020-11-03 19:06:23 
+)


Pull request for 5.2

NVMe fixes to solve IOMMU issues on non-x86 and error message/tracing
improvements. Elena Afanasova's ioeventfd fixes are also included.

Signed-off-by: Stefan Hajnoczi 



Elena Afanasova (2):
  accel/kvm: add PIO ioeventfds only in case kvm_eventfds_allowed is
true
  softmmu/memory: fix memory_region_ioeventfd_equal()

Eric Auger (4):
  block/nvme: Change size and alignment of IDENTIFY response buffer
  block/nvme: Change size and alignment of queue
  block/nvme: Change size and alignment of prp_list_pages
  block/nvme: Align iov's va and size on host page size

Philippe Mathieu-Daudé (27):
  MAINTAINERS: Cover "block/nvme.h" file
  block/nvme: Use hex format to display offset in trace events
  block/nvme: Report warning with warn_report()
  block/nvme: Trace controller capabilities
  block/nvme: Trace nvme_poll_queue() per queue
  block/nvme: Improve nvme_free_req_queue_wait() trace information
  block/nvme: Trace queue pair creation/deletion
  block/nvme: Move definitions before structure declarations
  block/nvme: Use unsigned integer for queue counter/size
  block/nvme: Make nvme_identify() return boolean indicating error
  block/nvme: Make nvme_init_queue() return boolean indicating error
  block/nvme: Introduce Completion Queue definitions
  block/nvme: Use definitions instead of magic values in add_io_queue()
  block/nvme: Correctly initialize Admin Queue Attributes
  block/nvme: Simplify ADMIN queue access
  block/nvme: Simplify nvme_cmd_sync()
  block/nvme: Set request_alignment at initialization
  block/nvme: Correct minimum device page size
  block/nvme: Fix use of write-only doorbells page on Aarch64 arch
  block/nvme: Fix nvme_submit_command() on big-endian host
  util/vfio-helpers: Improve reporting unsupported IOMMU type
  util/vfio-helpers: Trace PCI I/O config accesses
  util/vfio-helpers: Trace PCI BAR region info
  util/vfio-helpers: Trace where BARs are mapped
  util/vfio-helpers: Improve DMA trace events
  util/vfio-helpers: Convert vfio_dump_mapping to trace events
  util/vfio-helpers: Assert offset is aligned to page size

 MAINTAINERS  |   2 +
 include/block/nvme.h |  18 ++--
 accel/kvm/kvm-all.c  |   6 +-
 block/nvme.c | 209 ---
 softmmu/memory.c |  11 ++-
 util/vfio-helpers.c  |  43 +
 block/trace-events   |  30 ---
 util/trace-events|  10 ++-
 8 files changed, 195 insertions(+), 134 deletions(-)

-- 
2.28.0

Re: [PATCH] Prefer 'on' | 'off' over 'yes' | 'no' for bool options

2020-11-04 Thread Philippe Mathieu-Daudé

On 11/4/20 3:05 PM, Daniel P. Berrangé wrote:
> Update some docs and test cases to use 'on' | 'off' as the preferred
> value for bool options.
> 
> Signed-off-by: Daniel P. Berrangé 
> ---
>  docs/system/vnc-security.rst | 6 +++---
>  include/authz/listfile.h | 2 +-
>  qemu-options.hx  | 4 ++--
>  tests/qemu-iotests/233   | 4 ++--
>  4 files changed, 8 insertions(+), 8 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH] Prefer 'on' | 'off' over 'yes' | 'no' for bool options

2020-11-04 Thread Thomas Huth

On 04/11/2020 15.05, Daniel P. Berrangé wrote:
> Update some docs and test cases to use 'on' | 'off' as the preferred
> value for bool options.
> 
> Signed-off-by: Daniel P. Berrangé 
> ---
>  docs/system/vnc-security.rst | 6 +++---
>  include/authz/listfile.h | 2 +-
>  qemu-options.hx  | 4 ++--
>  tests/qemu-iotests/233   | 4 ++--
>  4 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/docs/system/vnc-security.rst b/docs/system/vnc-security.rst
> index b237b07330..e97b42dfdc 100644
> --- a/docs/system/vnc-security.rst
> +++ b/docs/system/vnc-security.rst
> @@ -89,7 +89,7 @@ but with ``verify-peer`` set to ``yes`` instead.
>  .. parsed-literal::
>  
> |qemu_system| [...OPTIONS...] \
> - -object 
> tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=yes \
> + -object 
> tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=on \
>   -vnc :1,tls-creds=tls0 -monitor stdio
>  
>  .. _vnc_005fsec_005fcertificate_005fpw:
> @@ -103,7 +103,7 @@ authentication to provide two layers of authentication 
> for clients.
>  .. parsed-literal::
>  
> |qemu_system| [...OPTIONS...] \
> - -object 
> tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=yes \
> + -object 
> tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=on \
>   -vnc :1,tls-creds=tls0,password -monitor stdio
> (qemu) change vnc password
> Password: 
> @@ -145,7 +145,7 @@ x509 options:
>  .. parsed-literal::
>  
> |qemu_system| [...OPTIONS...] \
> - -object 
> tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=yes \
> + -object 
> tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=on \
>   -vnc :1,tls-creds=tls0,sasl -monitor stdio
>  
>  .. _vnc_005fsetup_005fsasl:
> diff --git a/include/authz/listfile.h b/include/authz/listfile.h
> index 0a1e5bddd3..0b7fe72198 100644
> --- a/include/authz/listfile.h
> +++ b/include/authz/listfile.h
> @@ -73,7 +73,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(QAuthZListFile,
>   * The object can be created on the command line using
>   *
>   *   -object authz-list-file,id=authz0,\
> - *   filename=/etc/qemu/myvm-vnc.acl,refresh=yes
> + *   filename=/etc/qemu/myvm-vnc.acl,refresh=on
>   *
>   */
>  struct QAuthZListFile {
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 2c83390504..0bdc07bc47 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -5002,7 +5002,7 @@ SRST
>  Note the use of quotes due to the x509 distinguished name
>  containing whitespace, and escaping of ','.
>  
> -``-object authz-listfile,id=id,filename=path,refresh=yes|no``
> +``-object authz-listfile,id=id,filename=path,refresh=on|off``
>  Create an authorization object that will control access to
>  network services.
>  
> @@ -5047,7 +5047,7 @@ SRST
>  
>   # |qemu_system| \\
>   ... \\
> - -object 
> authz-simple,id=auth0,filename=/etc/qemu/vnc-sasl.acl,refresh=yes \\
> + -object 
> authz-simple,id=auth0,filename=/etc/qemu/vnc-sasl.acl,refresh=on \\
>   ...
>  
>  ``-object authz-pam,id=id,service=string``
> diff --git a/tests/qemu-iotests/233 b/tests/qemu-iotests/233
> index a5c17c3963..0b99530f7f 100755
> --- a/tests/qemu-iotests/233
> +++ b/tests/qemu-iotests/233
> @@ -83,7 +83,7 @@ echo
>  echo "== check plain client to TLS server fails =="
>  
>  nbd_server_start_tcp_socket \
> ---object 
> tls-creds-x509,dir=${tls_dir}/server1,endpoint=server,id=tls0,verify-peer=yes 
> \
> +--object 
> tls-creds-x509,dir=${tls_dir}/server1,endpoint=server,id=tls0,verify-peer=on \
>  --tls-creds tls0 \
>  -f $IMGFMT "$TEST_IMG" 2>> "$TEST_DIR/server.log"
>  
> @@ -128,7 +128,7 @@ echo "== check TLS with authorization =="
>  nbd_server_stop
>  
>  nbd_server_start_tcp_socket \
> ---object 
> tls-creds-x509,dir=${tls_dir}/server1,endpoint=server,id=tls0,verify-peer=yes 
> \
> +--object 
> tls-creds-x509,dir=${tls_dir}/server1,endpoint=server,id=tls0,verify-peer=on \
>  --object "authz-simple,id=authz0,identity=CN=localhost,, \
>O=Cthulu Dark Lord Enterprises client1,,L=R'lyeh,,C=South Pacific" \
>  --tls-authz authz0 \
> 

Reviewed-by: Thomas Huth

[PATCH] Prefer 'on' | 'off' over 'yes' | 'no' for bool options

2020-11-04 Thread Daniel P . Berrangé

Update some docs and test cases to use 'on' | 'off' as the preferred
value for bool options.

Signed-off-by: Daniel P. Berrangé 
---
 docs/system/vnc-security.rst | 6 +++---
 include/authz/listfile.h | 2 +-
 qemu-options.hx  | 4 ++--
 tests/qemu-iotests/233   | 4 ++--
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/docs/system/vnc-security.rst b/docs/system/vnc-security.rst
index b237b07330..e97b42dfdc 100644
--- a/docs/system/vnc-security.rst
+++ b/docs/system/vnc-security.rst
@@ -89,7 +89,7 @@ but with ``verify-peer`` set to ``yes`` instead.
 .. parsed-literal::
 
|qemu_system| [...OPTIONS...] \
- -object 
tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=yes \
+ -object 
tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=on \
  -vnc :1,tls-creds=tls0 -monitor stdio
 
 .. _vnc_005fsec_005fcertificate_005fpw:
@@ -103,7 +103,7 @@ authentication to provide two layers of authentication for 
clients.
 .. parsed-literal::
 
|qemu_system| [...OPTIONS...] \
- -object 
tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=yes \
+ -object 
tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=on \
  -vnc :1,tls-creds=tls0,password -monitor stdio
(qemu) change vnc password
Password: 
@@ -145,7 +145,7 @@ x509 options:
 .. parsed-literal::
 
|qemu_system| [...OPTIONS...] \
- -object 
tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=yes \
+ -object 
tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=on \
  -vnc :1,tls-creds=tls0,sasl -monitor stdio
 
 .. _vnc_005fsetup_005fsasl:
diff --git a/include/authz/listfile.h b/include/authz/listfile.h
index 0a1e5bddd3..0b7fe72198 100644
--- a/include/authz/listfile.h
+++ b/include/authz/listfile.h
@@ -73,7 +73,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(QAuthZListFile,
  * The object can be created on the command line using
  *
  *   -object authz-list-file,id=authz0,\
- *   filename=/etc/qemu/myvm-vnc.acl,refresh=yes
+ *   filename=/etc/qemu/myvm-vnc.acl,refresh=on
  *
  */
 struct QAuthZListFile {
diff --git a/qemu-options.hx b/qemu-options.hx
index 2c83390504..0bdc07bc47 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -5002,7 +5002,7 @@ SRST
 Note the use of quotes due to the x509 distinguished name
 containing whitespace, and escaping of ','.
 
-``-object authz-listfile,id=id,filename=path,refresh=yes|no``
+``-object authz-listfile,id=id,filename=path,refresh=on|off``
 Create an authorization object that will control access to
 network services.
 
@@ -5047,7 +5047,7 @@ SRST
 
  # |qemu_system| \\
  ... \\
- -object 
authz-simple,id=auth0,filename=/etc/qemu/vnc-sasl.acl,refresh=yes \\
+ -object 
authz-simple,id=auth0,filename=/etc/qemu/vnc-sasl.acl,refresh=on \\
  ...
 
 ``-object authz-pam,id=id,service=string``
diff --git a/tests/qemu-iotests/233 b/tests/qemu-iotests/233
index a5c17c3963..0b99530f7f 100755
--- a/tests/qemu-iotests/233
+++ b/tests/qemu-iotests/233
@@ -83,7 +83,7 @@ echo
 echo "== check plain client to TLS server fails =="
 
 nbd_server_start_tcp_socket \
---object 
tls-creds-x509,dir=${tls_dir}/server1,endpoint=server,id=tls0,verify-peer=yes \
+--object 
tls-creds-x509,dir=${tls_dir}/server1,endpoint=server,id=tls0,verify-peer=on \
 --tls-creds tls0 \
 -f $IMGFMT "$TEST_IMG" 2>> "$TEST_DIR/server.log"
 
@@ -128,7 +128,7 @@ echo "== check TLS with authorization =="
 nbd_server_stop
 
 nbd_server_start_tcp_socket \
---object 
tls-creds-x509,dir=${tls_dir}/server1,endpoint=server,id=tls0,verify-peer=yes \
+--object 
tls-creds-x509,dir=${tls_dir}/server1,endpoint=server,id=tls0,verify-peer=on \
 --object "authz-simple,id=authz0,identity=CN=localhost,, \
   O=Cthulu Dark Lord Enterprises client1,,L=R'lyeh,,C=South Pacific" \
 --tls-authz authz0 \
-- 
2.28.0

Re: [PULL 0/6] Mips fixes patches

2020-11-04 Thread Peter Maydell

On Tue, 3 Nov 2020 at 17:33, Philippe Mathieu-Daudé  wrote:
>
> The following changes since commit 83851c7c60c90e9fb6a23ff48076387a77bc33cd:
>
>   Merge remote-tracking branch 'remotes/mdroth/tags/qga-pull-2020-10-27-v3-ta=
> g' into staging (2020-11-03 12:47:58 +)
>
> are available in the Git repository at:
>
>   https://gitlab.com/philmd/qemu.git tags/mips-fixes-20201103
>
> for you to fetch changes up to 8a805609d126ff2be9ad9ec118185dfc52633d6f:
>
>   target/mips: Add unaligned access support for MIPS64R6 and Loongson-3 (2020=
> -11-03 16:51:13 +0100)
>
> 
> MIPS patches queue
>
> - Removal of the 'r4k' machine (deprecated before 5.0)
> - Fix LGPL license text (Chetan Pant)
> - Support unaligned accesses on Loongson-3 (Huacai Chen)
> - Fix out-of-bound access in Loongson-3 embedded I/O interrupt
>   controller (Alex Chen)
>
> CI jobs results:
> . https://cirrus-ci.com/build/6324890389184512
> . https://gitlab.com/philmd/qemu/-/pipelines/211275262
> . https://travis-ci.org/github/philmd/qemu/builds/741188958


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/5.2
for any user-visible changes.

-- PMM

Re: [PATCH for-5.2 0/3] hw/block/nvme: coverity fixes

2020-11-04 Thread Max Reitz

On 04.11.20 11:22, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Fix three issues reported by coverity (CIDs 1436128, 1436129 and
> 1436131).
> 
> Klaus Jensen (3):
>   hw/block/nvme: fix null ns in register namespace
>   hw/block/nvme: fix uint16_t use of uint32_t sgls member
>   hw/block/nvme: fix free of array-typed value
> 
>  hw/block/nvme.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)

Thanks again, applied to my block branch:

https://git.xanclic.moe/XanClic/qemu/commits/branch/block

Max

Re: [PATCH for-5.2 3/3] hw/block/nvme: fix free of array-typed value

2020-11-04 Thread Max Reitz

On 04.11.20 12:04, Klaus Jensen wrote:
> On Nov  4 11:59, Max Reitz wrote:
>> On 04.11.20 11:22, Klaus Jensen wrote:
>>> From: Klaus Jensen 
>>>
>>> Since 7f0f1acedf15 ("hw/block/nvme: support multiple namespaces"), the
>>> namespaces member of NvmeCtrl is no longer a dynamically allocated
>>> array. Remove the free.
>>>
>>> Fixes: 7f0f1acedf15 ("hw/block/nvme: support multiple namespaces")
>>> Reported-by: Coverity (CID 1436131)
>>> Signed-off-by: Klaus Jensen 
>>> ---
>>>  hw/block/nvme.c | 1 -
>>>  1 file changed, 1 deletion(-)
>>
>> Thanks! :)
>>
>> Reviewed-by: Max Reitz 
>>
> 
> Will Peter pick up fixes like this directly so we don't have to go
> through a pull request from nvme-next?

AFAIA, Peter only picks up build fixes.  Since the build wasn’t broken,
I think someone™ will have to send a pull request...

I understand you don’t necessarily want to be that someone, so I suppose
I might as well.

> Did I correctly annotate with "for-5.2"? :)

Yes!

Max

Re: [PATCH for-5.2 2/3] hw/block/nvme: fix uint16_t use of uint32_t sgls member

2020-11-04 Thread Philippe Mathieu-Daudé

On 11/4/20 11:22 AM, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> nvme_map_sgl_data erroneously uses the sgls member of NvmeIdNs as a
> uint16_t.
> 
> Reported-by: Coverity (CID 1436129)
> Fixes: cba0a8a344fe ("hw/block/nvme: add support for scatter gather lists")
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 080d782f1c2b..2bdc50eb6fce 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -452,7 +452,7 @@ static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList 
> *qsg,
>   * segments and/or descriptors. The controller might accept
>   * ignoring the rest of the SGL.
>   */
> -uint16_t sgls = le16_to_cpu(n->id_ctrl.sgls);
> +uint32_t sgls = le32_to_cpu(n->id_ctrl.sgls);
>  if (sgls & NVME_CTRL_SGLS_EXCESS_LENGTH) {

I'm surprise the compiler doesn't warn here.

Reviewed-by: Philippe Mathieu-Daudé 

>  break;
>  }
>

Re: [PATCH for-5.2 1/3] hw/block/nvme: fix null ns in register namespace

2020-11-04 Thread Philippe Mathieu-Daudé

On 11/4/20 11:22 AM, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Fix dereference after NULL check.
> 
> Reported-by: Coverity (CID 1436128)
> Fixes: b20804946bce ("hw/block/nvme: update nsid when registered")
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index fa2cba744b57..080d782f1c2b 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -2562,8 +2562,7 @@ int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace 
> *ns, Error **errp)
>  
>  if (!nsid) {
>  for (int i = 1; i <= n->num_namespaces; i++) {
> -NvmeNamespace *ns = nvme_ns(n, i);
> -if (!ns) {
> +if (!nvme_ns(n, i)) {
>  nsid = ns->params.nsid = i;

Uh.

Reviewed-by: Philippe Mathieu-Daudé 

>  break;
>  }
>

Re: [PATCH for-5.2 3/3] hw/block/nvme: fix free of array-typed value

2020-11-04 Thread Klaus Jensen

On Nov  4 11:59, Max Reitz wrote:
> On 04.11.20 11:22, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Since 7f0f1acedf15 ("hw/block/nvme: support multiple namespaces"), the
> > namespaces member of NvmeCtrl is no longer a dynamically allocated
> > array. Remove the free.
> > 
> > Fixes: 7f0f1acedf15 ("hw/block/nvme: support multiple namespaces")
> > Reported-by: Coverity (CID 1436131)
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 1 -
> >  1 file changed, 1 deletion(-)
> 
> Thanks! :)
> 
> Reviewed-by: Max Reitz 
> 

Will Peter pick up fixes like this directly so we don't have to go
through a pull request from nvme-next?

Did I correctly annotate with "for-5.2"? :)


signature.asc
Description: PGP signature

1 2 >

1 - 100 of 112 matches

Mail list logo