date:20200617

Re: [PATCH 1/7] qemu-common: Briefly document qemu_timedate_diff() unit

2020-06-17 Thread Markus Armbruster

Philippe Mathieu-Daudé  writes:

> It is not obvious that the qemu_timedate_diff() and
> qemu_ref_timedate() functions return seconds. Briefly
> document it.
>
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/qemu-common.h | 1 +
>  softmmu/vl.c  | 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/qemu-common.h b/include/qemu-common.h
> index d0142f29ac..e97644710c 100644
> --- a/include/qemu-common.h
> +++ b/include/qemu-common.h
> @@ -27,6 +27,7 @@ int qemu_main(int argc, char **argv, char **envp);
>  #endif
>  
>  void qemu_get_timedate(struct tm *tm, int offset);
> +/* Returns difference with RTC reference time (in seconds) */
>  int qemu_timedate_diff(struct tm *tm);

Not this patch's problem: use of int here smells; is it wide enough?

>  
>  void *qemu_oom_check(void *ptr);
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index f669c06ede..215459c7b5 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -737,7 +737,7 @@ void qemu_system_vmstop_request(RunState state)
>  }
>  
>  /***/
> -/* RTC reference time/date access */
> +/* RTC reference time/date access (in seconds) */
>  static time_t qemu_ref_timedate(QEMUClockType clock)
>  {
>  time_t value = qemu_clock_get_ms(clock) / 1000;

time_t is seconds on all systems we support.  Using it for something
other than seconds would be wrong.  The comment feels redundant to me.
But if it helps someone else...

[PATCH v2 16/18] hw/block/nvme: Add injection of Offline/Read-Only zones

2020-06-17 Thread Dmitry Fomichev

ZNS specification defines two zone conditions for the zones that no
longer can function properly, possibly because of flash wear or other
internal fault. It is useful to be able to "inject" a small number of
such zones for testing purposes.

This commit defines two optional driver properties, "offline_zones"
and "rdonly_zones". User can assign non-zero values to these variables
to specify the number of zones to be initialized as Offline or
Read-Only. The actual number of injected zones may be smaller than the
requested amount - Read-Only and Offline counts are expected to be much
smaller than the total number of drive zones.

Signed-off-by: Dmitry Fomichev 
---
 hw/block/nvme.c | 46 ++
 hw/block/nvme.h |  2 ++
 2 files changed, 48 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index eb41081627..14d5f1d155 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -2980,8 +2980,11 @@ static int nvme_init_zone_meta(NvmeCtrl *n, 
NvmeNamespace *ns,
 uint64_t capacity)
 {
 NvmeZone *zone;
+Error *err;
 uint64_t start = 0, zone_size = n->params.zone_size;
+uint32_t rnd;
 int i;
+uint16_t zs;
 
 ns->zone_array = g_malloc0(n->zone_array_size);
 ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
@@ -3011,6 +3014,37 @@ static int nvme_init_zone_meta(NvmeCtrl *n, 
NvmeNamespace *ns,
 start += zone_size;
 }
 
+/* If required, make some zones Offline or Read Only */
+
+for (i = 0; i < n->params.nr_offline_zones; i++) {
+do {
+qcrypto_random_bytes(&rnd, sizeof(rnd), &err);
+rnd %= n->num_zones;
+} while (rnd < n->params.max_open_zones);
+zone = &ns->zone_array[rnd];
+zs = nvme_get_zone_state(zone);
+if (zs != NVME_ZONE_STATE_OFFLINE) {
+nvme_set_zone_state(zone, NVME_ZONE_STATE_OFFLINE);
+} else {
+i--;
+}
+}
+
+for (i = 0; i < n->params.nr_rdonly_zones; i++) {
+do {
+qcrypto_random_bytes(&rnd, sizeof(rnd), &err);
+rnd %= n->num_zones;
+} while (rnd < n->params.max_open_zones);
+zone = &ns->zone_array[rnd];
+zs = nvme_get_zone_state(zone);
+if (zs != NVME_ZONE_STATE_OFFLINE &&
+zs != NVME_ZONE_STATE_READ_ONLY) {
+nvme_set_zone_state(zone, NVME_ZONE_STATE_READ_ONLY);
+} else {
+i--;
+}
+}
+
 return 0;
 }
 
@@ -3063,6 +3097,16 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error 
**errp)
 if (n->params.max_active_zones > nz) {
 n->params.max_active_zones = nz;
 }
+if (n->params.max_open_zones < nz) {
+if (n->params.nr_offline_zones > nz - n->params.max_open_zones) {
+n->params.nr_offline_zones = nz - n->params.max_open_zones;
+}
+if (n->params.nr_rdonly_zones >
+nz - n->params.max_open_zones - n->params.nr_offline_zones) {
+n->params.nr_rdonly_zones =
+nz - n->params.max_open_zones - n->params.nr_offline_zones;
+}
+}
 if (n->params.zd_extension_size) {
 if (n->params.zd_extension_size & 0x3f) {
 error_setg(errp,
@@ -3471,6 +3515,8 @@ static Property nvme_props[] = {
 DEFINE_PROP_UINT64("finish_rcmnd_delay", NvmeCtrl,
params.fzr_delay_usec, 0),
 DEFINE_PROP_UINT64("finish_rcmnd_limit", NvmeCtrl, params.frl_usec, 0),
+DEFINE_PROP_UINT32("offline_zones", NvmeCtrl, params.nr_offline_zones, 0),
+DEFINE_PROP_UINT32("rdonly_zones", NvmeCtrl, params.nr_rdonly_zones, 0),
 DEFINE_PROP_BOOL("zone_async_events", NvmeCtrl, params.zone_async_events,
  true),
 DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, 
true),
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 4251295917..900fc54809 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -24,6 +24,8 @@ typedef struct NvmeParams {
 uint64_tzone_capacity;
 int32_t max_active_zones;
 int32_t max_open_zones;
+uint32_tnr_offline_zones;
+uint32_tnr_rdonly_zones;
 uint32_tzd_extension_size;
 uint64_trzr_delay_usec;
 uint64_trrl_usec;
-- 
2.21.0

[PATCH v2 08/18] hw/block/nvme: Make Zoned NS Command Set definitions

2020-06-17 Thread Dmitry Fomichev

Define values and structures that are needed to support Zoned
Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.

All new protocol definitions are located in include/block/nvme.h
and everything added that is specific to this implementation is kept
in hw/block/nvme.h.

In order to improve scalability, all open, closed and full zones
are organized in separate linked lists. Consequently, almost all
zone operations don't require scanning of the entire zone array
(which potentially can be quite large) - it is only necessary to
enumerate one or more zone lists. Zone lists are designed to be
position-independent as they can be persisted to the backing file
as a part of zone metadata. NvmeZoneList struct defined in this patch
serves as a head of every zone list.

NvmeZone structure encapsulates NvmeZoneDescriptor defined in Zoned
Command Set specification and adds a few more fields that are
internal to this implementation.

Signed-off-by: Niklas Cassel 
Signed-off-by: Hans Holmberg 
Signed-off-by: Ajay Joshi 
Signed-off-by: Matias Bjorling 
Signed-off-by: Shin'ichiro Kawasaki 
Signed-off-by: Alexey Bogoslavsky 
Signed-off-by: Dmitry Fomichev 
---
 hw/block/nvme.h  | 130 +++
 include/block/nvme.h | 119 ++-
 2 files changed, 248 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 0d29f75475..2c932b5e29 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -3,12 +3,22 @@
 
 #include "block/nvme.h"
 
+#define NVME_DEFAULT_ZONE_SIZE   128 /* MiB */
+#define NVME_DEFAULT_MAX_ZA_SIZE 128 /* KiB */
+
 typedef struct NvmeParams {
 char *serial;
 uint32_t num_queues; /* deprecated since 5.1 */
 uint32_t max_ioqpairs;
 uint16_t msix_qsize;
 uint32_t cmb_size_mb;
+
+boolzoned;
+boolcross_zone_read;
+uint8_t fill_pattern;
+uint32_tzamds_bs;
+uint64_tzone_size;
+uint64_tzone_capacity;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
@@ -17,6 +27,8 @@ typedef struct NvmeAsyncEvent {
 
 enum NvmeRequestFlags {
 NVME_REQ_FLG_HAS_SG   = 1 << 0,
+NVME_REQ_FLG_FILL = 1 << 1,
+NVME_REQ_FLG_APPEND   = 1 << 2,
 };
 
 typedef struct NvmeRequest {
@@ -24,6 +36,7 @@ typedef struct NvmeRequest {
 BlockAIOCB  *aiocb;
 uint16_tstatus;
 uint16_tflags;
+uint64_tfill_ofs;
 NvmeCqe cqe;
 BlockAcctCookie acct;
 QEMUSGList  qsg;
@@ -61,11 +74,35 @@ typedef struct NvmeCQueue {
 QTAILQ_HEAD(, NvmeRequest) req_list;
 } NvmeCQueue;
 
+typedef struct NvmeZone {
+NvmeZoneDescr   d;
+uint64_ttstamp;
+uint32_tnext;
+uint32_tprev;
+uint8_t rsvd80[8];
+} NvmeZone;
+
+#define NVME_ZONE_LIST_NILUINT_MAX
+
+typedef struct NvmeZoneList {
+uint32_thead;
+uint32_ttail;
+uint32_tsize;
+uint8_t rsvd12[4];
+} NvmeZoneList;
+
 typedef struct NvmeNamespace {
 NvmeIdNsid_ns;
 uint32_tnsid;
 uint8_t csi;
 QemuUUIDuuid;
+
+NvmeIdNsZoned   *id_ns_zoned;
+NvmeZone*zone_array;
+NvmeZoneList*exp_open_zones;
+NvmeZoneList*imp_open_zones;
+NvmeZoneList*closed_zones;
+NvmeZoneList*full_zones;
 } NvmeNamespace;
 
 static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
@@ -100,6 +137,7 @@ typedef struct NvmeCtrl {
 uint32_tnum_namespaces;
 uint32_tmax_q_ents;
 uint64_tns_size;
+
 uint8_t *cmbuf;
 uint32_tirq_status;
 uint64_thost_timestamp; /* Timestamp sent by the host 
*/
@@ -107,6 +145,12 @@ typedef struct NvmeCtrl {
 
 HostMemoryBackend *pmrdev;
 
+int zone_file_fd;
+uint32_tnum_zones;
+uint64_tzone_size_bs;
+uint64_tzone_array_size;
+uint8_t zamds;
+
 NvmeNamespace   *namespaces;
 NvmeSQueue  **sq;
 NvmeCQueue  **cq;
@@ -121,6 +165,86 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, 
NvmeNamespace *ns)
 return n->ns_size >> nvme_ns_lbads(ns);
 }
 
+static inline uint8_t nvme_get_zone_state(NvmeZone *zone)
+{
+return zone->d.zs >> 4;
+}
+
+static inline void nvme_set_zone_state(NvmeZone *zone, enum NvmeZoneState 
state)
+{
+zone->d.zs = state << 4;
+}
+
+static inline uint64_t nvme_zone_rd_boundary(NvmeCtrl *n, NvmeZone *zone)
+{
+return zone->d.zslba + n->params.zone_size;
+}
+
+static inline uint64_t nvme_zone_wr_boundary(NvmeZone *zone)
+{
+return zone->d.zslba + zone->d.zcap;
+}
+
+static inline bool nvme_wp_is_valid(NvmeZone *zone)
+{
+uint8_t st = nvme_get_zone_state(zone);
+
+return st != NVME_ZONE_STATE_FULL &&
+   st != NVME_ZONE_STATE_READ_ONLY &&
+   st != NVME_ZONE_STATE_OFFLINE;
+}
+
+/*
+ * Initialize a zone list head.
+

[PATCH] block: Raise an error when backing file parameter is an empty string

2020-06-17 Thread Connor Kuehl

Providing an empty string for the backing file parameter like so:

qemu-img create -f qcow2 -b '' /tmp/foo

allows the flow of control to reach and subsequently fail an assert
statement because passing an empty string to

bdrv_get_full_backing_filename_from_filename()

simply results in NULL being returned without an error being raised.

To fix this, let's check for an empty string when getting the value from
the opts list.

Reported-by: Attila Fazekas 
Fixes: https://bugzilla.redhat.com/1809553
Signed-off-by: Connor Kuehl 
---
 block.c|  4 
 tests/qemu-iotests/298 | 47 ++
 tests/qemu-iotests/298.out |  5 
 tests/qemu-iotests/group   |  1 +
 4 files changed, 57 insertions(+)
 create mode 100755 tests/qemu-iotests/298
 create mode 100644 tests/qemu-iotests/298.out

diff --git a/block.c b/block.c
index 6dbcb7e083..b335d6bcb2 100644
--- a/block.c
+++ b/block.c
@@ -6116,6 +6116,10 @@ void bdrv_img_create(const char *filename, const char 
*fmt,
  "same filename as the backing file");
 goto out;
 }
+if (backing_file[0] == '\0') {
+error_setg(errp, "Expected backing file name, got empty string");
+goto out;
+}
 }
 
 backing_fmt = qemu_opt_get(opts, BLOCK_OPT_BACKING_FMT);
diff --git a/tests/qemu-iotests/298 b/tests/qemu-iotests/298
new file mode 100755
index 00..1e30caebec
--- /dev/null
+++ b/tests/qemu-iotests/298
@@ -0,0 +1,47 @@
+#!/usr/bin/env python3
+#
+# Copyright (C) 2020 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+
+
+
+# Regression test for avoiding an assertion when the 'backing file'
+# parameter (-b) is set to an empty string for qemu-img create
+#
+#   qemu-img create -f qcow2 -b '' /tmp/foo
+#
+# This ensures the invalid parameter is handled with a user-
+# friendly message instead of a failed assertion.
+
+import iotests
+
+class TestEmptyBackingFilename(iotests.QMPTestCase):
+
+
+def test_empty_backing_file_name(self):
+actual = iotests.qemu_img_pipe(
+'create',
+'-f', 'qcow2',
+'-b', '',
+'/tmp/foo'
+)
+expected = 'qemu-img: /tmp/foo: Expected backing file name,' \
+   ' got empty string'
+
+self.assertEqual(actual.strip(), expected.strip())
+
+
+if __name__ == '__main__':
+iotests.main(supported_fmts=['raw', 'qcow2'])
diff --git a/tests/qemu-iotests/298.out b/tests/qemu-iotests/298.out
new file mode 100644
index 00..ae1213e6f8
--- /dev/null
+++ b/tests/qemu-iotests/298.out
@@ -0,0 +1,5 @@
+.
+--
+Ran 1 tests
+
+OK
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index d886fa0cb3..4bca2d9e05 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -302,3 +302,4 @@
 291 rw quick
 292 rw auto quick
 297 meta
+298 img auto quick
-- 
2.25.4

[PATCH v2 11/18] hw/block/nvme: Introduce max active and open zone limits

2020-06-17 Thread Dmitry Fomichev

Added two module properties, "max_active" and "max_open" to control
the maximum number of zones that can be active or open. Once these
variables are set to non-default values, the driver checks these
limits during I/O and returns Too Many Active or Too Many Open
command status if they are exceeded.

Signed-off-by: Hans Holmberg 
Signed-off-by: Dmitry Fomichev 
---
 hw/block/nvme.c | 183 +++-
 hw/block/nvme.h |   4 ++
 2 files changed, 185 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 2e03b0b6ed..05a7cbcfcc 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -120,6 +120,87 @@ static void nvme_remove_zone(NvmeCtrl *n, NvmeNamespace 
*ns, NvmeZoneList *zl,
 zone->prev = zone->next = 0;
 }
 
+/*
+ * Take the first zone out from a list, return NULL if the list is empty.
+ */
+static NvmeZone *nvme_remove_zone_head(NvmeCtrl *n, NvmeNamespace *ns,
+NvmeZoneList *zl)
+{
+NvmeZone *zone = nvme_peek_zone_head(ns, zl);
+
+if (zone) {
+--zl->size;
+if (zl->size == 0) {
+zl->head = NVME_ZONE_LIST_NIL;
+zl->tail = NVME_ZONE_LIST_NIL;
+} else {
+zl->head = zone->next;
+ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
+}
+zone->prev = zone->next = 0;
+}
+
+return zone;
+}
+
+/*
+ * Check if we can open a zone without exceeding open/active limits.
+ * AOR stands for "Active and Open Resources" (see TP 4053 section 2.5).
+ */
+static int nvme_aor_check(NvmeCtrl *n, NvmeNamespace *ns,
+ uint32_t act, uint32_t opn)
+{
+if (n->params.max_active_zones != 0 &&
+ns->nr_active_zones + act > n->params.max_active_zones) {
+trace_pci_nvme_err_insuff_active_res(n->params.max_active_zones);
+return NVME_ZONE_TOO_MANY_ACTIVE | NVME_DNR;
+}
+if (n->params.max_open_zones != 0 &&
+ns->nr_open_zones + opn > n->params.max_open_zones) {
+trace_pci_nvme_err_insuff_open_res(n->params.max_open_zones);
+return NVME_ZONE_TOO_MANY_OPEN | NVME_DNR;
+}
+
+return NVME_SUCCESS;
+}
+
+static inline void nvme_aor_inc_open(NvmeCtrl *n, NvmeNamespace *ns)
+{
+assert(ns->nr_open_zones >= 0);
+if (n->params.max_open_zones) {
+ns->nr_open_zones++;
+assert(ns->nr_open_zones <= n->params.max_open_zones);
+}
+}
+
+static inline void nvme_aor_dec_open(NvmeCtrl *n, NvmeNamespace *ns)
+{
+if (n->params.max_open_zones) {
+assert(ns->nr_open_zones > 0);
+ns->nr_open_zones--;
+}
+assert(ns->nr_open_zones >= 0);
+}
+
+static inline void nvme_aor_inc_active(NvmeCtrl *n, NvmeNamespace *ns)
+{
+assert(ns->nr_active_zones >= 0);
+if (n->params.max_active_zones) {
+ns->nr_active_zones++;
+assert(ns->nr_active_zones <= n->params.max_active_zones);
+}
+}
+
+static inline void nvme_aor_dec_active(NvmeCtrl *n, NvmeNamespace *ns)
+{
+if (n->params.max_active_zones) {
+assert(ns->nr_active_zones > 0);
+ns->nr_active_zones--;
+assert(ns->nr_active_zones >= ns->nr_open_zones);
+}
+assert(ns->nr_active_zones >= 0);
+}
+
 static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
 NvmeZone *zone, uint8_t state)
 {
@@ -454,6 +535,24 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
 }
 
+static void nvme_auto_transition_zone(NvmeCtrl *n, NvmeNamespace *ns,
+bool implicit, bool adding_active)
+{
+NvmeZone *zone;
+
+if (implicit && n->params.max_open_zones &&
+ns->nr_open_zones == n->params.max_open_zones) {
+zone = nvme_remove_zone_head(n, ns, ns->imp_open_zones);
+if (zone) {
+/*
+ * Automatically close this implicitly open zone.
+ */
+nvme_aor_dec_open(n, ns);
+nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
+}
+}
+}
+
 static uint16_t nvme_check_zone_write(NvmeZone *zone, uint64_t slba,
 uint32_t nlb)
 {
@@ -531,6 +630,23 @@ static uint16_t nvme_check_zone_read(NvmeCtrl *n, NvmeZone 
*zone, uint64_t slba,
 return status;
 }
 
+static uint16_t nvme_auto_open_zone(NvmeCtrl *n, NvmeNamespace *ns,
+NvmeZone *zone)
+{
+uint16_t status = NVME_SUCCESS;
+uint8_t zs = nvme_get_zone_state(zone);
+
+if (zs == NVME_ZONE_STATE_EMPTY) {
+nvme_auto_transition_zone(n, ns, true, true);
+status = nvme_aor_check(n, ns, 1, 1);
+} else if (zs == NVME_ZONE_STATE_CLOSED) {
+nvme_auto_transition_zone(n, ns, true, false);
+status = nvme_aor_check(n, ns, 0, 1);
+}
+
+return status;
+}
+
 static uint64_t nvme_finalize_zone_write(NvmeCtrl *n, NvmeNamespace *ns,
 NvmeZone *zone, uint32_t nlb)
 {
@@ -543,7 +659,11 @@ static uint64_t nvme_finalize_zone_write(NvmeCtrl *n, 
NvmeNamespace *ns,
 switch (zs) {
 case NVM

[PATCH v2 12/18] hw/block/nvme: Simulate Zone Active excursions

2020-06-17 Thread Dmitry Fomichev

Added a Boolean flag to turn on simulation of Zone Active Excursions.
If the flag, "active_excursions", is set to true, the driver will try
to finish one of the currently open zone if max active zones limit is
going to get exceeded.

Signed-off-by: Dmitry Fomichev 
---
 hw/block/nvme.c | 24 +++-
 hw/block/nvme.h |  1 +
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 05a7cbcfcc..a29cbfcc96 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -540,6 +540,26 @@ static void nvme_auto_transition_zone(NvmeCtrl *n, 
NvmeNamespace *ns,
 {
 NvmeZone *zone;
 
+if (n->params.active_excursions && adding_active &&
+n->params.max_active_zones &&
+ns->nr_active_zones == n->params.max_active_zones) {
+zone = nvme_peek_zone_head(ns, ns->closed_zones);
+if (zone) {
+/*
+ * The namespace is at the limit of active zones.
+ * Try to finish one of the currently active zones
+ * to make the needed active zone resource available.
+ */
+nvme_aor_dec_active(n, ns);
+nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_FULL);
+zone->d.za &= ~(NVME_ZA_FINISH_RECOMMENDED |
+NVME_ZA_RESET_RECOMMENDED);
+zone->d.za |= NVME_ZA_FINISHED_BY_CTLR;
+zone->tstamp = 0;
+trace_pci_nvme_zone_finished_by_controller(zone->d.zslba);
+}
+}
+
 if (implicit && n->params.max_open_zones &&
 ns->nr_open_zones == n->params.max_open_zones) {
 zone = nvme_remove_zone_head(n, ns, ns->imp_open_zones);
@@ -2631,7 +2651,7 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace 
*ns, int lba_index,
 /* MAR/MOR are zeroes-based, 0x means no limit */
 ns->id_ns_zoned->mar = cpu_to_le32(n->params.max_active_zones - 1);
 ns->id_ns_zoned->mor = cpu_to_le32(n->params.max_open_zones - 1);
-ns->id_ns_zoned->zoc = 0;
+ns->id_ns_zoned->zoc = cpu_to_le16(n->params.active_excursions ? 0x2 : 0);
 ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
 
 ns->id_ns_zoned->lbafe[lba_index].zsze = cpu_to_le64(n->params.zone_size);
@@ -2993,6 +3013,8 @@ static Property nvme_props[] = {
 DEFINE_PROP_INT32("max_active", NvmeCtrl, params.max_active_zones, 0),
 DEFINE_PROP_INT32("max_open", NvmeCtrl, params.max_open_zones, 0),
 DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, 
true),
+DEFINE_PROP_BOOL("active_excursions", NvmeCtrl, params.active_excursions,
+ false),
 DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
 DEFINE_PROP_END_OF_LIST(),
 };
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index f5a4679702..8a0aaeb09a 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -15,6 +15,7 @@ typedef struct NvmeParams {
 
 boolzoned;
 boolcross_zone_read;
+boolactive_excursions;
 uint8_t fill_pattern;
 uint32_tzamds_bs;
 uint64_tzone_size;
-- 
2.21.0

[PATCH v2 15/18] hw/block/nvme: Support Zone Descriptor Extensions

2020-06-17 Thread Dmitry Fomichev

Zone Descriptor Extension is a label that can be assigned to a zone.
It can be set to an Empty zone and it stays assigned until the zone
is reset.

This commit adds a new optional property, "zone_descr_ext_size", to
the driver. Its value must be a multiple of 64 bytes. If this value
is non-zero, it becomes possible to assign extensions of that size
to any Empty zones. The default value for this property is 0,
therefore setting extensions is disabled by default.

Signed-off-by: Hans Holmberg 
Signed-off-by: Dmitry Fomichev 
---
 hw/block/nvme.c | 76 ++---
 hw/block/nvme.h |  8 ++
 2 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b9135a6b1f..eb41081627 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1360,6 +1360,26 @@ static bool nvme_cond_offline_all(uint8_t state)
 return state == NVME_ZONE_STATE_READ_ONLY;
 }
 
+static uint16_t nvme_set_zd_ext(NvmeCtrl *n, NvmeNamespace *ns,
+NvmeZone *zone, uint8_t state)
+{
+uint16_t status;
+
+if (state == NVME_ZONE_STATE_EMPTY) {
+nvme_auto_transition_zone(n, ns, false, true);
+status = nvme_aor_check(n, ns, 1, 0);
+if (status != NVME_SUCCESS) {
+return status;
+}
+nvme_aor_inc_active(n, ns);
+zone->d.za |= NVME_ZA_ZD_EXT_VALID;
+nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
+return NVME_SUCCESS;
+}
+
+return NVME_ZONE_INVAL_TRANSITION;
+}
+
 static uint16_t name_do_zone_op(NvmeCtrl *n, NvmeNamespace *ns,
 NvmeZone *zone, uint8_t state, bool all,
 uint16_t (*op_hndlr)(NvmeCtrl *, NvmeNamespace *, NvmeZone *,
@@ -1388,13 +1408,16 @@ static uint16_t name_do_zone_op(NvmeCtrl *n, 
NvmeNamespace *ns,
 static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeNamespace *ns,
 NvmeCmd *cmd, NvmeRequest *req)
 {
+NvmeRwCmd *rw;
 uint32_t dw13 = le32_to_cpu(cmd->cdw13);
+uint64_t prp1, prp2;
 uint64_t slba = 0;
 uint64_t zone_idx = 0;
 uint16_t status;
 uint8_t action, state;
 bool all;
 NvmeZone *zone;
+uint8_t *zd_ext;
 
 action = dw13 & 0xff;
 all = dw13 & 0x100;
@@ -1449,7 +1472,25 @@ static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, 
NvmeNamespace *ns,
 
 case NVME_ZONE_ACTION_SET_ZD_EXT:
 trace_pci_nvme_set_descriptor_extension(slba, zone_idx);
-return NVME_INVALID_FIELD | NVME_DNR;
+if (all || !n->params.zd_extension_size) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+zd_ext = nvme_get_zd_extension(n, ns, zone_idx);
+rw = (NvmeRwCmd *)cmd;
+prp1 = le64_to_cpu(rw->prp1);
+prp2 = le64_to_cpu(rw->prp2);
+status = nvme_dma_write_prp(n, zd_ext, n->params.zd_extension_size,
+prp1, prp2);
+if (status) {
+trace_pci_nvme_err_zd_extension_map_error(zone_idx);
+return status;
+}
+
+status = nvme_set_zd_ext(n, ns, zone, state);
+if (status == NVME_SUCCESS) {
+trace_pci_nvme_zd_extension_set(zone_idx);
+return status;
+}
 break;
 
 default:
@@ -1528,7 +1569,7 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, 
NvmeNamespace *ns,
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-if (zra == NVME_ZONE_REPORT_EXTENDED) {
+if (zra == NVME_ZONE_REPORT_EXTENDED && !n->params.zd_extension_size) {
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
@@ -1540,6 +1581,9 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, 
NvmeNamespace *ns,
 partial = (dw13 >> 16) & 0x01;
 
 zone_entry_sz = sizeof(NvmeZoneDescr);
+if (zra == NVME_ZONE_REPORT_EXTENDED) {
+zone_entry_sz += n->params.zd_extension_size;
+}
 
 max_zones = (len - sizeof(NvmeZoneReportHeader)) / zone_entry_sz;
 buf = g_malloc0(len);
@@ -1571,6 +1615,14 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, 
NvmeNamespace *ns,
 z->wp = cpu_to_le64(~0ULL);
 }
 
+if (zra == NVME_ZONE_REPORT_EXTENDED) {
+if (zs->d.za & NVME_ZA_ZD_EXT_VALID) {
+memcpy(buf_p, nvme_get_zd_extension(n, ns, zone_index),
+   n->params.zd_extension_size);
+}
+buf_p += n->params.zd_extension_size;
+}
+
 zone_index++;
 }
 
@@ -2337,7 +2389,7 @@ static uint16_t nvme_handle_changed_zone_log(NvmeCtrl *n, 
NvmeCmd *cmd,
 continue;
 }
 num_aen_zones++;
-if (zone->d.za) {
+if (zone->d.za & ~NVME_ZA_ZD_EXT_VALID) {
 trace_pci_nvme_reporting_changed_zone(zone->d.zslba, zone->d.za);
 *zid_ptr++ = cpu_to_le64(zone->d.zslba);
 nids++;
@@ -2936,6 +2988,7 @@ static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace 
*ns,
 ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
 ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
 n

[PATCH v2 13/18] hw/block/nvme: Set Finish/Reset Zone Recommended attributes

2020-06-17 Thread Dmitry Fomichev

Added logic to set and reset FZR and RZR zone attributes. Four new
driver properties are added to control the timing of setting and
resetting these attributes. FZR/RZR delay lasts from the zone
operation and until when the corresponding zone attribute is set.
FZR/RZR limits set the time period between setting FZR or RZR
attribute and resetting it simulating the internal controller action
on that zone.

Signed-off-by: Dmitry Fomichev 
---
 hw/block/nvme.c | 99 +
 hw/block/nvme.h | 13 ++-
 2 files changed, 111 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a29cbfcc96..c3898448c7 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -201,6 +201,84 @@ static inline void nvme_aor_dec_active(NvmeCtrl *n, 
NvmeNamespace *ns)
 assert(ns->nr_active_zones >= 0);
 }
 
+static void nvme_set_rzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
+{
+assert(zone->flags & NVME_ZFLAGS_SET_RZR);
+zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+zone->flags &= ~NVME_ZFLAGS_TS_DELAY;
+zone->d.za |= NVME_ZA_RESET_RECOMMENDED;
+zone->flags &= ~NVME_ZFLAGS_SET_RZR;
+trace_pci_nvme_zone_reset_recommended(zone->d.zslba);
+}
+
+static void nvme_clear_rzr(NvmeCtrl *n, NvmeNamespace *ns,
+NvmeZone *zone, bool notify)
+{
+if (n->params.rrl_usec) {
+zone->flags &= ~(NVME_ZFLAGS_SET_RZR | NVME_ZFLAGS_TS_DELAY);
+notify = notify && (zone->d.za & NVME_ZA_RESET_RECOMMENDED);
+zone->d.za &= ~NVME_ZA_RESET_RECOMMENDED;
+zone->tstamp = 0;
+}
+}
+
+static void nvme_set_fzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
+{
+assert(zone->flags & NVME_ZFLAGS_SET_FZR);
+zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+zone->flags &= ~NVME_ZFLAGS_TS_DELAY;
+zone->d.za |= NVME_ZA_FINISH_RECOMMENDED;
+zone->flags &= ~NVME_ZFLAGS_SET_FZR;
+trace_pci_nvme_zone_finish_recommended(zone->d.zslba);
+}
+
+static void nvme_clear_fzr(NvmeCtrl *n, NvmeNamespace *ns,
+NvmeZone *zone, bool notify)
+{
+if (n->params.frl_usec) {
+zone->flags &= ~(NVME_ZFLAGS_SET_FZR | NVME_ZFLAGS_TS_DELAY);
+notify = notify && (zone->d.za & NVME_ZA_FINISH_RECOMMENDED);
+zone->d.za &= ~NVME_ZA_FINISH_RECOMMENDED;
+zone->tstamp = 0;
+}
+}
+
+static void nvme_schedule_rzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
+{
+if (n->params.frl_usec) {
+zone->flags &= ~(NVME_ZFLAGS_SET_FZR | NVME_ZFLAGS_TS_DELAY);
+zone->d.za &= ~NVME_ZA_FINISH_RECOMMENDED;
+zone->tstamp = 0;
+}
+if (n->params.rrl_usec) {
+zone->flags |= NVME_ZFLAGS_SET_RZR;
+if (n->params.rzr_delay_usec) {
+zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+zone->flags |= NVME_ZFLAGS_TS_DELAY;
+} else {
+nvme_set_rzr(n, ns, zone);
+}
+}
+}
+
+static void nvme_schedule_fzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
+{
+if (n->params.rrl_usec) {
+zone->flags &= ~(NVME_ZFLAGS_SET_RZR | NVME_ZFLAGS_TS_DELAY);
+zone->d.za &= ~NVME_ZA_RESET_RECOMMENDED;
+zone->tstamp = 0;
+}
+if (n->params.frl_usec) {
+zone->flags |= NVME_ZFLAGS_SET_FZR;
+if (n->params.fzr_delay_usec) {
+zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+zone->flags |= NVME_ZFLAGS_TS_DELAY;
+} else {
+nvme_set_fzr(n, ns, zone);
+}
+}
+}
+
 static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
 NvmeZone *zone, uint8_t state)
 {
@@ -208,15 +286,19 @@ static void nvme_assign_zone_state(NvmeCtrl *n, 
NvmeNamespace *ns,
 switch (nvme_get_zone_state(zone)) {
 case NVME_ZONE_STATE_EXPLICITLY_OPEN:
 nvme_remove_zone(n, ns, ns->exp_open_zones, zone);
+nvme_clear_fzr(n, ns, zone, false);
 break;
 case NVME_ZONE_STATE_IMPLICITLY_OPEN:
 nvme_remove_zone(n, ns, ns->imp_open_zones, zone);
+nvme_clear_fzr(n, ns, zone, false);
 break;
 case NVME_ZONE_STATE_CLOSED:
 nvme_remove_zone(n, ns, ns->closed_zones, zone);
+nvme_clear_fzr(n, ns, zone, false);
 break;
 case NVME_ZONE_STATE_FULL:
 nvme_remove_zone(n, ns, ns->full_zones, zone);
+nvme_clear_rzr(n, ns, zone, false);
 }
}
 
@@ -225,15 +307,19 @@ static void nvme_assign_zone_state(NvmeCtrl *n, 
NvmeNamespace *ns,
 switch (state) {
 case NVME_ZONE_STATE_EXPLICITLY_OPEN:
 nvme_add_zone_tail(n, ns, ns->exp_open_zones, zone);
+nvme_schedule_fzr(n, ns, zone);
 break;
 case NVME_ZONE_STATE_IMPLICITLY_OPEN:
 nvme_add_zone_tail(n, ns, ns->imp_open_zones, zone);
+nvme_schedule_fzr(n, ns, zone);
 break;
 case NVME_ZONE_STATE_CLOSED:
 nvme_add_zone_tail(n, ns, ns->closed_zones, zone);
+nvme_schedu

[PATCH v2 14/18] hw/block/nvme: Generate zone AENs

2020-06-17 Thread Dmitry Fomichev

Added an optional Boolean "zone_async_events" property to the driver.
Once it's turned on, the namespace will be sending "Zone Descriptor
Changed" asynchronous events to the host in particular situations
defined by the protocol. In order to clear these AENs, the host needs
to read the newly added Changed Zones Log.

Signed-off-by: Dmitry Fomichev 
---
 hw/block/nvme.c  | 300 ++-
 hw/block/nvme.h  |  13 +-
 include/block/nvme.h |  23 +++-
 3 files changed, 328 insertions(+), 8 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index c3898448c7..b9135a6b1f 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -201,12 +201,66 @@ static inline void nvme_aor_dec_active(NvmeCtrl *n, 
NvmeNamespace *ns)
 assert(ns->nr_active_zones >= 0);
 }
 
+static bool nvme_complete_async_req(NvmeCtrl *n, NvmeNamespace *ns,
+enum NvmeAsyncEventType type, uint8_t info)
+{
+NvmeAsyncEvent *ae;
+uint32_t nsid = 0;
+uint8_t log_page = 0;
+
+switch (type) {
+case NVME_AER_TYPE_ERROR:
+case NVME_AER_TYPE_SMART:
+break;
+case NVME_AER_TYPE_NOTICE:
+switch (info) {
+case NVME_AER_NOTICE_ZONE_DESCR_CHANGED:
+log_page = NVME_LOG_ZONE_CHANGED_LIST;
+nsid = ns->nsid;
+if (!(n->ae_cfg & NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES)) {
+trace_pci_nvme_zone_ae_not_enabled(info, log_page, nsid);
+return false;
+}
+if (ns->aen_pending) {
+trace_pci_nvme_zone_ae_not_cleared(info, log_page, nsid);
+return false;
+}
+ns->aen_pending = true;
+}
+break;
+case NVME_AER_TYPE_CMDSET_SPECIFIC:
+case NVME_AER_TYPE_VENDOR_SPECIFIC:
+break;
+}
+
+ae = g_malloc0(sizeof(*ae));
+ae->res = type;
+ae->res |= (info << 8) & 0xff00;
+ae->res |= (log_page << 16) & 0xff;
+ae->nsid = nsid;
+
+QTAILQ_INSERT_TAIL(&n->async_reqs, ae, entry);
+timer_mod(n->admin_cq.timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+return true;
+}
+
+static inline void nvme_notify_zone_changed(NvmeCtrl *n, NvmeNamespace *ns,
+NvmeZone *zone)
+{
+if (n->ae_cfg) {
+zone->flags |= NVME_ZFLAGS_AEN_PEND;
+nvme_complete_async_req(n, ns, NVME_AER_TYPE_NOTICE,
+NVME_AER_NOTICE_ZONE_DESCR_CHANGED);
+}
+}
+
 static void nvme_set_rzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
 {
 assert(zone->flags & NVME_ZFLAGS_SET_RZR);
 zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
 zone->flags &= ~NVME_ZFLAGS_TS_DELAY;
 zone->d.za |= NVME_ZA_RESET_RECOMMENDED;
+nvme_notify_zone_changed(n, ns, zone);
 zone->flags &= ~NVME_ZFLAGS_SET_RZR;
 trace_pci_nvme_zone_reset_recommended(zone->d.zslba);
 }
@@ -215,10 +269,14 @@ static void nvme_clear_rzr(NvmeCtrl *n, NvmeNamespace *ns,
 NvmeZone *zone, bool notify)
 {
 if (n->params.rrl_usec) {
-zone->flags &= ~(NVME_ZFLAGS_SET_RZR | NVME_ZFLAGS_TS_DELAY);
+zone->flags &= ~(NVME_ZFLAGS_SET_RZR | NVME_ZFLAGS_TS_DELAY |
+ NVME_ZFLAGS_AEN_PEND);
 notify = notify && (zone->d.za & NVME_ZA_RESET_RECOMMENDED);
 zone->d.za &= ~NVME_ZA_RESET_RECOMMENDED;
 zone->tstamp = 0;
+if (notify) {
+nvme_notify_zone_changed(n, ns, zone);
+}
 }
 }
 
@@ -228,6 +286,7 @@ static void nvme_set_fzr(NvmeCtrl *n, NvmeNamespace *ns, 
NvmeZone *zone)
 zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
 zone->flags &= ~NVME_ZFLAGS_TS_DELAY;
 zone->d.za |= NVME_ZA_FINISH_RECOMMENDED;
+nvme_notify_zone_changed(n, ns, zone);
 zone->flags &= ~NVME_ZFLAGS_SET_FZR;
 trace_pci_nvme_zone_finish_recommended(zone->d.zslba);
 }
@@ -236,13 +295,61 @@ static void nvme_clear_fzr(NvmeCtrl *n, NvmeNamespace *ns,
 NvmeZone *zone, bool notify)
 {
 if (n->params.frl_usec) {
-zone->flags &= ~(NVME_ZFLAGS_SET_FZR | NVME_ZFLAGS_TS_DELAY);
+zone->flags &= ~(NVME_ZFLAGS_SET_FZR | NVME_ZFLAGS_TS_DELAY |
+ NVME_ZFLAGS_AEN_PEND);
 notify = notify && (zone->d.za & NVME_ZA_FINISH_RECOMMENDED);
 zone->d.za &= ~NVME_ZA_FINISH_RECOMMENDED;
 zone->tstamp = 0;
+if (notify) {
+nvme_notify_zone_changed(n, ns, zone);
+}
 }
 }
 
+static bool nvme_process_rrl(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
+{
+if (zone->flags & NVME_ZFLAGS_SET_RZR) {
+if (zone->flags & NVME_ZFLAGS_TS_DELAY) {
+assert(!(zone->d.za & NVME_ZA_RESET_RECOMMENDED));
+if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - zone->tstamp >=
+n->params.rzr_delay_usec) {
+nvme_set_rzr(n, ns, zone);
+return true;
+}
+} else if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - zone->tstamp >=
+   n->params.rrl_us

[PATCH v2 17/18] hw/block/nvme: Use zone metadata file for persistence

2020-06-17 Thread Dmitry Fomichev

A ZNS drive that is emulated by this driver is currently initialized
with all zones Empty upon startup. However, actual ZNS SSDs save the
state and condition of all zones in their internal NVRAM in the event
of power loss. When such a drive is powered up again, it closes or
finishes all zones that were open at the moment of shutdown. Besides
that, the write pointer position as well as the state and condition
of all zones is preserved across power-downs.

This commit adds the capability to have a persistent zone metadata
to the driver. The new optional driver property, "zone_file",
is introduced. If added to the command line, this property specifies
the name of the file that stores the zone metadata. If "zone_file" is
omitted, the driver will initialize with all zones empty, the same as
before.

If zone metadata is configured to be persistent, then zone descriptor
extensions also persist across controller shutdowns.

Signed-off-by: Dmitry Fomichev 
---
 hw/block/nvme.c | 371 +---
 hw/block/nvme.h |  38 +
 2 files changed, 388 insertions(+), 21 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 14d5f1d155..63e7a6352e 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -69,6 +69,8 @@
 } while (0)
 
 static void nvme_process_sq(void *opaque);
+static void nvme_sync_zone_file(NvmeCtrl *n, NvmeNamespace *ns,
+NvmeZone *zone, int len);
 
 /*
  * Add a zone to the tail of a zone list.
@@ -90,6 +92,7 @@ static void nvme_add_zone_tail(NvmeCtrl *n, NvmeNamespace 
*ns, NvmeZoneList *zl,
 zl->tail = idx;
 }
 zl->size++;
+nvme_set_zone_meta_dirty(n, ns, true);
 }
 
 /*
@@ -106,12 +109,15 @@ static void nvme_remove_zone(NvmeCtrl *n, NvmeNamespace 
*ns, NvmeZoneList *zl,
 if (zl->size == 0) {
 zl->head = NVME_ZONE_LIST_NIL;
 zl->tail = NVME_ZONE_LIST_NIL;
+nvme_set_zone_meta_dirty(n, ns, true);
 } else if (idx == zl->head) {
 zl->head = zone->next;
 ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
+nvme_set_zone_meta_dirty(n, ns, true);
 } else if (idx == zl->tail) {
 zl->tail = zone->prev;
 ns->zone_array[zl->tail].next = NVME_ZONE_LIST_NIL;
+nvme_set_zone_meta_dirty(n, ns, true);
 } else {
 ns->zone_array[zone->next].prev = zone->prev;
 ns->zone_array[zone->prev].next = zone->next;
@@ -138,6 +144,7 @@ static NvmeZone *nvme_remove_zone_head(NvmeCtrl *n, 
NvmeNamespace *ns,
 ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
 }
 zone->prev = zone->next = 0;
+nvme_set_zone_meta_dirty(n, ns, true);
 }
 
 return zone;
@@ -476,6 +483,7 @@ static void nvme_assign_zone_state(NvmeCtrl *n, 
NvmeNamespace *ns,
 case NVME_ZONE_STATE_READ_ONLY:
 zone->tstamp = 0;
 }
+nvme_sync_zone_file(n, ns, zone, sizeof(NvmeZone));
 }
 
 static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
@@ -2976,9 +2984,114 @@ static const MemoryRegionOps nvme_cmb_ops = {
 },
 };
 
-static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
+static int nvme_validate_zone_file(NvmeCtrl *n, NvmeNamespace *ns,
 uint64_t capacity)
 {
+NvmeZoneMeta *meta = ns->zone_meta;
+NvmeZone *zone = ns->zone_array;
+uint64_t start = 0, zone_size = n->params.zone_size;
+int i, n_imp_open = 0, n_exp_open = 0, n_closed = 0, n_full = 0;
+
+if (meta->magic != NVME_ZONE_META_MAGIC) {
+return 1;
+}
+if (meta->version != NVME_ZONE_META_VER) {
+return 2;
+}
+if (meta->zone_size != zone_size) {
+return 3;
+}
+if (meta->zone_capacity != n->params.zone_capacity) {
+return 4;
+}
+if (meta->nr_offline_zones != n->params.nr_offline_zones) {
+return 5;
+}
+if (meta->nr_rdonly_zones != n->params.nr_rdonly_zones) {
+return 6;
+}
+if (meta->lba_size != n->conf.logical_block_size) {
+return 7;
+}
+if (meta->zd_extension_size != n->params.zd_extension_size) {
+return 8;
+}
+
+for (i = 0; i < n->num_zones; i++, zone++) {
+if (start + zone_size > capacity) {
+zone_size = capacity - start;
+}
+if (zone->d.zt != NVME_ZONE_TYPE_SEQ_WRITE) {
+return 9;
+}
+if (zone->d.zcap != n->params.zone_capacity) {
+return 10;
+}
+if (zone->d.zslba != start) {
+return 11;
+}
+switch (nvme_get_zone_state(zone)) {
+case NVME_ZONE_STATE_EMPTY:
+case NVME_ZONE_STATE_OFFLINE:
+case NVME_ZONE_STATE_READ_ONLY:
+if (zone->d.wp != start) {
+return 12;
+}
+break;
+case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+if (zone->d.wp < start ||
+zone->d.wp >= zone->d.zslba + zone->d.zcap) {
+return 13;
+}
+n_imp_open++;
+break;
+case N

[PATCH v2 18/18] hw/block/nvme: Document zoned parameters in usage text

2020-06-17 Thread Dmitry Fomichev

Added brief descriptions of the new driver properties now available
to users to configure features of Zoned Namespace Command Set in the
driver.

This patch is for documentation only, no functionality change.

Signed-off-by: Dmitry Fomichev 
---
 hw/block/nvme.c | 62 +++--
 1 file changed, 60 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 63e7a6352e..90b1ae24b5 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,7 +9,7 @@
  */
 
 /**
- * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
+ * Reference Specs: http://www.nvmexpress.org, 1.4, 1.3, 1.2, 1.1, 1.0e
  *
  *  http://www.nvmexpress.org/resources/
  */
@@ -20,7 +20,8 @@
  *  -device nvme,drive=,serial=,id=, \
  *  cmb_size_mb=, \
  *  [pmrdev=,] \
- *  max_ioqpairs=
+ *  max_ioqpairs= \
+ *  zoned=
  *
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
  * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
@@ -32,6 +33,63 @@
  * For example:
  * -object memory-backend-file,id=,share=on,mem-path=, \
  *  size=  -device nvme,...,pmrdev=
+ *
+ * Setting "zoned" to true makes the driver emulate zoned namespaces.
+ * In this case, of the following options are available to configure zoned
+ * operation:
+ * zone_size=
+ *
+ * zone_capacity=
+ *
+ * zone_file=
+ * Zone metadata file, if specified, allows zone information
+ * to be persistent across shutdowns and restarts.
+ *
+ * zone_descr_ext_size=
+ * This value needs to be specified in 64B units. If it is zero,
+ * namespace(s) will not support zone descriptor extensions.
+ *
+ * max_active=
+ *
+ * max_open=
+ *
+ * reset_rcmnd_delay=
+ * The amount of time that passes between the moment when a zone
+ * enters Full state and when Reset Zone Recommended attribute
+ * is set for that zone.
+ *
+ * reset_rcmnd_limit=
+ * If this value is zero (default), RZR attribute is not set for
+ *  any zones.
+ *
+ * finish_rcmnd_delay=
+ * The amount of time that passes between the moment when a zone
+ * enters an Open or Closed state and when Finish Zone Recommended
+ * attribute is set for that zone.
+ *
+ * finish_rcmnd_limit=
+ * If this value is zero (default), FZR attribute is not set for
+ * any zones.
+ *
+ * zamds=
+ * The maximum I/O size that can be supported by Zone Append
+ * command. Since internally this this value is maintained as
+ * ZAMDS = log2( / ), some
+ * values assigned to this property may be rounded down and
+ * result in a lower maximum ZA data size being in effect.
+ *
+ * zone_async_events=
+ * Enable sending Zone Descriptor Changed AENs to the host.
+ *
+ * offline_zones=
+ *
+ * rdonly_zones=
+ *
+ * cross_zone_read=
+ *
+ * fill_pattern=
+ * Byte pattern to to return for any portions of unwritten data
+ * during read.
  */
 
 #include "qemu/osdep.h"
-- 
2.21.0

[PATCH v2 10/18] hw/block/nvme: Support Zoned Namespace Command Set

2020-06-17 Thread Dmitry Fomichev

The driver has been changed to advertise NVM Command Set when "zoned"
driver property is not set (default) and Zoned Namespace Command Set
otherwise.

Handlers for three new NVMe commands introduced in Zoned Namespace
Command Set specification are added, namely for Zone Management
Receive, Zone Management Send and Zone Append.

Driver initialization code has been extended to create a proper
configuration for zoned operation using driver properties.

Read/Write command handler is modified to only allow writes at the
write pointer if the namespace is zoned. For Zone Append command,
writes implicitly happen at the write pointer and the starting write
pointer value is returned as the result of the command. Read Zeroes
handler is modified to add zoned checks that are identical to those
done as a part of Write flow.

The code to support for Zone Descriptor Extensions is not included in
this commit and the driver always reports ZDES 0. A later commit in
this series will add ZDE support.

This commit doesn't yet include checks for active and open zone
limits. It is assumed that there are no limits on either active or
open zones.

Signed-off-by: Niklas Cassel 
Signed-off-by: Hans Holmberg 
Signed-off-by: Ajay Joshi 
Signed-off-by: Chaitanya Kulkarni 
Signed-off-by: Matias Bjorling 
Signed-off-by: Aravind Ramesh 
Signed-off-by: Shin'ichiro Kawasaki 
Signed-off-by: Adam Manzanares 
Signed-off-by: Dmitry Fomichev 
---
 hw/block/nvme.c | 963 ++--
 1 file changed, 933 insertions(+), 30 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 453f4747a5..2e03b0b6ed 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -37,6 +37,7 @@
 #include "qemu/osdep.h"
 #include "qemu/units.h"
 #include "qemu/error-report.h"
+#include "crypto/random.h"
 #include "hw/block/block.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pci.h"
@@ -69,6 +70,98 @@
 
 static void nvme_process_sq(void *opaque);
 
+/*
+ * Add a zone to the tail of a zone list.
+ */
+static void nvme_add_zone_tail(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList 
*zl,
+NvmeZone *zone)
+{
+uint32_t idx = (uint32_t)(zone - ns->zone_array);
+
+assert(nvme_zone_not_in_list(zone));
+
+if (!zl->size) {
+zl->head = zl->tail = idx;
+zone->next = zone->prev = NVME_ZONE_LIST_NIL;
+} else {
+ns->zone_array[zl->tail].next = idx;
+zone->prev = zl->tail;
+zone->next = NVME_ZONE_LIST_NIL;
+zl->tail = idx;
+}
+zl->size++;
+}
+
+/*
+ * Remove a zone from a zone list. The zone must be linked in the list.
+ */
+static void nvme_remove_zone(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
+NvmeZone *zone)
+{
+uint32_t idx = (uint32_t)(zone - ns->zone_array);
+
+assert(!nvme_zone_not_in_list(zone));
+
+--zl->size;
+if (zl->size == 0) {
+zl->head = NVME_ZONE_LIST_NIL;
+zl->tail = NVME_ZONE_LIST_NIL;
+} else if (idx == zl->head) {
+zl->head = zone->next;
+ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
+} else if (idx == zl->tail) {
+zl->tail = zone->prev;
+ns->zone_array[zl->tail].next = NVME_ZONE_LIST_NIL;
+} else {
+ns->zone_array[zone->next].prev = zone->prev;
+ns->zone_array[zone->prev].next = zone->next;
+}
+
+zone->prev = zone->next = 0;
+}
+
+static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
+NvmeZone *zone, uint8_t state)
+{
+if (!nvme_zone_not_in_list(zone)) {
+switch (nvme_get_zone_state(zone)) {
+case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+nvme_remove_zone(n, ns, ns->exp_open_zones, zone);
+break;
+case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+nvme_remove_zone(n, ns, ns->imp_open_zones, zone);
+break;
+case NVME_ZONE_STATE_CLOSED:
+nvme_remove_zone(n, ns, ns->closed_zones, zone);
+break;
+case NVME_ZONE_STATE_FULL:
+nvme_remove_zone(n, ns, ns->full_zones, zone);
+}
+   }
+
+nvme_set_zone_state(zone, state);
+
+switch (state) {
+case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+nvme_add_zone_tail(n, ns, ns->exp_open_zones, zone);
+break;
+case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+nvme_add_zone_tail(n, ns, ns->imp_open_zones, zone);
+break;
+case NVME_ZONE_STATE_CLOSED:
+nvme_add_zone_tail(n, ns, ns->closed_zones, zone);
+break;
+case NVME_ZONE_STATE_FULL:
+nvme_add_zone_tail(n, ns, ns->full_zones, zone);
+break;
+default:
+zone->d.za = 0;
+/* fall through */
+case NVME_ZONE_STATE_READ_ONLY:
+zone->tstamp = 0;
+}
+}
+
 static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
 {
 hwaddr low = n->ctrl_mem.addr;
@@ -314,6 +407,7 @@ static void nvme_post_cqes(void *opaque)
 
 QTAILQ_REMOVE(&cq->req_list, req, entry);
 sq = req->sq;
+
 req->cqe.status = cpu_to_le16((req->status <

[PATCH v2 05/18] hw/block/nvme: Introduce the Namespace Types definitions

2020-06-17 Thread Dmitry Fomichev

From: Niklas Cassel 

Define the structures and constants required to implement
Namespace Types support.

Signed-off-by: Niklas Cassel 
Signed-off-by: Dmitry Fomichev 
---
 hw/block/nvme.h  |  3 ++
 include/block/nvme.h | 75 +---
 2 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 4f0dac39ae..4fd155c409 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -63,6 +63,9 @@ typedef struct NvmeCQueue {
 
 typedef struct NvmeNamespace {
 NvmeIdNsid_ns;
+uint32_tnsid;
+uint8_t csi;
+QemuUUIDuuid;
 } NvmeNamespace;
 
 static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 6a58bac0c2..5a1e5e137c 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -50,6 +50,11 @@ enum NvmeCapMask {
 CAP_PMR_MASK   = 0x1,
 };
 
+enum NvmeCapCssBits {
+CAP_CSS_NVM= 0x01,
+CAP_CSS_CSI_SUPP   = 0x40,
+};
+
 #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
 #define NVME_CAP_CQR(cap)   (((cap) >> CAP_CQR_SHIFT)& CAP_CQR_MASK)
 #define NVME_CAP_AMS(cap)   (((cap) >> CAP_AMS_SHIFT)& CAP_AMS_MASK)
@@ -101,6 +106,12 @@ enum NvmeCcMask {
 CC_IOCQES_MASK  = 0xf,
 };
 
+enum NvmeCcCss {
+CSS_NVM_ONLY= 0,
+CSS_ALL_NSTYPES = 6,
+CSS_ADMIN_ONLY  = 7,
+};
+
 #define NVME_CC_EN(cc) ((cc >> CC_EN_SHIFT) & CC_EN_MASK)
 #define NVME_CC_CSS(cc)((cc >> CC_CSS_SHIFT)& CC_CSS_MASK)
 #define NVME_CC_MPS(cc)((cc >> CC_MPS_SHIFT)& CC_MPS_MASK)
@@ -109,6 +120,21 @@ enum NvmeCcMask {
 #define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK)
 #define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK)
 
+#define NVME_SET_CC_EN(cc, val) \
+(cc |= (uint32_t)((val) & CC_EN_MASK) << CC_EN_SHIFT)
+#define NVME_SET_CC_CSS(cc, val)\
+(cc |= (uint32_t)((val) & CC_CSS_MASK) << CC_CSS_SHIFT)
+#define NVME_SET_CC_MPS(cc, val)\
+(cc |= (uint32_t)((val) & CC_MPS_MASK) << CC_MPS_SHIFT)
+#define NVME_SET_CC_AMS(cc, val)\
+(cc |= (uint32_t)((val) & CC_AMS_MASK) << CC_AMS_SHIFT)
+#define NVME_SET_CC_SHN(cc, val)\
+(cc |= (uint32_t)((val) & CC_SHN_MASK) << CC_SHN_SHIFT)
+#define NVME_SET_CC_IOSQES(cc, val) \
+(cc |= (uint32_t)((val) & CC_IOSQES_MASK) << CC_IOSQES_SHIFT)
+#define NVME_SET_CC_IOCQES(cc, val) \
+(cc |= (uint32_t)((val) & CC_IOCQES_MASK) << CC_IOCQES_SHIFT)
+
 enum NvmeCstsShift {
 CSTS_RDY_SHIFT  = 0,
 CSTS_CFS_SHIFT  = 1,
@@ -482,10 +508,41 @@ typedef struct NvmeIdentify {
 uint64_trsvd2[2];
 uint64_tprp1;
 uint64_tprp2;
-uint32_tcns;
-uint32_trsvd11[5];
+uint8_t cns;
+uint8_t rsvd4;
+uint16_tctrlid;
+uint16_tnvmsetid;
+uint8_t rsvd3;
+uint8_t csi;
+uint32_trsvd12[4];
 } NvmeIdentify;
 
+typedef struct NvmeNsIdDesc {
+uint8_t nidt;
+uint8_t nidl;
+uint16_trsvd2;
+} NvmeNsIdDesc;
+
+enum NvmeNidType {
+NVME_NIDT_EUI64 = 0x01,
+NVME_NIDT_NGUID = 0x02,
+NVME_NIDT_UUID  = 0x03,
+NVME_NIDT_CSI   = 0x04,
+};
+
+enum NvmeNidLength {
+NVME_NIDL_EUI64 = 8,
+NVME_NIDL_NGUID = 16,
+NVME_NIDL_UUID  = 16,
+NVME_NIDL_CSI   = 1,
+};
+
+enum NvmeCsi {
+NVME_CSI_NVM= 0x00,
+};
+
+#define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
+
 typedef struct NvmeRwCmd {
 uint8_t opcode;
 uint8_t flags;
@@ -603,6 +660,7 @@ enum NvmeStatusCodes {
 NVME_CMD_ABORT_MISSING_FUSE = 0x000a,
 NVME_INVALID_NSID   = 0x000b,
 NVME_CMD_SEQ_ERROR  = 0x000c,
+NVME_CMD_SET_CMB_REJECTED   = 0x002b,
 NVME_LBA_RANGE  = 0x0080,
 NVME_CAP_EXCEEDED   = 0x0081,
 NVME_NS_NOT_READY   = 0x0082,
@@ -729,9 +787,14 @@ typedef struct NvmePSD {
 #define NVME_IDENTIFY_DATA_SIZE 4096
 
 enum {
-NVME_ID_CNS_NS = 0x0,
-NVME_ID_CNS_CTRL   = 0x1,
-NVME_ID_CNS_NS_ACTIVE_LIST = 0x2,
+NVME_ID_CNS_NS= 0x0,
+NVME_ID_CNS_CTRL  = 0x1,
+NVME_ID_CNS_NS_ACTIVE_LIST= 0x2,
+NVME_ID_CNS_NS_DESC_LIST  = 0x03,
+NVME_ID_CNS_CS_NS = 0x05,
+NVME_ID_CNS_CS_CTRL   = 0x06,
+NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
+NVME_ID_CNS_IO_COMMAND_SET= 0x1c,
 };
 
 typedef struct NvmeIdCtrl {
@@ -825,6 +888,7 @@ enum NvmeFeatureIds {
 NVME_WRITE_ATOMICITY= 0xa,
 NVME_ASYNCHRONOUS_EVENT_CONF= 0xb,
 NVME_TIMESTAMP  = 0xe,
+NVME_COMMAND_SET_PROFILE= 0x19,
 NVME_SOFTWARE_PROGRESS_MARKER   = 0x80
 };
 
@@ -914,6 +978,7 @@ static inline void _nvme_check_size(void)
 QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) !=

[PATCH v2 09/18] hw/block/nvme: Define Zoned NS Command Set trace events

2020-06-17 Thread Dmitry Fomichev

The Zoned Namespace Command Set / Namespace Types implementation that
is being introduced in this series adds a good number of trace events.
Combine all tracepoint definitions into a separate patch to make
reviewing more convenient.

Signed-off-by: Dmitry Fomichev 
---
 hw/block/trace-events | 41 +
 1 file changed, 41 insertions(+)

diff --git a/hw/block/trace-events b/hw/block/trace-events
index 3f3323fe38..984db8a20c 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -66,6 +66,31 @@ pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
 pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects 
log read"
 pci_nvme_css_nvm_cset_selected_by_host(uint32_t cc) "NVM command set selected 
by host, bar.cc=0x%"PRIx32""
 pci_nvme_css_all_csets_sel_by_host(uint32_t cc) "all supported command sets 
selected by host, bar.cc=0x%"PRIx32""
+pci_nvme_open_zone(uint64_t slba, uint32_t zone_idx, int all) "open zone, 
slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_close_zone(uint64_t slba, uint32_t zone_idx, int all) "close zone, 
slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_finish_zone(uint64_t slba, uint32_t zone_idx, int all) "finish zone, 
slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_reset_zone(uint64_t slba, uint32_t zone_idx, int all) "reset zone, 
slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_offline_zone(uint64_t slba, uint32_t zone_idx, int all) "offline 
zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_set_descriptor_extension(uint64_t slba, uint32_t zone_idx) "set zone 
descriptor extension, slba=%"PRIu64", idx=%"PRIu32""
+pci_nvme_zone_reset_recommended(uint64_t slba) "slba=%"PRIu64""
+pci_nvme_zone_reset_internal_op(uint64_t slba) "slba=%"PRIu64""
+pci_nvme_zone_finish_recommended(uint64_t slba) "slba=%"PRIu64""
+pci_nvme_zone_finish_internal_op(uint64_t slba) "slba=%"PRIu64""
+pci_nvme_zone_finished_by_controller(uint64_t slba) "slba=%"PRIu64""
+pci_nvme_zd_extension_set(uint32_t zone_idx) "set descriptor extension for 
zone_idx=%"PRIu32""
+pci_nvme_power_on_close(uint32_t state, uint64_t slba) "zone state=%"PRIu32", 
slba=%"PRIu64" transitioned to Closed state"
+pci_nvme_power_on_reset(uint32_t state, uint64_t slba) "zone state=%"PRIu32", 
slba=%"PRIu64" transitioned to Empty state"
+pci_nvme_power_on_full(uint32_t state, uint64_t slba) "zone state=%"PRIu32", 
slba=%"PRIu64" transitioned to Full state"
+pci_nvme_zone_ae_not_enabled(int info, int log_page, int nsid) "zone async 
event not enabled, info=0x%"PRIx32", lp=0x%"PRIx32", nsid=%"PRIu32""
+pci_nvme_zone_ae_not_cleared(int info, int log_page, int nsid) "zoned async 
event not cleared, info=0x%"PRIx32", lp=0x%"PRIx32", nsid=%"PRIu32""
+pci_nvme_zone_aen_not_requested(uint32_t oaes) "zone descriptor AEN are not 
requested by host, oaes=0x%"PRIx32""
+pci_nvme_getfeat_aen_cfg(uint64_t res) "reporting async event config 
res=%"PRIu64""
+pci_nvme_setfeat_zone_info_aer_on(void) "zone info change notices enabled"
+pci_nvme_setfeat_zone_info_aer_off(void) "zone info change notices disabled"
+pci_nvme_changed_zone_log_read(uint16_t nsid) "changed zone list log of ns 
%"PRIu16""
+pci_nvme_reporting_changed_zone(uint64_t zslba, uint8_t za) "zslba=%"PRIu64", 
attr=0x%"PRIx8""
+pci_nvme_empty_changed_zone_list(void) "no changes zones to report"
+pci_nvme_mapped_zone_file(char *zfile_name, int ret) "mapped zone file %s, 
error %d"
 
 # nvme traces for error conditions
 pci_nvme_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
@@ -77,10 +102,25 @@ pci_nvme_err_invalid_ns(uint32_t ns, uint32_t limit) 
"invalid namespace %u not w
 pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
 pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
 pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) 
"Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
+pci_nvme_err_capacity_exceeded(uint64_t zone_id, uint64_t nr_zones) "zone 
capacity exceeded, zone_id=%"PRIu64", nr_zones=%"PRIu64""
+pci_nvme_err_unaligned_zone_cmd(uint8_t action, uint64_t slba, uint64_t zslba) 
"unaligned zone op 0x%"PRIx32", got slba=%"PRIu64", zslba=%"PRIu64""
+pci_nvme_err_invalid_zone_state_transition(uint8_t state, uint8_t action, 
uint64_t slba, uint8_t attrs) "0x%"PRIx32"->0x%"PRIx32", slba=%"PRIu64", 
attrs=0x%"PRIx32""
+pci_nvme_err_write_not_at_wp(uint64_t slba, uint64_t zone, uint64_t wp) 
"writing at slba=%"PRIu64", zone=%"PRIu64", but wp=%"PRIu64""
+pci_nvme_err_append_not_at_start(uint64_t slba, uint64_t zone) "appending at 
slba=%"PRIu64", but zone=%"PRIu64""
+pci_nvme_err_zone_write_not_ok(uint64_t slba, uint32_t nlb, uint32_t status) 
"slba=%"PRIu64", nlb=%"PRIu32", status=0x%"PRIx16""
+pci_nvme_err_zone_read_not_ok(uint64_t slba, uint32_t nlb, uint32_t status) 
"slba=%"PRIu64", nlb=%"PRIu32", status=0x%"PRIx16""
+pci_nvme_err_append_too_large(uint64_t slba, uint32_t nlb, uint8_t za

[PATCH v2 07/18] hw/block/nvme: Add support for Namespace Types

2020-06-17 Thread Dmitry Fomichev

From: Niklas Cassel 

Namespace Types introduce a new command set, "I/O Command Sets",
that allows the host to retrieve the command sets associated with
a namespace. Introduce support for the command set, and enable
detection for the NVM Command Set.

Signed-off-by: Niklas Cassel 
Signed-off-by: Dmitry Fomichev 
---
 hw/block/nvme.c | 210 ++--
 hw/block/nvme.h |  11 +++
 2 files changed, 216 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 03b8deee85..453f4747a5 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -686,6 +686,26 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, 
NvmeIdentify *c)
 prp1, prp2);
 }
 
+static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeIdentify *c)
+{
+uint64_t prp1 = le64_to_cpu(c->prp1);
+uint64_t prp2 = le64_to_cpu(c->prp2);
+static const int data_len = NVME_IDENTIFY_DATA_SIZE;
+uint32_t *list;
+uint16_t ret;
+
+trace_pci_nvme_identify_ctrl_csi(c->csi);
+
+if (c->csi == NVME_CSI_NVM) {
+list = g_malloc0(data_len);
+ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
+g_free(list);
+return ret;
+} else {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+}
+
 static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
 {
 NvmeNamespace *ns;
@@ -701,11 +721,42 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
NvmeIdentify *c)
 }
 
 ns = &n->namespaces[nsid - 1];
+assert(nsid == ns->nsid);
 
 return nvme_dma_read_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
 prp1, prp2);
 }
 
+static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeIdentify *c)
+{
+NvmeNamespace *ns;
+uint32_t nsid = le32_to_cpu(c->nsid);
+uint64_t prp1 = le64_to_cpu(c->prp1);
+uint64_t prp2 = le64_to_cpu(c->prp2);
+static const int data_len = NVME_IDENTIFY_DATA_SIZE;
+uint32_t *list;
+uint16_t ret;
+
+trace_pci_nvme_identify_ns_csi(nsid, c->csi);
+
+if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
+trace_pci_nvme_err_invalid_ns(nsid, n->num_namespaces);
+return NVME_INVALID_NSID | NVME_DNR;
+}
+
+ns = &n->namespaces[nsid - 1];
+assert(nsid == ns->nsid);
+
+if (c->csi == NVME_CSI_NVM) {
+list = g_malloc0(data_len);
+ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
+g_free(list);
+return ret;
+} else {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+}
+
 static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
 {
 static const int data_len = NVME_IDENTIFY_DATA_SIZE;
@@ -733,6 +784,99 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
NvmeIdentify *c)
 return ret;
 }
 
+static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeIdentify *c)
+{
+static const int data_len = NVME_IDENTIFY_DATA_SIZE;
+uint32_t min_nsid = le32_to_cpu(c->nsid);
+uint64_t prp1 = le64_to_cpu(c->prp1);
+uint64_t prp2 = le64_to_cpu(c->prp2);
+uint32_t *list;
+uint16_t ret;
+int i, j = 0;
+
+trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
+
+if (c->csi != NVME_CSI_NVM) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+list = g_malloc0(data_len);
+for (i = 0; i < n->num_namespaces; i++) {
+if (i < min_nsid) {
+continue;
+}
+list[j++] = cpu_to_le32(i + 1);
+if (j == data_len / sizeof(uint32_t)) {
+break;
+}
+}
+ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
+g_free(list);
+return ret;
+}
+
+static uint16_t nvme_list_ns_descriptors(NvmeCtrl *n, NvmeIdentify *c)
+{
+NvmeNamespace *ns;
+uint32_t nsid = le32_to_cpu(c->nsid);
+uint64_t prp1 = le64_to_cpu(c->prp1);
+uint64_t prp2 = le64_to_cpu(c->prp2);
+void *buf_ptr;
+NvmeNsIdDesc *desc;
+static const int data_len = NVME_IDENTIFY_DATA_SIZE;
+uint8_t *buf;
+uint16_t status;
+
+trace_pci_nvme_list_ns_descriptors();
+
+if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
+trace_pci_nvme_err_invalid_ns(nsid, n->num_namespaces);
+return NVME_INVALID_NSID | NVME_DNR;
+}
+
+ns = &n->namespaces[nsid - 1];
+assert(nsid == ns->nsid);
+
+buf = g_malloc0(data_len);
+buf_ptr = buf;
+
+desc = buf_ptr;
+desc->nidt = NVME_NIDT_UUID;
+desc->nidl = NVME_NIDL_UUID;
+buf_ptr += sizeof(*desc);
+memcpy(buf_ptr, ns->uuid.data, NVME_NIDL_UUID);
+buf_ptr += NVME_NIDL_UUID;
+
+desc = buf_ptr;
+desc->nidt = NVME_NIDT_CSI;
+desc->nidl = NVME_NIDL_CSI;
+buf_ptr += sizeof(*desc);
+*(uint8_t *)buf_ptr = NVME_CSI_NVM;
+
+status = nvme_dma_read_prp(n, buf, data_len, prp1, prp2);
+g_free(buf);
+return status;
+}
+
+static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeIdentify *c)
+{
+uint64_t prp1 = le64_to_cpu(c->prp1);
+uint64_t prp2 = le64_to_cpu(c->prp2);
+stati

[PATCH v2 03/18] hw/block/nvme: Clean up unused AER definitions

2020-06-17 Thread Dmitry Fomichev

Removed unused struct NvmeAerResult and SMART-related async event
codes. All other event codes are now categorized by their type.
This avoids having to define the same values in a single enum,
NvmeAsyncEventRequest, that is now removed.

Later commits in this series will define additional values in some
of these enums. No functional change.

Signed-off-by: Dmitry Fomichev 
---
 hw/block/nvme.h  |  1 -
 include/block/nvme.h | 43 ++-
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 0460cc0e62..4f0dac39ae 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -13,7 +13,6 @@ typedef struct NvmeParams {
 
 typedef struct NvmeAsyncEvent {
 QSIMPLEQ_ENTRY(NvmeAsyncEvent) entry;
-NvmeAerResult result;
 } NvmeAsyncEvent;
 
 enum NvmeRequestFlags {
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 9c3a04dcd7..3099df99eb 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -553,28 +553,30 @@ typedef struct NvmeDsmRange {
 uint64_tslba;
 } NvmeDsmRange;
 
-enum NvmeAsyncEventRequest {
-NVME_AER_TYPE_ERROR = 0,
-NVME_AER_TYPE_SMART = 1,
-NVME_AER_TYPE_IO_SPECIFIC   = 6,
-NVME_AER_TYPE_VENDOR_SPECIFIC   = 7,
-NVME_AER_INFO_ERR_INVALID_SQ= 0,
-NVME_AER_INFO_ERR_INVALID_DB= 1,
-NVME_AER_INFO_ERR_DIAG_FAIL = 2,
-NVME_AER_INFO_ERR_PERS_INTERNAL_ERR = 3,
-NVME_AER_INFO_ERR_TRANS_INTERNAL_ERR= 4,
-NVME_AER_INFO_ERR_FW_IMG_LOAD_ERR   = 5,
-NVME_AER_INFO_SMART_RELIABILITY = 0,
-NVME_AER_INFO_SMART_TEMP_THRESH = 1,
-NVME_AER_INFO_SMART_SPARE_THRESH= 2,
+enum NvmeAsyncEventType {
+NVME_AER_TYPE_ERROR = 0x00,
+NVME_AER_TYPE_SMART = 0x01,
+NVME_AER_TYPE_NOTICE= 0x02,
+NVME_AER_TYPE_CMDSET_SPECIFIC   = 0x06,
+NVME_AER_TYPE_VENDOR_SPECIFIC   = 0x07,
 };
 
-typedef struct NvmeAerResult {
-uint8_t event_type;
-uint8_t event_info;
-uint8_t log_page;
-uint8_t resv;
-} NvmeAerResult;
+enum NvmeAsyncErrorInfo {
+NVME_AER_ERR_INVALID_SQ = 0x00,
+NVME_AER_ERR_INVALID_DB = 0x01,
+NVME_AER_ERR_DIAG_FAIL  = 0x02,
+NVME_AER_ERR_PERS_INTERNAL_ERR  = 0x03,
+NVME_AER_ERR_TRANS_INTERNAL_ERR = 0x04,
+NVME_AER_ERR_FW_IMG_LOAD_ERR= 0x05,
+};
+
+enum NvmeAsyncNoticeInfo {
+NVME_AER_NOTICE_NS_CHANGED  = 0x00,
+};
+
+enum NvmeAsyncEventCfg {
+NVME_AEN_CFG_NS_ATTR= 1 << 8,
+};
 
 typedef struct NvmeCqe {
 union {
@@ -881,7 +883,6 @@ enum NvmeIdNsDps {
 
 static inline void _nvme_check_size(void)
 {
-QEMU_BUILD_BUG_ON(sizeof(NvmeAerResult) != 4);
 QEMU_BUILD_BUG_ON(sizeof(NvmeCqe) != 16);
 QEMU_BUILD_BUG_ON(sizeof(NvmeDsmRange) != 16);
 QEMU_BUILD_BUG_ON(sizeof(NvmeCmd) != 64);
-- 
2.21.0

[PATCH v2 01/18] hw/block/nvme: Move NvmeRequest has_sg field to a bit flag

2020-06-17 Thread Dmitry Fomichev

In addition to the existing has_sg flag, a few more Boolean
NvmeRequest flags are going to be introduced in subsequent patches.
Convert "has_sg" variable to "flags" and define NvmeRequestFlags
enum for individual flag values.

Signed-off-by: Dmitry Fomichev 
---
 hw/block/nvme.c | 8 +++-
 hw/block/nvme.h | 6 +-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 1aee042d4c..3ed9f3d321 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -350,7 +350,7 @@ static void nvme_rw_cb(void *opaque, int ret)
 block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
 req->status = NVME_INTERNAL_DEV_ERROR;
 }
-if (req->has_sg) {
+if (req->flags & NVME_REQ_FLG_HAS_SG) {
 qemu_sglist_destroy(&req->qsg);
 }
 nvme_enqueue_req_completion(cq, req);
@@ -359,7 +359,6 @@ static void nvme_rw_cb(void *opaque, int ret)
 static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
 NvmeRequest *req)
 {
-req->has_sg = false;
 block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
  BLOCK_ACCT_FLUSH);
 req->aiocb = blk_aio_flush(n->conf.blk, nvme_rw_cb, req);
@@ -383,7 +382,6 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace 
*ns, NvmeCmd *cmd,
 return NVME_LBA_RANGE | NVME_DNR;
 }
 
-req->has_sg = false;
 block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
  BLOCK_ACCT_WRITE);
 req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
@@ -422,14 +420,13 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, 
NvmeCmd *cmd,
 
 dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
 if (req->qsg.nsg > 0) {
-req->has_sg = true;
+req->flags |= NVME_REQ_FLG_HAS_SG;
 req->aiocb = is_write ?
 dma_blk_write(n->conf.blk, &req->qsg, data_offset, 
BDRV_SECTOR_SIZE,
   nvme_rw_cb, req) :
 dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
  nvme_rw_cb, req);
 } else {
-req->has_sg = false;
 req->aiocb = is_write ?
 blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
 req) :
@@ -917,6 +914,7 @@ static void nvme_process_sq(void *opaque)
 QTAILQ_REMOVE(&sq->req_list, req, entry);
 QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
 memset(&req->cqe, 0, sizeof(req->cqe));
+req->flags = 0;
 req->cqe.cid = cmd.cid;
 
 status = sq->sqid ? nvme_io_cmd(n, &cmd, req) :
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 1d30c0bca2..0460cc0e62 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -16,11 +16,15 @@ typedef struct NvmeAsyncEvent {
 NvmeAerResult result;
 } NvmeAsyncEvent;
 
+enum NvmeRequestFlags {
+NVME_REQ_FLG_HAS_SG   = 1 << 0,
+};
+
 typedef struct NvmeRequest {
 struct NvmeSQueue   *sq;
 BlockAIOCB  *aiocb;
 uint16_tstatus;
-boolhas_sg;
+uint16_tflags;
 NvmeCqe cqe;
 BlockAcctCookie acct;
 QEMUSGList  qsg;
-- 
2.21.0

[PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-06-17 Thread Dmitry Fomichev

v2: rebased on top of block-next/block branch

Zoned Namespace (ZNS) Command Set is a newly introduced command set
published by the NVM Express, Inc. organization as TP 4053. The main
design goals of ZNS are to provide hardware designers the means to
reduce NVMe controller complexity and to allow achieving a better I/O
latency and throughput. SSDs that implement this interface are
commonly known as ZNS SSDs.

This command set is implementing a zoned storage model, similarly to
ZAC/ZBC. As such, there is already support in Linux, allowing one to
perform the majority of tasks needed for managing ZNS SSDs.

The Zoned Namespace Command Set relies on another TP, known as
Namespace Types (NVMe TP 4056), which introduces support for having
multiple command sets per namespace.

Both ZNS and Namespace Types specifications can be downloaded by
visiting the following link -

https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs.zip

This patch series adds Namespace Types support and zoned namespace
emulation capability to the existing NVMe PCI driver.

The patchset is organized as follows -

The first several patches are preparatory and are added to allow for
an easier review of the subsequent commits. The group of patches that
follows adds NS Types support with only NVM Command Set being
available. Finally, the last group of commits makes definitions and
adds new code to support Zoned Namespace Command Set.

Based-on: <20200609205944.3549240-1-ebl...@redhat.com>

Ajay Joshi (1):
  hw/block/nvme: Define 64 bit cqe.result

Dmitry Fomichev (15):
  hw/block/nvme: Move NvmeRequest has_sg field to a bit flag
  hw/block/nvme: Clean up unused AER definitions
  hw/block/nvme: Add Commands Supported and Effects log
  hw/block/nvme: Define trace events related to NS Types
  hw/block/nvme: Make Zoned NS Command Set definitions
  hw/block/nvme: Define Zoned NS Command Set trace events
  hw/block/nvme: Support Zoned Namespace Command Set
  hw/block/nvme: Introduce max active and open zone limits
  hw/block/nvme: Simulate Zone Active excursions
  hw/block/nvme: Set Finish/Reset Zone Recommended attributes
  hw/block/nvme: Generate zone AENs
  hw/block/nvme: Support Zone Descriptor Extensions
  hw/block/nvme: Add injection of Offline/Read-Only zones
  hw/block/nvme: Use zone metadata file for persistence
  hw/block/nvme: Document zoned parameters in usage text

Niklas Cassel (2):
  hw/block/nvme: Introduce the Namespace Types definitions
  hw/block/nvme: Add support for Namespace Types

 block/nvme.c  |2 +-
 block/trace-events|2 +-
 hw/block/nvme.c   | 2316 -
 hw/block/nvme.h   |  228 +++-
 hw/block/trace-events |   56 +
 include/block/nvme.h  |  282 -
 6 files changed, 2820 insertions(+), 66 deletions(-)

-- 
2.21.0

[PATCH v2 02/18] hw/block/nvme: Define 64 bit cqe.result

2020-06-17 Thread Dmitry Fomichev

From: Ajay Joshi 

A new write command, Zone Append, is added as a part of Zoned
Namespace Command Set. Upon successful completion of this command,
the controller returns the start LBA of the performed write operation
in cqe.result field. Therefore, the maximum size of this variable
needs to be changed from 32 to 64 bit, consuming the reserved 32 bit
field that follows the result in CQE struct. Since the existing
commands are expected to return a 32 bit LE value, two separate
variables, result32 and result64, are now kept in a union.

Signed-off-by: Ajay Joshi 
Signed-off-by: Dmitry Fomichev 
---
 block/nvme.c | 2 +-
 block/trace-events   | 2 +-
 hw/block/nvme.c  | 6 +++---
 include/block/nvme.h | 6 --
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index eb2f54dd9d..ca245ec574 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -287,7 +287,7 @@ static inline int nvme_translate_error(const NvmeCqe *c)
 {
 uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF;
 if (status) {
-trace_nvme_error(le32_to_cpu(c->result),
+trace_nvme_error(le64_to_cpu(c->result64),
  le16_to_cpu(c->sq_head),
  le16_to_cpu(c->sq_id),
  le16_to_cpu(c->cid),
diff --git a/block/trace-events b/block/trace-events
index 29dff8881c..05c1393943 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -156,7 +156,7 @@ vxhs_get_creds(const char *cacert, const char *client_key, 
const char *client_ce
 # nvme.c
 nvme_kick(void *s, int queue) "s %p queue %d"
 nvme_dma_flush_queue_wait(void *s) "s %p"
-nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) 
"cmd_specific %d sq_head %d sqid %d cid %d status 0x%x"
+nvme_error(uint64_t cmd_specific, int sq_head, int sqid, int cid, int status) 
"cmd_specific %ld sq_head %d sqid %d cid %d status 0x%x"
 nvme_process_completion(void *s, int index, int inflight) "s %p queue %d 
inflight %d"
 nvme_process_completion_queue_busy(void *s, int index) "s %p queue %d"
 nvme_complete_command(void *s, int index, int cid) "s %p queue %d cid %d"
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3ed9f3d321..a1bbc9acde 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -823,7 +823,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-req->cqe.result = result;
+req->cqe.result32 = result;
 return NVME_SUCCESS;
 }
 
@@ -859,8 +859,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 ((dw11 >> 16) & 0x) + 1,
 n->params.max_ioqpairs,
 n->params.max_ioqpairs);
-req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
-  ((n->params.max_ioqpairs - 1) << 16));
+req->cqe.result32 = cpu_to_le32((n->params.max_ioqpairs - 1) |
+((n->params.max_ioqpairs - 1) << 16));
 break;
 case NVME_TIMESTAMP:
 return nvme_set_feature_timestamp(n, cmd);
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 1720ee1d51..9c3a04dcd7 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -577,8 +577,10 @@ typedef struct NvmeAerResult {
 } NvmeAerResult;
 
 typedef struct NvmeCqe {
-uint32_tresult;
-uint32_trsvd;
+union {
+uint64_t result64;
+uint32_t result32;
+};
 uint16_tsq_head;
 uint16_tsq_id;
 uint16_tcid;
-- 
2.21.0

[PATCH v2 04/18] hw/block/nvme: Add Commands Supported and Effects log

2020-06-17 Thread Dmitry Fomichev

This log page becomes necessary to implement to allow checking for
Zone Append command support in Zoned Namespace Command Set.

This commit adds the code to report this log page for NVM Command
Set only. The parts that are specific to zoned operation will be
added later in the series.

Signed-off-by: Dmitry Fomichev 
---
 hw/block/nvme.c   | 62 +++
 hw/block/trace-events |  4 +++
 include/block/nvme.h  | 18 +
 3 files changed, 84 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a1bbc9acde..03b8deee85 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -871,6 +871,66 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 return NVME_SUCCESS;
 }
 
+static uint16_t nvme_handle_cmd_effects(NvmeCtrl *n, NvmeCmd *cmd,
+uint64_t prp1, uint64_t prp2, uint64_t ofs, uint32_t len)
+{
+   NvmeEffectsLog cmd_eff_log = {};
+   uint32_t *iocs = cmd_eff_log.iocs;
+
+trace_pci_nvme_cmd_supp_and_effects_log_read();
+
+if (ofs != 0) {
+trace_pci_nvme_err_invalid_effects_log_offset(ofs);
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+if (len != sizeof(cmd_eff_log)) {
+trace_pci_nvme_err_invalid_effects_log_len(len);
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+iocs[NVME_ADM_CMD_DELETE_SQ] = NVME_CMD_EFFECTS_CSUPP;
+iocs[NVME_ADM_CMD_CREATE_SQ] = NVME_CMD_EFFECTS_CSUPP;
+iocs[NVME_ADM_CMD_DELETE_CQ] = NVME_CMD_EFFECTS_CSUPP;
+iocs[NVME_ADM_CMD_CREATE_CQ] = NVME_CMD_EFFECTS_CSUPP;
+iocs[NVME_ADM_CMD_IDENTIFY] = NVME_CMD_EFFECTS_CSUPP;
+iocs[NVME_ADM_CMD_SET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
+iocs[NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
+iocs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;
+
+iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
+iocs[NVME_CMD_WRITE_ZEROS] = NVME_CMD_EFFECTS_CSUPP |
+ NVME_CMD_EFFECTS_LBCC;
+iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
+iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
+
+return nvme_dma_read_prp(n, (uint8_t *)&cmd_eff_log, len, prp1, prp2);
+}
+
+static uint16_t nvme_get_log_page(NvmeCtrl *n, NvmeCmd *cmd)
+{
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+uint32_t dw10 = le32_to_cpu(cmd->cdw10);
+uint32_t dw11 = le32_to_cpu(cmd->cdw11);
+uint64_t dw12 = le32_to_cpu(cmd->cdw12);
+uint64_t dw13 = le32_to_cpu(cmd->cdw13);
+uint64_t ofs = (dw13 << 32) | dw12;
+uint32_t numdl, numdu, len;
+uint16_t lid = dw10 & 0xff;
+
+numdl = dw10 >> 16;
+numdu = dw11 & 0x;
+len = (((numdu << 16) | numdl) + 1) << 2;
+
+switch (lid) {
+case NVME_LOG_CMD_EFFECTS:
+return nvme_handle_cmd_effects(n, cmd, prp1, prp2, ofs, len);
+}
+
+trace_pci_nvme_unsupported_log_page(lid);
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
 static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
 switch (cmd->opcode) {
@@ -888,6 +948,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return nvme_set_feature(n, cmd, req);
 case NVME_ADM_CMD_GET_FEATURES:
 return nvme_get_feature(n, cmd, req);
+case NVME_ADM_CMD_GET_LOG_PAGE:
+return nvme_get_log_page(n, cmd);
 default:
 trace_pci_nvme_err_invalid_admin_opc(cmd->opcode);
 return NVME_INVALID_OPCODE | NVME_DNR;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 958fcc5508..423d491e27 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -58,6 +58,7 @@ pci_nvme_mmio_start_success(void) "setting controller enable 
bit succeeded"
 pci_nvme_mmio_stopped(void) "cleared controller enable bit"
 pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
 pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
+pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects 
log read"
 
 # nvme traces for error conditions
 pci_nvme_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
@@ -69,6 +70,8 @@ pci_nvme_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid 
namespace %u not w
 pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
 pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
 pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) 
"Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
+pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and 
effects log offset must be 0, got %"PRIu64""
+pci_nvme_err_invalid_effects_log_len(uint32_t len) "commands supported and 
effects log size is 4096, got %"PRIu32""
 pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, 
sid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission 
queue, invalid cqid=%"PRIu16""
 pci_nvme_err_inv

[PATCH v2 06/18] hw/block/nvme: Define trace events related to NS Types

2020-06-17 Thread Dmitry Fomichev

A few trace events are defined that are relevant to implementing
Namespace Types (NVMe TP 4056).

Signed-off-by: Dmitry Fomichev 
---
 hw/block/trace-events | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/hw/block/trace-events b/hw/block/trace-events
index 423d491e27..3f3323fe38 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -39,8 +39,13 @@ pci_nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t 
vector, uint16_t size,
 pci_nvme_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
 pci_nvme_del_cq(uint16_t cqid) "deleted completion queue, cqid=%"PRIu16""
 pci_nvme_identify_ctrl(void) "identify controller"
+pci_nvme_identify_ctrl_csi(uint8_t csi) "identify controller, csi=0x%"PRIx8""
 pci_nvme_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
+pci_nvme_identify_ns_csi(uint16_t ns, uint8_t csi) "identify namespace, 
nsid=%"PRIu16", csi=0x%"PRIx8""
 pci_nvme_identify_nslist(uint16_t ns) "identify namespace list, nsid=%"PRIu16""
+pci_nvme_identify_nslist_csi(uint16_t ns, uint8_t csi) "identify namespace 
list, nsid=%"PRIu16", csi=0x%"PRIx8""
+pci_nvme_list_ns_descriptors(void) "identify namespace descriptors"
+pci_nvme_identify_cmd_set(void) "identify i/o command set"
 pci_nvme_getfeat_vwcache(const char* result) "get feature volatile write 
cache, result=%s"
 pci_nvme_getfeat_numq(int result) "get feature number of queues, result=%d"
 pci_nvme_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested 
cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
@@ -59,6 +64,8 @@ pci_nvme_mmio_stopped(void) "cleared controller enable bit"
 pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
 pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
 pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects 
log read"
+pci_nvme_css_nvm_cset_selected_by_host(uint32_t cc) "NVM command set selected 
by host, bar.cc=0x%"PRIx32""
+pci_nvme_css_all_csets_sel_by_host(uint32_t cc) "all supported command sets 
selected by host, bar.cc=0x%"PRIx32""
 
 # nvme traces for error conditions
 pci_nvme_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
@@ -72,6 +79,9 @@ pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin 
opcode 0x%"PRIx8""
 pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) 
"Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
 pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and 
effects log offset must be 0, got %"PRIu64""
 pci_nvme_err_invalid_effects_log_len(uint32_t len) "commands supported and 
effects log size is 4096, got %"PRIu32""
+pci_nvme_err_change_css_when_enabled(void) "changing CC.CSS while controller 
is enabled"
+pci_nvme_err_only_nvm_cmd_set_avail(void) "setting 110b CC.CSS, but only NVM 
command set is enabled"
+pci_nvme_err_invalid_iocsci(uint32_t idx) "unsupported command set combination 
index %"PRIu32""
 pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, 
sid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission 
queue, invalid cqid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission 
queue, invalid sqid=%"PRIu16""
@@ -127,6 +137,7 @@ pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t 
new_head) "completion qu
 pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write 
for nonexistent queue, sqid=%"PRIu32", ignoring"
 pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission 
queue doorbell write value beyond queue size, sqid=%"PRIu32", 
new_head=%"PRIu16", ignoring"
 pci_nvme_unsupported_log_page(uint16_t lid) "unsupported log page 0x%"PRIx16""
+pci_nvme_ub_unknown_css_value(void) "unknown value in cc.css field"
 
 # xen-block.c
 xen_block_realize(const char *type, uint32_t disk, uint32_t partition) "%s 
d%up%u"
-- 
2.21.0

Re: [PATCH 0/7] python: create installable package

2020-06-17 Thread John Snow




On 6/17/20 3:52 PM, Cleber Rosa wrote:
> On Tue, Jun 02, 2020 at 08:15:16PM -0400, John Snow wrote:
>> Based-on: 20200602214528.12107-1-js...@redhat.com
>>
>> This series factors the python/qemu directory as an installable
>> module. As a developer, you can install this to your virtual environment
>> and then always have access to the classes contained within without
>> needing to wrangle python import path problems.
>>
> 
> First of all, major kudos for picking up this task.  It's so high in
> importance to so many users (myself included) that I feel like I owe
> you many truck loads of beers now. :)
> 

Mostly I just wanted to formalize mypy, pylint, flake8 et al across the
most important python bits in our tree so that when making changes for
testing it's easier to verify that I didn't break something else.

Easiest way to get the right structure that these tools expect is to
make a real package ...

So here we are. And also Philippe asked nicely.

>> When developing, you could go to qemu/python/ and invoke `pipenv shell`
>> to activate a virtual environment within which you could type `pip
>> install -e .` to install a special development version of this package
>> to your virtual environment. This package will always reflect the most
>> recent version of the source files in the tree.
>>
>> When not developing, you could install a version of this package to your
>> environment outright to gain access to the QMP and QEMUMachine classes
>> for lightweight scripting and testing.
>>
>> This package is formatted in such a way that it COULD be uploaded to
>> https://pypi.org/project/qemu and installed independently of qemu.git
>> with `pip install qemu`, but of course that button remains unpushed.
>>
>> There are a few major questions to answer first:
>>
>> - What versioning scheme should we use? See patch 2.
>>
>> - Should we use a namespace package or not?
>>   - Namespaced: 'qemu.machine', 'qemu.monitor' etc may be separately
>> versioned, packaged and distributed packages. Third party authors
>> may register 'qemu.xxx' to create plugins within the namespace as
>> they see fit.
>>
>>   - Non-namespaced: 'qemu' is one giant glob package, packaged and
>> versioned in unison. We control this package exclusively.
>>
> 
> For simplicity sake, I'd suggest starting with non-namespaced
> approach.  It should be easier to move to a namespaced package if the
> need arises.  Also, there are many ways to extend Python code without
> necessarily requiring third party authors to register their packages
> according to a namespace.
> 
> In the Avocado project, we have been using setuptools entrypoints with
> a reasonable level of success.  Anyone can have code under any
> namespace whatsoever extending Avocado, as long as it register their
> entrypoints.
> 

It's not (from my POV) very complex to do a namespace. I have some plans
to move e.g. qapi into qemu.qapi, and some of our other tools into
qemu.tools.

Some of these packages can be published externally, some can remain in
the tree.

but -- maybe namespaces ARE complicating matters in ways I don't
understand yet. I'll be open about it. The thought was mostly about
keeping flexibility with just installing the bits and pieces that you
want/need.

>> - How do we eliminate sys.path hacks from the rest of the QEMU tree?
>>   (Background: sys.path hacks generally impede the function of static
>>   code quality analysis tools like mypy and pylint.)
>>
>>   - Simplest: parent scripts (or developer) needs to set PYTHONPATH.
>>
>>   - Harder: Python scripts should all be written to assume package form,
>> all tests and CI that use Python should execute within a VENV.
>>
> 
> Having a venv is desirable, but it's not really necessary.  As long as
> "python setup.py develop --user" is called, that user can access this
> code without sys.path hacks.  And if the user chooses to use a venv,
> it's just an extra step.
> 

whether a venv or a user installation, it's the same thing, really: the
user needs to set up and be in that environment to use the python tools
in the tree.

Once we're there, we may as well formalize the VENV to make it easier to
set up and use.

> In the Avocado project, we have a `make develop` rule that does that
> for the main setup.py file, and for all plugins we carry on the same
> tree, which is similar in some regards to the "not at the project root
> directory" situation here with "qemu/python/setup.py".
> 

Ah, yeah. If we're going this far, I'd prefer using a VENV over
modifying the user's environment. That way you can blast it all away
with a `make distclean`.

Maybe the "make develop" target could even use the presence of a .venv
directory to know when it needs to make the environment or not ...

>>   In either case, we lose the ability (for many scripts) to "just run" a
>>   script out of the source tree if it depends on other QEMU Python
>>   files. This is annoying, but as the complexity of the Python lib
>>   grows, it is un

Re: [PATCH 0/7] python: create installable package

2020-06-17 Thread Cleber Rosa

On Tue, Jun 02, 2020 at 08:15:16PM -0400, John Snow wrote:
> Based-on: 20200602214528.12107-1-js...@redhat.com
> 
> This series factors the python/qemu directory as an installable
> module. As a developer, you can install this to your virtual environment
> and then always have access to the classes contained within without
> needing to wrangle python import path problems.
>

First of all, major kudos for picking up this task.  It's so high in
importance to so many users (myself included) that I feel like I owe
you many truck loads of beers now. :)

> When developing, you could go to qemu/python/ and invoke `pipenv shell`
> to activate a virtual environment within which you could type `pip
> install -e .` to install a special development version of this package
> to your virtual environment. This package will always reflect the most
> recent version of the source files in the tree.
> 
> When not developing, you could install a version of this package to your
> environment outright to gain access to the QMP and QEMUMachine classes
> for lightweight scripting and testing.
> 
> This package is formatted in such a way that it COULD be uploaded to
> https://pypi.org/project/qemu and installed independently of qemu.git
> with `pip install qemu`, but of course that button remains unpushed.
> 
> There are a few major questions to answer first:
> 
> - What versioning scheme should we use? See patch 2.
> 
> - Should we use a namespace package or not?
>   - Namespaced: 'qemu.machine', 'qemu.monitor' etc may be separately
> versioned, packaged and distributed packages. Third party authors
> may register 'qemu.xxx' to create plugins within the namespace as
> they see fit.
> 
>   - Non-namespaced: 'qemu' is one giant glob package, packaged and
> versioned in unison. We control this package exclusively.
>

For simplicity sake, I'd suggest starting with non-namespaced
approach.  It should be easier to move to a namespaced package if the
need arises.  Also, there are many ways to extend Python code without
necessarily requiring third party authors to register their packages
according to a namespace.

In the Avocado project, we have been using setuptools entrypoints with
a reasonable level of success.  Anyone can have code under any
namespace whatsoever extending Avocado, as long as it register their
entrypoints.

> - How do we eliminate sys.path hacks from the rest of the QEMU tree?
>   (Background: sys.path hacks generally impede the function of static
>   code quality analysis tools like mypy and pylint.)
> 
>   - Simplest: parent scripts (or developer) needs to set PYTHONPATH.
> 
>   - Harder: Python scripts should all be written to assume package form,
> all tests and CI that use Python should execute within a VENV.
>

Having a venv is desirable, but it's not really necessary.  As long as
"python setup.py develop --user" is called, that user can access this
code without sys.path hacks.  And if the user chooses to use a venv,
it's just an extra step.

In the Avocado project, we have a `make develop` rule that does that
for the main setup.py file, and for all plugins we carry on the same
tree, which is similar in some regards to the "not at the project root
directory" situation here with "qemu/python/setup.py".

>   In either case, we lose the ability (for many scripts) to "just run" a
>   script out of the source tree if it depends on other QEMU Python
>   files. This is annoying, but as the complexity of the Python lib
>   grows, it is unavoidable.
>

Like I said before, we may introduce a "make develop"-like
requirement, but after that, I don't think we'll loose anything.
Also, I think this is just a sign of maturity.  We should be using
Python as it's inteded to be used, and sys.path hacks is not among
those.

>   In the VENV case, we at least establish a simple paradigm: the scripts
>   should work in their "installed" forms; and the rest of the build and
>   test infrastructure should use this VENV to automatically handle
>   dependencies and path requirements. This should allow us to move many
>   of our existing python scripts with "hidden" dependencies into a
>   proper python module hierarchy and test for regressions with mypy,
>   flake8, pylint, etc.
> 
>   (We could even establish e.g. Sphinx versions as a dependency for our
>   build kit here and make sure it's installed to the VENV.)
> 
>   Pros: Almost all scripts can be moved under python/qemu/* and checked
>   with CQA tools. imports are written the same no matter where you are
>   (Use the fully qualified names, e.g. qemu.core.qmp.QMPMessage).
>   Regressions in scripts are caught *much* faster.
> 
>   Downsides: Kind of annoying; most scripts now require you to install a
>   devkit forwarder (pip3 install --user .) or be inside of an activated
>   venv. Not too bad if you know python at all, but it's certainly less
>   plug-n-play.
> 
> - What's our backwards compatibility policy if we start shipping this?
> 
>   Proposed: Attempt t

Re: [PATCH] block: Raise an error when backing file parameter is an empty string

2020-06-17 Thread Eric Blake


On 6/17/20 1:27 PM, Connor Kuehl wrote:

Providing an empty string for the backing file parameter like so:

qemu-img create -f qcow2 -b '' /tmp/foo

allows the flow of control to reach and subsequently fail an assert
statement because passing an empty string to

bdrv_get_full_backing_filename_from_filename()

simply results in NULL being returned without an error being raised.

To fix this, let's check for an empty string when getting the value from
the opts list.

Reported-by: Attila Fazekas 
Fixes: https://bugzilla.redhat.com/1809553
Signed-off-by: Connor Kuehl 
---
  block.c|  4 
  tests/qemu-iotests/298 | 47 ++
  tests/qemu-iotests/298.out |  5 
  tests/qemu-iotests/group   |  1 +
  4 files changed, 57 insertions(+)
  create mode 100755 tests/qemu-iotests/298
  create mode 100644 tests/qemu-iotests/298.out


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH] block: Raise an error when backing file parameter is an empty string

2020-06-17 Thread no-reply

Patchew URL: https://patchew.org/QEMU/20200617182725.951119-1-cku...@redhat.com/



Hi,

This series failed the asan build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-debug@fedora TARGET_LIST=x86_64-softmmu J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC  qga/commands.o
  CC  qga/guest-agent-command-state.o
  CC  qga/main.o
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  CC  qga/commands-posix.o
  CC  qga/channel-posix.o
  CC  qga/qapi-generated/qga-qapi-types.o
---
  CC  qemu-img.o
  AR  libvhost-user.a
  GEN docs/interop/qemu-ga-ref.html
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  GEN docs/interop/qemu-ga-ref.txt
  GEN docs/interop/qemu-ga-ref.7
  LINKqemu-keymap
  LINKivshmem-client
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKivshmem-server
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKqemu-nbd
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKqemu-storage-daemon
  AS  pc-bios/optionrom/multiboot.o
  AS  pc-bios/optionrom/linuxboot.o
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKqemu-img
  CC  pc-bios/optionrom/linuxboot_dma.o
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  AS  pc-bios/optionrom/kvmvapic.o
  AS  pc-bios/optionrom/pvh.o
  LINKqemu-io
  CC  pc-bios/optionrom/pvh_main.o
  BUILD   pc-bios/optionrom/multiboot.img
  LINKqemu-edid
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  BUILD   pc-bios/optionrom/linuxboot.img
  LINKfsdev/virtfs-proxy-helper
  BUILD   pc-bios/optionrom/linuxboot_dma.img
  BUILD   pc-bios/optionrom/kvmvapic.img
  BUILD   pc-bios/optionrom/multiboot.raw
  BUILD   pc-bios/optionrom/linuxboot.raw
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKscsi/qemu-pr-helper
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKqemu-bridge-helper
  BUILD   pc-bios/optionrom/linuxboot_dma.raw
  LINKvirtiofsd
---
  BUILD   pc-bios/optionrom/pvh.img
  SIGNpc-bios/optionrom/linuxboot_dma.bin
  SIGNpc-bios/optionrom/kvmvapic.bi

Re: [PATCH v3 00/16] python: add mypy support to python/qemu

2020-06-17 Thread John Snow




On 6/17/20 1:18 PM, Philippe Mathieu-Daudé wrote:
> On 6/4/20 10:22 PM, John Snow wrote:
>> Based-on: 20200604195252.20739-1-js...@redhat.com
>>
>> This series is extracted from my larger series that attempted to bundle
>> our python module as an installable module. These fixes don't require that,
>> so they are being sent first as I think there's less up for debate in here.
>>
> [...]
>>
>> John Snow (16):
>>   python/qmp.py: Define common types
>>   iotests.py: use qemu.qmp type aliases
>>   python/qmp.py: re-absorb MonitorResponseError
>>   python/qmp.py: Do not return None from cmd_obj
>>   python/qmp.py: add casts to JSON deserialization
>>   python/qmp.py: add QMPProtocolError
>>   python/machine.py: Fix monitor address typing
>>   python/machine.py: reorder __init__
>>   python/machine.py: Don't modify state in _base_args()
>>   python/machine.py: Handle None events in events_wait
>>   python/machine.py: use qmp.command
>>   python/machine.py: Add _qmp access shim
>>   python/machine.py: fix _popen access
>>   python/qemu: make 'args' style arguments immutable
>>   iotests.py: Adjust HMP kwargs typing
>>   python/qemu: Add mypy type annotations
>>
>>  python/qemu/accel.py  |   8 +-
>>  python/qemu/machine.py| 286 --
>>  python/qemu/qmp.py| 111 +
>>  python/qemu/qtest.py  |  53 ---
>>  scripts/render_block_graph.py |   7 +-
>>  tests/qemu-iotests/iotests.py |  11 +-
>>  6 files changed, 298 insertions(+), 178 deletions(-)
>>
> 
> Thanks, applied to my python-next tree:
> https://gitlab.com/philmd/qemu/commits/python-next
> 

Awesome, thanks!

Re: [PATCH v3 00/16] python: add mypy support to python/qemu

2020-06-17 Thread Philippe Mathieu-Daudé

On 6/4/20 10:22 PM, John Snow wrote:
> Based-on: 20200604195252.20739-1-js...@redhat.com
> 
> This series is extracted from my larger series that attempted to bundle
> our python module as an installable module. These fixes don't require that,
> so they are being sent first as I think there's less up for debate in here.
> 
[...]
> 
> John Snow (16):
>   python/qmp.py: Define common types
>   iotests.py: use qemu.qmp type aliases
>   python/qmp.py: re-absorb MonitorResponseError
>   python/qmp.py: Do not return None from cmd_obj
>   python/qmp.py: add casts to JSON deserialization
>   python/qmp.py: add QMPProtocolError
>   python/machine.py: Fix monitor address typing
>   python/machine.py: reorder __init__
>   python/machine.py: Don't modify state in _base_args()
>   python/machine.py: Handle None events in events_wait
>   python/machine.py: use qmp.command
>   python/machine.py: Add _qmp access shim
>   python/machine.py: fix _popen access
>   python/qemu: make 'args' style arguments immutable
>   iotests.py: Adjust HMP kwargs typing
>   python/qemu: Add mypy type annotations
> 
>  python/qemu/accel.py  |   8 +-
>  python/qemu/machine.py| 286 --
>  python/qemu/qmp.py| 111 +
>  python/qemu/qtest.py  |  53 ---
>  scripts/render_block_graph.py |   7 +-
>  tests/qemu-iotests/iotests.py |  11 +-
>  6 files changed, 298 insertions(+), 178 deletions(-)
> 

Thanks, applied to my python-next tree:
https://gitlab.com/philmd/qemu/commits/python-next

Re: [PULL 00/43] Block layer patches

2020-06-17 Thread no-reply

Patchew URL: https://patchew.org/QEMU/20200617144909.192176-1-kw...@redhat.com/



Hi,

This series failed the asan build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-debug@fedora TARGET_LIST=x86_64-softmmu J=14 NETWORK=1
=== TEST SCRIPT END ===

  LINKtests/qemu-iotests/socket_scm_helper
  GEN docs/interop/qemu-qmp-ref.html
  GEN docs/interop/qemu-qmp-ref.txt
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  GEN docs/interop/qemu-qmp-ref.7
  CC  qga/commands.o
  CC  qga/guest-agent-command-state.o
---
  CC  qemu-img.o
  AS  pc-bios/optionrom/pvh.o
  CC  pc-bios/optionrom/pvh_main.o
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  BUILD   pc-bios/optionrom/multiboot.img
  BUILD   pc-bios/optionrom/linuxboot.img
  BUILD   pc-bios/optionrom/kvmvapic.img
---
  BUILD   pc-bios/optionrom/pvh.raw
  SIGNpc-bios/optionrom/pvh.bin
  LINKfsdev/virtfs-proxy-helper
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKscsi/qemu-pr-helper
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  AR  libvhost-user.a
  LINKqemu-bridge-helper
  GEN docs/interop/qemu-ga-ref.txt
  GEN docs/interop/qemu-ga-ref.7
  GEN docs/interop/qemu-ga-ref.html
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKqemu-ga
  LINKqemu-keymap
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKivshmem-client
  LINKivshmem-server
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKqemu-nbd
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKqemu-storage-daemon
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKqemu-img
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common o

Re: [PATCH v3 00/16] python: add mypy support to python/qemu

2020-06-17 Thread John Snow




On 6/16/20 11:33 PM, Philippe Mathieu-Daudé wrote:
> Does that mean you want to respin this series?
> Else you can consider it applied on python-next.

No no - you're all set. Just taking the opportunity to discuss tooling
that I find cumbersome at the moment.

Thank you Philippe :)

--js

[PULL 43/43] iotests: Add copyright line in qcow2.py

2020-06-17 Thread Kevin Wolf

From: Eric Blake 

The file qcow2.py was originally contributed in 2012 by Kevin Wolf,
but was not given traditional boilerplate headers at the time.  The
missing license was just rectified (commit 16306a7b39) using the
project-default GPLv2+, but as Vladimir is not at Red Hat, he did not
add a Copyright line.  All earlier contributions have come from CC'd
authors, where all but Stefan used a Red Hat address at the time of
the contribution, and that copyright carries over to the split to
qcow2_format.py (d5262c7124).

CC: Kevin Wolf 
CC: Stefan Hajnoczi 
CC: Eduardo Habkost 
CC: Max Reitz 
CC: Philippe Mathieu-Daudé 
CC: Paolo Bonzini 
Signed-off-by: Eric Blake 
Message-Id: <20200609205944.3549240-1-ebl...@redhat.com>
Acked-by: Stefan Hajnoczi 
Acked-by: Philippe Mathieu-Daudé 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/qcow2.py| 2 ++
 tests/qemu-iotests/qcow2_format.py | 1 +
 2 files changed, 3 insertions(+)

diff --git a/tests/qemu-iotests/qcow2.py b/tests/qemu-iotests/qcow2.py
index 8c187e9a72..0910e6ac07 100755
--- a/tests/qemu-iotests/qcow2.py
+++ b/tests/qemu-iotests/qcow2.py
@@ -2,6 +2,8 @@
 #
 # Manipulations with qcow2 image
 #
+# Copyright (C) 2012 Red Hat, Inc.
+#
 # This program is free software; you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
 # the Free Software Foundation; either version 2 of the License, or
diff --git a/tests/qemu-iotests/qcow2_format.py 
b/tests/qemu-iotests/qcow2_format.py
index 0f65fd161d..cc432e7ae0 100644
--- a/tests/qemu-iotests/qcow2_format.py
+++ b/tests/qemu-iotests/qcow2_format.py
@@ -1,6 +1,7 @@
 # Library for manipulations with qcow2 image
 #
 # Copyright (c) 2020 Virtuozzo International GmbH.
+# Copyright (C) 2012 Red Hat, Inc.
 #
 # This program is free software; you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
-- 
2.25.4

[PULL 42/43] iotests/{190,291}: compat=0.10 is unsupported

2020-06-17 Thread Kevin Wolf

From: Max Reitz 

Fixes: 5d72c68b49769c927e90b78af6d90f6a384b26ac
Fixes: cf2d1203dcfc2bf964453d83a2302231ce77f2dc
Signed-off-by: Max Reitz 
Message-Id: <20200617104822.27525-6-mre...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/190 | 2 ++
 tests/qemu-iotests/291 | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/tests/qemu-iotests/190 b/tests/qemu-iotests/190
index fe630918e9..c22d8d64f9 100755
--- a/tests/qemu-iotests/190
+++ b/tests/qemu-iotests/190
@@ -41,6 +41,8 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 # See 178 for more extensive tests across more formats
 _supported_fmt qcow2
 _supported_proto file
+# compat=0.10 does not support bitmaps
+_unsupported_imgopts 'compat=0.10'
 
 echo "== Huge file without bitmaps =="
 echo
diff --git a/tests/qemu-iotests/291 b/tests/qemu-iotests/291
index 404f8521f7..28e4fb9b4d 100755
--- a/tests/qemu-iotests/291
+++ b/tests/qemu-iotests/291
@@ -39,6 +39,8 @@ _supported_fmt qcow2
 _supported_proto file
 _supported_os Linux
 _require_command QEMU_NBD
+# compat=0.10 does not support bitmaps
+_unsupported_imgopts 'compat=0.10'
 
 echo
 echo "=== Initial image setup ==="
-- 
2.25.4

[PULL 37/43] block: lift blocksize property limit to 2 MiB

2020-06-17 Thread Kevin Wolf

From: Roman Kagan 

Logical and physical block sizes in QEMU are limited to 32 KiB.

This appears unnecessarily tight, and we've seen bigger block sizes
handy at times.

Lift the limitation up to 2 MiB which appears to be good enough for
everybody, and matches the qcow2 cluster size limit.

Signed-off-by: Roman Kagan 
Reviewed-by: Eric Blake 
Message-Id: <20200528225516.1676602-9-rvka...@yandex-team.ru>
Signed-off-by: Kevin Wolf 
---
 hw/core/qdev-properties.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
index 63d48db70c..ead35d7ffd 100644
--- a/hw/core/qdev-properties.c
+++ b/hw/core/qdev-properties.c
@@ -784,9 +784,12 @@ const PropertyInfo qdev_prop_size32 = {
 /* lower limit is sector size */
 #define MIN_BLOCK_SIZE  512
 #define MIN_BLOCK_SIZE_STR  "512 B"
-/* upper limit is the max power of 2 that fits in uint16_t */
-#define MAX_BLOCK_SIZE  (32 * KiB)
-#define MAX_BLOCK_SIZE_STR  "32 KiB"
+/*
+ * upper limit is arbitrary, 2 MiB looks sufficient for all sensible uses, and
+ * matches qcow2 cluster size limit
+ */
+#define MAX_BLOCK_SIZE  (2 * MiB)
+#define MAX_BLOCK_SIZE_STR  "2 MiB"
 
 static void set_blocksize(Object *obj, Visitor *v, const char *name,
   void *opaque, Error **errp)
-- 
2.25.4

[PULL 35/43] block: make BlockConf size props 32bit and accept size suffixes

2020-06-17 Thread Kevin Wolf

From: Roman Kagan 

Convert all size-related properties in BlockConf to 32bit.  This will
accommodate bigger block sizes (in a followup patch).  This also allows
to make them all accept size suffixes, either via DEFINE_PROP_BLOCKSIZE
or via DEFINE_PROP_SIZE32.

Also, since min_io_size is exposed to the guest by scsi and virtio-blk
devices as an uint16_t in units of logical blocks, introduce an
additional check in blkconf_blocksizes to prevent its silent truncation.

Signed-off-by: Roman Kagan 
Message-Id: <20200528225516.1676602-7-rvka...@yandex-team.ru>
Signed-off-by: Kevin Wolf 
---
 include/hw/block/block.h | 12 ++--
 include/hw/qdev-properties.h |  2 +-
 hw/block/block.c | 10 ++
 hw/core/qdev-properties.c|  4 ++--
 4 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/include/hw/block/block.h b/include/hw/block/block.h
index 784953a237..1e8b6253dd 100644
--- a/include/hw/block/block.h
+++ b/include/hw/block/block.h
@@ -18,9 +18,9 @@
 
 typedef struct BlockConf {
 BlockBackend *blk;
-uint16_t physical_block_size;
-uint16_t logical_block_size;
-uint16_t min_io_size;
+uint32_t physical_block_size;
+uint32_t logical_block_size;
+uint32_t min_io_size;
 uint32_t opt_io_size;
 int32_t bootindex;
 uint32_t discard_granularity;
@@ -51,9 +51,9 @@ static inline unsigned int get_physical_block_exp(BlockConf 
*conf)
   _conf.logical_block_size),\
 DEFINE_PROP_BLOCKSIZE("physical_block_size", _state,\
   _conf.physical_block_size),   \
-DEFINE_PROP_UINT16("min_io_size", _state, _conf.min_io_size, 0),\
-DEFINE_PROP_UINT32("opt_io_size", _state, _conf.opt_io_size, 0),\
-DEFINE_PROP_UINT32("discard_granularity", _state,   \
+DEFINE_PROP_SIZE32("min_io_size", _state, _conf.min_io_size, 0),\
+DEFINE_PROP_SIZE32("opt_io_size", _state, _conf.opt_io_size, 0),\
+DEFINE_PROP_SIZE32("discard_granularity", _state,   \
_conf.discard_granularity, -1),  \
 DEFINE_PROP_ON_OFF_AUTO("write-cache", _state, _conf.wce,   \
 ON_OFF_AUTO_AUTO),  \
diff --git a/include/hw/qdev-properties.h b/include/hw/qdev-properties.h
index c03eadfad6..5252bb6b1a 100644
--- a/include/hw/qdev-properties.h
+++ b/include/hw/qdev-properties.h
@@ -200,7 +200,7 @@ extern const PropertyInfo qdev_prop_pcie_link_width;
 #define DEFINE_PROP_SIZE32(_n, _s, _f, _d)   \
 DEFINE_PROP_UNSIGNED(_n, _s, _f, _d, qdev_prop_size32, uint32_t)
 #define DEFINE_PROP_BLOCKSIZE(_n, _s, _f) \
-DEFINE_PROP_UNSIGNED(_n, _s, _f, 0, qdev_prop_blocksize, uint16_t)
+DEFINE_PROP_UNSIGNED(_n, _s, _f, 0, qdev_prop_blocksize, uint32_t)
 #define DEFINE_PROP_PCI_HOST_DEVADDR(_n, _s, _f) \
 DEFINE_PROP(_n, _s, _f, qdev_prop_pci_host_devaddr, PCIHostDeviceAddress)
 #define DEFINE_PROP_OFF_AUTO_PCIBAR(_n, _s, _f, _d) \
diff --git a/hw/block/block.c b/hw/block/block.c
index b22207c921..1e34573da7 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -96,6 +96,16 @@ bool blkconf_blocksizes(BlockConf *conf, Error **errp)
 return false;
 }
 
+/*
+ * all devices which support min_io_size (scsi and virtio-blk) expose it to
+ * the guest as a uint16_t in units of logical blocks
+ */
+if (conf->min_io_size / conf->logical_block_size > UINT16_MAX) {
+error_setg(errp, "min_io_size must not exceed %u logical blocks",
+   UINT16_MAX);
+return false;
+}
+
 if (!QEMU_IS_ALIGNED(conf->opt_io_size, conf->logical_block_size)) {
 error_setg(errp,
"opt_io_size must be a multiple of logical_block_size");
diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
index c9af6a1341..bd4abdc1d1 100644
--- a/hw/core/qdev-properties.c
+++ b/hw/core/qdev-properties.c
@@ -782,7 +782,7 @@ static void set_blocksize(Object *obj, Visitor *v, const 
char *name,
 {
 DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
-uint16_t *ptr = qdev_get_prop_ptr(dev, prop);
+uint32_t *ptr = qdev_get_prop_ptr(dev, prop);
 uint64_t value;
 Error *local_err = NULL;
 
@@ -821,7 +821,7 @@ const PropertyInfo qdev_prop_blocksize = {
 .name  = "size",
 .description = "A power of two between " MIN_BLOCK_SIZE_STR
" and " MAX_BLOCK_SIZE_STR,
-.get   = get_uint16,
+.get   = get_uint32,
 .set   = set_blocksize,
 .set_default_value = set_default_value_uint,
 };
-- 
2.25.4

[PULL 33/43] qdev-properties: add size32 property type

2020-06-17 Thread Kevin Wolf

From: Roman Kagan 

Introduce size32 property type which handles size suffixes (k, m, g)
just like size property, but is uint32_t rather than uint64_t.  It's
going to be useful for properties that are byte sizes but are inherently
32bit, like BlkConf.opt_io_size or .discard_granularity (they are
switched to this new property type in a followup commit).

The getter for size32 is left out for a separate patch as its benefit is
less obvious, and it affects test output; for now the regular uint32
getter is used.

Signed-off-by: Roman Kagan 
Message-Id: <20200528225516.1676602-5-rvka...@yandex-team.ru>
Signed-off-by: Kevin Wolf 
---
 include/hw/qdev-properties.h |  3 +++
 hw/core/qdev-properties.c| 40 
 2 files changed, 43 insertions(+)

diff --git a/include/hw/qdev-properties.h b/include/hw/qdev-properties.h
index f161604fb6..c03eadfad6 100644
--- a/include/hw/qdev-properties.h
+++ b/include/hw/qdev-properties.h
@@ -29,6 +29,7 @@ extern const PropertyInfo qdev_prop_drive;
 extern const PropertyInfo qdev_prop_drive_iothread;
 extern const PropertyInfo qdev_prop_netdev;
 extern const PropertyInfo qdev_prop_pci_devfn;
+extern const PropertyInfo qdev_prop_size32;
 extern const PropertyInfo qdev_prop_blocksize;
 extern const PropertyInfo qdev_prop_pci_host_devaddr;
 extern const PropertyInfo qdev_prop_uuid;
@@ -196,6 +197,8 @@ extern const PropertyInfo qdev_prop_pcie_link_width;
 BlockdevOnError)
 #define DEFINE_PROP_BIOS_CHS_TRANS(_n, _s, _f, _d) \
 DEFINE_PROP_SIGNED(_n, _s, _f, _d, qdev_prop_bios_chs_trans, int)
+#define DEFINE_PROP_SIZE32(_n, _s, _f, _d)   \
+DEFINE_PROP_UNSIGNED(_n, _s, _f, _d, qdev_prop_size32, uint32_t)
 #define DEFINE_PROP_BLOCKSIZE(_n, _s, _f) \
 DEFINE_PROP_UNSIGNED(_n, _s, _f, 0, qdev_prop_blocksize, uint16_t)
 #define DEFINE_PROP_PCI_HOST_DEVADDR(_n, _s, _f) \
diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
index 249dc69bd8..40c13f6ebe 100644
--- a/hw/core/qdev-properties.c
+++ b/hw/core/qdev-properties.c
@@ -727,6 +727,46 @@ const PropertyInfo qdev_prop_pci_devfn = {
 .set_default_value = set_default_value_int,
 };
 
+/* --- 32bit unsigned int 'size' type --- */
+
+static void set_size32(Object *obj, Visitor *v, const char *name, void *opaque,
+   Error **errp)
+{
+DeviceState *dev = DEVICE(obj);
+Property *prop = opaque;
+uint32_t *ptr = qdev_get_prop_ptr(dev, prop);
+uint64_t value;
+Error *local_err = NULL;
+
+if (dev->realized) {
+qdev_prop_set_after_realize(dev, name, errp);
+return;
+}
+
+visit_type_size(v, name, &value, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+
+if (value > UINT32_MAX) {
+error_setg(errp,
+   "Property %s.%s doesn't take value %" PRIu64
+   " (maximum: %u)",
+   dev->id ? : "", name, value, UINT32_MAX);
+return;
+}
+
+*ptr = value;
+}
+
+const PropertyInfo qdev_prop_size32 = {
+.name  = "size",
+.get = get_uint32,
+.set = set_size32,
+.set_default_value = set_default_value_uint,
+};
+
 /* --- blocksize --- */
 
 /* lower limit is sector size */
-- 
2.25.4

Re: [PATCH v4 1/4] Introduce yank feature

2020-06-17 Thread Stefan Hajnoczi

On Mon, May 25, 2020 at 05:44:23PM +0200, Lukas Straub wrote:
> +static struct YankInstance *yank_find_instance(char *name)

There are const char * -> char * casts in later patches. Please use
const char * where possible. Callers shouldn't need to cast away const.


signature.asc
Description: PGP signature

Re: applying mailing list review tags (was: Re: [PATCH v3 00/16] python: add mypy support to python/qemu)

2020-06-17 Thread Paolo Bonzini

On 16/06/20 19:58, John Snow wrote:
> 1. Correlating a mailing list patch from e.g. patchew to a commit in my
> history, even if it's changed a little bit?

Use "git am --message-id"?

> (git-backport-diff uses patch names, that might be sufficient... Could
> use that as a starting point, at least.)
> 
> 2. Obtaining the commit message of that patch?
> `git show -s --format=%B $SHA` ought to do it...
> 
> 3. Editing that commit message? This I'm not sure about. I'd need to
> understand the tags on the upstream and downstream versions, merge them,
> and then re-write the message. Some magic with `git rebase -i` ?

"patchew apply" actually uses "git filter-branch --msg-filter" to add the
tags  after a successful "git am", so you can do something similar yourself.
(Actually I have pending patches to patchew that switch it to server-side
application of tags using the "mbox" URL that Philippe mentioned earlier, but
they've been pending for quite some time now).

To get the upstream tags you can use the Patchew REST API:

   $ MSGID=20200521153616.307100-1-stefa...@redhat.com
   $ curl -L https://patchew.org/api/v1/projects/by-name/QEMU/messages/$MSGID/ 
| jq -r '.tags[]'
   Reported-by: Philippe Mathieu-Daudé 
   Reviewed-by: Richard Henderson 

So you'd have to take a commit, look for a Message-Id header, fetch the
tags from above mentioned Patchew API URL and add the tags to the commit
message.

The commit message can be either emitted to stdout (and the script
used with "git filter-branch)" or, for the faint of heart, the script
could do a "git commit --amend" and you can use "git rebase -i --exec"
to execute the script on all commits in a range.

This script is for the latter option:

   #! /bin/bash
   BODY=$(git show -s --format=%B)
   MSGID=$(git interpret-trailers --parse <<< $BODY | sed -n 's/^Message-Id: 
<\(.*\)>/\1/ip')
   USER="$(git config user.name) <$(git config user.email)>"
   BODY=$(curl -L 
https://patchew.org/api/v1/projects/by-name/QEMU/messages/$MSGID/ | \
 jq -r '.tags[]' | ( \
   args=()
   while read x; do
 args+=(--trailer "$x")
   done
   git interpret-trailers \
 --if-exists doNothing "${args[@]}" \
 --if-exists replace --if-missing doNothing --trailer "Signed-off-by: 
$USER" <<< $BODY
   ))
   git commit --amend -m"$BODY"

The script will also move your Signed-off-by line at the end of the commit
message, this might be a problem if there is more than one such line and
you want to keep them all.

Paolo

Re: [PATCH v2] qcow2: Fix preallocation on images with unaligned sizes

2020-06-17 Thread Eric Blake


On 6/17/20 9:00 AM, Alberto Garcia wrote:

When resizing an image with qcow2_co_truncate() using the falloc or
full preallocation modes the code assumes that both the old and new
sizes are cluster-aligned.

There are two problems with this:

   1) The calculation of how many clusters are involved does not always
  get the right result.

  Example: creating a 60KB image and resizing it (with
  preallocation=full) to 80KB won't allocate the second cluster.

   2) No copy-on-write is performed, so in the previous example if
  there is a backing file then the first 60KB of the first cluster
  won't be filled with data from the backing file.

This patch fixes both issues.

Signed-off-by: Alberto Garcia 
---
v2: iotests: don't check the image size if data_file is set [Max]



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v4 2/4] block/nbd.c: Add yank feature

2020-06-17 Thread Stefan Hajnoczi

On Mon, May 25, 2020 at 05:44:26PM +0200, Lukas Straub wrote:
> @@ -1395,6 +1407,15 @@ static int nbd_client_reopen_prepare(BDRVReopenState 
> *state,
>  return 0;
>  }
> 
> +static void nbd_yank(void *opaque)
> +{
> +BlockDriverState *bs = opaque;
> +BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> +
> +qio_channel_shutdown(QIO_CHANNEL(s->sioc), QIO_CHANNEL_SHUTDOWN_BOTH, 
> NULL);

qio_channel_shutdown() is not guaranteed to be thread-safe. Please
document new assumptions that are being introduced.

Today we can more or less get away with it (although TLS sockets are a
little iffy) because it boils down the a shutdown(2) system call. I
think it would be okay to update the qio_channel_shutdown() and
.io_shutdown() documentation to clarify that this is thread-safe.

> +atomic_set(&s->state, NBD_CLIENT_QUIT);

docs/devel/atomics.rst says:

  No barriers are implied by ``atomic_read`` and ``atomic_set`` in either Linux
  or QEMU.

Other threads might not see the latest value of s->state because this is
a weakly ordered memory access.

I haven't audited the NBD code in detail, but if you want the other
threads to always see NBD_CLIENT_QUIT then s->state should be set before
calling qio_channel_shutdown() using a stronger atomics API like
atomic_load_acquire()/atomic_store_release().

signature.asc
Description: PGP signature

[PULL 31/43] block: consolidate blocksize properties consistency checks

2020-06-17 Thread Kevin Wolf

From: Roman Kagan 

Several block device properties related to blocksize configuration must
be in certain relationship WRT each other: physical block must be no
smaller than logical block; min_io_size, opt_io_size, and
discard_granularity must be a multiple of a logical block.

To ensure these requirements are met, add corresponding consistency
checks to blkconf_blocksizes, adjusting its signature to communicate
possible error to the caller.  Also remove the now redundant consistency
checks from the specific devices.

Signed-off-by: Roman Kagan 
Reviewed-by: Eric Blake 
Reviewed-by: Paul Durrant 
Message-Id: <20200528225516.1676602-3-rvka...@yandex-team.ru>
Signed-off-by: Kevin Wolf 
---
 include/hw/block/block.h   |  2 +-
 hw/block/block.c   | 30 +-
 hw/block/fdc.c |  5 -
 hw/block/nvme.c|  4 +++-
 hw/block/swim.c|  5 -
 hw/block/virtio-blk.c  |  7 +--
 hw/block/xen-block.c   |  6 +-
 hw/ide/qdev.c  |  5 -
 hw/scsi/scsi-disk.c| 12 +---
 hw/usb/dev-storage.c   |  5 -
 tests/qemu-iotests/172.out |  2 +-
 11 files changed, 57 insertions(+), 26 deletions(-)

diff --git a/include/hw/block/block.h b/include/hw/block/block.h
index d7246f3862..784953a237 100644
--- a/include/hw/block/block.h
+++ b/include/hw/block/block.h
@@ -87,7 +87,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void 
*buf, hwaddr size,
 bool blkconf_geometry(BlockConf *conf, int *trans,
   unsigned cyls_max, unsigned heads_max, unsigned secs_max,
   Error **errp);
-void blkconf_blocksizes(BlockConf *conf);
+bool blkconf_blocksizes(BlockConf *conf, Error **errp);
 bool blkconf_apply_backend_options(BlockConf *conf, bool readonly,
bool resizable, Error **errp);
 
diff --git a/hw/block/block.c b/hw/block/block.c
index bf56c7612b..b22207c921 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -61,7 +61,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void 
*buf, hwaddr size,
 return true;
 }
 
-void blkconf_blocksizes(BlockConf *conf)
+bool blkconf_blocksizes(BlockConf *conf, Error **errp)
 {
 BlockBackend *blk = conf->blk;
 BlockSizes blocksizes;
@@ -83,6 +83,34 @@ void blkconf_blocksizes(BlockConf *conf)
 conf->logical_block_size = BDRV_SECTOR_SIZE;
 }
 }
+
+if (conf->logical_block_size > conf->physical_block_size) {
+error_setg(errp,
+   "logical_block_size > physical_block_size not supported");
+return false;
+}
+
+if (!QEMU_IS_ALIGNED(conf->min_io_size, conf->logical_block_size)) {
+error_setg(errp,
+   "min_io_size must be a multiple of logical_block_size");
+return false;
+}
+
+if (!QEMU_IS_ALIGNED(conf->opt_io_size, conf->logical_block_size)) {
+error_setg(errp,
+   "opt_io_size must be a multiple of logical_block_size");
+return false;
+}
+
+if (conf->discard_granularity != -1 &&
+!QEMU_IS_ALIGNED(conf->discard_granularity,
+ conf->logical_block_size)) {
+error_setg(errp, "discard_granularity must be "
+   "a multiple of logical_block_size");
+return false;
+}
+
+return true;
 }
 
 bool blkconf_apply_backend_options(BlockConf *conf, bool readonly,
diff --git a/hw/block/fdc.c b/hw/block/fdc.c
index 8528b9a3c7..be0674e4aa 100644
--- a/hw/block/fdc.c
+++ b/hw/block/fdc.c
@@ -554,7 +554,10 @@ static void floppy_drive_realize(DeviceState *qdev, Error 
**errp)
 read_only = !blk_bs(dev->conf.blk) || blk_is_read_only(dev->conf.blk);
 }
 
-blkconf_blocksizes(&dev->conf);
+if (!blkconf_blocksizes(&dev->conf, errp)) {
+return;
+}
+
 if (dev->conf.logical_block_size != 512 ||
 dev->conf.physical_block_size != 512)
 {
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 2a2e43f681..1aee042d4c 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1425,7 +1425,9 @@ static void nvme_init_state(NvmeCtrl *n)
 
 static void nvme_init_blk(NvmeCtrl *n, Error **errp)
 {
-blkconf_blocksizes(&n->conf);
+if (!blkconf_blocksizes(&n->conf, errp)) {
+return;
+}
 blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
   false, errp);
 }
diff --git a/hw/block/swim.c b/hw/block/swim.c
index 8f124782f4..74f56e8f46 100644
--- a/hw/block/swim.c
+++ b/hw/block/swim.c
@@ -189,7 +189,10 @@ static void swim_drive_realize(DeviceState *qdev, Error 
**errp)
 assert(ret == 0);
 }
 
-blkconf_blocksizes(&dev->conf);
+if (!blkconf_blocksizes(&dev->conf, errp)) {
+return;
+}
+
 if (dev->conf.logical_block_size != 512 ||
 dev->conf.physical_block_size != 512)
 {
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 6938a75aa5..413783693c 100644
--- a/

[PULL 36/43] qdev-properties: add getter for size32 and blocksize

2020-06-17 Thread Kevin Wolf

From: Roman Kagan 

Add getter for size32, and use it for blocksize, too.

In its human-readable branch, it reports approximate size in
human-readable units next to the exact byte value, like the getter for
64bit size does.

Adjust the expected test output accordingly.

Signed-off-by: Roman Kagan 
Reviewed-by: Eric Blake 
Message-Id: <20200528225516.1676602-8-rvka...@yandex-team.ru>
Signed-off-by: Kevin Wolf 
---
 hw/core/qdev-properties.c  |  15 +-
 tests/qemu-iotests/172.out | 530 ++---
 2 files changed, 278 insertions(+), 267 deletions(-)

diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
index bd4abdc1d1..63d48db70c 100644
--- a/hw/core/qdev-properties.c
+++ b/hw/core/qdev-properties.c
@@ -730,6 +730,17 @@ const PropertyInfo qdev_prop_pci_devfn = {
 
 /* --- 32bit unsigned int 'size' type --- */
 
+static void get_size32(Object *obj, Visitor *v, const char *name, void *opaque,
+   Error **errp)
+{
+DeviceState *dev = DEVICE(obj);
+Property *prop = opaque;
+uint32_t *ptr = qdev_get_prop_ptr(dev, prop);
+uint64_t value = *ptr;
+
+visit_type_size(v, name, &value, errp);
+}
+
 static void set_size32(Object *obj, Visitor *v, const char *name, void *opaque,
Error **errp)
 {
@@ -763,7 +774,7 @@ static void set_size32(Object *obj, Visitor *v, const char 
*name, void *opaque,
 
 const PropertyInfo qdev_prop_size32 = {
 .name  = "size",
-.get = get_uint32,
+.get = get_size32,
 .set = set_size32,
 .set_default_value = set_default_value_uint,
 };
@@ -821,7 +832,7 @@ const PropertyInfo qdev_prop_blocksize = {
 .name  = "size",
 .description = "A power of two between " MIN_BLOCK_SIZE_STR
" and " MAX_BLOCK_SIZE_STR,
-.get   = get_uint32,
+.get   = get_size32,
 .set   = set_blocksize,
 .set_default_value = set_default_value_uint,
 };
diff --git a/tests/qemu-iotests/172.out b/tests/qemu-iotests/172.out
index 59cc70aebb..e782c5957e 100644
--- a/tests/qemu-iotests/172.out
+++ b/tests/qemu-iotests/172.out
@@ -24,11 +24,11 @@ Testing:
   dev: floppy, id ""
 unit = 0 (0x0)
 drive = "floppy0"
-logical_block_size = 512 (0x200)
-physical_block_size = 512 (0x200)
-min_io_size = 0 (0x0)
-opt_io_size = 0 (0x0)
-discard_granularity = 4294967295 (0x)
+logical_block_size = 512 (512 B)
+physical_block_size = 512 (512 B)
+min_io_size = 0 (0 B)
+opt_io_size = 0 (0 B)
+discard_granularity = 4294967295 (4 GiB)
 write-cache = "auto"
 share-rw = false
 drive-type = "288"
@@ -54,11 +54,11 @@ Testing: -fda TEST_DIR/t.qcow2
   dev: floppy, id ""
 unit = 0 (0x0)
 drive = "floppy0"
-logical_block_size = 512 (0x200)
-physical_block_size = 512 (0x200)
-min_io_size = 0 (0x0)
-opt_io_size = 0 (0x0)
-discard_granularity = 4294967295 (0x)
+logical_block_size = 512 (512 B)
+physical_block_size = 512 (512 B)
+min_io_size = 0 (0 B)
+opt_io_size = 0 (0 B)
+discard_granularity = 4294967295 (4 GiB)
 write-cache = "auto"
 share-rw = false
 drive-type = "144"
@@ -81,22 +81,22 @@ Testing: -fdb TEST_DIR/t.qcow2
   dev: floppy, id ""
 unit = 1 (0x1)
 drive = "floppy1"
-logical_block_size = 512 (0x200)
-physical_block_size = 512 (0x200)
-min_io_size = 0 (0x0)
-opt_io_size = 0 (0x0)
-discard_granularity = 4294967295 (0x)
+logical_block_size = 512 (512 B)
+physical_block_size = 512 (512 B)
+min_io_size = 0 (0 B)
+opt_io_size = 0 (0 B)
+discard_granularity = 4294967295 (4 GiB)
 write-cache = "auto"
 share-rw = false
 drive-type = "144"
   dev: floppy, id ""
 unit = 0 (0x0)
 drive = "floppy0"
-logical_block_size = 512 (0x200)
-physical_block_size = 512 (0x200)
-min_io_size = 0 (0x0)
-opt_io_size = 0 (0x0)
-discard_granularity = 4294967295 (0x)
+logical_block_size = 512 (512 B)
+physical_block_size = 512 (512 B)
+min_io_size = 0 (0 B)
+opt_io_size = 0 (0 B)
+discard_granularity = 4294967295 (4 GiB)
 write-cache = "auto"
 share-rw = false
 drive-type = "288"
@@ -119,22 +119,

[PULL 41/43] iotests/229: data_file is unsupported

2020-06-17 Thread Kevin Wolf

From: Max Reitz 

Fixes: d89ac3cf305b28c024a76805a84d75c0ee1e786f
Signed-off-by: Max Reitz 
Message-Id: <20200617104822.27525-5-mre...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/229 | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tests/qemu-iotests/229 b/tests/qemu-iotests/229
index 99acb55ebb..89a5359f32 100755
--- a/tests/qemu-iotests/229
+++ b/tests/qemu-iotests/229
@@ -46,6 +46,9 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fmt qcow2 qed
 _supported_proto file
 _supported_os Linux
+# blkdebug can only inject errors on bs->file, so external data files
+# do not work with this test
+_unsupported_imgopts data_file
 
 
 DEST_IMG="$TEST_DIR/d.$IMGFMT"
-- 
2.25.4

[PULL 38/43] iotests.py: Add skip_for_formats() decorator

2020-06-17 Thread Kevin Wolf

From: Max Reitz 

Sometimes, we want to skip some test methods for certain formats.  This
decorator allows that.

Signed-off-by: Max Reitz 
Message-Id: <20200617104822.27525-2-mre...@redhat.com>
Tested-by: Thomas Huth 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/iotests.py | 16 
 tests/qemu-iotests/118|  7 +++
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index f20d90f969..5ea4c4df8b 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -1103,6 +1103,22 @@ def skip_if_unsupported(required_formats=(), 
read_only=False):
 return func_wrapper
 return skip_test_decorator
 
+def skip_for_formats(formats: Sequence[str] = ()) \
+-> Callable[[Callable[[QMPTestCase, List[Any], Dict[str, Any]], None]],
+Callable[[QMPTestCase, List[Any], Dict[str, Any]], None]]:
+'''Skip Test Decorator
+   Skips the test for the given formats'''
+def skip_test_decorator(func):
+def func_wrapper(test_case: QMPTestCase, *args: List[Any],
+ **kwargs: Dict[str, Any]) -> None:
+if imgfmt in formats:
+msg = f'{test_case}: Skipped for format {imgfmt}'
+test_case.case_skip(msg)
+else:
+func(test_case, *args, **kwargs)
+return func_wrapper
+return skip_test_decorator
+
 def skip_if_user_is_root(func):
 '''Skip Test Decorator
Runs the test only without root permissions'''
diff --git a/tests/qemu-iotests/118 b/tests/qemu-iotests/118
index adc8a848b5..2350929fd8 100755
--- a/tests/qemu-iotests/118
+++ b/tests/qemu-iotests/118
@@ -683,11 +683,10 @@ class TestBlockJobsAfterCycle(ChangeBaseClass):
 except OSError:
 pass
 
+# We need backing file support
+@iotests.skip_for_formats(('vpc', 'parallels', 'qcow', 'vdi', 'vmdk', 
'raw',
+   'vhdx'))
 def test_snapshot_and_commit(self):
-# We need backing file support
-if iotests.imgfmt != 'qcow2' and iotests.imgfmt != 'qed':
-return
-
 result = self.vm.qmp('blockdev-snapshot-sync', device='drive0',
snapshot_file=new_img,
format=iotests.imgfmt)
-- 
2.25.4

[PULL 29/43] .gitignore: Ignore storage-daemon files

2020-06-17 Thread Kevin Wolf

From: Roman Bolshakov 

The files are generated.

Fixes: 2af282ec51a ("qemu-storage-daemon: Add --monitor option")
Cc: Kevin Wolf 
Signed-off-by: Roman Bolshakov 
Message-Id: <20200612105830.17082-1-r.bolsha...@yadro.com>
Reviewed-by: Philippe Mathieu-Daudé 
Tested-by: Philippe Mathieu-Daudé 
Signed-off-by: Kevin Wolf 
---
 .gitignore | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/.gitignore b/.gitignore
index 0c5af83aa7..90acb4347d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -34,18 +34,18 @@
 /qapi/qapi-builtin-types.[ch]
 /qapi/qapi-builtin-visit.[ch]
 /qapi/qapi-commands-*.[ch]
-/qapi/qapi-commands.[ch]
-/qapi/qapi-emit-events.[ch]
+**/qapi/qapi-commands.[ch]
+**/qapi/qapi-emit-events.[ch]
 /qapi/qapi-events-*.[ch]
-/qapi/qapi-events.[ch]
-/qapi/qapi-init-commands.[ch]
-/qapi/qapi-introspect.[ch]
+**/qapi/qapi-events.[ch]
+**/qapi/qapi-init-commands.[ch]
+**/qapi/qapi-introspect.[ch]
 /qapi/qapi-types-*.[ch]
-/qapi/qapi-types.[ch]
+**/qapi/qapi-types.[ch]
 /qapi/qapi-visit-*.[ch]
 !/qapi/qapi-visit-core.c
-/qapi/qapi-visit.[ch]
-/qapi/qapi-doc.texi
+**/qapi/qapi-visit.[ch]
+**/qapi/qapi-doc.texi
 /qemu-edid
 /qemu-img
 /qemu-nbd
@@ -59,6 +59,7 @@
 /qemu-keymap
 /qemu-monitor.texi
 /qemu-monitor-info.texi
+/qemu-storage-daemon
 /qemu-version.h
 /qemu-version.h.tmp
 /module_block.h
-- 
2.25.4

[PULL 34/43] qdev-properties: make blocksize accept size suffixes

2020-06-17 Thread Kevin Wolf

From: Roman Kagan 

It appears convenient to be able to specify physical_block_size and
logical_block_size using common size suffixes.

Teach the blocksize property setter to interpret them.  Also express the
upper and lower limits in the respective units.

Signed-off-by: Roman Kagan 
Reviewed-by: Eric Blake 
Message-Id: <20200528225516.1676602-6-rvka...@yandex-team.ru>
Signed-off-by: Kevin Wolf 
---
 hw/core/qdev-properties.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
index 40c13f6ebe..c9af6a1341 100644
--- a/hw/core/qdev-properties.c
+++ b/hw/core/qdev-properties.c
@@ -14,6 +14,7 @@
 #include "qapi/visitor.h"
 #include "chardev/char.h"
 #include "qemu/uuid.h"
+#include "qemu/units.h"
 
 void qdev_prop_set_after_realize(DeviceState *dev, const char *name,
   Error **errp)
@@ -771,17 +772,18 @@ const PropertyInfo qdev_prop_size32 = {
 
 /* lower limit is sector size */
 #define MIN_BLOCK_SIZE  512
-#define MIN_BLOCK_SIZE_STR  stringify(MIN_BLOCK_SIZE)
+#define MIN_BLOCK_SIZE_STR  "512 B"
 /* upper limit is the max power of 2 that fits in uint16_t */
-#define MAX_BLOCK_SIZE  32768
-#define MAX_BLOCK_SIZE_STR  stringify(MAX_BLOCK_SIZE)
+#define MAX_BLOCK_SIZE  (32 * KiB)
+#define MAX_BLOCK_SIZE_STR  "32 KiB"
 
 static void set_blocksize(Object *obj, Visitor *v, const char *name,
   void *opaque, Error **errp)
 {
 DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
-uint16_t value, *ptr = qdev_get_prop_ptr(dev, prop);
+uint16_t *ptr = qdev_get_prop_ptr(dev, prop);
+uint64_t value;
 Error *local_err = NULL;
 
 if (dev->realized) {
@@ -789,7 +791,7 @@ static void set_blocksize(Object *obj, Visitor *v, const 
char *name,
 return;
 }
 
-visit_type_uint16(v, name, &value, &local_err);
+visit_type_size(v, name, &value, &local_err);
 if (local_err) {
 error_propagate(errp, local_err);
 return;
@@ -797,7 +799,7 @@ static void set_blocksize(Object *obj, Visitor *v, const 
char *name,
 /* value of 0 means "unset" */
 if (value && (value < MIN_BLOCK_SIZE || value > MAX_BLOCK_SIZE)) {
 error_setg(errp,
-   "Property %s.%s doesn't take value %" PRIu16
+   "Property %s.%s doesn't take value %" PRIu64
" (minimum: " MIN_BLOCK_SIZE_STR
", maximum: " MAX_BLOCK_SIZE_STR ")",
dev->id ? : "", name, value);
@@ -816,7 +818,7 @@ static void set_blocksize(Object *obj, Visitor *v, const 
char *name,
 }
 
 const PropertyInfo qdev_prop_blocksize = {
-.name  = "uint16",
+.name  = "size",
 .description = "A power of two between " MIN_BLOCK_SIZE_STR
" and " MAX_BLOCK_SIZE_STR,
 .get   = get_uint16,
-- 
2.25.4

[PULL 32/43] qdev-properties: blocksize: use same limits in code and description

2020-06-17 Thread Kevin Wolf

From: Roman Kagan 

Make it easier (more visible) to maintain the limits on the blocksize
properties in sync with the respective description, by using macros both
in the code and in the description.

Signed-off-by: Roman Kagan 
Reviewed-by: Eric Blake 
Message-Id: <20200528225516.1676602-4-rvka...@yandex-team.ru>
Signed-off-by: Kevin Wolf 
---
 hw/core/qdev-properties.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
index cc924815da..249dc69bd8 100644
--- a/hw/core/qdev-properties.c
+++ b/hw/core/qdev-properties.c
@@ -729,6 +729,13 @@ const PropertyInfo qdev_prop_pci_devfn = {
 
 /* --- blocksize --- */
 
+/* lower limit is sector size */
+#define MIN_BLOCK_SIZE  512
+#define MIN_BLOCK_SIZE_STR  stringify(MIN_BLOCK_SIZE)
+/* upper limit is the max power of 2 that fits in uint16_t */
+#define MAX_BLOCK_SIZE  32768
+#define MAX_BLOCK_SIZE_STR  stringify(MAX_BLOCK_SIZE)
+
 static void set_blocksize(Object *obj, Visitor *v, const char *name,
   void *opaque, Error **errp)
 {
@@ -736,8 +743,6 @@ static void set_blocksize(Object *obj, Visitor *v, const 
char *name,
 Property *prop = opaque;
 uint16_t value, *ptr = qdev_get_prop_ptr(dev, prop);
 Error *local_err = NULL;
-const int64_t min = 512;
-const int64_t max = 32768;
 
 if (dev->realized) {
 qdev_prop_set_after_realize(dev, name, errp);
@@ -750,9 +755,12 @@ static void set_blocksize(Object *obj, Visitor *v, const 
char *name,
 return;
 }
 /* value of 0 means "unset" */
-if (value && (value < min || value > max)) {
-error_setg(errp, QERR_PROPERTY_VALUE_OUT_OF_RANGE,
-   dev->id ? : "", name, (int64_t)value, min, max);
+if (value && (value < MIN_BLOCK_SIZE || value > MAX_BLOCK_SIZE)) {
+error_setg(errp,
+   "Property %s.%s doesn't take value %" PRIu16
+   " (minimum: " MIN_BLOCK_SIZE_STR
+   ", maximum: " MAX_BLOCK_SIZE_STR ")",
+   dev->id ? : "", name, value);
 return;
 }
 
@@ -769,7 +777,8 @@ static void set_blocksize(Object *obj, Visitor *v, const 
char *name,
 
 const PropertyInfo qdev_prop_blocksize = {
 .name  = "uint16",
-.description = "A power of two between 512 and 32768",
+.description = "A power of two between " MIN_BLOCK_SIZE_STR
+   " and " MAX_BLOCK_SIZE_STR,
 .get   = get_uint16,
 .set   = set_blocksize,
 .set_default_value = set_default_value_uint,
-- 
2.25.4

[PULL 39/43] iotests/041: Skip test_small_target for qed

2020-06-17 Thread Kevin Wolf

From: Max Reitz 

qed does not support shrinking images, so the test_small_target method
should be skipped to keep 041 passing.

Fixes: 16cea4ee1c8e5a69a058e76f426b2e17974d8d7d
Signed-off-by: Max Reitz 
Message-Id: <20200617104822.27525-3-mre...@redhat.com>
Tested-by: Thomas Huth 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/041 | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tests/qemu-iotests/041 b/tests/qemu-iotests/041
index 601c756117..b843f88a66 100755
--- a/tests/qemu-iotests/041
+++ b/tests/qemu-iotests/041
@@ -277,6 +277,8 @@ class TestSingleBlockdev(TestSingleDrive):
 result = self.vm.run_job('job0')
 self.assertEqual(result, 'Source and target image have different 
sizes')
 
+# qed does not support shrinking
+@iotests.skip_for_formats(('qed'))
 def test_small_target(self):
 self.do_test_target_size(self.image_len // 2)
 
-- 
2.25.4

[PULL 22/43] hw/block/nvme: factor out cmb setup

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Signed-off-by: Klaus Jensen 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-17-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.c | 43 ---
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a4022b0291..8aabb4c3c3 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -56,6 +56,7 @@
 
 #define NVME_REG_SIZE 0x1000
 #define NVME_DB_SIZE  4
+#define NVME_CMB_BIR 2
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -1438,6 +1439,28 @@ static void nvme_init_namespace(NvmeCtrl *n, 
NvmeNamespace *ns, Error **errp)
 id_ns->nuse = id_ns->ncap;
 }
 
+static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
+{
+NVME_CMBLOC_SET_BIR(n->bar.cmbloc, NVME_CMB_BIR);
+NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
+
+NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
+NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
+NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
+NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
+
+n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
+  "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
+ PCI_BASE_ADDRESS_SPACE_MEMORY |
+ PCI_BASE_ADDRESS_MEM_TYPE_64 |
+ PCI_BASE_ADDRESS_MEM_PREFETCH, &n->ctrl_mem);
+}
+
 static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
 {
 uint8_t *pci_conf = pci_dev->config;
@@ -1514,25 +1537,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 n->bar.intmc = n->bar.intms = 0;
 
 if (n->params.cmb_size_mb) {
-
-NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
-NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
-
-NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
-NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
-NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
-NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
-NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
-NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
-NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
-
-n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
-memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
-  "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
-pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
-PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64 |
-PCI_BASE_ADDRESS_MEM_PREFETCH, &n->ctrl_mem);
-
+nvme_init_cmb(n, pci_dev);
 } else if (n->pmrdev) {
 /* Controller Capabilities register */
 NVME_CAP_SET_PMRS(n->bar.cap, 1);
-- 
2.25.4

[PULL 25/43] hw/block/nvme: factor out controller identify setup

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Signed-off-by: Klaus Jensen 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-20-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.c | 49 ++---
 1 file changed, 26 insertions(+), 23 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 02a6a97df9..e10fc774fc 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1533,32 +1533,11 @@ static void nvme_init_pci(NvmeCtrl *n, PCIDevice 
*pci_dev)
 }
 }
 
-static void nvme_realize(PCIDevice *pci_dev, Error **errp)
+static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
 {
-NvmeCtrl *n = NVME(pci_dev);
 NvmeIdCtrl *id = &n->id_ctrl;
-Error *local_err = NULL;
-
-int i;
-uint8_t *pci_conf;
-
-nvme_check_constraints(n, &local_err);
-if (local_err) {
-error_propagate(errp, local_err);
-return;
-}
-
-nvme_init_state(n);
-
-nvme_init_blk(n, &local_err);
-if (local_err) {
-error_propagate(errp, local_err);
-return;
-}
-
-nvme_init_pci(n, pci_dev);
+uint8_t *pci_conf = pci_dev->config;
 
-pci_conf = pci_dev->config;
 id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
 id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
 strpadcpy((char *)id->mn, sizeof(id->mn), "QEMU NVMe Ctrl", ' ');
@@ -1591,6 +1570,30 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 
 n->bar.vs = 0x00010200;
 n->bar.intmc = n->bar.intms = 0;
+}
+
+static void nvme_realize(PCIDevice *pci_dev, Error **errp)
+{
+NvmeCtrl *n = NVME(pci_dev);
+Error *local_err = NULL;
+
+int i;
+
+nvme_check_constraints(n, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+
+nvme_init_state(n);
+nvme_init_blk(n, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+
+nvme_init_pci(n, pci_dev);
+nvme_init_ctrl(n, pci_dev);
 
 for (i = 0; i < n->num_namespaces; i++) {
 nvme_init_namespace(n, &n->namespaces[i], &local_err);
-- 
2.25.4

[PULL 21/43] hw/block/nvme: factor out pci setup

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Signed-off-by: Klaus Jensen 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-16-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.c | 30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index c98af03f44..a4022b0291 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1438,6 +1438,22 @@ static void nvme_init_namespace(NvmeCtrl *n, 
NvmeNamespace *ns, Error **errp)
 id_ns->nuse = id_ns->ncap;
 }
 
+static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
+{
+uint8_t *pci_conf = pci_dev->config;
+
+pci_conf[PCI_INTERRUPT_PIN] = 1;
+pci_config_set_prog_interface(pci_conf, 0x2);
+pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
+pcie_endpoint_cap_init(pci_dev, 0x80);
+
+memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme",
+  n->reg_size);
+pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
+ PCI_BASE_ADDRESS_MEM_TYPE_64, &n->iomem);
+msix_init_exclusive_bar(pci_dev, n->params.max_ioqpairs + 1, 4, NULL);
+}
+
 static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 {
 NvmeCtrl *n = NVME(pci_dev);
@@ -1461,19 +1477,9 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 return;
 }
 
-pci_conf = pci_dev->config;
-pci_conf[PCI_INTERRUPT_PIN] = 1;
-pci_config_set_prog_interface(pci_dev->config, 0x2);
-pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
-pcie_endpoint_cap_init(pci_dev, 0x80);
-
-memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n,
-  "nvme", n->reg_size);
-pci_register_bar(pci_dev, 0,
-PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
-&n->iomem);
-msix_init_exclusive_bar(pci_dev, n->params.max_ioqpairs + 1, 4, NULL);
+nvme_init_pci(n, pci_dev);
 
+pci_conf = pci_dev->config;
 id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
 id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
 strpadcpy((char *)id->mn, sizeof(id->mn), "QEMU NVMe Ctrl", ' ');
-- 
2.25.4

[PULL 17/43] hw/block/nvme: factor out device state setup

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Signed-off-by: Klaus Jensen 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-12-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index ee669ee8dc..b721cab9b0 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1399,6 +1399,17 @@ static void nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 }
 }
 
+static void nvme_init_state(NvmeCtrl *n)
+{
+n->num_namespaces = 1;
+/* add one to max_ioqpairs to account for the admin queue pair */
+n->reg_size = pow2ceil(NVME_REG_SIZE +
+   2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE);
+n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
+n->sq = g_new0(NvmeSQueue *, n->params.max_ioqpairs + 1);
+n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
+}
+
 static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 {
 NvmeCtrl *n = NVME(pci_dev);
@@ -1415,6 +1426,8 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 return;
 }
 
+nvme_init_state(n);
+
 bs_size = blk_getlength(n->conf.blk);
 if (bs_size < 0) {
 error_setg(errp, "could not get backing file size");
@@ -1433,17 +1446,8 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
 pcie_endpoint_cap_init(pci_dev, 0x80);
 
-n->num_namespaces = 1;
-
-/* add one to max_ioqpairs to account for the admin queue pair */
-n->reg_size = pow2ceil(NVME_REG_SIZE +
-   2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE);
 n->ns_size = bs_size / (uint64_t)n->num_namespaces;
 
-n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
-n->sq = g_new0(NvmeSQueue *, n->params.max_ioqpairs + 1);
-n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
-
 memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n,
   "nvme", n->reg_size);
 pci_register_bar(pci_dev, 0,
-- 
2.25.4

[PULL 30/43] virtio-blk: store opt_io_size with correct size

2020-06-17 Thread Kevin Wolf

From: Roman Kagan 

The width of opt_io_size in virtio_blk_config is 32bit.  However, it's
written with virtio_stw_p; this may result in value truncation, and on
big-endian systems with legacy virtio in completely bogus readings in
the guest.

Use the appropriate accessor to store it.

Signed-off-by: Roman Kagan 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Kevin Wolf 
Message-Id: <20200528225516.1676602-2-rvka...@yandex-team.ru>
Signed-off-by: Kevin Wolf 
---
 hw/block/virtio-blk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 8882a1d1d4..6938a75aa5 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -930,7 +930,7 @@ static void virtio_blk_update_config(VirtIODevice *vdev, 
uint8_t *config)
 virtio_stw_p(vdev, &blkcfg.geometry.cylinders, conf->cyls);
 virtio_stl_p(vdev, &blkcfg.blk_size, blk_size);
 virtio_stw_p(vdev, &blkcfg.min_io_size, conf->min_io_size / blk_size);
-virtio_stw_p(vdev, &blkcfg.opt_io_size, conf->opt_io_size / blk_size);
+virtio_stl_p(vdev, &blkcfg.opt_io_size, conf->opt_io_size / blk_size);
 blkcfg.geometry.heads = conf->heads;
 /*
  * We must ensure that the block device capacity is a multiple of
-- 
2.25.4

[PULL 40/43] iotests/292: data_file is unsupported

2020-06-17 Thread Kevin Wolf

From: Max Reitz 

Fixes: e4d7019e1a81c61de6a925c3ac5bb6e62ea21b29
Signed-off-by: Max Reitz 
Message-Id: <20200617104822.27525-4-mre...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/292 | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tests/qemu-iotests/292 b/tests/qemu-iotests/292
index a2de27cca4..83ab19231d 100755
--- a/tests/qemu-iotests/292
+++ b/tests/qemu-iotests/292
@@ -40,6 +40,11 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fmt qcow2
 _supported_proto file
 _supported_os Linux
+# We need qemu-img map to show the file where the data is allocated,
+# but with an external data file, it will show that instead of the
+# file we want to check.  So just skip this test for external data
+# files.
+_unsupported_imgopts data_file
 
 echo '### Create the backing image'
 BACKING_IMG="$TEST_IMG.base"
-- 
2.25.4

[PULL 28/43] hw/block/nvme: verify msix_init_exclusive_bar() return value

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Pass an Error to msix_init_exclusive_bar() and check it.

Signed-off-by: Klaus Jensen 
Message-Id: <20200609190333.59390-23-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index acc6dbc900..2a2e43f681 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1522,7 +1522,7 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice *pci_dev)
  PCI_BASE_ADDRESS_MEM_PREFETCH, &n->pmrdev->mr);
 }
 
-static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
+static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
 uint8_t *pci_conf = pci_dev->config;
 
@@ -1535,7 +1535,9 @@ static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
   n->reg_size);
 pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
  PCI_BASE_ADDRESS_MEM_TYPE_64, &n->iomem);
-msix_init_exclusive_bar(pci_dev, n->params.msix_qsize, 4, NULL);
+if (msix_init_exclusive_bar(pci_dev, n->params.msix_qsize, 4, errp)) {
+return;
+}
 
 if (n->params.cmb_size_mb) {
 nvme_init_cmb(n, pci_dev);
@@ -1603,7 +1605,12 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 return;
 }
 
-nvme_init_pci(n, pci_dev);
+nvme_init_pci(n, pci_dev, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+
 nvme_init_ctrl(n, pci_dev);
 
 for (i = 0; i < n->num_namespaces; i++) {
-- 
2.25.4

[PULL 24/43] hw/block/nvme: do cmb/pmr init as part of pci init

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Signed-off-by: Klaus Jensen 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Philippe Mathieu-Daudé 
Message-Id: <20200609190333.59390-19-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b954e7b7b2..02a6a97df9 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1525,6 +1525,12 @@ static void nvme_init_pci(NvmeCtrl *n, PCIDevice 
*pci_dev)
 pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
  PCI_BASE_ADDRESS_MEM_TYPE_64, &n->iomem);
 msix_init_exclusive_bar(pci_dev, n->params.max_ioqpairs + 1, 4, NULL);
+
+if (n->params.cmb_size_mb) {
+nvme_init_cmb(n, pci_dev);
+} else if (n->pmrdev) {
+nvme_init_pmr(n, pci_dev);
+}
 }
 
 static void nvme_realize(PCIDevice *pci_dev, Error **errp)
@@ -1586,12 +1592,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 n->bar.vs = 0x00010200;
 n->bar.intmc = n->bar.intms = 0;
 
-if (n->params.cmb_size_mb) {
-nvme_init_cmb(n, pci_dev);
-} else if (n->pmrdev) {
-nvme_init_pmr(n, pci_dev);
-}
-
 for (i = 0; i < n->num_namespaces; i++) {
 nvme_init_namespace(n, &n->namespaces[i], &local_err);
 if (local_err) {
-- 
2.25.4

[PULL 23/43] hw/block/nvme: factor out pmr setup

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Signed-off-by: Klaus Jensen 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Philippe Mathieu-Daudé 
Message-Id: <20200609190333.59390-18-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.c | 95 ++---
 1 file changed, 51 insertions(+), 44 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 8aabb4c3c3..b954e7b7b2 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -57,6 +57,7 @@
 #define NVME_REG_SIZE 0x1000
 #define NVME_DB_SIZE  4
 #define NVME_CMB_BIR 2
+#define NVME_PMR_BIR 2
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -1461,6 +1462,55 @@ static void nvme_init_cmb(NvmeCtrl *n, PCIDevice 
*pci_dev)
  PCI_BASE_ADDRESS_MEM_PREFETCH, &n->ctrl_mem);
 }
 
+static void nvme_init_pmr(NvmeCtrl *n, PCIDevice *pci_dev)
+{
+/* Controller Capabilities register */
+NVME_CAP_SET_PMRS(n->bar.cap, 1);
+
+/* PMR Capabities register */
+n->bar.pmrcap = 0;
+NVME_PMRCAP_SET_RDS(n->bar.pmrcap, 0);
+NVME_PMRCAP_SET_WDS(n->bar.pmrcap, 0);
+NVME_PMRCAP_SET_BIR(n->bar.pmrcap, NVME_PMR_BIR);
+NVME_PMRCAP_SET_PMRTU(n->bar.pmrcap, 0);
+/* Turn on bit 1 support */
+NVME_PMRCAP_SET_PMRWBM(n->bar.pmrcap, 0x02);
+NVME_PMRCAP_SET_PMRTO(n->bar.pmrcap, 0);
+NVME_PMRCAP_SET_CMSS(n->bar.pmrcap, 0);
+
+/* PMR Control register */
+n->bar.pmrctl = 0;
+NVME_PMRCTL_SET_EN(n->bar.pmrctl, 0);
+
+/* PMR Status register */
+n->bar.pmrsts = 0;
+NVME_PMRSTS_SET_ERR(n->bar.pmrsts, 0);
+NVME_PMRSTS_SET_NRDY(n->bar.pmrsts, 0);
+NVME_PMRSTS_SET_HSTS(n->bar.pmrsts, 0);
+NVME_PMRSTS_SET_CBAI(n->bar.pmrsts, 0);
+
+/* PMR Elasticity Buffer Size register */
+n->bar.pmrebs = 0;
+NVME_PMREBS_SET_PMRSZU(n->bar.pmrebs, 0);
+NVME_PMREBS_SET_RBB(n->bar.pmrebs, 0);
+NVME_PMREBS_SET_PMRWBZ(n->bar.pmrebs, 0);
+
+/* PMR Sustained Write Throughput register */
+n->bar.pmrswtp = 0;
+NVME_PMRSWTP_SET_PMRSWTU(n->bar.pmrswtp, 0);
+NVME_PMRSWTP_SET_PMRSWTV(n->bar.pmrswtp, 0);
+
+/* PMR Memory Space Control register */
+n->bar.pmrmsc = 0;
+NVME_PMRMSC_SET_CMSE(n->bar.pmrmsc, 0);
+NVME_PMRMSC_SET_CBA(n->bar.pmrmsc, 0);
+
+pci_register_bar(pci_dev, NVME_PMRCAP_BIR(n->bar.pmrcap),
+ PCI_BASE_ADDRESS_SPACE_MEMORY |
+ PCI_BASE_ADDRESS_MEM_TYPE_64 |
+ PCI_BASE_ADDRESS_MEM_PREFETCH, &n->pmrdev->mr);
+}
+
 static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
 {
 uint8_t *pci_conf = pci_dev->config;
@@ -1539,50 +1589,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 if (n->params.cmb_size_mb) {
 nvme_init_cmb(n, pci_dev);
 } else if (n->pmrdev) {
-/* Controller Capabilities register */
-NVME_CAP_SET_PMRS(n->bar.cap, 1);
-
-/* PMR Capabities register */
-n->bar.pmrcap = 0;
-NVME_PMRCAP_SET_RDS(n->bar.pmrcap, 0);
-NVME_PMRCAP_SET_WDS(n->bar.pmrcap, 0);
-NVME_PMRCAP_SET_BIR(n->bar.pmrcap, 2);
-NVME_PMRCAP_SET_PMRTU(n->bar.pmrcap, 0);
-/* Turn on bit 1 support */
-NVME_PMRCAP_SET_PMRWBM(n->bar.pmrcap, 0x02);
-NVME_PMRCAP_SET_PMRTO(n->bar.pmrcap, 0);
-NVME_PMRCAP_SET_CMSS(n->bar.pmrcap, 0);
-
-/* PMR Control register */
-n->bar.pmrctl = 0;
-NVME_PMRCTL_SET_EN(n->bar.pmrctl, 0);
-
-/* PMR Status register */
-n->bar.pmrsts = 0;
-NVME_PMRSTS_SET_ERR(n->bar.pmrsts, 0);
-NVME_PMRSTS_SET_NRDY(n->bar.pmrsts, 0);
-NVME_PMRSTS_SET_HSTS(n->bar.pmrsts, 0);
-NVME_PMRSTS_SET_CBAI(n->bar.pmrsts, 0);
-
-/* PMR Elasticity Buffer Size register */
-n->bar.pmrebs = 0;
-NVME_PMREBS_SET_PMRSZU(n->bar.pmrebs, 0);
-NVME_PMREBS_SET_RBB(n->bar.pmrebs, 0);
-NVME_PMREBS_SET_PMRWBZ(n->bar.pmrebs, 0);
-
-/* PMR Sustained Write Throughput register */
-n->bar.pmrswtp = 0;
-NVME_PMRSWTP_SET_PMRSWTU(n->bar.pmrswtp, 0);
-NVME_PMRSWTP_SET_PMRSWTV(n->bar.pmrswtp, 0);
-
-/* PMR Memory Space Control register */
-n->bar.pmrmsc = 0;
-NVME_PMRMSC_SET_CMSE(n->bar.pmrmsc, 0);
-NVME_PMRMSC_SET_CBA(n->bar.pmrmsc, 0);
-
-pci_register_bar(pci_dev, NVME_PMRCAP_BIR(n->bar.pmrcap),
-PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64 |
-PCI_BASE_ADDRESS_MEM_PREFETCH, &n->pmrdev->mr);
+nvme_init_pmr(n, pci_dev);
 }
 
 for (i = 0; i < n->num_namespaces; i++) {
-- 
2.25.4

[PULL 26/43] hw/block/nvme: Verify msix_vector_use() returned value

2020-06-17 Thread Kevin Wolf

From: Philippe Mathieu-Daudé 

msix_vector_use() returns -EINVAL on error. Assert it won't.

Signed-off-by: Philippe Mathieu-Daudé 
Message-Id: <20200609190333.59390-21-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index e10fc774fc..fe17aa5d70 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -615,6 +615,10 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeCmd *cmd)
 static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
 uint16_t cqid, uint16_t vector, uint16_t size, uint16_t irq_enabled)
 {
+int ret;
+
+ret = msix_vector_use(&n->parent_obj, vector);
+assert(ret == 0);
 cq->ctrl = n;
 cq->cqid = cqid;
 cq->size = size;
@@ -625,7 +629,6 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, 
uint64_t dma_addr,
 cq->head = cq->tail = 0;
 QTAILQ_INIT(&cq->req_list);
 QTAILQ_INIT(&cq->sq_list);
-msix_vector_use(&n->parent_obj, cq->vector);
 n->cq[cqid] = cq;
 cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
 }
-- 
2.25.4

[PULL 27/43] hw/block/nvme: add msix_qsize parameter

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Decouple the requested maximum number of ioqpairs (param max_ioqpairs)
from the number of MSI-X interrupt vectors by introducing a new
msix_qsize parameter and initialize MSI-X with that. This allows
emulating a device that has fewer vectors than I/O queue pairs and also
allows more than 2048 queue pairs. To keep the device behaving as
previously, use a msix_qsize default of 65 (default max_ioqpairs + 1).

This decoupling was actually suggested by Maxim some time ago in a
slightly different context, so adding a Suggested-by.

Suggested-by: Maxim Levitsky 
Signed-off-by: Klaus Jensen 
Message-Id: <20200609190333.59390-22-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.h |  1 +
 hw/block/nvme.c | 17 +
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 61dd9b23b8..1d30c0bca2 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -7,6 +7,7 @@ typedef struct NvmeParams {
 char *serial;
 uint32_t num_queues; /* deprecated since 5.1 */
 uint32_t max_ioqpairs;
+uint16_t msix_qsize;
 uint32_t cmb_size_mb;
 } NvmeParams;
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index fe17aa5d70..acc6dbc900 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -54,6 +54,7 @@
 #include "trace.h"
 #include "nvme.h"
 
+#define NVME_MAX_IOQPAIRS 0x
 #define NVME_REG_SIZE 0x1000
 #define NVME_DB_SIZE  4
 #define NVME_CMB_BIR 2
@@ -662,7 +663,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
 trace_pci_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
-if (unlikely(vector > n->params.max_ioqpairs)) {
+if (unlikely(vector >= n->params.msix_qsize)) {
 trace_pci_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
@@ -1371,9 +1372,16 @@ static void nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 }
 
 if (params->max_ioqpairs < 1 ||
-params->max_ioqpairs > PCI_MSIX_FLAGS_QSIZE) {
+params->max_ioqpairs > NVME_MAX_IOQPAIRS) {
 error_setg(errp, "max_ioqpairs must be between 1 and %d",
-   PCI_MSIX_FLAGS_QSIZE);
+   NVME_MAX_IOQPAIRS);
+return;
+}
+
+if (params->msix_qsize < 1 ||
+params->msix_qsize > PCI_MSIX_FLAGS_QSIZE + 1) {
+error_setg(errp, "msix_qsize must be between 1 and %d",
+   PCI_MSIX_FLAGS_QSIZE + 1);
 return;
 }
 
@@ -1527,7 +1535,7 @@ static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
   n->reg_size);
 pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
  PCI_BASE_ADDRESS_MEM_TYPE_64, &n->iomem);
-msix_init_exclusive_bar(pci_dev, n->params.max_ioqpairs + 1, 4, NULL);
+msix_init_exclusive_bar(pci_dev, n->params.msix_qsize, 4, NULL);
 
 if (n->params.cmb_size_mb) {
 nvme_init_cmb(n, pci_dev);
@@ -1634,6 +1642,7 @@ static Property nvme_props[] = {
 DEFINE_PROP_UINT32("cmb_size_mb", NvmeCtrl, params.cmb_size_mb, 0),
 DEFINE_PROP_UINT32("num_queues", NvmeCtrl, params.num_queues, 0),
 DEFINE_PROP_UINT32("max_ioqpairs", NvmeCtrl, params.max_ioqpairs, 64),
+DEFINE_PROP_UINT16("msix_qsize", NvmeCtrl, params.msix_qsize, 65),
 DEFINE_PROP_END_OF_LIST(),
 };
 
-- 
2.25.4

[PULL 16/43] hw/block/nvme: factor out property/constraint checks

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Signed-off-by: Klaus Jensen 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-11-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.c | 48 ++--
 1 file changed, 30 insertions(+), 18 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 61447220a8..ee669ee8dc 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1354,24 +1354,19 @@ static const MemoryRegionOps nvme_cmb_ops = {
 },
 };
 
-static void nvme_realize(PCIDevice *pci_dev, Error **errp)
+static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
 {
-NvmeCtrl *n = NVME(pci_dev);
-NvmeIdCtrl *id = &n->id_ctrl;
-
-int i;
-int64_t bs_size;
-uint8_t *pci_conf;
+NvmeParams *params = &n->params;
 
-if (n->params.num_queues) {
+if (params->num_queues) {
 warn_report("num_queues is deprecated; please use max_ioqpairs "
 "instead");
 
-n->params.max_ioqpairs = n->params.num_queues - 1;
+params->max_ioqpairs = params->num_queues - 1;
 }
 
-if (n->params.max_ioqpairs < 1 ||
-n->params.max_ioqpairs > PCI_MSIX_FLAGS_QSIZE) {
+if (params->max_ioqpairs < 1 ||
+params->max_ioqpairs > PCI_MSIX_FLAGS_QSIZE) {
 error_setg(errp, "max_ioqpairs must be between 1 and %d",
PCI_MSIX_FLAGS_QSIZE);
 return;
@@ -1382,13 +1377,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 return;
 }
 
-bs_size = blk_getlength(n->conf.blk);
-if (bs_size < 0) {
-error_setg(errp, "could not get backing file size");
-return;
-}
-
-if (!n->params.serial) {
+if (!params->serial) {
 error_setg(errp, "serial property not set");
 return;
 }
@@ -1408,6 +1397,29 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 
 host_memory_backend_set_mapped(n->pmrdev, true);
 }
+}
+
+static void nvme_realize(PCIDevice *pci_dev, Error **errp)
+{
+NvmeCtrl *n = NVME(pci_dev);
+NvmeIdCtrl *id = &n->id_ctrl;
+Error *local_err = NULL;
+
+int i;
+int64_t bs_size;
+uint8_t *pci_conf;
+
+nvme_check_constraints(n, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+
+bs_size = blk_getlength(n->conf.blk);
+if (bs_size < 0) {
+error_setg(errp, "could not get backing file size");
+return;
+}
 
 blkconf_blocksizes(&n->conf);
 if (!blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
-- 
2.25.4

[PULL 08/43] hw/block/nvme: rename trace events to pci_nvme

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Change the prefix of all nvme device related trace events to 'pci_nvme'
to not clash with trace events from the nvme block driver.

Signed-off-by: Klaus Jensen 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-3-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.c   | 198 +-
 hw/block/trace-events | 180 +++---
 2 files changed, 188 insertions(+), 190 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index c1476e8b2a..e8f5c5ab82 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -125,16 +125,16 @@ static void nvme_irq_assert(NvmeCtrl *n, NvmeCQueue *cq)
 {
 if (cq->irq_enabled) {
 if (msix_enabled(&(n->parent_obj))) {
-trace_nvme_irq_msix(cq->vector);
+trace_pci_nvme_irq_msix(cq->vector);
 msix_notify(&(n->parent_obj), cq->vector);
 } else {
-trace_nvme_irq_pin();
+trace_pci_nvme_irq_pin();
 assert(cq->cqid < 64);
 n->irq_status |= 1 << cq->cqid;
 nvme_irq_check(n);
 }
 } else {
-trace_nvme_irq_masked();
+trace_pci_nvme_irq_masked();
 }
 }
 
@@ -159,7 +159,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector 
*iov, uint64_t prp1,
 int num_prps = (len >> n->page_bits) + 1;
 
 if (unlikely(!prp1)) {
-trace_nvme_err_invalid_prp();
+trace_pci_nvme_err_invalid_prp();
 return NVME_INVALID_FIELD | NVME_DNR;
 } else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
@@ -173,7 +173,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector 
*iov, uint64_t prp1,
 len -= trans_len;
 if (len) {
 if (unlikely(!prp2)) {
-trace_nvme_err_invalid_prp2_missing();
+trace_pci_nvme_err_invalid_prp2_missing();
 goto unmap;
 }
 if (len > n->page_size) {
@@ -189,7 +189,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector 
*iov, uint64_t prp1,
 
 if (i == n->max_prp_ents - 1 && len > n->page_size) {
 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
-trace_nvme_err_invalid_prplist_ent(prp_ent);
+trace_pci_nvme_err_invalid_prplist_ent(prp_ent);
 goto unmap;
 }
 
@@ -202,7 +202,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector 
*iov, uint64_t prp1,
 }
 
 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
-trace_nvme_err_invalid_prplist_ent(prp_ent);
+trace_pci_nvme_err_invalid_prplist_ent(prp_ent);
 goto unmap;
 }
 
@@ -217,7 +217,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector 
*iov, uint64_t prp1,
 }
 } else {
 if (unlikely(prp2 & (n->page_size - 1))) {
-trace_nvme_err_invalid_prp2_align(prp2);
+trace_pci_nvme_err_invalid_prp2_align(prp2);
 goto unmap;
 }
 if (qsg->nsg) {
@@ -265,20 +265,20 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 QEMUIOVector iov;
 uint16_t status = NVME_SUCCESS;
 
-trace_nvme_dma_read(prp1, prp2);
+trace_pci_nvme_dma_read(prp1, prp2);
 
 if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 if (qsg.nsg > 0) {
 if (unlikely(dma_buf_read(ptr, len, &qsg))) {
-trace_nvme_err_invalid_dma();
+trace_pci_nvme_err_invalid_dma();
 status = NVME_INVALID_FIELD | NVME_DNR;
 }
 qemu_sglist_destroy(&qsg);
 } else {
 if (unlikely(qemu_iovec_from_buf(&iov, 0, ptr, len) != len)) {
-trace_nvme_err_invalid_dma();
+trace_pci_nvme_err_invalid_dma();
 status = NVME_INVALID_FIELD | NVME_DNR;
 }
 qemu_iovec_destroy(&iov);
@@ -367,7 +367,7 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace 
*ns, NvmeCmd *cmd,
 uint32_t count = nlb << data_shift;
 
 if (unlikely(slba + nlb > ns->id_ns.nsze)) {
-trace_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
+trace_pci_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
 return NVME_LBA_RANGE | NVME_DNR;
 }
 
@@ -395,11 +395,11 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, 
NvmeCmd *cmd,
 int is_write = rw->opcode == NVME_CMD_WRITE ? 1 : 0;
 enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
 
-trace_nvme_rw(is_write ? "write" : "read", nlb, data_size, slba);
+trace_pci_nvme_rw(is_write ? "write" : "read", nlb, data_size, slba);
 
 if (u

[PULL 19/43] hw/block/nvme: add namespace helpers

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Introduce some small helpers to make the next patches easier on the eye.

Signed-off-by: Klaus Jensen 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-14-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.h | 17 +
 hw/block/nvme.c |  3 +--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index cedc8022db..61dd9b23b8 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -61,6 +61,17 @@ typedef struct NvmeNamespace {
 NvmeIdNsid_ns;
 } NvmeNamespace;
 
+static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
+{
+NvmeIdNs *id_ns = &ns->id_ns;
+return &id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
+}
+
+static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
+{
+return nvme_ns_lbaf(ns)->ds;
+}
+
 #define TYPE_NVME "nvme"
 #define NVME(obj) \
 OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
@@ -97,4 +108,10 @@ typedef struct NvmeCtrl {
 NvmeIdCtrl  id_ctrl;
 } NvmeCtrl;
 
+/* calculate the number of LBAs that the namespace can accomodate */
+static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
+{
+return n->ns_size >> nvme_ns_lbads(ns);
+}
+
 #endif /* HW_NVME_H */
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 87f1f0d0d1..3f3db17231 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1573,8 +1573,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 id_ns->dps = 0;
 id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
 id_ns->ncap  = id_ns->nuse = id_ns->nsze =
-cpu_to_le64(n->ns_size >>
-id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas)].ds);
+cpu_to_le64(nvme_ns_nlbas(n, ns));
 }
 }
 
-- 
2.25.4

[PULL 20/43] hw/block/nvme: factor out namespace setup

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Signed-off-by: Klaus Jensen 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Maxim Levitsky  
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-15-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.c | 46 ++
 1 file changed, 26 insertions(+), 20 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3f3db17231..c98af03f44 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1417,6 +1417,27 @@ static void nvme_init_blk(NvmeCtrl *n, Error **errp)
   false, errp);
 }
 
+static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
+{
+int64_t bs_size;
+NvmeIdNs *id_ns = &ns->id_ns;
+
+bs_size = blk_getlength(n->conf.blk);
+if (bs_size < 0) {
+error_setg_errno(errp, -bs_size, "could not get backing file size");
+return;
+}
+
+n->ns_size = bs_size;
+
+id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(n, ns));
+
+/* no thin provisioning */
+id_ns->ncap = id_ns->nsze;
+id_ns->nuse = id_ns->ncap;
+}
+
 static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 {
 NvmeCtrl *n = NVME(pci_dev);
@@ -1424,7 +1445,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 Error *local_err = NULL;
 
 int i;
-int64_t bs_size;
 uint8_t *pci_conf;
 
 nvme_check_constraints(n, &local_err);
@@ -1435,12 +1455,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 
 nvme_init_state(n);
 
-bs_size = blk_getlength(n->conf.blk);
-if (bs_size < 0) {
-error_setg(errp, "could not get backing file size");
-return;
-}
-
 nvme_init_blk(n, &local_err);
 if (local_err) {
 error_propagate(errp, local_err);
@@ -1453,8 +1467,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
 pcie_endpoint_cap_init(pci_dev, 0x80);
 
-n->ns_size = bs_size / (uint64_t)n->num_namespaces;
-
 memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n,
   "nvme", n->reg_size);
 pci_register_bar(pci_dev, 0,
@@ -1563,17 +1575,11 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 }
 
 for (i = 0; i < n->num_namespaces; i++) {
-NvmeNamespace *ns = &n->namespaces[i];
-NvmeIdNs *id_ns = &ns->id_ns;
-id_ns->nsfeat = 0;
-id_ns->nlbaf = 0;
-id_ns->flbas = 0;
-id_ns->mc = 0;
-id_ns->dpc = 0;
-id_ns->dps = 0;
-id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
-id_ns->ncap  = id_ns->nuse = id_ns->nsze =
-cpu_to_le64(nvme_ns_nlbas(n, ns));
+nvme_init_namespace(n, &n->namespaces[i], &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
 }
 }
 
-- 
2.25.4

[PULL 13/43] hw/block/nvme: fix pin-based interrupt behavior

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

First, since the device only supports MSI-X or pin-based interrupt, if
MSI-X is not enabled, it should not accept interrupt vectors different
from 0 when creating completion queues.

Secondly, the irq_status NvmeCtrl member is meant to be compared to the
INTMS register, so it should only be 32 bits wide. And it is really only
useful when used with multi-message MSI.

Third, since we do not force a 1-to-1 correspondence between cqid and
interrupt vector, the irq_status register should not have bits set
according to cqid, but according to the associated interrupt vector.

Fix these issues, but keep irq_status available so we can easily support
multi-message MSI down the line.

Fixes: 5e9aa92eb1a5 ("hw/block: Fix pin-based interrupt behaviour of NVMe")
Cc: "Michael S. Tsirkin" 
Cc: Marcel Apfelbaum 
Signed-off-by: Klaus Jensen 
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-8-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.h |  2 +-
 hw/block/nvme.c | 12 
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 9df244c93c..91f16c8125 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -84,7 +84,7 @@ typedef struct NvmeCtrl {
 uint32_tcmbsz;
 uint32_tcmbloc;
 uint8_t *cmbuf;
-uint64_tirq_status;
+uint32_tirq_status;
 uint64_thost_timestamp; /* Timestamp sent by the host 
*/
 uint64_ttimestamp_set_qemu_clock_ms;/* QEMU clock time */
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index d6fcf078a4..ee514625ee 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -137,8 +137,8 @@ static void nvme_irq_assert(NvmeCtrl *n, NvmeCQueue *cq)
 msix_notify(&(n->parent_obj), cq->vector);
 } else {
 trace_pci_nvme_irq_pin();
-assert(cq->cqid < 64);
-n->irq_status |= 1 << cq->cqid;
+assert(cq->vector < 32);
+n->irq_status |= 1 << cq->vector;
 nvme_irq_check(n);
 }
 } else {
@@ -152,8 +152,8 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
 if (msix_enabled(&(n->parent_obj))) {
 return;
 } else {
-assert(cq->cqid < 64);
-n->irq_status &= ~(1 << cq->cqid);
+assert(cq->vector < 32);
+n->irq_status &= ~(1 << cq->vector);
 nvme_irq_check(n);
 }
 }
@@ -652,6 +652,10 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
 trace_pci_nvme_err_invalid_create_cq_addr(prp1);
 return NVME_INVALID_FIELD | NVME_DNR;
 }
+if (unlikely(!msix_enabled(&n->parent_obj) && vector)) {
+trace_pci_nvme_err_invalid_create_cq_vector(vector);
+return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
+}
 if (unlikely(vector > n->params.num_queues)) {
 trace_pci_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
-- 
2.25.4

[PULL 15/43] hw/block/nvme: remove redundant cmbloc/cmbsz members

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Signed-off-by: Klaus Jensen 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-10-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.h | 2 --
 hw/block/nvme.c | 7 ++-
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 26c38bd913..cedc8022db 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -82,8 +82,6 @@ typedef struct NvmeCtrl {
 uint32_tnum_namespaces;
 uint32_tmax_q_ents;
 uint64_tns_size;
-uint32_tcmbsz;
-uint32_tcmbloc;
 uint8_t *cmbuf;
 uint32_tirq_status;
 uint64_thost_timestamp; /* Timestamp sent by the host 
*/
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 1c1d2f8b77..61447220a8 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -76,7 +76,7 @@ static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
 
 static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
 {
-if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
+if (n->bar.cmbsz && nvme_addr_is_cmb(n, addr)) {
 memcpy(buf, (void *)&n->cmbuf[addr - n->ctrl_mem.addr], size);
 return;
 }
@@ -170,7 +170,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector 
*iov, uint64_t prp1,
 if (unlikely(!prp1)) {
 trace_pci_nvme_err_invalid_prp();
 return NVME_INVALID_FIELD | NVME_DNR;
-} else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
+} else if (n->bar.cmbsz && prp1 >= n->ctrl_mem.addr &&
prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
 qsg->nsg = 0;
 qemu_iovec_init(iov, num_prps);
@@ -1485,9 +1485,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
 NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
 
-n->cmbloc = n->bar.cmbloc;
-n->cmbsz = n->bar.cmbsz;
-
 n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
 memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
   "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
-- 
2.25.4

[PULL 14/43] hw/block/nvme: add max_ioqpairs device parameter

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

The num_queues device paramater has a slightly confusing meaning because
it accounts for the admin queue pair which is not really optional.
Secondly, it is really a maximum value of queues allowed.

Add a new max_ioqpairs parameter that only accounts for I/O queue pairs,
but keep num_queues for compatibility.

Signed-off-by: Klaus Jensen 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-9-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.h |  3 ++-
 hw/block/nvme.c | 51 ++---
 2 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 91f16c8125..26c38bd913 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -5,7 +5,8 @@
 
 typedef struct NvmeParams {
 char *serial;
-uint32_t num_queues;
+uint32_t num_queues; /* deprecated since 5.1 */
+uint32_t max_ioqpairs;
 uint32_t cmb_size_mb;
 } NvmeParams;
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index ee514625ee..1c1d2f8b77 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -20,7 +20,7 @@
  *  -device nvme,drive=,serial=,id=, \
  *  cmb_size_mb=, \
  *  [pmrdev=,] \
- *  num_queues=
+ *  max_ioqpairs=
  *
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
  * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
@@ -36,6 +36,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/units.h"
+#include "qemu/error-report.h"
 #include "hw/block/block.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pci.h"
@@ -85,12 +86,12 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
-return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
+return sqid < n->params.max_ioqpairs + 1 && n->sq[sqid] != NULL ? 0 : -1;
 }
 
 static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
 {
-return cqid < n->params.num_queues && n->cq[cqid] != NULL ? 0 : -1;
+return cqid < n->params.max_ioqpairs + 1 && n->cq[cqid] != NULL ? 0 : -1;
 }
 
 static void nvme_inc_cq_tail(NvmeCQueue *cq)
@@ -656,7 +657,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
 trace_pci_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
-if (unlikely(vector > n->params.num_queues)) {
+if (unlikely(vector > n->params.max_ioqpairs)) {
 trace_pci_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
@@ -808,8 +809,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
 break;
 case NVME_NUMBER_OF_QUEUES:
-result = cpu_to_le32((n->params.num_queues - 2) |
- ((n->params.num_queues - 2) << 16));
+result = cpu_to_le32((n->params.max_ioqpairs - 1) |
+ ((n->params.max_ioqpairs - 1) << 16));
 trace_pci_nvme_getfeat_numq(result);
 break;
 case NVME_TIMESTAMP:
@@ -853,10 +854,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 case NVME_NUMBER_OF_QUEUES:
 trace_pci_nvme_setfeat_numq((dw11 & 0x) + 1,
 ((dw11 >> 16) & 0x) + 1,
-n->params.num_queues - 1,
-n->params.num_queues - 1);
-req->cqe.result = cpu_to_le32((n->params.num_queues - 2) |
-  ((n->params.num_queues - 2) << 16));
+n->params.max_ioqpairs,
+n->params.max_ioqpairs);
+req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
+  ((n->params.max_ioqpairs - 1) << 16));
 break;
 case NVME_TIMESTAMP:
 return nvme_set_feature_timestamp(n, cmd);
@@ -927,12 +928,12 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 
 blk_drain(n->conf.blk);
 
-for (i = 0; i < n->params.num_queues; i++) {
+for (i = 0; i < n->params.max_ioqpairs + 1; i++) {
 if (n->sq[i] != NULL) {
 nvme_free_sq(n->sq[i], n);
 }
 }
-for (i = 0; i < n->params.num_queues; i++) {
+for (i = 0; i < n->params.max_ioqpairs + 1; i++) {
 if (n->cq[i] != NULL) {
 nvme_free_cq(n->cq[i], n);
 }
@@ -1362,8 +1363,17 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 int64_t bs_size;
 uint8_t *pci_conf;
 
-if (!n->params.num_queues) {
-error_setg(errp, "num_queues can't be zero");
+if (n->params.num_queues) {
+warn_report("num_queues is deprecated; please use max_ioqpairs "
+"instead");
+
+n->params.

[PULL 10/43] hw/block/nvme: move device parameters to separate struct

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Move device configuration parameters to separate struct to make it
explicit what is configurable and what is set internally.

Signed-off-by: Klaus Jensen 
Signed-off-by: Klaus Jensen 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Maxim Levitsky 
Message-Id: <20200609190333.59390-5-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.h | 11 ---
 hw/block/nvme.c | 49 ++---
 2 files changed, 34 insertions(+), 26 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 6520a9f0be..9df244c93c 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -1,7 +1,14 @@
 #ifndef HW_NVME_H
 #define HW_NVME_H
+
 #include "block/nvme.h"
 
+typedef struct NvmeParams {
+char *serial;
+uint32_t num_queues;
+uint32_t cmb_size_mb;
+} NvmeParams;
+
 typedef struct NvmeAsyncEvent {
 QSIMPLEQ_ENTRY(NvmeAsyncEvent) entry;
 NvmeAerResult result;
@@ -63,6 +70,7 @@ typedef struct NvmeCtrl {
 MemoryRegion ctrl_mem;
 NvmeBar  bar;
 BlockConfconf;
+NvmeParams   params;
 
 uint32_tpage_size;
 uint16_tpage_bits;
@@ -71,10 +79,8 @@ typedef struct NvmeCtrl {
 uint16_tsqe_size;
 uint32_treg_size;
 uint32_tnum_namespaces;
-uint32_tnum_queues;
 uint32_tmax_q_ents;
 uint64_tns_size;
-uint32_tcmb_size_mb;
 uint32_tcmbsz;
 uint32_tcmbloc;
 uint8_t *cmbuf;
@@ -82,7 +88,6 @@ typedef struct NvmeCtrl {
 uint64_thost_timestamp; /* Timestamp sent by the host 
*/
 uint64_ttimestamp_set_qemu_clock_ms;/* QEMU clock time */
 
-char*serial;
 HostMemoryBackend *pmrdev;
 
 NvmeNamespace   *namespaces;
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 0d3f8f345f..bc2d9d2091 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -77,12 +77,12 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
-return sqid < n->num_queues && n->sq[sqid] != NULL ? 0 : -1;
+return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
 }
 
 static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
 {
-return cqid < n->num_queues && n->cq[cqid] != NULL ? 0 : -1;
+return cqid < n->params.num_queues && n->cq[cqid] != NULL ? 0 : -1;
 }
 
 static void nvme_inc_cq_tail(NvmeCQueue *cq)
@@ -644,7 +644,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
 trace_pci_nvme_err_invalid_create_cq_addr(prp1);
 return NVME_INVALID_FIELD | NVME_DNR;
 }
-if (unlikely(vector > n->num_queues)) {
+if (unlikely(vector > n->params.num_queues)) {
 trace_pci_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
@@ -796,7 +796,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
 break;
 case NVME_NUMBER_OF_QUEUES:
-result = cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 
16));
+result = cpu_to_le32((n->params.num_queues - 2) |
+ ((n->params.num_queues - 2) << 16));
 trace_pci_nvme_getfeat_numq(result);
 break;
 case NVME_TIMESTAMP:
@@ -840,9 +841,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 case NVME_NUMBER_OF_QUEUES:
 trace_pci_nvme_setfeat_numq((dw11 & 0x) + 1,
 ((dw11 >> 16) & 0x) + 1,
-n->num_queues - 1, n->num_queues - 1);
-req->cqe.result =
-cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
+n->params.num_queues - 1,
+n->params.num_queues - 1);
+req->cqe.result = cpu_to_le32((n->params.num_queues - 2) |
+  ((n->params.num_queues - 2) << 16));
 break;
 case NVME_TIMESTAMP:
 return nvme_set_feature_timestamp(n, cmd);
@@ -913,12 +915,12 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 
 blk_drain(n->conf.blk);
 
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->sq[i] != NULL) {
 nvme_free_sq(n->sq[i], n);
 }
 }
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->cq[i] != NULL) {
 nvme_free_cq(n->cq[i], n);
 }
@@ -1348,7 +1350,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 int64_t bs_size;
 uint8_t *pci_conf;
 
-if (!n->num_queues) {
+if (!n->params.num_queues) {
 error_setg(errp, "num_queues can't be zero");
 return;
 }
@@ -1364,12 +1366,12 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 re

[PULL 18/43] hw/block/nvme: factor out block backend setup

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Signed-off-by: Klaus Jensen 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-13-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b721cab9b0..87f1f0d0d1 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1410,6 +1410,13 @@ static void nvme_init_state(NvmeCtrl *n)
 n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
 }
 
+static void nvme_init_blk(NvmeCtrl *n, Error **errp)
+{
+blkconf_blocksizes(&n->conf);
+blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
+  false, errp);
+}
+
 static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 {
 NvmeCtrl *n = NVME(pci_dev);
@@ -1434,9 +1441,9 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 return;
 }
 
-blkconf_blocksizes(&n->conf);
-if (!blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
-   false, errp)) {
+nvme_init_blk(n, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
 return;
 }
 
-- 
2.25.4

[PULL 07/43] hw/block/nvme: fix pci doorbell size calculation

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

The size of the BAR is 0x1000 (main registers) + 8 bytes for each
queue. Currently, the size of the BAR is calculated like so:

n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);

Since the 'num_queues' parameter already accounts for the admin queue,
this should in any case not need to be incremented by one. Also, the
size should be initialized to (0x1000).

n->reg_size = pow2ceil(0x1000 + 2 * n->num_queues * 4);

This, with the default value of num_queues (64), we will set aside room
for 1 admin queue and 63 I/O queues (4 bytes per doorbell, 2 doorbells
per queue).

Signed-off-by: Klaus Jensen 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-2-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a21eeca2fb..c1476e8b2a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -53,6 +53,9 @@
 #include "trace.h"
 #include "nvme.h"
 
+#define NVME_REG_SIZE 0x1000
+#define NVME_DB_SIZE  4
+
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
 (trace_##trace)(__VA_ARGS__); \
@@ -1401,7 +1404,9 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 pcie_endpoint_cap_init(pci_dev, 0x80);
 
 n->num_namespaces = 1;
-n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
+
+/* num_queues is really number of pairs, so each has two doorbells */
+n->reg_size = pow2ceil(NVME_REG_SIZE + 2 * n->num_queues * NVME_DB_SIZE);
 n->ns_size = bs_size / (uint64_t)n->num_namespaces;
 
 n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
-- 
2.25.4

[PULL 03/43] virtio-blk: Refactor the code that processes queued requests

2020-06-17 Thread Kevin Wolf

From: Sergio Lopez 

Move the code that processes queued requests from
virtio_blk_dma_restart_bh() to its own, non-static, function. This
will allow us to call it from the virtio_blk_data_plane_start() in a
future patch.

Signed-off-by: Sergio Lopez 
Message-Id: <20200603093240.40489-2-...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 include/hw/virtio/virtio-blk.h |  1 +
 hw/block/virtio-blk.c  | 16 +++-
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 1e62f869b2..f584ad9b86 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -86,5 +86,6 @@ typedef struct MultiReqBuffer {
 } MultiReqBuffer;
 
 bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq);
+void virtio_blk_process_queued_requests(VirtIOBlock *s);
 
 #endif
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index f5f6fc925e..978574e4da 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -819,15 +819,11 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, 
VirtQueue *vq)
 virtio_blk_handle_output_do(s, vq);
 }
 
-static void virtio_blk_dma_restart_bh(void *opaque)
+void virtio_blk_process_queued_requests(VirtIOBlock *s)
 {
-VirtIOBlock *s = opaque;
 VirtIOBlockReq *req = s->rq;
 MultiReqBuffer mrb = {};
 
-qemu_bh_delete(s->bh);
-s->bh = NULL;
-
 s->rq = NULL;
 
 aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
@@ -855,6 +851,16 @@ static void virtio_blk_dma_restart_bh(void *opaque)
 aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
+static void virtio_blk_dma_restart_bh(void *opaque)
+{
+VirtIOBlock *s = opaque;
+
+qemu_bh_delete(s->bh);
+s->bh = NULL;
+
+virtio_blk_process_queued_requests(s);
+}
+
 static void virtio_blk_dma_restart_cb(void *opaque, int running,
   RunState state)
 {
-- 
2.25.4

[PULL 11/43] hw/block/nvme: use constants in identify

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Signed-off-by: Klaus Jensen 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-6-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 include/block/nvme.h | 8 
 hw/block/nvme.c  | 8 
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index 5525c8e343..1720ee1d51 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -705,6 +705,14 @@ typedef struct NvmePSD {
 uint8_t resv[16];
 } NvmePSD;
 
+#define NVME_IDENTIFY_DATA_SIZE 4096
+
+enum {
+NVME_ID_CNS_NS = 0x0,
+NVME_ID_CNS_CTRL   = 0x1,
+NVME_ID_CNS_NS_ACTIVE_LIST = 0x2,
+};
+
 typedef struct NvmeIdCtrl {
 uint16_tvid;
 uint16_tssvid;
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index bc2d9d2091..2a26b8859a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -692,7 +692,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify 
*c)
 
 static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
 {
-static const int data_len = 4 * KiB;
+static const int data_len = NVME_IDENTIFY_DATA_SIZE;
 uint32_t min_nsid = le32_to_cpu(c->nsid);
 uint64_t prp1 = le64_to_cpu(c->prp1);
 uint64_t prp2 = le64_to_cpu(c->prp2);
@@ -722,11 +722,11 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
 NvmeIdentify *c = (NvmeIdentify *)cmd;
 
 switch (le32_to_cpu(c->cns)) {
-case 0x00:
+case NVME_ID_CNS_NS:
 return nvme_identify_ns(n, c);
-case 0x01:
+case NVME_ID_CNS_CTRL:
 return nvme_identify_ctrl(n, c);
-case 0x02:
+case NVME_ID_CNS_NS_ACTIVE_LIST:
 return nvme_identify_nslist(n, c);
 default:
 trace_pci_nvme_err_invalid_identify_cns(le32_to_cpu(c->cns));
-- 
2.25.4

[PULL 05/43] block: Refactor subdirectory recursion during make

2020-06-17 Thread Kevin Wolf

From: Eric Blake 

Rather than listing block/monitor from the top-level Makefile.objs, we
should instead list monitor from block/Makefile.objs.

Suggested-by: Kevin Wolf 
Fixes: bb4e58c613
Signed-off-by: Eric Blake 
Message-Id: <20200608173339.3244211-1-ebl...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 Makefile.objs   | 2 +-
 block/Makefile.objs | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/Makefile.objs b/Makefile.objs
index c09d95dfe3..7ce2588b89 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -13,7 +13,7 @@ chardev-obj-y = chardev/
 
 authz-obj-y = authz/
 
-block-obj-y = block/ block/monitor/ nbd/ scsi/
+block-obj-y = block/ nbd/ scsi/
 block-obj-y += block.o blockjob.o job.o
 block-obj-y += qemu-io-cmds.o
 block-obj-$(CONFIG_REPLICATION) += replication.o
diff --git a/block/Makefile.objs b/block/Makefile.objs
index 3635b6b4c1..96028eedce 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -46,6 +46,7 @@ block-obj-y += aio_task.o
 block-obj-y += backup-top.o
 block-obj-y += filter-compress.o
 common-obj-y += monitor/
+block-obj-y += monitor/
 
 block-obj-y += stream.o
 
-- 
2.25.4

[PULL 12/43] hw/block/nvme: refactor nvme_addr_read

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

Pull the controller memory buffer check to its own function. The check
will be used on its own in later patches.

Signed-off-by: Klaus Jensen 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-7-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 2a26b8859a..d6fcf078a4 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -65,14 +65,22 @@
 
 static void nvme_process_sq(void *opaque);
 
+static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
+{
+hwaddr low = n->ctrl_mem.addr;
+hwaddr hi  = n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size);
+
+return addr >= low && addr < hi;
+}
+
 static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
 {
-if (n->cmbsz && addr >= n->ctrl_mem.addr &&
-addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
+if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
 memcpy(buf, (void *)&n->cmbuf[addr - n->ctrl_mem.addr], size);
-} else {
-pci_dma_read(&n->parent_obj, addr, buf, size);
+return;
 }
+
+pci_dma_read(&n->parent_obj, addr, buf, size);
 }
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
-- 
2.25.4

[PULL 09/43] hw/block/nvme: remove superfluous breaks

2020-06-17 Thread Kevin Wolf

From: Klaus Jensen 

These break statements was left over when commit 3036a626e9ef ("nvme:
add Get/Set Feature Timestamp support") was merged.

Signed-off-by: Klaus Jensen 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Keith Busch 
Message-Id: <20200609190333.59390-4-...@irrelevant.dk>
Signed-off-by: Kevin Wolf 
---
 hw/block/nvme.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index e8f5c5ab82..0d3f8f345f 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -801,7 +801,6 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 break;
 case NVME_TIMESTAMP:
 return nvme_get_feature_timestamp(n, cmd);
-break;
 default:
 trace_pci_nvme_err_invalid_getfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
@@ -845,11 +844,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 req->cqe.result =
 cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
 break;
-
 case NVME_TIMESTAMP:
 return nvme_set_feature_timestamp(n, cmd);
-break;
-
 default:
 trace_pci_nvme_err_invalid_setfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
-- 
2.25.4

[PULL 04/43] virtio-blk: On restart, process queued requests in the proper context

2020-06-17 Thread Kevin Wolf

From: Sergio Lopez 

On restart, we were scheduling a BH to process queued requests, which
would run before starting up the data plane, leading to those requests
being assigned and started on coroutines on the main context.

This could cause requests to be wrongly processed in parallel from
different threads (the main thread and the iothread managing the data
plane), potentially leading to multiple issues.

For example, stopping and resuming a VM multiple times while the guest
is generating I/O on a virtio_blk device can trigger a crash with a
stack tracing looking like this one:

<-->
 Thread 2 (Thread 0x7ff736765700 (LWP 1062503)):
 #0  0x5567a13b99d6 in iov_memset
 (iov=0x6563617073206f4e, iov_cnt=1717922848, offset=516096, fillc=0, 
bytes=7018105756081554803)
 at util/iov.c:69
 #1  0x5567a13bab73 in qemu_iovec_memset
 (qiov=0x7ff73ec99748, offset=516096, fillc=0, bytes=7018105756081554803) 
at util/iov.c:530
 #2  0x5567a12f411c in qemu_laio_process_completion (laiocb=0x7ff6512ee6c0) 
at block/linux-aio.c:86
 #3  0x5567a12f42ff in qemu_laio_process_completions (s=0x7ff7182e8420) at 
block/linux-aio.c:217
 #4  0x5567a12f480d in ioq_submit (s=0x7ff7182e8420) at 
block/linux-aio.c:323
 #5  0x5567a12f43d9 in qemu_laio_process_completions_and_submit 
(s=0x7ff7182e8420)
 at block/linux-aio.c:236
 #6  0x5567a12f44c2 in qemu_laio_poll_cb (opaque=0x7ff7182e8430) at 
block/linux-aio.c:267
 #7  0x5567a13aed83 in run_poll_handlers_once (ctx=0x5567a2b58c70, 
timeout=0x7ff7367645f8)
 at util/aio-posix.c:520
 #8  0x5567a13aee9f in run_poll_handlers (ctx=0x5567a2b58c70, max_ns=16000, 
timeout=0x7ff7367645f8)
 at util/aio-posix.c:562
 #9  0x5567a13aefde in try_poll_mode (ctx=0x5567a2b58c70, 
timeout=0x7ff7367645f8)
 at util/aio-posix.c:597
 #10 0x5567a13af115 in aio_poll (ctx=0x5567a2b58c70, blocking=true) at 
util/aio-posix.c:639
 #11 0x5567a109acca in iothread_run (opaque=0x5567a2b29760) at iothread.c:75
 #12 0x5567a13b2790 in qemu_thread_start (args=0x5567a2b694c0) at 
util/qemu-thread-posix.c:519
 #13 0x7ff73eedf2de in start_thread () at /lib64/libpthread.so.0
 #14 0x7ff73ec10e83 in clone () at /lib64/libc.so.6

 Thread 1 (Thread 0x7ff743986f00 (LWP 1062500)):
 #0  0x5567a13b99d6 in iov_memset
 (iov=0x6563617073206f4e, iov_cnt=1717922848, offset=516096, fillc=0, 
bytes=7018105756081554803)
 at util/iov.c:69
 #1  0x5567a13bab73 in qemu_iovec_memset
 (qiov=0x7ff73ec99748, offset=516096, fillc=0, bytes=7018105756081554803) 
at util/iov.c:530
 #2  0x5567a12f411c in qemu_laio_process_completion (laiocb=0x7ff6512ee6c0) 
at block/linux-aio.c:86
 #3  0x5567a12f42ff in qemu_laio_process_completions (s=0x7ff7182e8420) at 
block/linux-aio.c:217
 #4  0x5567a12f480d in ioq_submit (s=0x7ff7182e8420) at 
block/linux-aio.c:323
 #5  0x5567a12f4a2f in laio_do_submit (fd=19, laiocb=0x7ff5f4ff9ae0, 
offset=472363008, type=2)
 at block/linux-aio.c:375
 #6  0x5567a12f4af2 in laio_co_submit
 (bs=0x5567a2b8c460, s=0x7ff7182e8420, fd=19, offset=472363008, 
qiov=0x7ff5f4ff9ca0, type=2)
 at block/linux-aio.c:394
 #7  0x5567a12f1803 in raw_co_prw
 (bs=0x5567a2b8c460, offset=472363008, bytes=20480, qiov=0x7ff5f4ff9ca0, 
type=2)
 at block/file-posix.c:1892
 #8  0x5567a12f1941 in raw_co_pwritev
 (bs=0x5567a2b8c460, offset=472363008, bytes=20480, qiov=0x7ff5f4ff9ca0, 
flags=0)
 at block/file-posix.c:1925
 #9  0x5567a12fe3e1 in bdrv_driver_pwritev
 (bs=0x5567a2b8c460, offset=472363008, bytes=20480, qiov=0x7ff5f4ff9ca0, 
qiov_offset=0, flags=0)
 at block/io.c:1183
 #10 0x5567a1300340 in bdrv_aligned_pwritev
 (child=0x5567a2b5b070, req=0x7ff5f4ff9db0, offset=472363008, bytes=20480, 
align=512, qiov=0x7ff72c0425b8, qiov_offset=0, flags=0) at block/io.c:1980
 #11 0x5567a1300b29 in bdrv_co_pwritev_part
 (child=0x5567a2b5b070, offset=472363008, bytes=20480, qiov=0x7ff72c0425b8, 
qiov_offset=0, flags=0)
 at block/io.c:2137
 #12 0x5567a12baba1 in qcow2_co_pwritev_task
 (bs=0x5567a2b92740, file_cluster_offset=472317952, offset=487305216, 
bytes=20480, qiov=0x7ff72c0425b8, qiov_offset=0, l2meta=0x0) at 
block/qcow2.c:2444
 #13 0x5567a12bacdb in qcow2_co_pwritev_task_entry (task=0x5567a2b48540) at 
block/qcow2.c:2475
 #14 0x5567a13167d8 in aio_task_co (opaque=0x5567a2b48540) at 
block/aio_task.c:45
 #15 0x5567a13cf00c in coroutine_trampoline (i0=738245600, i1=32759) at 
util/coroutine-ucontext.c:115
 #16 0x7ff73eb622e0 in __start_context () at /lib64/libc.so.6
 #17 0x7ff6626f1350 in  ()
 #18 0x in  ()
<-->

This is also known to cause crashes with this message (assertion
failed):

 aio_co_schedule: Co-routine was already scheduled in 'aio_co_schedule'

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1812765
Signed-off-by: Sergio Lopez 
Message-Id: <20200603093240.40489-3-...@redhat.com>
Signed-off-by: Kevin Wolf

[PULL 02/43] icount: make dma reads deterministic

2020-06-17 Thread Kevin Wolf

From: Pavel Dovgalyuk 

Windows guest sometimes makes DMA requests with overlapping
target addresses. This leads to the following structure of iov for
the block driver:

addr size1
addr size2
addr size3

It means that three adjacent disk blocks should be read into the same
memory buffer. Windows does not expects anything from these bytes
(should it be data from the first block, or the last one, or some mix),
but uses them somehow. It leads to non-determinism of the guest execution,
because block driver does not preserve any order of reading.

This situation was discusses in the mailing list at least twice:
https://lists.gnu.org/archive/html/qemu-devel/2010-09/msg01996.html
https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg05185.html

This patch makes such disk reads deterministic in icount mode.
It splits the whole request into several parts. Parts may overlap,
but SGs inside one part do not overlap.
Parts that are processed later overwrite the prior ones in case
of overlapping.

Examples for different SG part sequences:

1)
A1 1000
A2 1000
A1 1000
A3 1000
->
One request is split into two.
A1 1000
A2 1000
--
A1 1000
A3 1000

2)
A1 800
A2 1000
A1 1000
->
A1 800
A2 1000
--
A1 1000

Signed-off-by: Pavel Dovgalyuk 
Message-Id: <159117972206.12193.12939621311413561779.stgit@pasha-ThinkPad-X280>
Signed-off-by: Kevin Wolf 
---
 dma-helpers.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/dma-helpers.c b/dma-helpers.c
index e8a26e81e1..2a77b5a9cb 100644
--- a/dma-helpers.c
+++ b/dma-helpers.c
@@ -13,6 +13,8 @@
 #include "trace-root.h"
 #include "qemu/thread.h"
 #include "qemu/main-loop.h"
+#include "sysemu/cpus.h"
+#include "qemu/range.h"
 
 /* #define DEBUG_IOMMU */
 
@@ -142,6 +144,26 @@ static void dma_blk_cb(void *opaque, int ret)
 cur_addr = dbs->sg->sg[dbs->sg_cur_index].base + dbs->sg_cur_byte;
 cur_len = dbs->sg->sg[dbs->sg_cur_index].len - dbs->sg_cur_byte;
 mem = dma_memory_map(dbs->sg->as, cur_addr, &cur_len, dbs->dir);
+/*
+ * Make reads deterministic in icount mode. Windows sometimes issues
+ * disk read requests with overlapping SGs. It leads
+ * to non-determinism, because resulting buffer contents may be mixed
+ * from several sectors. This code splits all SGs into several
+ * groups. SGs in every group do not overlap.
+ */
+if (mem && use_icount && dbs->dir == DMA_DIRECTION_FROM_DEVICE) {
+int i;
+for (i = 0 ; i < dbs->iov.niov ; ++i) {
+if (ranges_overlap((intptr_t)dbs->iov.iov[i].iov_base,
+   dbs->iov.iov[i].iov_len, (intptr_t)mem,
+   cur_len)) {
+dma_memory_unmap(dbs->sg->as, mem, cur_len,
+ dbs->dir, cur_len);
+mem = NULL;
+break;
+}
+}
+}
 if (!mem)
 break;
 qemu_iovec_add(&dbs->iov, mem, cur_len);
-- 
2.25.4

[PULL 00/43] Block layer patches

2020-06-17 Thread Kevin Wolf

The following changes since commit 5c24bce3056ff209a1ecc50ff4b7e65b85ad8e74:

  Merge remote-tracking branch 
'remotes/stsquad/tags/pull-testing-and-plugin-160620-2' into staging 
(2020-06-16 14:57:15 +0100)

are available in the Git repository at:

  git://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 3419ec713f04c323b030e0763459435335b25476:

  iotests: Add copyright line in qcow2.py (2020-06-17 16:21:21 +0200)


Block layer patches:

- enhance handling of size-related BlockConf properties
- nvme: small fixes, refactoring and cleanups
- virtio-blk: On restart, process queued requests in the proper context
- icount: make dma reads deterministic
- iotests: Some fixes for rarely run cases
- .gitignore: Ignore storage-daemon files
- Minor code cleanups


Eric Blake (3):
  block: Refactor subdirectory recursion during make
  qcow2: Tweak comments on qcow2_get_persistent_dirty_bitmap_size
  iotests: Add copyright line in qcow2.py

Klaus Jensen (21):
  hw/block/nvme: fix pci doorbell size calculation
  hw/block/nvme: rename trace events to pci_nvme
  hw/block/nvme: remove superfluous breaks
  hw/block/nvme: move device parameters to separate struct
  hw/block/nvme: use constants in identify
  hw/block/nvme: refactor nvme_addr_read
  hw/block/nvme: fix pin-based interrupt behavior
  hw/block/nvme: add max_ioqpairs device parameter
  hw/block/nvme: remove redundant cmbloc/cmbsz members
  hw/block/nvme: factor out property/constraint checks
  hw/block/nvme: factor out device state setup
  hw/block/nvme: factor out block backend setup
  hw/block/nvme: add namespace helpers
  hw/block/nvme: factor out namespace setup
  hw/block/nvme: factor out pci setup
  hw/block/nvme: factor out cmb setup
  hw/block/nvme: factor out pmr setup
  hw/block/nvme: do cmb/pmr init as part of pci init
  hw/block/nvme: factor out controller identify setup
  hw/block/nvme: add msix_qsize parameter
  hw/block/nvme: verify msix_init_exclusive_bar() return value

Max Reitz (5):
  iotests.py: Add skip_for_formats() decorator
  iotests/041: Skip test_small_target for qed
  iotests/292: data_file is unsupported
  iotests/229: data_file is unsupported
  iotests/{190,291}: compat=0.10 is unsupported

Pavel Dovgaluk (1):
  icount: make dma reads deterministic

Philippe Mathieu-Daudé (2):
  hw/ide: Make IDEDMAOps handlers take a const IDEDMA pointer
  hw/block/nvme: Verify msix_vector_use() returned value

Roman Bolshakov (1):
  .gitignore: Ignore storage-daemon files

Roman Kagan (8):
  virtio-blk: store opt_io_size with correct size
  block: consolidate blocksize properties consistency checks
  qdev-properties: blocksize: use same limits in code and description
  qdev-properties: add size32 property type
  qdev-properties: make blocksize accept size suffixes
  block: make BlockConf size props 32bit and accept size suffixes
  qdev-properties: add getter for size32 and blocksize
  block: lift blocksize property limit to 2 MiB

Sergio Lopez (2):
  virtio-blk: Refactor the code that processes queued requests
  virtio-blk: On restart, process queued requests in the proper context

 hw/block/nvme.h|  34 ++-
 include/block/nvme.h   |   8 +
 include/hw/block/block.h   |  14 +-
 include/hw/ide/internal.h  |  12 +-
 include/hw/qdev-properties.h   |   5 +-
 include/hw/virtio/virtio-blk.h |   1 +
 block/qcow2-bitmap.c   |   9 +-
 dma-helpers.c  |  22 ++
 hw/block/block.c   |  40 ++-
 hw/block/dataplane/virtio-blk.c|   8 +
 hw/block/fdc.c |   5 +-
 hw/block/nvme.c| 574 +
 hw/block/swim.c|   5 +-
 hw/block/virtio-blk.c  |  39 +--
 hw/block/xen-block.c   |   6 +-
 hw/core/qdev-properties.c  |  85 +-
 hw/ide/ahci.c  |  18 +-
 hw/ide/core.c  |   6 +-
 hw/ide/macio.c |   6 +-
 hw/ide/pci.c   |  12 +-
 hw/ide/qdev.c  |   5 +-
 hw/scsi/scsi-disk.c|  12 +-
 hw/usb/dev-storage.c   |   5 +-
 tests/qemu-iotests/iotests.py  |  16 ++
 tests/qemu-iotests/qcow2.py|   2 +
 tests/qemu-iotests/qcow2_format.py |   1 +
 .gitignore |  17 +-
 Makefile.objs  |   2 +-
 block/Makefile.objs|   1 +
 hw/block/trace-events  | 180 ++--
 tests/qemu-iotests/041 |   2 +
 tests/qemu-iotests/118 |   7 +-
 tests/qemu-iotests/172.out | 532 +-
 tests/qemu-iot

[PULL 01/43] hw/ide: Make IDEDMAOps handlers take a const IDEDMA pointer

2020-06-17 Thread Kevin Wolf

From: Philippe Mathieu-Daudé 

Handlers don't need to modify the IDEDMA structure.
Make it const.

Signed-off-by: Philippe Mathieu-Daudé 
Message-Id: <20200512194917.15807-1-phi...@redhat.com>
Acked-by: John Snow 
Signed-off-by: Kevin Wolf 
---
 include/hw/ide/internal.h | 12 ++--
 hw/ide/ahci.c | 18 +-
 hw/ide/core.c |  6 +++---
 hw/ide/macio.c|  6 +++---
 hw/ide/pci.c  | 12 ++--
 5 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/include/hw/ide/internal.h b/include/hw/ide/internal.h
index 55da35d768..1a7869e85d 100644
--- a/include/hw/ide/internal.h
+++ b/include/hw/ide/internal.h
@@ -322,12 +322,12 @@ typedef enum { IDE_HD, IDE_CD, IDE_CFATA } IDEDriveKind;
 
 typedef void EndTransferFunc(IDEState *);
 
-typedef void DMAStartFunc(IDEDMA *, IDEState *, BlockCompletionFunc *);
-typedef void DMAVoidFunc(IDEDMA *);
-typedef int DMAIntFunc(IDEDMA *, bool);
-typedef int32_t DMAInt32Func(IDEDMA *, int32_t len);
-typedef void DMAu32Func(IDEDMA *, uint32_t);
-typedef void DMAStopFunc(IDEDMA *, bool);
+typedef void DMAStartFunc(const IDEDMA *, IDEState *, BlockCompletionFunc *);
+typedef void DMAVoidFunc(const IDEDMA *);
+typedef int DMAIntFunc(const IDEDMA *, bool);
+typedef int32_t DMAInt32Func(const IDEDMA *, int32_t len);
+typedef void DMAu32Func(const IDEDMA *, uint32_t);
+typedef void DMAStopFunc(const IDEDMA *, bool);
 
 struct unreported_events {
 bool eject_request;
diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index fc82cbd5f1..009120f88b 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -44,7 +44,7 @@ static int handle_cmd(AHCIState *s, int port, uint8_t slot);
 static void ahci_reset_port(AHCIState *s, int port);
 static bool ahci_write_fis_d2h(AHCIDevice *ad);
 static void ahci_init_d2h(AHCIDevice *ad);
-static int ahci_dma_prepare_buf(IDEDMA *dma, int32_t limit);
+static int ahci_dma_prepare_buf(const IDEDMA *dma, int32_t limit);
 static bool ahci_map_clb_address(AHCIDevice *ad);
 static bool ahci_map_fis_address(AHCIDevice *ad);
 static void ahci_unmap_clb_address(AHCIDevice *ad);
@@ -1338,7 +1338,7 @@ out:
 }
 
 /* Transfer PIO data between RAM and device */
-static void ahci_pio_transfer(IDEDMA *dma)
+static void ahci_pio_transfer(const IDEDMA *dma)
 {
 AHCIDevice *ad = DO_UPCAST(AHCIDevice, dma, dma);
 IDEState *s = &ad->port.ifs[0];
@@ -1397,7 +1397,7 @@ out:
 }
 }
 
-static void ahci_start_dma(IDEDMA *dma, IDEState *s,
+static void ahci_start_dma(const IDEDMA *dma, IDEState *s,
BlockCompletionFunc *dma_cb)
 {
 AHCIDevice *ad = DO_UPCAST(AHCIDevice, dma, dma);
@@ -1406,7 +1406,7 @@ static void ahci_start_dma(IDEDMA *dma, IDEState *s,
 dma_cb(s, 0);
 }
 
-static void ahci_restart_dma(IDEDMA *dma)
+static void ahci_restart_dma(const IDEDMA *dma)
 {
 /* Nothing to do, ahci_start_dma already resets s->io_buffer_offset.  */
 }
@@ -1415,7 +1415,7 @@ static void ahci_restart_dma(IDEDMA *dma)
  * IDE/PIO restarts are handled by the core layer, but NCQ commands
  * need an extra kick from the AHCI HBA.
  */
-static void ahci_restart(IDEDMA *dma)
+static void ahci_restart(const IDEDMA *dma)
 {
 AHCIDevice *ad = DO_UPCAST(AHCIDevice, dma, dma);
 int i;
@@ -1432,7 +1432,7 @@ static void ahci_restart(IDEDMA *dma)
  * Called in DMA and PIO R/W chains to read the PRDT.
  * Not shared with NCQ pathways.
  */
-static int32_t ahci_dma_prepare_buf(IDEDMA *dma, int32_t limit)
+static int32_t ahci_dma_prepare_buf(const IDEDMA *dma, int32_t limit)
 {
 AHCIDevice *ad = DO_UPCAST(AHCIDevice, dma, dma);
 IDEState *s = &ad->port.ifs[0];
@@ -1453,7 +1453,7 @@ static int32_t ahci_dma_prepare_buf(IDEDMA *dma, int32_t 
limit)
  * Called via dma_buf_commit, for both DMA and PIO paths.
  * sglist destruction is handled within dma_buf_commit.
  */
-static void ahci_commit_buf(IDEDMA *dma, uint32_t tx_bytes)
+static void ahci_commit_buf(const IDEDMA *dma, uint32_t tx_bytes)
 {
 AHCIDevice *ad = DO_UPCAST(AHCIDevice, dma, dma);
 
@@ -1461,7 +1461,7 @@ static void ahci_commit_buf(IDEDMA *dma, uint32_t 
tx_bytes)
 ad->cur_cmd->status = cpu_to_le32(tx_bytes);
 }
 
-static int ahci_dma_rw_buf(IDEDMA *dma, bool is_write)
+static int ahci_dma_rw_buf(const IDEDMA *dma, bool is_write)
 {
 AHCIDevice *ad = DO_UPCAST(AHCIDevice, dma, dma);
 IDEState *s = &ad->port.ifs[0];
@@ -1486,7 +1486,7 @@ static int ahci_dma_rw_buf(IDEDMA *dma, bool is_write)
 return 1;
 }
 
-static void ahci_cmd_done(IDEDMA *dma)
+static void ahci_cmd_done(const IDEDMA *dma)
 {
 AHCIDevice *ad = DO_UPCAST(AHCIDevice, dma, dma);
 
diff --git a/hw/ide/core.c b/hw/ide/core.c
index 689bb36409..d997a78e47 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -2570,16 +2570,16 @@ static void ide_init1(IDEBus *bus, int unit)
ide_sector_write_timer_cb, s);
 }
 
-static int ide_nop_int(IDEDMA *dma, bool is_write)
+static int ide_nop_int(const IDEDMA *d

[PULL 06/43] qcow2: Tweak comments on qcow2_get_persistent_dirty_bitmap_size

2020-06-17 Thread Kevin Wolf

From: Eric Blake 

For now, we don't have persistent bitmaps in any other formats, but
that might not be true in the future.  Make it obvious that our
incoming parameter is not necessarily a qcow2 image, and therefore is
limited to just the bdrv_dirty_bitmap_* API calls (rather than probing
into qcow2 internals).

Suggested-by: Kevin Wolf 
Signed-off-by: Eric Blake 
Message-Id: <20200608190821.3293867-1-ebl...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block/qcow2-bitmap.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index 7bf12502da..1f38806ca6 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -1757,19 +1757,20 @@ bool 
qcow2_supports_persistent_dirty_bitmap(BlockDriverState *bs)
 }
 
 /*
- * Compute the space required for bitmaps in @bs.
+ * Compute the space required to copy bitmaps from @in_bs.
  *
  * The computation is based as if copying to a new image with the
- * given @cluster_size, which may differ from the cluster size in @bs.
+ * given @cluster_size, which may differ from the cluster size in
+ * @in_bs; in fact, @in_bs might be something other than qcow2.
  */
-uint64_t qcow2_get_persistent_dirty_bitmap_size(BlockDriverState *bs,
+uint64_t qcow2_get_persistent_dirty_bitmap_size(BlockDriverState *in_bs,
 uint32_t cluster_size)
 {
 uint64_t bitmaps_size = 0;
 BdrvDirtyBitmap *bm;
 size_t bitmap_dir_size = 0;
 
-FOR_EACH_DIRTY_BITMAP(bs, bm) {
+FOR_EACH_DIRTY_BITMAP(in_bs, bm) {
 if (bdrv_dirty_bitmap_get_persistence(bm)) {
 const char *name = bdrv_dirty_bitmap_name(bm);
 uint32_t granularity = bdrv_dirty_bitmap_granularity(bm);
-- 
2.25.4

Re: [PATCH v4 0/4] Introduce 'yank' oob qmp command to recover from hanging qemu

2020-06-17 Thread Stefan Hajnoczi

On Sat, Jun 06, 2020 at 09:30:38PM +0200, Lukas Straub wrote:
> On Mon, 25 May 2020 17:44:12 +0200
> Lukas Straub  wrote:
> 
> > Hello Everyone,
> > In many cases, if qemu has a network connection (qmp, migration, chardev, 
> > etc.)
> > to some other server and that server dies or hangs, qemu hangs too.
> > These patches introduce the new 'yank' out-of-band qmp command to recover 
> > from
> > these kinds of hangs. The different subsystems register callbacks which get
> > executed with the yank command. For example the callback can shutdown() a
> > socket. This is intended for the colo use-case, but it can be used for other
> > things too of course.
> > 
> > Regards,
> > Lukas Straub
> 
> Hello Everyone,
> Can this be reviewed, it would be cool to have this in qemu 5.1.

Please see my reply to a previous version. Code that executes in the oob
environment needs to take special precautions, this needs to be
documented so that yank API users know what the limitations are.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v2 1/4] Introduce yank feature

2020-06-17 Thread Stefan Hajnoczi

On Thu, May 21, 2020 at 04:48:06PM +0100, Daniel P. Berrangé wrote:
> On Thu, May 21, 2020 at 05:42:41PM +0200, Lukas Straub wrote:
> > On Thu, 21 May 2020 16:03:35 +0100
> > Stefan Hajnoczi  wrote:
> > 
> > > On Wed, May 20, 2020 at 11:05:39PM +0200, Lukas Straub wrote:
> > > > +void yank_generic_iochannel(void *opaque)
> > > > +{
> > > > +QIOChannel *ioc = QIO_CHANNEL(opaque);
> > > > +
> > > > +qio_channel_shutdown(ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
> > > > +}
> > > > +
> > > > +void qmp_yank(strList *instances, Error **errp)
> > > > +{
> > > > +strList *tmp;
> > > > +struct YankInstance *instance;
> > > > +struct YankFuncAndParam *entry;
> > > > +
> > > > +qemu_mutex_lock(&lock);
> > > > +tmp = instances;
> > > > +for (; tmp; tmp = tmp->next) {
> > > > +instance = yank_find_instance(tmp->value);
> > > > +if (!instance) {
> > > > +error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,
> > > > +  "Instance '%s' not found", tmp->value);
> > > > +qemu_mutex_unlock(&lock);
> > > > +return;
> > > > +}
> > > > +}
> > > > +tmp = instances;
> > > > +for (; tmp; tmp = tmp->next) {
> > > > +instance = yank_find_instance(tmp->value);
> > > > +assert(instance);
> > > > +QLIST_FOREACH(entry, &instance->yankfns, next) {
> > > > +entry->func(entry->opaque);
> > > > +}
> > > > +}
> > > > +qemu_mutex_unlock(&lock);
> > > > +}  
> > > 
> > > From docs/devel/qapi-code-gen.txt:
> > > 
> > >   An OOB-capable command handler must satisfy the following conditions:
> > > 
> > >   - It terminates quickly.
> > Check.
> > 
> > >   - It does not invoke system calls that may block.
> > brk/sbrk (malloc and friends):
> > The manpage doesn't say anything about blocking, but malloc is already used 
> > while handling the qmp command.
> > 
> > shutdown():
> > The manpage doesn't say anything about blocking, but this is already used 
> > in migration oob qmp commands.
> 
> It just marks the socket state in local kernel side. It doesn't involve
> any blocking roundtrips over the wire, so this is fine.
> 
> > 
> > There are no other syscalls involved to my knowledge.
> > 
> > >   - It does not access guest RAM that may block when userfaultfd is
> > > enabled for postcopy live migration.
> > Check.
> > 
> > >   - It takes only "fast" locks, i.e. all critical sections protected by
> > > any lock it takes also satisfy the conditions for OOB command
> > > handler code.
> > 
> > The lock in yank.c satisfies this requirement.
> > 
> > qio_channel_shutdown doesn't take any locks.
> 
> Agreed, I think the yank code is compliant with all the points
> listed above.

The code does not document this in any way. In fact, it's an interface
for registering arbitrary callback functions which execute in this
special environment.

If you don't document this explicitly it will break when someone changes
the code, even if it works right now.

Please document this in the yank APIs and also in the implementation.

For example, QemuMutex yank has the priority inversion problem: no other
function may violate the oob rules while holding the mutex, otherwise
the oob function will hang while waiting for the lock when the other
function is blocked.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v2] qcow2: Fix preallocation on images with unaligned sizes

2020-06-17 Thread no-reply

Patchew URL: https://patchew.org/QEMU/20200617140036.20311-1-be...@igalia.com/



Hi,

This series failed the asan build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-debug@fedora TARGET_LIST=x86_64-softmmu J=14 NETWORK=1
=== TEST SCRIPT END ===

  GEN docs/interop/qemu-qmp-ref.txt
  GEN docs/interop/qemu-qmp-ref.7
  CC  qga/commands.o
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  CC  qga/guest-agent-command-state.o
  CC  qga/main.o
  CC  qga/commands-posix.o
---
  GEN docs/interop/qemu-ga-ref.html
  GEN docs/interop/qemu-ga-ref.txt
  GEN docs/interop/qemu-ga-ref.7
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  AS  pc-bios/optionrom/multiboot.o
  AS  pc-bios/optionrom/linuxboot.o
  CC  pc-bios/optionrom/linuxboot_dma.o
---
  SIGNpc-bios/optionrom/kvmvapic.bin
  SIGNpc-bios/optionrom/multiboot.bin
  SIGNpc-bios/optionrom/linuxboot.bin
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKivshmem-server
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKqemu-nbd
  BUILD   pc-bios/optionrom/linuxboot_dma.img
  BUILD   pc-bios/optionrom/pvh.img
---
  BUILD   pc-bios/optionrom/pvh.raw
  SIGNpc-bios/optionrom/linuxboot_dma.bin
  SIGNpc-bios/optionrom/pvh.bin
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKqemu-storage-daemon
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKqemu-img
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKqemu-io
  LINKqemu-edid
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKfsdev/virtfs-proxy-helper
  LINKscsi/qemu-pr-helper
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKqemu-bridge-helper
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKvirtiofsd
/usr/bin/ld: 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors_vfork.S.o):
 warning: common of `__interception::real_vfork' overridden by definition from 
/usr/lib64/clang/10.0.0/lib/linux/libclang_rt.asan-x86_64.a(asan_interceptors.cpp.o)
  LINKvhost-user-input
/us

Re: [PATCH 1/2] iotests: Make _filter_img_create more active

2020-06-17 Thread Max Reitz

On 17.06.20 16:02, Maxim Levitsky wrote:
> On Wed, 2020-06-17 at 15:50 +0200, Max Reitz wrote:
>> On 17.06.20 13:46, Maxim Levitsky wrote:
>>> On Tue, 2020-06-16 at 15:17 +0200, Max Reitz wrote:
 Right now, _filter_img_create just filters out everything that looks
 format-dependent, and applies some filename filters.  That means that we
 have to add another filter line every time some format gets a new
 creation option.  This can be avoided by instead discarding everything
 and just keeping what we know is format-independent (format, size,
 backing file, encryption information[1], preallocation) or just
 interesting to have in the reference output (external data file path).

 Furthermore, we probably want to sort these options.  Format drivers are
 not required to define them in any specific order, so the output is
 effectively random (although this has never bothered us until now).  We
 need a specific order for our reference outputs, though.  Unfortunately,
 just using a plain "sort" would change a lot of existing reference
 outputs, so we have to pre-filter the option keys to keep our existing
 order (fmt, size, backing*, data, encryption info, preallocation).

 [1] Actually, the only thing that is really important is whether
 encryption is enabled or not.  A patch by Maxim thus removes all
 other "encrypt.*" options from the output:
 https://lists.nongnu.org/archive/html/qemu-block/2020-06/msg00339.html
 But that patch needs to come later so we can get away with changing
 as few reference outputs in this patch here as possible.

 Signed-off-by: Max Reitz 
 ---
  tests/qemu-iotests/112.out   |   2 +-
  tests/qemu-iotests/153   |   9 ++-
  tests/qemu-iotests/common.filter | 100 +++
  3 files changed, 81 insertions(+), 30 deletions(-)

 diff --git a/tests/qemu-iotests/112.out b/tests/qemu-iotests/112.out
 index ae0318cabe..182655dbf6 100644
 --- a/tests/qemu-iotests/112.out
 +++ b/tests/qemu-iotests/112.out
 @@ -5,7 +5,7 @@ QA output created by 112
  qemu-img: TEST_DIR/t.IMGFMT: Refcount width must be a power of two and 
 may not exceed 64 bits
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
  qemu-img: TEST_DIR/t.IMGFMT: Refcount width must be a power of two and 
 may not exceed 64 bits
 -Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 refcount_bits=-1
 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
  qemu-img: TEST_DIR/t.IMGFMT: Refcount width must be a power of two and 
 may not exceed 64 bits
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
  qemu-img: TEST_DIR/t.IMGFMT: Refcount width must be a power of two and 
 may not exceed 64 bits
 diff --git a/tests/qemu-iotests/153 b/tests/qemu-iotests/153
 index cf961d3609..11e3d28841 100755
 --- a/tests/qemu-iotests/153
 +++ b/tests/qemu-iotests/153
 @@ -167,11 +167,10 @@ done
  
  echo
  echo "== Creating ${TEST_IMG}.[abc] ==" | _filter_testdir
 -(
 -$QEMU_IMG create -f qcow2 "${TEST_IMG}.a" -b "${TEST_IMG}"
 -$QEMU_IMG create -f qcow2 "${TEST_IMG}.b" -b "${TEST_IMG}"
 -$QEMU_IMG create -f qcow2 "${TEST_IMG}.c" -b "${TEST_IMG}.b"
 -) | _filter_img_create
 +$QEMU_IMG create -f qcow2 "${TEST_IMG}.a" -b "${TEST_IMG}" | 
 _filter_img_create
 +$QEMU_IMG create -f qcow2 "${TEST_IMG}.b" -b "${TEST_IMG}" | 
 _filter_img_create
 +$QEMU_IMG create -f qcow2 "${TEST_IMG}.c" -b "${TEST_IMG}.b" \
 +| _filter_img_create
  
  echo
  echo "== Two devices sharing the same file in backing chain =="
>>>
>>> I guess this is done because now the filter expectes only a single
>>> qemu-img output.
>>
>> Yes, that’s right.
>>
>>> IMHO this is better anyway.
>>>
 diff --git a/tests/qemu-iotests/common.filter 
 b/tests/qemu-iotests/common.filter
 index 03e4f71808..f104ad7a9b 100644
 --- a/tests/qemu-iotests/common.filter
 +++ b/tests/qemu-iotests/common.filter
 @@ -122,38 +122,90 @@ _filter_actual_image_size()
  # replace driver-specific options in the "Formatting..." line
  _filter_img_create()
  {
 -data_file_filter=()
 -if data_file=$(_get_data_file "$TEST_IMG"); then
 -data_file_filter=(-e "s# data_file=$data_file##")
 +# Keep QMP output unchanged
 +qmp_pre=''
 +qmp_post=''
 +to_filter=''
 +
 +while read -r line; do
 +if echo "$line" | grep -q '^{.*}$'; then
 +if [ -z "$to_filter" ]; then
 +# Use $'\n' so the newline is not dropped on variable
 +# expansion
 +qmp_pre="$qmp_pre$line"$'\n'
 +else
 +qmp_post="$qmp_post$line"$'\n'
 +fi
 +else

Re: [PATCH 0/5] iotests: Some fixes for rarely run cases

2020-06-17 Thread Max Reitz

On 17.06.20 16:19, Kevin Wolf wrote:
> Am 17.06.2020 um 16:11 hat Max Reitz geschrieben:
>> On 17.06.20 14:52, Kevin Wolf wrote:
>>> Am 17.06.2020 um 12:48 hat Max Reitz geschrieben:
 Hi,

 Thomas’s report
 (https://lists.nongnu.org/archive/html/qemu-block/2020-06/msg00791.html)
 has given me a nice excuse to write this series.

 There are some iotests that have recently start to fail in rarely
 exercised test environments (qed, qcow2 with data_file, qcow2 v2), and
 this series fixes what I found.
>>>
>>> Thanks, applied to the block branch.
>>
>> Sorry, I didn’t run iotest 297 before sending this series...
>>
>> The problems arise in patch 1:
>>
>> iotests.py:1113:0: C0301: Line too long (80/79) (line-too-long)
>> iotests.py:1106: error: Function is missing a return type annotation
>>
>> (So there’s a line with 80 characters, when 79 is the maximum (*shrug*),
>> and I failed to specify skip_for_format’s return type.)
>>
>> I think patch 1 needs the attached diff squashed in.  Are you willing to
>> do that or should I just send a v2?
> 
> I'm squashing it in. In fact, I already had fixed it, but I was too lazy
> to be more specific than Callable[..., Any], so I'll replace that with
> your version.

I have to admit I was intrigued to see how the actual signature would
turn out, so I followed it through until the biter end. :)

Thanks!



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 0/5] iotests: Some fixes for rarely run cases

2020-06-17 Thread Kevin Wolf

Am 17.06.2020 um 16:11 hat Max Reitz geschrieben:
> On 17.06.20 14:52, Kevin Wolf wrote:
> > Am 17.06.2020 um 12:48 hat Max Reitz geschrieben:
> >> Hi,
> >>
> >> Thomas’s report
> >> (https://lists.nongnu.org/archive/html/qemu-block/2020-06/msg00791.html)
> >> has given me a nice excuse to write this series.
> >>
> >> There are some iotests that have recently start to fail in rarely
> >> exercised test environments (qed, qcow2 with data_file, qcow2 v2), and
> >> this series fixes what I found.
> > 
> > Thanks, applied to the block branch.
> 
> Sorry, I didn’t run iotest 297 before sending this series...
> 
> The problems arise in patch 1:
> 
> iotests.py:1113:0: C0301: Line too long (80/79) (line-too-long)
> iotests.py:1106: error: Function is missing a return type annotation
> 
> (So there’s a line with 80 characters, when 79 is the maximum (*shrug*),
> and I failed to specify skip_for_format’s return type.)
> 
> I think patch 1 needs the attached diff squashed in.  Are you willing to
> do that or should I just send a v2?

I'm squashing it in. In fact, I already had fixed it, but I was too lazy
to be more specific than Callable[..., Any], so I'll replace that with
your version.

Kevin


signature.asc
Description: PGP signature

Re: [PATCH v2] rbd: Use RBD fast-diff for querying actual allocation

2020-06-17 Thread Yi Li

ping ?

On 6/11/20, Yi Li  wrote:
> Since Ceph version Infernalis (9.2.0) the new fast-diff mechanism
> of RBD allows for querying actual rbd image usage.
>
> Prior to this version there was no easy and fast way to query how
> much allocation a RBD image had inside a Ceph cluster.
>
> To use the fast-diff feature it needs to be enabled per RBD image
> and is only supported by Ceph cluster running version Infernalis
> (9.2.0) or newer.
>
> The fast-diff feature disabled or fast-diff map is marked as invalid,
> qemu-img will report an allocation identical to the image capacity.
>
> 'qemu-img info rbd:cepharm/liyi-rbd' might output for example:
>
>   image: json:{"driver": "raw", "file": {"pool": "cepharm",
>   "image": "liyi-rbd", "driver": "rbd"}}
>   file format: raw
>   virtual size: 20 GiB (21474836480 bytes)
>   disk size: 0 B
>   cluster_size: 4194304
>
> Newly created rbds will have the fast-diff feature enabled.
>
> Signed-off-by: Yi Li 
> ---
>  block/rbd.c | 103 
>  1 file changed, 103 insertions(+)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 617553b022..c1e68ff7e9 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -1107,6 +1107,108 @@ static int64_t qemu_rbd_getlength(BlockDriverState
> *bs)
>  return info.size;
>  }
>
> +#if LIBRBD_VERSION_CODE > 265
> +static int disk_usage_callback(uint64_t offset, size_t len, int exists,
> +   void *arg)
> +{
> +  uint64_t *used_size = (uint64_t *)(arg);
> +  if (exists) {
> +(*used_size) += len;
> +  }
> +  return 0;
> +}
> +
> +static int qemu_rbd_getflags(rbd_image_t image, uint64_t *flags)
> +{
> +int r;
> +
> +r = rbd_get_flags(image, flags);
> +if (r < 0) {
> +return r;
> +}
> +return 0;
> +}
> +
> +static bool qemu_rbd_use_fastdiff(uint64_t features, uint64_t flags)
> +{
> +return (((features & RBD_FEATURE_FAST_DIFF) != 0ULL) &&
> +((flags & RBD_FLAG_FAST_DIFF_INVALID) == 0ULL));
> +}
> +
> +static int qemu_rbd_set_allocation(rbd_image_t image,
> +   rbd_image_info_t *info,
> +   uint64_t *used_size)
> +{
> +int r;
> +/*
> + * RBD image fast-diff feature enabled
> + * Querying for actual allocation.
> + */
> +r = rbd_diff_iterate2(image, NULL, 0, info->size, 0, 1,
> +  &disk_usage_callback,
> +  used_size);
> +if (r < 0) {
> +return r;
> +}
> +return 0;
> +}
> +
> +#else
> +static int qemu_rbd_getflags(rbd_image_t image G_GNUC_UNUSED, uint64_t
> *flags)
> +{
> +*flags = 0;
> +return 0;
> +}
> +
> +static bool qemu_rbd_use_fastdiff(uint64_t features G_GNUC_UNUSED,
> +  uint64_t feature_flags G_GNUC_UNUSED)
> +{
> +return false;
> +}
> +
> +static int qemu_rbd_set_allocation(rbd_image_t image G_GNUC_UNUSED,
> +   rbd_image_info_t *info _GNUC_UNUSED,
> +   uint64_t *used_size _GNUC_UNUSED)
> +{
> +return 0;
> +}
> +#endif
> +
> +static int64_t qemu_rbd_allocated_file_size(BlockDriverState *bs)
> +{
> +BDRVRBDState *s = bs->opaque;
> +rbd_image_info_t info;
> +int r;
> +uint64_t used_size = 0;
> +uint64_t features = 0;
> +uint64_t flags = 0;
> +
> +r = rbd_stat(s->image, &info, sizeof(info));
> +if (r < 0) {
> +return r;
> +}
> +
> +r = rbd_get_features(s->image, &features);
> +if (r < 0) {
> +return r;
> +}
> +
> +r = qemu_rbd_getflags(s->image, &flags);
> +if (r < 0) {
> +return r;
> +}
> +
> +if (qemu_rbd_use_fastdiff(features, flags)) {
> +r = qemu_rbd_set_allocation(s->image, &info, &used_size);
> +if (r < 0) {
> +return r;
> +}
> +} else {
> +used_size = info.size;
> +}
> +return used_size;
> +}
> +
>  static int coroutine_fn qemu_rbd_co_truncate(BlockDriverState *bs,
>   int64_t offset,
>   bool exact,
> @@ -1316,6 +1418,7 @@ static BlockDriver bdrv_rbd = {
>  .bdrv_get_info  = qemu_rbd_getinfo,
>  .create_opts= &qemu_rbd_create_opts,
>  .bdrv_getlength = qemu_rbd_getlength,
> +.bdrv_get_allocated_file_size = qemu_rbd_allocated_file_size,
>  .bdrv_co_truncate   = qemu_rbd_co_truncate,
>  .protocol_name  = "rbd",
>
> --
> 2.25.3
>
>
>
>

Re: [PATCH 0/5] iotests: Some fixes for rarely run cases

2020-06-17 Thread Max Reitz

On 17.06.20 14:52, Kevin Wolf wrote:
> Am 17.06.2020 um 12:48 hat Max Reitz geschrieben:
>> Hi,
>>
>> Thomas’s report
>> (https://lists.nongnu.org/archive/html/qemu-block/2020-06/msg00791.html)
>> has given me a nice excuse to write this series.
>>
>> There are some iotests that have recently start to fail in rarely
>> exercised test environments (qed, qcow2 with data_file, qcow2 v2), and
>> this series fixes what I found.
> 
> Thanks, applied to the block branch.

Sorry, I didn’t run iotest 297 before sending this series...

The problems arise in patch 1:

iotests.py:1113:0: C0301: Line too long (80/79) (line-too-long)
iotests.py:1106: error: Function is missing a return type annotation

(So there’s a line with 80 characters, when 79 is the maximum (*shrug*),
and I failed to specify skip_for_format’s return type.)

I think patch 1 needs the attached diff squashed in.  Are you willing to
do that or should I just send a v2?

Max
diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 92c08b9dc6..5ea4c4df8b 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -1103,14 +1103,17 @@ def skip_if_unsupported(required_formats=(), read_only=False):
 return func_wrapper
 return skip_test_decorator
 
-def skip_for_formats(formats: Sequence[str] = ()):
+def skip_for_formats(formats: Sequence[str] = ()) \
+-> Callable[[Callable[[QMPTestCase, List[Any], Dict[str, Any]], None]],
+Callable[[QMPTestCase, List[Any], Dict[str, Any]], None]]:
 '''Skip Test Decorator
Skips the test for the given formats'''
 def skip_test_decorator(func):
 def func_wrapper(test_case: QMPTestCase, *args: List[Any],
  **kwargs: Dict[str, Any]) -> None:
 if imgfmt in formats:
-test_case.case_skip(f'{test_case}: Skipped for format {imgfmt}')
+msg = f'{test_case}: Skipped for format {imgfmt}'
+test_case.case_skip(msg)
 else:
 func(test_case, *args, **kwargs)
 return func_wrapper


signature.asc
Description: OpenPGP digital signature

[PATCH v2] qcow2: Fix preallocation on images with unaligned sizes

2020-06-17 Thread Alberto Garcia

When resizing an image with qcow2_co_truncate() using the falloc or
full preallocation modes the code assumes that both the old and new
sizes are cluster-aligned.

There are two problems with this:

  1) The calculation of how many clusters are involved does not always
 get the right result.

 Example: creating a 60KB image and resizing it (with
 preallocation=full) to 80KB won't allocate the second cluster.

  2) No copy-on-write is performed, so in the previous example if
 there is a backing file then the first 60KB of the first cluster
 won't be filled with data from the backing file.

This patch fixes both issues.

Signed-off-by: Alberto Garcia 
---
v2: iotests: don't check the image size if data_file is set [Max]

 block/qcow2.c  | 17 ++---
 tests/qemu-iotests/125 | 24 
 tests/qemu-iotests/125.out |  9 +
 3 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 0cd2e6757e..e20590c3b7 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4239,8 +4239,8 @@ static int coroutine_fn 
qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
 old_file_size = ROUND_UP(old_file_size, s->cluster_size);
 }
 
-nb_new_data_clusters = DIV_ROUND_UP(offset - old_length,
-s->cluster_size);
+nb_new_data_clusters = (ROUND_UP(offset, s->cluster_size) -
+start_of_cluster(s, old_length)) >> s->cluster_bits;
 
 /* This is an overestimation; we will not actually allocate space for
  * these in the file but just make sure the new refcount structures are
@@ -4317,10 +4317,21 @@ static int coroutine_fn 
qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
 int64_t nb_clusters = MIN(
 nb_new_data_clusters,
 s->l2_slice_size - offset_to_l2_slice_index(s, guest_offset));
-QCowL2Meta allocation = {
+unsigned cow_start_length = offset_into_cluster(s, guest_offset);
+QCowL2Meta allocation;
+guest_offset = start_of_cluster(s, guest_offset);
+allocation = (QCowL2Meta) {
 .offset   = guest_offset,
 .alloc_offset = host_offset,
 .nb_clusters  = nb_clusters,
+.cow_start= {
+.offset   = 0,
+.nb_bytes = cow_start_length,
+},
+.cow_end  = {
+.offset   = nb_clusters << s->cluster_bits,
+.nb_bytes = 0,
+},
 };
 qemu_co_queue_init(&allocation.dependent_requests);
 
diff --git a/tests/qemu-iotests/125 b/tests/qemu-iotests/125
index d510984045..7cb1c19730 100755
--- a/tests/qemu-iotests/125
+++ b/tests/qemu-iotests/125
@@ -164,6 +164,30 @@ for GROWTH_SIZE in 16 48 80; do
 done
 done
 
+# Test image resizing using preallocation and unaligned offsets
+$QEMU_IMG create -f raw "$TEST_IMG.base" 128k | _filter_img_create
+$QEMU_IO -c 'write -q -P 1 0 128k' -f raw "$TEST_IMG.base"
+for orig_size in 31k 33k; do
+echo "--- Resizing image from $orig_size to 96k ---"
+_make_test_img -F raw -b "$TEST_IMG.base" -o cluster_size=64k "$orig_size"
+$QEMU_IMG resize -f "$IMGFMT" --preallocation=full "$TEST_IMG" 96k
+# The first part of the image should contain data from the backing file
+$QEMU_IO -c "read -q -P 1 0 ${orig_size}" "$TEST_IMG"
+# The resized part of the image should contain zeroes
+$QEMU_IO -c "read -q -P 0 ${orig_size} 63k" "$TEST_IMG"
+# If the image does not have an external data file we can also verify its
+# actual size. The resized image should have 7 clusters:
+# header, L1 table, L2 table, refcount table, refcount block, 2 data 
clusters
+if ! _get_data_file "$TEST_IMG" > /dev/null; then
+expected_file_length=$((65536 * 7))
+file_length=$(stat -c '%s' "$TEST_IMG_FILE")
+if [ "$file_length" != "$expected_file_length" ]; then
+echo "ERROR: file length $file_length (expected 
$expected_file_length)"
+fi
+fi
+echo
+done
+
 # success, all done
 echo '*** done'
 rm -f $seq.full
diff --git a/tests/qemu-iotests/125.out b/tests/qemu-iotests/125.out
index 596905f533..7f76f7af20 100644
--- a/tests/qemu-iotests/125.out
+++ b/tests/qemu-iotests/125.out
@@ -767,4 +767,13 @@ wrote 2048000/2048000 bytes at offset 0
 wrote 81920/81920 bytes at offset 2048000
 80 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=raw size=131072
+--- Resizing image from 31k to 96k ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=31744 
backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=raw
+Image resized.
+
+--- Resizing image from 33k to 96k ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=33792 
backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=raw
+Image resized.
+

Re: [PATCH 1/2] iotests: Make _filter_img_create more active

2020-06-17 Thread Maxim Levitsky

On Wed, 2020-06-17 at 15:50 +0200, Max Reitz wrote:
> On 17.06.20 13:46, Maxim Levitsky wrote:
> > On Tue, 2020-06-16 at 15:17 +0200, Max Reitz wrote:
> > > Right now, _filter_img_create just filters out everything that looks
> > > format-dependent, and applies some filename filters.  That means that we
> > > have to add another filter line every time some format gets a new
> > > creation option.  This can be avoided by instead discarding everything
> > > and just keeping what we know is format-independent (format, size,
> > > backing file, encryption information[1], preallocation) or just
> > > interesting to have in the reference output (external data file path).
> > > 
> > > Furthermore, we probably want to sort these options.  Format drivers are
> > > not required to define them in any specific order, so the output is
> > > effectively random (although this has never bothered us until now).  We
> > > need a specific order for our reference outputs, though.  Unfortunately,
> > > just using a plain "sort" would change a lot of existing reference
> > > outputs, so we have to pre-filter the option keys to keep our existing
> > > order (fmt, size, backing*, data, encryption info, preallocation).
> > > 
> > > [1] Actually, the only thing that is really important is whether
> > > encryption is enabled or not.  A patch by Maxim thus removes all
> > > other "encrypt.*" options from the output:
> > > https://lists.nongnu.org/archive/html/qemu-block/2020-06/msg00339.html
> > > But that patch needs to come later so we can get away with changing
> > > as few reference outputs in this patch here as possible.
> > > 
> > > Signed-off-by: Max Reitz 
> > > ---
> > >  tests/qemu-iotests/112.out   |   2 +-
> > >  tests/qemu-iotests/153   |   9 ++-
> > >  tests/qemu-iotests/common.filter | 100 +++
> > >  3 files changed, 81 insertions(+), 30 deletions(-)
> > > 
> > > diff --git a/tests/qemu-iotests/112.out b/tests/qemu-iotests/112.out
> > > index ae0318cabe..182655dbf6 100644
> > > --- a/tests/qemu-iotests/112.out
> > > +++ b/tests/qemu-iotests/112.out
> > > @@ -5,7 +5,7 @@ QA output created by 112
> > >  qemu-img: TEST_DIR/t.IMGFMT: Refcount width must be a power of two and 
> > > may not exceed 64 bits
> > >  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
> > >  qemu-img: TEST_DIR/t.IMGFMT: Refcount width must be a power of two and 
> > > may not exceed 64 bits
> > > -Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 refcount_bits=-1
> > > +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
> > >  qemu-img: TEST_DIR/t.IMGFMT: Refcount width must be a power of two and 
> > > may not exceed 64 bits
> > >  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
> > >  qemu-img: TEST_DIR/t.IMGFMT: Refcount width must be a power of two and 
> > > may not exceed 64 bits
> > > diff --git a/tests/qemu-iotests/153 b/tests/qemu-iotests/153
> > > index cf961d3609..11e3d28841 100755
> > > --- a/tests/qemu-iotests/153
> > > +++ b/tests/qemu-iotests/153
> > > @@ -167,11 +167,10 @@ done
> > >  
> > >  echo
> > >  echo "== Creating ${TEST_IMG}.[abc] ==" | _filter_testdir
> > > -(
> > > -$QEMU_IMG create -f qcow2 "${TEST_IMG}.a" -b "${TEST_IMG}"
> > > -$QEMU_IMG create -f qcow2 "${TEST_IMG}.b" -b "${TEST_IMG}"
> > > -$QEMU_IMG create -f qcow2 "${TEST_IMG}.c" -b "${TEST_IMG}.b"
> > > -) | _filter_img_create
> > > +$QEMU_IMG create -f qcow2 "${TEST_IMG}.a" -b "${TEST_IMG}" | 
> > > _filter_img_create
> > > +$QEMU_IMG create -f qcow2 "${TEST_IMG}.b" -b "${TEST_IMG}" | 
> > > _filter_img_create
> > > +$QEMU_IMG create -f qcow2 "${TEST_IMG}.c" -b "${TEST_IMG}.b" \
> > > +| _filter_img_create
> > >  
> > >  echo
> > >  echo "== Two devices sharing the same file in backing chain =="
> > 
> > I guess this is done because now the filter expectes only a single
> > qemu-img output.
> 
> Yes, that’s right.
> 
> > IMHO this is better anyway.
> > 
> > > diff --git a/tests/qemu-iotests/common.filter 
> > > b/tests/qemu-iotests/common.filter
> > > index 03e4f71808..f104ad7a9b 100644
> > > --- a/tests/qemu-iotests/common.filter
> > > +++ b/tests/qemu-iotests/common.filter
> > > @@ -122,38 +122,90 @@ _filter_actual_image_size()
> > >  # replace driver-specific options in the "Formatting..." line
> > >  _filter_img_create()
> > >  {
> > > -data_file_filter=()
> > > -if data_file=$(_get_data_file "$TEST_IMG"); then
> > > -data_file_filter=(-e "s# data_file=$data_file##")
> > > +# Keep QMP output unchanged
> > > +qmp_pre=''
> > > +qmp_post=''
> > > +to_filter=''
> > > +
> > > +while read -r line; do
> > > +if echo "$line" | grep -q '^{.*}$'; then
> > > +if [ -z "$to_filter" ]; then
> > > +# Use $'\n' so the newline is not dropped on variable
> > > +# expansion
> > > +qmp_pre="$qmp_pre$line"$'\n'
> > > +else
> > > +qmp_post="$qmp_p

Re: [PATCH 1/2] iotests: Make _filter_img_create more active

2020-06-17 Thread Max Reitz

On 17.06.20 13:46, Maxim Levitsky wrote:
> On Tue, 2020-06-16 at 15:17 +0200, Max Reitz wrote:
>> Right now, _filter_img_create just filters out everything that looks
>> format-dependent, and applies some filename filters.  That means that we
>> have to add another filter line every time some format gets a new
>> creation option.  This can be avoided by instead discarding everything
>> and just keeping what we know is format-independent (format, size,
>> backing file, encryption information[1], preallocation) or just
>> interesting to have in the reference output (external data file path).
>>
>> Furthermore, we probably want to sort these options.  Format drivers are
>> not required to define them in any specific order, so the output is
>> effectively random (although this has never bothered us until now).  We
>> need a specific order for our reference outputs, though.  Unfortunately,
>> just using a plain "sort" would change a lot of existing reference
>> outputs, so we have to pre-filter the option keys to keep our existing
>> order (fmt, size, backing*, data, encryption info, preallocation).
>>
>> [1] Actually, the only thing that is really important is whether
>> encryption is enabled or not.  A patch by Maxim thus removes all
>> other "encrypt.*" options from the output:
>> https://lists.nongnu.org/archive/html/qemu-block/2020-06/msg00339.html
>> But that patch needs to come later so we can get away with changing
>> as few reference outputs in this patch here as possible.
>>
>> Signed-off-by: Max Reitz 
>> ---
>>  tests/qemu-iotests/112.out   |   2 +-
>>  tests/qemu-iotests/153   |   9 ++-
>>  tests/qemu-iotests/common.filter | 100 +++
>>  3 files changed, 81 insertions(+), 30 deletions(-)
>>
>> diff --git a/tests/qemu-iotests/112.out b/tests/qemu-iotests/112.out
>> index ae0318cabe..182655dbf6 100644
>> --- a/tests/qemu-iotests/112.out
>> +++ b/tests/qemu-iotests/112.out
>> @@ -5,7 +5,7 @@ QA output created by 112
>>  qemu-img: TEST_DIR/t.IMGFMT: Refcount width must be a power of two and may 
>> not exceed 64 bits
>>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
>>  qemu-img: TEST_DIR/t.IMGFMT: Refcount width must be a power of two and may 
>> not exceed 64 bits
>> -Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 refcount_bits=-1
>> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
>>  qemu-img: TEST_DIR/t.IMGFMT: Refcount width must be a power of two and may 
>> not exceed 64 bits
>>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
>>  qemu-img: TEST_DIR/t.IMGFMT: Refcount width must be a power of two and may 
>> not exceed 64 bits
>> diff --git a/tests/qemu-iotests/153 b/tests/qemu-iotests/153
>> index cf961d3609..11e3d28841 100755
>> --- a/tests/qemu-iotests/153
>> +++ b/tests/qemu-iotests/153
>> @@ -167,11 +167,10 @@ done
>>  
>>  echo
>>  echo "== Creating ${TEST_IMG}.[abc] ==" | _filter_testdir
>> -(
>> -$QEMU_IMG create -f qcow2 "${TEST_IMG}.a" -b "${TEST_IMG}"
>> -$QEMU_IMG create -f qcow2 "${TEST_IMG}.b" -b "${TEST_IMG}"
>> -$QEMU_IMG create -f qcow2 "${TEST_IMG}.c" -b "${TEST_IMG}.b"
>> -) | _filter_img_create
>> +$QEMU_IMG create -f qcow2 "${TEST_IMG}.a" -b "${TEST_IMG}" | 
>> _filter_img_create
>> +$QEMU_IMG create -f qcow2 "${TEST_IMG}.b" -b "${TEST_IMG}" | 
>> _filter_img_create
>> +$QEMU_IMG create -f qcow2 "${TEST_IMG}.c" -b "${TEST_IMG}.b" \
>> +| _filter_img_create
>>  
>>  echo
>>  echo "== Two devices sharing the same file in backing chain =="
> 
> I guess this is done because now the filter expectes only a single
> qemu-img output.

Yes, that’s right.

> IMHO this is better anyway.
> 
>> diff --git a/tests/qemu-iotests/common.filter 
>> b/tests/qemu-iotests/common.filter
>> index 03e4f71808..f104ad7a9b 100644
>> --- a/tests/qemu-iotests/common.filter
>> +++ b/tests/qemu-iotests/common.filter
>> @@ -122,38 +122,90 @@ _filter_actual_image_size()
>>  # replace driver-specific options in the "Formatting..." line
>>  _filter_img_create()
>>  {
>> -data_file_filter=()
>> -if data_file=$(_get_data_file "$TEST_IMG"); then
>> -data_file_filter=(-e "s# data_file=$data_file##")
>> +# Keep QMP output unchanged
>> +qmp_pre=''
>> +qmp_post=''
>> +to_filter=''
>> +
>> +while read -r line; do
>> +if echo "$line" | grep -q '^{.*}$'; then
>> +if [ -z "$to_filter" ]; then
>> +# Use $'\n' so the newline is not dropped on variable
>> +# expansion
>> +qmp_pre="$qmp_pre$line"$'\n'
>> +else
>> +qmp_post="$qmp_post$line"$'\n'
>> +fi
>> +else
>> +to_filter="$to_filter$line"$'\n'
>> +fi
>> +done
> 
> The above code basically assumes that qmp output starts with '{' and ends 
> with '}'
> which I guess is fair, and then it assumes that we can have set of qmp 
> commands prior
> to qemu-img line and another set of qmp comman

Re: [PATCH v3] block: Factor out bdrv_run_co()

2020-06-17 Thread Stefan Hajnoczi

On Thu, May 28, 2020 at 08:38:04PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 28.05.2020 18:17, Stefan Hajnoczi wrote:
> > On Wed, May 20, 2020 at 05:49:01PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > We have a few bdrv_*() functions that can either spawn a new coroutine
> > > and wait for it with BDRV_POLL_WHILE() or use a fastpath if they are
> > > alreeady running in a coroutine. All of them duplicate basically the
> > > same code.
> > > 
> > > Factor the common code into a new function bdrv_run_co().
> > > 
> > > Signed-off-by: Kevin Wolf 
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > [Factor out bdrv_run_co_entry too]
> > > ---
> > > 
> > > v3: keep created coroutine in BdrvRunCo struct for debugging [Kevin]
> > > 
> > >   block/io.c | 193 -
> > >   1 file changed, 72 insertions(+), 121 deletions(-)
> > 
> > Thanks, applied to my block tree:
> > https://github.com/stefanha/qemu/commits/block
> > 
> > Stefan
> > 
> 
> Actually, [PATCH v5 0/7] coroutines: generate wrapper code
> substites this patch.. What do you think of it, could we take it instead?

This patch has already been merged but the "coroutines: generate wrapper
code" series can be reviewed and merged separately.

Stefan


signature.asc
Description: PGP signature

[PATCH v2 6/7] block/nvme: keep BDRVNVMeState pointer in NVMeQueuePair

2020-06-17 Thread Stefan Hajnoczi

Passing around both BDRVNVMeState and NVMeQueuePair is unwieldy. Reduce
the number of function arguments by keeping the BDRVNVMeState pointer in
NVMeQueuePair. This will come in handly when a BH is introduced in a
later patch and only one argument can be passed to it.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Sergio Lopez 
Reviewed-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 70 
 1 file changed, 38 insertions(+), 32 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 426c77e5ab..8dc68d3daa 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -39,6 +39,8 @@
  */
 #define NVME_NUM_REQS (NVME_QUEUE_SIZE - 1)
 
+typedef struct BDRVNVMeState BDRVNVMeState;
+
 typedef struct {
 int32_t  head, tail;
 uint8_t  *queue;
@@ -59,8 +61,11 @@ typedef struct {
 typedef struct {
 QemuMutex   lock;
 
+/* Read from I/O code path, initialized under BQL */
+BDRVNVMeState   *s;
+int index;
+
 /* Fields protected by BQL */
-int index;
 uint8_t *prp_list_pages;
 
 /* Fields protected by @lock */
@@ -96,7 +101,7 @@ typedef volatile struct {
 
 QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 0x1000);
 
-typedef struct {
+struct BDRVNVMeState {
 AioContext *aio_context;
 QEMUVFIOState *vfio;
 NVMeRegs *regs;
@@ -130,7 +135,7 @@ typedef struct {
 
 /* PCI address (required for nvme_refresh_filename()) */
 char *device;
-} BDRVNVMeState;
+};
 
 #define NVME_BLOCK_OPT_DEVICE "device"
 #define NVME_BLOCK_OPT_NAMESPACE "namespace"
@@ -174,7 +179,7 @@ static void nvme_init_queue(BlockDriverState *bs, NVMeQueue 
*q,
 }
 }
 
-static void nvme_free_queue_pair(BlockDriverState *bs, NVMeQueuePair *q)
+static void nvme_free_queue_pair(NVMeQueuePair *q)
 {
 qemu_vfree(q->prp_list_pages);
 qemu_vfree(q->sq.queue);
@@ -205,6 +210,7 @@ static NVMeQueuePair 
*nvme_create_queue_pair(BlockDriverState *bs,
 uint64_t prp_list_iova;
 
 qemu_mutex_init(&q->lock);
+q->s = s;
 q->index = idx;
 qemu_co_queue_init(&q->free_req_queue);
 q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_NUM_REQS);
@@ -240,13 +246,15 @@ static NVMeQueuePair 
*nvme_create_queue_pair(BlockDriverState *bs,
 
 return q;
 fail:
-nvme_free_queue_pair(bs, q);
+nvme_free_queue_pair(q);
 return NULL;
 }
 
 /* With q->lock */
-static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q)
+static void nvme_kick(NVMeQueuePair *q)
 {
+BDRVNVMeState *s = q->s;
+
 if (s->plugged || !q->need_kick) {
 return;
 }
@@ -295,21 +303,20 @@ static void nvme_put_free_req_locked(NVMeQueuePair *q, 
NVMeRequest *req)
 }
 
 /* With q->lock */
-static void nvme_wake_free_req_locked(BDRVNVMeState *s, NVMeQueuePair *q)
+static void nvme_wake_free_req_locked(NVMeQueuePair *q)
 {
 if (!qemu_co_queue_empty(&q->free_req_queue)) {
-replay_bh_schedule_oneshot_event(s->aio_context,
+replay_bh_schedule_oneshot_event(q->s->aio_context,
 nvme_free_req_queue_cb, q);
 }
 }
 
 /* Insert a request in the freelist and wake waiters */
-static void nvme_put_free_req_and_wake(BDRVNVMeState *s,  NVMeQueuePair *q,
-   NVMeRequest *req)
+static void nvme_put_free_req_and_wake(NVMeQueuePair *q, NVMeRequest *req)
 {
 qemu_mutex_lock(&q->lock);
 nvme_put_free_req_locked(q, req);
-nvme_wake_free_req_locked(s, q);
+nvme_wake_free_req_locked(q);
 qemu_mutex_unlock(&q->lock);
 }
 
@@ -336,8 +343,9 @@ static inline int nvme_translate_error(const NvmeCqe *c)
 }
 
 /* With q->lock */
-static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
+static bool nvme_process_completion(NVMeQueuePair *q)
 {
+BDRVNVMeState *s = q->s;
 bool progress = false;
 NVMeRequest *preq;
 NVMeRequest req;
@@ -386,7 +394,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
NVMeQueuePair *q)
 /* Notify the device so it can post more completions. */
 smp_mb_release();
 *q->cq.doorbell = cpu_to_le32(q->cq.head);
-nvme_wake_free_req_locked(s, q);
+nvme_wake_free_req_locked(q);
 }
 q->busy = false;
 return progress;
@@ -403,8 +411,7 @@ static void nvme_trace_command(const NvmeCmd *cmd)
 }
 }
 
-static void nvme_submit_command(BDRVNVMeState *s, NVMeQueuePair *q,
-NVMeRequest *req,
+static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
 NvmeCmd *cmd, BlockCompletionFunc cb,
 void *opaque)
 {
@@ -413,15 +420,15 @@ static void nvme_submit_command(BDRVNVMeState *s, 
NVMeQueuePair *q,
 req->opaque = opaque;
 cmd->cid = cpu_to_le32(req->cid);
 
-trace_nvme_submit_command(s, q->index, req->cid);
+trace_nvme_submit_command(q->s, q->index, req->cid);
 nvme_trace_command(cmd);
 qemu_mutex_lock(&q->lock);
 memcpy((uint8_t *)q->sq.q

[PATCH v2 7/7] block/nvme: support nested aio_poll()

2020-06-17 Thread Stefan Hajnoczi

QEMU block drivers are supposed to support aio_poll() from I/O
completion callback functions. This means completion processing must be
re-entrant.

The standard approach is to schedule a BH during completion processing
and cancel it at the end of processing. If aio_poll() is invoked by a
callback function then the BH will run. The BH continues the suspended
completion processing.

All of this means that request A's cb() can synchronously wait for
request B to complete. Previously the nvme block driver would hang
because it didn't process completions from nested aio_poll().

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Sergio Lopez 
---
 block/nvme.c   | 67 --
 block/trace-events |  2 +-
 2 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 8dc68d3daa..374e268915 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -74,9 +74,11 @@ typedef struct {
 int cq_phase;
 int free_req_head;
 NVMeRequest reqs[NVME_NUM_REQS];
-boolbusy;
 int need_kick;
 int inflight;
+
+/* Thread-safe, no lock necessary */
+QEMUBH  *completion_bh;
 } NVMeQueuePair;
 
 /* Memory mapped registers */
@@ -140,6 +142,8 @@ struct BDRVNVMeState {
 #define NVME_BLOCK_OPT_DEVICE "device"
 #define NVME_BLOCK_OPT_NAMESPACE "namespace"
 
+static void nvme_process_completion_bh(void *opaque);
+
 static QemuOptsList runtime_opts = {
 .name = "nvme",
 .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
@@ -181,6 +185,9 @@ static void nvme_init_queue(BlockDriverState *bs, NVMeQueue 
*q,
 
 static void nvme_free_queue_pair(NVMeQueuePair *q)
 {
+if (q->completion_bh) {
+qemu_bh_delete(q->completion_bh);
+}
 qemu_vfree(q->prp_list_pages);
 qemu_vfree(q->sq.queue);
 qemu_vfree(q->cq.queue);
@@ -214,6 +221,8 @@ static NVMeQueuePair 
*nvme_create_queue_pair(BlockDriverState *bs,
 q->index = idx;
 qemu_co_queue_init(&q->free_req_queue);
 q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_NUM_REQS);
+q->completion_bh = aio_bh_new(bdrv_get_aio_context(bs),
+  nvme_process_completion_bh, q);
 r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
   s->page_size * NVME_NUM_REQS,
   false, &prp_list_iova);
@@ -352,11 +361,21 @@ static bool nvme_process_completion(NVMeQueuePair *q)
 NvmeCqe *c;
 
 trace_nvme_process_completion(s, q->index, q->inflight);
-if (q->busy || s->plugged) {
-trace_nvme_process_completion_queue_busy(s, q->index);
+if (s->plugged) {
+trace_nvme_process_completion_queue_plugged(s, q->index);
 return false;
 }
-q->busy = true;
+
+/*
+ * Support re-entrancy when a request cb() function invokes aio_poll().
+ * Pending completions must be visible to aio_poll() so that a cb()
+ * function can wait for the completion of another request.
+ *
+ * The aio_poll() loop will execute our BH and we'll resume completion
+ * processing there.
+ */
+qemu_bh_schedule(q->completion_bh);
+
 assert(q->inflight >= 0);
 while (q->inflight) {
 int ret;
@@ -384,10 +403,10 @@ static bool nvme_process_completion(NVMeQueuePair *q)
 assert(req.cb);
 nvme_put_free_req_locked(q, preq);
 preq->cb = preq->opaque = NULL;
-qemu_mutex_unlock(&q->lock);
-req.cb(req.opaque, ret);
-qemu_mutex_lock(&q->lock);
 q->inflight--;
+qemu_mutex_unlock(&q->lock);
+req.cb(req.opaque, ret);
+qemu_mutex_lock(&q->lock);
 progress = true;
 }
 if (progress) {
@@ -396,10 +415,28 @@ static bool nvme_process_completion(NVMeQueuePair *q)
 *q->cq.doorbell = cpu_to_le32(q->cq.head);
 nvme_wake_free_req_locked(q);
 }
-q->busy = false;
+
+qemu_bh_cancel(q->completion_bh);
+
 return progress;
 }
 
+static void nvme_process_completion_bh(void *opaque)
+{
+NVMeQueuePair *q = opaque;
+
+/*
+ * We're being invoked because a nvme_process_completion() cb() function
+ * called aio_poll(). The callback may be waiting for further completions
+ * so notify the device that it has space to fill in more completions now.
+ */
+smp_mb_release();
+*q->cq.doorbell = cpu_to_le32(q->cq.head);
+nvme_wake_free_req_locked(q);
+
+nvme_process_completion(q);
+}
+
 static void nvme_trace_command(const NvmeCmd *cmd)
 {
 int i;
@@ -1309,6 +1346,13 @@ static void nvme_detach_aio_context(BlockDriverState *bs)
 {
 BDRVNVMeState *s = bs->opaque;
 
+for (int i = 0; i < s->nr_queues; i++) {
+NVMeQueuePair *q = s->queues[i];
+
+qemu_bh_delete(q->completion_bh);
+q->completion_bh = NULL;
+}
+
 aio_set_event_notifier(bdrv_get_aio_context(bs), &s->irq_notifier,
false, NULL, NULL);
 }
@@ -1321,6 +1365,13 @@

[PATCH v2 5/7] block/nvme: clarify that free_req_queue is protected by q->lock

2020-06-17 Thread Stefan Hajnoczi

Existing users access free_req_queue under q->lock. Document this.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Sergio Lopez 
Reviewed-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index 8e60882af3..426c77e5ab 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -57,7 +57,6 @@ typedef struct {
 } NVMeRequest;
 
 typedef struct {
-CoQueue free_req_queue;
 QemuMutex   lock;
 
 /* Fields protected by BQL */
@@ -65,6 +64,7 @@ typedef struct {
 uint8_t *prp_list_pages;
 
 /* Fields protected by @lock */
+CoQueue free_req_queue;
 NVMeQueue   sq, cq;
 int cq_phase;
 int free_req_head;
-- 
2.26.2

[PATCH v2 4/7] block/nvme: switch to a NVMeRequest freelist

2020-06-17 Thread Stefan Hajnoczi

There are three issues with the current NVMeRequest->busy field:
1. The busy field is accidentally accessed outside q->lock when request
   submission fails.
2. Waiters on free_req_queue are not woken when a request is returned
   early due to submission failure.
2. Finding a free request involves scanning all requests. This makes
   request submission O(n^2).

Switch to an O(1) freelist that is always accessed under the lock.

Also differentiate between NVME_QUEUE_SIZE, the actual SQ/CQ size, and
NVME_NUM_REQS, the number of usable requests. This makes the code
simpler than using NVME_QUEUE_SIZE everywhere and having to keep in mind
that one slot is reserved.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Sergio Lopez 
---
 block/nvme.c | 81 ++--
 1 file changed, 54 insertions(+), 27 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 344893811a..8e60882af3 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -33,6 +33,12 @@
 #define NVME_QUEUE_SIZE 128
 #define NVME_BAR_SIZE 8192
 
+/*
+ * We have to leave one slot empty as that is the full queue case where
+ * head == tail + 1.
+ */
+#define NVME_NUM_REQS (NVME_QUEUE_SIZE - 1)
+
 typedef struct {
 int32_t  head, tail;
 uint8_t  *queue;
@@ -47,7 +53,7 @@ typedef struct {
 int cid;
 void *prp_list_page;
 uint64_t prp_list_iova;
-bool busy;
+int free_req_next; /* q->reqs[] index of next free req */
 } NVMeRequest;
 
 typedef struct {
@@ -61,7 +67,8 @@ typedef struct {
 /* Fields protected by @lock */
 NVMeQueue   sq, cq;
 int cq_phase;
-NVMeRequest reqs[NVME_QUEUE_SIZE];
+int free_req_head;
+NVMeRequest reqs[NVME_NUM_REQS];
 boolbusy;
 int need_kick;
 int inflight;
@@ -200,19 +207,23 @@ static NVMeQueuePair 
*nvme_create_queue_pair(BlockDriverState *bs,
 qemu_mutex_init(&q->lock);
 q->index = idx;
 qemu_co_queue_init(&q->free_req_queue);
-q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_QUEUE_SIZE);
+q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_NUM_REQS);
 r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
-  s->page_size * NVME_QUEUE_SIZE,
+  s->page_size * NVME_NUM_REQS,
   false, &prp_list_iova);
 if (r) {
 goto fail;
 }
-for (i = 0; i < NVME_QUEUE_SIZE; i++) {
+q->free_req_head = -1;
+for (i = 0; i < NVME_NUM_REQS; i++) {
 NVMeRequest *req = &q->reqs[i];
 req->cid = i + 1;
+req->free_req_next = q->free_req_head;
+q->free_req_head = i;
 req->prp_list_page = q->prp_list_pages + i * s->page_size;
 req->prp_list_iova = prp_list_iova + i * s->page_size;
 }
+
 nvme_init_queue(bs, &q->sq, size, NVME_SQ_ENTRY_BYTES, &local_err);
 if (local_err) {
 error_propagate(errp, local_err);
@@ -254,13 +265,11 @@ static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q)
  */
 static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
 {
-int i;
-NVMeRequest *req = NULL;
+NVMeRequest *req;
 
 qemu_mutex_lock(&q->lock);
-while (q->inflight + q->need_kick > NVME_QUEUE_SIZE - 2) {
-/* We have to leave one slot empty as that is the full queue case (head
- * == tail + 1). */
+
+while (q->free_req_head == -1) {
 if (qemu_in_coroutine()) {
 trace_nvme_free_req_queue_wait(q);
 qemu_co_queue_wait(&q->free_req_queue, &q->lock);
@@ -269,20 +278,41 @@ static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
 return NULL;
 }
 }
-for (i = 0; i < NVME_QUEUE_SIZE; i++) {
-if (!q->reqs[i].busy) {
-q->reqs[i].busy = true;
-req = &q->reqs[i];
-break;
-}
-}
-/* We have checked inflight and need_kick while holding q->lock, so one
- * free req must be available. */
-assert(req);
+
+req = &q->reqs[q->free_req_head];
+q->free_req_head = req->free_req_next;
+req->free_req_next = -1;
+
 qemu_mutex_unlock(&q->lock);
 return req;
 }
 
+/* With q->lock */
+static void nvme_put_free_req_locked(NVMeQueuePair *q, NVMeRequest *req)
+{
+req->free_req_next = q->free_req_head;
+q->free_req_head = req - q->reqs;
+}
+
+/* With q->lock */
+static void nvme_wake_free_req_locked(BDRVNVMeState *s, NVMeQueuePair *q)
+{
+if (!qemu_co_queue_empty(&q->free_req_queue)) {
+replay_bh_schedule_oneshot_event(s->aio_context,
+nvme_free_req_queue_cb, q);
+}
+}
+
+/* Insert a request in the freelist and wake waiters */
+static void nvme_put_free_req_and_wake(BDRVNVMeState *s,  NVMeQueuePair *q,
+   NVMeRequest *req)
+{
+qemu_mutex_lock(&q->lock);
+nvme_put_free_req_locked(q, req);
+nvme_wake_free_req_locked(s, q);
+qemu_mutex_unlock(&q->lock);
+}
+
 static inline int nvme_translate_error

[PATCH v2 2/7] block/nvme: drop tautologous assertion

2020-06-17 Thread Stefan Hajnoczi

nvme_process_completion() explicitly checks cid so the assertion that
follows is always true:

  if (cid == 0 || cid > NVME_QUEUE_SIZE) {
  ...
  continue;
  }
  assert(cid <= NVME_QUEUE_SIZE);

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Sergio Lopez 
Reviewed-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index e4375ec245..d567ece3f4 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -336,7 +336,6 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
NVMeQueuePair *q)
 cid);
 continue;
 }
-assert(cid <= NVME_QUEUE_SIZE);
 trace_nvme_complete_command(s, q->index, cid);
 preq = &q->reqs[cid - 1];
 req = *preq;
-- 
2.26.2

[PATCH v2 1/7] block/nvme: poll queues without q->lock

2020-06-17 Thread Stefan Hajnoczi

A lot of CPU time is spent simply locking/unlocking q->lock during
polling. Check for completion outside the lock to make q->lock disappear
from the profile.

Signed-off-by: Stefan Hajnoczi 
---
 block/nvme.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/block/nvme.c b/block/nvme.c
index eb2f54dd9d..e4375ec245 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -512,6 +512,18 @@ static bool nvme_poll_queues(BDRVNVMeState *s)
 
 for (i = 0; i < s->nr_queues; i++) {
 NVMeQueuePair *q = s->queues[i];
+const size_t cqe_offset = q->cq.head * NVME_CQ_ENTRY_BYTES;
+NvmeCqe *cqe = (NvmeCqe *)&q->cq.queue[cqe_offset];
+
+/*
+ * Do an early check for completions. q->lock isn't needed because
+ * nvme_process_completion() only runs in the event loop thread and
+ * cannot race with itself.
+ */
+if ((le16_to_cpu(cqe->status) & 0x1) == q->cq_phase) {
+continue;
+}
+
 qemu_mutex_lock(&q->lock);
 while (nvme_process_completion(s, q)) {
 /* Keep polling */
-- 
2.26.2

[PATCH v2 0/7] block/nvme: support nested aio_poll()

2020-06-17 Thread Stefan Hajnoczi

v2:
 * Reword comment in Patch 1 explainin why its safe not to hold q->lock [Sergio]
 * Fix s/unwiedly/unwieldy/ typo in the Patch 6 commit description [Philippe]

This series allows aio_poll() to work from I/O request completion callbacks.
QEMU block drivers are supposed to support this because some code paths rely on
this behavior.

There was no measurable performance difference with nested aio_poll() support.

This patch series also contains cleanups that I made while examining and
benchmarking the code.

Stefan Hajnoczi (7):
  block/nvme: poll queues without q->lock
  block/nvme: drop tautologous assertion
  block/nvme: don't access CQE after moving cq.head
  block/nvme: switch to a NVMeRequest freelist
  block/nvme: clarify that free_req_queue is protected by q->lock
  block/nvme: keep BDRVNVMeState pointer in NVMeQueuePair
  block/nvme: support nested aio_poll()

 block/nvme.c   | 220 -
 block/trace-events |   2 +-
 2 files changed, 160 insertions(+), 62 deletions(-)

-- 
2.26.2

[PATCH v2 3/7] block/nvme: don't access CQE after moving cq.head

2020-06-17 Thread Stefan Hajnoczi

Do not access a CQE after incrementing q->cq.head and releasing q->lock.
It is unlikely that this causes problems in practice but it's a latent
bug.

The reason why it should be safe at the moment is that completion
processing is not re-entrant and the CQ doorbell isn't written until the
end of nvme_process_completion().

Make this change now because QEMU expects completion processing to be
re-entrant and later patches will do that.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Sergio Lopez 
Reviewed-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index d567ece3f4..344893811a 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -321,11 +321,14 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
NVMeQueuePair *q)
 q->busy = true;
 assert(q->inflight >= 0);
 while (q->inflight) {
+int ret;
 int16_t cid;
+
 c = (NvmeCqe *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
 if ((le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
 break;
 }
+ret = nvme_translate_error(c);
 q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
 if (!q->cq.head) {
 q->cq_phase = !q->cq_phase;
@@ -344,7 +347,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
NVMeQueuePair *q)
 preq->busy = false;
 preq->cb = preq->opaque = NULL;
 qemu_mutex_unlock(&q->lock);
-req.cb(req.opaque, nvme_translate_error(c));
+req.cb(req.opaque, ret);
 qemu_mutex_lock(&q->lock);
 q->inflight--;
 progress = true;
-- 
2.26.2

Re: [PATCH] iotests: Add copyright line in qcow2.py

2020-06-17 Thread Kevin Wolf

Am 09.06.2020 um 22:59 hat Eric Blake geschrieben:
> The file qcow2.py was originally contributed in 2012 by Kevin Wolf,
> but was not given traditional boilerplate headers at the time.  The
> missing license was just rectified (commit 16306a7b39) using the
> project-default GPLv2+, but as Vladimir is not at Red Hat, he did not
> add a Copyright line.  All earlier contributions have come from CC'd
> authors, where all but Stefan used a Red Hat address at the time of
> the contribution, and that copyright carries over to the split to
> qcow2_format.py (d5262c7124).
> 
> CC: Kevin Wolf 
> CC: Stefan Hajnoczi 
> CC: Eduardo Habkost 
> CC: Max Reitz 
> CC: Philippe Mathieu-DaudÃ© 
> CC: Paolo Bonzini 
> Signed-off-by: Eric Blake 

Thanks, applied to the block branch.

Kevin

Re: [PATCH 0/5] iotests: Some fixes for rarely run cases

2020-06-17 Thread Kevin Wolf

Am 17.06.2020 um 12:48 hat Max Reitz geschrieben:
> Hi,
> 
> Thomas’s report
> (https://lists.nongnu.org/archive/html/qemu-block/2020-06/msg00791.html)
> has given me a nice excuse to write this series.
> 
> There are some iotests that have recently start to fail in rarely
> exercised test environments (qed, qcow2 with data_file, qcow2 v2), and
> this series fixes what I found.

Thanks, applied to the block branch.

Kevin

Re: [PATCH 1/7] block/nvme: poll queues without q->lock

2020-06-17 Thread Stefan Hajnoczi

On Fri, May 29, 2020 at 09:49:31AM +0200, Sergio Lopez wrote:
> On Thu, May 28, 2020 at 04:23:50PM +0100, Stefan Hajnoczi wrote:
> > On Mon, May 25, 2020 at 10:07:13AM +0200, Sergio Lopez wrote:
> > > On Tue, May 19, 2020 at 06:11:32PM +0100, Stefan Hajnoczi wrote:
> > > > A lot of CPU time is spent simply locking/unlocking q->lock during
> > > > polling. Check for completion outside the lock to make q->lock disappear
> > > > from the profile.
> > > >
> > > > Signed-off-by: Stefan Hajnoczi 
> > > > ---
> > > >  block/nvme.c | 12 
> > > >  1 file changed, 12 insertions(+)
> > > >
> > > > diff --git a/block/nvme.c b/block/nvme.c
> > > > index eb2f54dd9d..7eb4512666 100644
> > > > --- a/block/nvme.c
> > > > +++ b/block/nvme.c
> > > > @@ -512,6 +512,18 @@ static bool nvme_poll_queues(BDRVNVMeState *s)
> > > >
> > > >  for (i = 0; i < s->nr_queues; i++) {
> > > >  NVMeQueuePair *q = s->queues[i];
> > > > +const size_t cqe_offset = q->cq.head * NVME_CQ_ENTRY_BYTES;
> > > > +NvmeCqe *cqe = (NvmeCqe *)&q->cq.queue[cqe_offset];
> > > > +
> > > > +/*
> > > > + * q->lock isn't needed for checking completion because
> > > > + * nvme_process_completion() only runs in the event loop 
> > > > thread and
> > > > + * cannot race with itself.
> > > > + */
> > > > +if ((le16_to_cpu(cqe->status) & 0x1) == q->cq_phase) {
> > > > +continue;
> > > > +}
> > > > +
> > >
> > > IIUC, this is introducing an early check of the phase bit to determine
> > > if there is something new in the queue.
> > >
> > > I'm fine with this optimization, but I have the feeling that the
> > > comment doesn't properly describe it.
> >
> > I'm not sure I understand. The comment explains why it's safe not to
> > take q->lock. Normally it would be taken. Without the comment readers
> > could be confused why we ignore the locking rules here.
> >
> > As for documenting the cqe->status expression itself, I didn't think of
> > explaining it since it's part of the theory of operation of this device.
> > Any polling driver will do this, there's nothing QEMU-specific or
> > unusual going on here.
> >
> > Would you like me to expand the comment explaining that NVMe polling
> > consists of checking the phase bit of the latest cqe to check for
> > readiness?
> >
> > Or maybe I misunderstood? :)
> 
> I was thinking of something like "Do an early check for
> completions. We don't need q->lock here because
> nvme_process_completion() only runs (...)"

Sure, will fix.

Stefan


signature.asc
Description: PGP signature

1 2 >

1 - 100 of 137 matches

Mail list logo