date:20210419

Re: [PATCH 03/14] hw/block/nvme: rename __nvme_select_ns_iocs

2021-04-19 Thread Thomas Huth


On 19/04/2021 21.27, Klaus Jensen wrote:

From: Klaus Jensen 

Get rid of the (reserved) double underscore use.

Cc: Philippe Mathieu-Daudé 
Cc: Thomas Huth 
Signed-off-by: Klaus Jensen 
---
  hw/block/nvme.c | 47 +++
  1 file changed, 23 insertions(+), 24 deletions(-)


Reviewed-by: Thomas Huth

Re: [PATCH 02/14] hw/block/nvme: rename __nvme_advance_zone_wp

2021-04-19 Thread Thomas Huth


On 19/04/2021 21.27, Klaus Jensen wrote:

From: Klaus Jensen 

Get rid of the (reserved) double underscore use.

Cc: Philippe Mathieu-Daudé 
Cc: Thomas Huth 
Signed-off-by: Klaus Jensen 
---
  hw/block/nvme.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 002c0672b397..d1b94e36c6fb 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1745,8 +1745,8 @@ static inline uint16_t nvme_zrm_open(NvmeNamespace *ns, 
NvmeZone *zone)
  return nvme_zrm_open_flags(ns, zone, 0);
  }
  
-static void __nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone,

-   uint32_t nlb)
+static void nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone,
+ uint32_t nlb)
  {
  zone->d.wp += nlb;
  
@@ -1766,7 +1766,7 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req)

  nlb = le16_to_cpu(rw->nlb) + 1;
  zone = nvme_get_zone_by_slba(ns, slba);
  
-__nvme_advance_zone_wp(ns, zone, nlb);

+nvme_advance_zone_wp(ns, zone, nlb);
  }
  
  static inline bool nvme_is_write(NvmeRequest *req)

@@ -2155,7 +2155,7 @@ out:
  uint64_t sdlba = le64_to_cpu(copy->sdlba);
  NvmeZone *zone = nvme_get_zone_by_slba(ns, sdlba);
  
-__nvme_advance_zone_wp(ns, zone, ctx->nlb);

+nvme_advance_zone_wp(ns, zone, ctx->nlb);
  }
  
  g_free(ctx->bounce);




Reviewed-by: Thomas Huth

Re: [PATCH 01/14] hw/block/nvme: rename __nvme_zrm_open

2021-04-19 Thread Thomas Huth

On 19/04/2021 21.27, Klaus Jensen wrote:

From: Klaus Jensen 

Get rid of the (reserved) double underscore use.

Cc: Philippe Mathieu-Daudé 
Cc: Thomas Huth 
Signed-off-by: Klaus Jensen 
---
  hw/block/nvme.c | 16 ++--
  1 file changed, 10 insertions(+), 6 deletions(-)

I think it would be good to mention the change with NVME_ZRM_AUTO in the 
patch description, too.

Apart from that:
Reviewed-by: Thomas Huth

Re: [for-6.1 0/4] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci

2021-04-19 Thread Michael S. Tsirkin

On Wed, Apr 07, 2021 at 04:34:57PM +0200, Greg Kurz wrote:
> Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU,
> a serious slow down may be observed on setups with a big enough number
> of vCPUs.
> 
> Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads):
> 
>   virtio-scsi  virtio-blk
> 
> 1 0m20.922s   0m21.346s
> 2 0m21.230s   0m20.350s
> 4 0m21.761s   0m20.997s
> 8 0m22.770s   0m20.051s
> 160m22.038s   0m19.994s
> 320m22.928s   0m20.803s
> 640m26.583s   0m22.953s
> 128   0m41.273s   0m32.333s
> 256   2m4.727s1m16.924s
> 384   6m5.563s3m26.186s
> 
> Both perf and gprof indicate that QEMU is hogging CPUs when setting up
> the ioeventfds:
> 
>  67.88%  swapper [kernel.kallsyms]  [k] power_pmu_enable
>   9.47%  qemu-kvm[kernel.kallsyms]  [k] smp_call_function_single
>   8.64%  qemu-kvm[kernel.kallsyms]  [k] power_pmu_enable
> =>2.79%  qemu-kvmqemu-kvm   [.] memory_region_ioeventfd_before
> =>2.12%  qemu-kvmqemu-kvm   [.] 
> address_space_update_ioeventfds
>   0.56%  kworker/8:0-mm  [kernel.kallsyms]  [k] smp_call_function_single
> 
> address_space_update_ioeventfds() is called when committing an MR
> transaction, i.e. for each ioeventfd with the current code base,
> and it internally loops on all ioventfds:
> 
> static void address_space_update_ioeventfds(AddressSpace *as)
> {
> [...]
> FOR_EACH_FLAT_RANGE(fr, view) {
> for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {
> 
> This means that the setup of ioeventfds for these devices has
> quadratic time complexity.
> 
> This series simply changes the device models to extend the transaction
> to all virtqueueues, like already done in the past in the generic
> code with 710fccf80d78 ("virtio: improve virtio devices initialization
> time").
> 
> Only virtio-scsi and virtio-blk are covered here, but a similar change
> might also be beneficial to other device types such as host-scsi-pci,
> vhost-user-scsi-pci and vhost-user-blk-pci.
> 
>   virtio-scsi  virtio-blk
> 
> 1 0m21.271s   0m22.076s
> 2 0m20.912s   0m19.716s
> 4 0m20.508s   0m19.310s
> 8 0m21.374s   0m20.273s
> 160m21.559s   0m21.374s
> 320m22.532s   0m21.271s
> 640m26.550s   0m22.007s
> 128   0m29.115s   0m27.446s
> 256   0m44.752s   0m41.004s
> 384   1m2.884s0m58.023s
> 
> This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108
> which reported the issue for virtio-scsi-pci.
> 
> Changes since RFC:
> 
> As suggested by Stefan, splimplify the code by directly beginning and
> committing the memory transaction from the device model, without all
> the virtio specific proxying code and no changes needed in the memory
> subsystem.
> 
> Greg Kurz (4):
>   virtio-blk: Fix rollback path in virtio_blk_data_plane_start()
>   virtio-blk: Configure all host notifiers in a single MR transaction
>   virtio-scsi: Set host notifiers and callbacks separately
>   virtio-scsi: Configure all host notifiers in a single MR transaction
> 
>  hw/block/dataplane/virtio-blk.c | 36 +++--
>  hw/scsi/virtio-scsi-dataplane.c | 56 ++---
>  2 files changed, 72 insertions(+), 20 deletions(-)


Tagged for 6.1, thanks!

> -- 
> 2.26.3
>

[PATCH 09/14] hw/block/nvme: add metadata offset helper

2021-04-19 Thread Klaus Jensen

From: Klaus Jensen 

Add an nvme_moff() helper.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.h |  7 ++-
 hw/block/nvme-dif.c |  4 ++--
 hw/block/nvme-ns.c  |  2 +-
 hw/block/nvme.c | 12 ++--
 4 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index dc065e57b509..9349d1c33ad7 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -107,7 +107,7 @@ typedef struct NvmeNamespace {
 BlockConfblkconf;
 int32_t  bootindex;
 int64_t  size;
-int64_t  mdata_offset;
+int64_t  moff;
 NvmeIdNs id_ns;
 NvmeLBAF lbaf;
 size_t   lbasz;
@@ -158,6 +158,11 @@ static inline size_t nvme_m2b(NvmeNamespace *ns, uint64_t 
lba)
 return ns->lbaf.ms * lba;
 }
 
+static inline int64_t nvme_moff(NvmeNamespace *ns, uint64_t lba)
+{
+return ns->moff + nvme_m2b(ns, lba);
+}
+
 static inline bool nvme_ns_ext(NvmeNamespace *ns)
 {
 return !!NVME_ID_NS_FLBAS_EXTENDED(ns->id_ns.flbas);
diff --git a/hw/block/nvme-dif.c b/hw/block/nvme-dif.c
index c72e43195abf..88efcbe9bd60 100644
--- a/hw/block/nvme-dif.c
+++ b/hw/block/nvme-dif.c
@@ -306,7 +306,7 @@ static void nvme_dif_rw_mdata_in_cb(void *opaque, int ret)
 uint64_t slba = le64_to_cpu(rw->slba);
 uint32_t nlb = le16_to_cpu(rw->nlb) + 1;
 size_t mlen = nvme_m2b(ns, nlb);
-uint64_t offset = ns->mdata_offset + nvme_m2b(ns, slba);
+uint64_t offset = nvme_moff(ns, slba);
 BlockBackend *blk = ns->blkconf.blk;
 
 trace_pci_nvme_dif_rw_mdata_in_cb(nvme_cid(req), blk_name(blk));
@@ -335,7 +335,7 @@ static void nvme_dif_rw_mdata_out_cb(void *opaque, int ret)
 NvmeNamespace *ns = req->ns;
 NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;
 uint64_t slba = le64_to_cpu(rw->slba);
-uint64_t offset = ns->mdata_offset + nvme_m2b(ns, slba);
+uint64_t offset = nvme_moff(ns, slba);
 BlockBackend *blk = ns->blkconf.blk;
 
 trace_pci_nvme_dif_rw_mdata_out_cb(nvme_cid(req), blk_name(blk));
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 2224b497e4b5..84f602652354 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -42,7 +42,7 @@ void nvme_ns_init_format(NvmeNamespace *ns)
 id_ns->ncap = id_ns->nsze;
 id_ns->nuse = id_ns->ncap;
 
-ns->mdata_offset = (int64_t)nlbas << ns->lbaf.ds;
+ns->moff = (int64_t)nlbas << ns->lbaf.ds;
 
 npdg = ns->blkconf.discard_granularity / ns->lbasz;
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index fcc0fe72dc33..b0a6c1457a88 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1827,7 +1827,7 @@ static void nvme_rw_cb(void *opaque, int ret)
 NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;
 uint64_t slba = le64_to_cpu(rw->slba);
 uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
-uint64_t offset = ns->mdata_offset + nvme_m2b(ns, slba);
+uint64_t offset = nvme_moff(ns, slba);
 
 if (req->cmd.opcode == NVME_CMD_WRITE_ZEROES) {
 size_t mlen = nvme_m2b(ns, nlb);
@@ -1993,7 +1993,7 @@ static void nvme_verify_mdata_in_cb(void *opaque, int ret)
 uint64_t slba = le64_to_cpu(rw->slba);
 uint32_t nlb = le16_to_cpu(rw->nlb) + 1;
 size_t mlen = nvme_m2b(ns, nlb);
-uint64_t offset = ns->mdata_offset + nvme_m2b(ns, slba);
+uint64_t offset = nvme_moff(ns, slba);
 BlockBackend *blk = ns->blkconf.blk;
 
 trace_pci_nvme_verify_mdata_in_cb(nvme_cid(req), blk_name(blk));
@@ -2096,7 +2096,7 @@ static void nvme_aio_zone_reset_cb(void *opaque, int ret)
 }
 
 if (ns->lbaf.ms) {
-int64_t offset = ns->mdata_offset + nvme_m2b(ns, zone->d.zslba);
+int64_t offset = nvme_moff(ns, zone->d.zslba);
 
 blk_aio_pwrite_zeroes(ns->blkconf.blk, offset,
   nvme_m2b(ns, ns->zone_size), BDRV_REQ_MAY_UNMAP,
@@ -2167,7 +2167,7 @@ static void nvme_copy_cb(void *opaque, int ret)
 if (ns->lbaf.ms) {
 NvmeCopyCmd *copy = (NvmeCopyCmd *)>cmd;
 uint64_t sdlba = le64_to_cpu(copy->sdlba);
-int64_t offset = ns->mdata_offset + nvme_m2b(ns, sdlba);
+int64_t offset = nvme_moff(ns, sdlba);
 
 qemu_iovec_reset(>sg.iov);
 qemu_iovec_add(>sg.iov, ctx->mbounce, nvme_m2b(ns, ctx->nlb));
@@ -2463,7 +2463,7 @@ static void nvme_compare_data_cb(void *opaque, int ret)
 uint64_t slba = le64_to_cpu(rw->slba);
 uint32_t nlb = le16_to_cpu(rw->nlb) + 1;
 size_t mlen = nvme_m2b(ns, nlb);
-uint64_t offset = ns->mdata_offset + nvme_m2b(ns, slba);
+uint64_t offset = nvme_moff(ns, slba);
 
 ctx->mdata.bounce = g_malloc(mlen);
 
@@ -2744,7 +2744,7 @@ static uint16_t nvme_copy(NvmeCtrl *n, NvmeRequest *req)
 
 if (ns->lbaf.ms) {
 len = nvme_m2b(ns, nlb);
-offset = ns->mdata_offset + nvme_m2b(ns, slba);
+offset = nvme_moff(ns, slba);
 
 in_ctx = g_new(struct nvme_copy_in_ctx, 1);
 in_ctx->req = req;
-- 
2.31.1

[PATCH 12/14] hw/block/nvme: remove irrelevant zone resource checks

2021-04-19 Thread Klaus Jensen

From: Klaus Jensen 

It is not an error to report more active/open zones supported than the
number of zones in the namespace.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme-ns.c | 13 -
 1 file changed, 13 deletions(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 84f602652354..2041d8138420 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -210,19 +210,6 @@ static int nvme_ns_zoned_check_calc_geometry(NvmeNamespace 
*ns, Error **errp)
 return -1;
 }
 
-if (ns->params.max_open_zones > ns->num_zones) {
-error_setg(errp,
-   "max_open_zones value %u exceeds the number of zones %u",
-   ns->params.max_open_zones, ns->num_zones);
-return -1;
-}
-if (ns->params.max_active_zones > ns->num_zones) {
-error_setg(errp,
-   "max_active_zones value %u exceeds the number of zones %u",
-   ns->params.max_active_zones, ns->num_zones);
-return -1;
-}
-
 if (ns->params.max_active_zones) {
 if (ns->params.max_open_zones > ns->params.max_active_zones) {
 error_setg(errp, "max_open_zones (%u) exceeds max_active_zones 
(%u)",
-- 
2.31.1

[PATCH 08/14] hw/block/nvme: cache lba and ms sizes

2021-04-19 Thread Klaus Jensen

From: Klaus Jensen 

There is no need to look up the lba size and metadata size in the LBA
Format structure everytime we want to use it. And we use it a lot.

Cache the values in the NvmeNamespace and update them if the namespace
is formatted.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.h | 37 ---
 hw/block/nvme-dif.c | 45 ++-
 hw/block/nvme-ns.c  | 26 +
 hw/block/nvme.c | 47 ++---
 4 files changed, 56 insertions(+), 99 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index d9bee7e5a05c..dc065e57b509 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -109,6 +109,8 @@ typedef struct NvmeNamespace {
 int64_t  size;
 int64_t  mdata_offset;
 NvmeIdNs id_ns;
+NvmeLBAF lbaf;
+size_t   lbasz;
 const uint32_t *iocs;
 uint8_t  csi;
 uint16_t status;
@@ -146,36 +148,14 @@ static inline uint32_t nvme_nsid(NvmeNamespace *ns)
 return 0;
 }
 
-static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
-{
-NvmeIdNs *id_ns = >id_ns;
-return _ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
-}
-
-static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
-{
-return nvme_ns_lbaf(ns)->ds;
-}
-
-/* convert an LBA to the equivalent in bytes */
 static inline size_t nvme_l2b(NvmeNamespace *ns, uint64_t lba)
 {
-return lba << nvme_ns_lbads(ns);
-}
-
-static inline size_t nvme_lsize(NvmeNamespace *ns)
-{
-return 1 << nvme_ns_lbads(ns);
-}
-
-static inline uint16_t nvme_msize(NvmeNamespace *ns)
-{
-return nvme_ns_lbaf(ns)->ms;
+return lba << ns->lbaf.ds;
 }
 
 static inline size_t nvme_m2b(NvmeNamespace *ns, uint64_t lba)
 {
-return nvme_msize(ns) * lba;
+return ns->lbaf.ms * lba;
 }
 
 static inline bool nvme_ns_ext(NvmeNamespace *ns)
@@ -183,15 +163,6 @@ static inline bool nvme_ns_ext(NvmeNamespace *ns)
 return !!NVME_ID_NS_FLBAS_EXTENDED(ns->id_ns.flbas);
 }
 
-/* calculate the number of LBAs that the namespace can accomodate */
-static inline uint64_t nvme_ns_nlbas(NvmeNamespace *ns)
-{
-if (nvme_msize(ns)) {
-return ns->size / (nvme_lsize(ns) + nvme_msize(ns));
-}
-return ns->size >> nvme_ns_lbads(ns);
-}
-
 static inline NvmeZoneState nvme_get_zone_state(NvmeZone *zone)
 {
 return zone->d.zs >> 4;
diff --git a/hw/block/nvme-dif.c b/hw/block/nvme-dif.c
index e269d275ebed..c72e43195abf 100644
--- a/hw/block/nvme-dif.c
+++ b/hw/block/nvme-dif.c
@@ -44,20 +44,18 @@ void nvme_dif_pract_generate_dif(NvmeNamespace *ns, uint8_t 
*buf, size_t len,
  uint32_t reftag)
 {
 uint8_t *end = buf + len;
-size_t lsize = nvme_lsize(ns);
-size_t msize = nvme_msize(ns);
 int16_t pil = 0;
 
 if (!(ns->id_ns.dps & NVME_ID_NS_DPS_FIRST_EIGHT)) {
-pil = nvme_msize(ns) - sizeof(NvmeDifTuple);
+pil = ns->lbaf.ms - sizeof(NvmeDifTuple);
 }
 
-trace_pci_nvme_dif_pract_generate_dif(len, lsize, lsize + pil, apptag,
-  reftag);
+trace_pci_nvme_dif_pract_generate_dif(len, ns->lbasz, ns->lbasz + pil,
+  apptag, reftag);
 
-for (; buf < end; buf += lsize, mbuf += msize) {
+for (; buf < end; buf += ns->lbasz, mbuf += ns->lbaf.ms) {
 NvmeDifTuple *dif = (NvmeDifTuple *)(mbuf + pil);
-uint16_t crc = crc_t10dif(0x0, buf, lsize);
+uint16_t crc = crc_t10dif(0x0, buf, ns->lbasz);
 
 if (pil) {
 crc = crc_t10dif(crc, mbuf, pil);
@@ -98,7 +96,7 @@ static uint16_t nvme_dif_prchk(NvmeNamespace *ns, 
NvmeDifTuple *dif,
 }
 
 if (ctrl & NVME_RW_PRINFO_PRCHK_GUARD) {
-uint16_t crc = crc_t10dif(0x0, buf, nvme_lsize(ns));
+uint16_t crc = crc_t10dif(0x0, buf, ns->lbasz);
 
 if (pil) {
 crc = crc_t10dif(crc, mbuf, pil);
@@ -137,8 +135,6 @@ uint16_t nvme_dif_check(NvmeNamespace *ns, uint8_t *buf, 
size_t len,
 uint16_t appmask, uint32_t reftag)
 {
 uint8_t *end = buf + len;
-size_t lsize = nvme_lsize(ns);
-size_t msize = nvme_msize(ns);
 int16_t pil = 0;
 uint16_t status;
 
@@ -148,12 +144,12 @@ uint16_t nvme_dif_check(NvmeNamespace *ns, uint8_t *buf, 
size_t len,
 }
 
 if (!(ns->id_ns.dps & NVME_ID_NS_DPS_FIRST_EIGHT)) {
-pil = nvme_msize(ns) - sizeof(NvmeDifTuple);
+pil = ns->lbaf.ms - sizeof(NvmeDifTuple);
 }
 
-trace_pci_nvme_dif_check(NVME_RW_PRINFO(ctrl), lsize + pil);
+trace_pci_nvme_dif_check(NVME_RW_PRINFO(ctrl), ns->lbasz + pil);
 
-for (; buf < end; buf += lsize, mbuf += msize) {
+for (; buf < end; buf += ns->lbasz, mbuf += ns->lbaf.ms) {
 NvmeDifTuple *dif = (NvmeDifTuple *)(mbuf + pil);
 
 status = nvme_dif_prchk(ns, dif, buf, mbuf, pil, ctrl, apptag,
@@ -176,20 +172,18 @@ uint16_t nvme_dif_mangle_mdata(NvmeNamespace *ns, uint8_t

[PATCH 10/14] hw/block/nvme: streamline namespace array indexing

2021-04-19 Thread Klaus Jensen

From: Klaus Jensen 

Streamline namespace array indexing such that both the subsystem and
controller namespaces arrays are 1-indexed.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.h | 4 ++--
 hw/block/nvme.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 9349d1c33ad7..ac3f0a886735 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -438,7 +438,7 @@ typedef struct NvmeCtrl {
 NvmeSubsystem   *subsys;
 
 NvmeNamespace   namespace;
-NvmeNamespace   *namespaces[NVME_MAX_NAMESPACES];
+NvmeNamespace   *namespaces[NVME_MAX_NAMESPACES + 1];
 NvmeSQueue  **sq;
 NvmeCQueue  **cq;
 NvmeSQueue  admin_sq;
@@ -460,7 +460,7 @@ static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t 
nsid)
 return NULL;
 }
 
-return n->namespaces[nsid - 1];
+return n->namespaces[nsid];
 }
 
 static inline NvmeCQueue *nvme_cq(NvmeRequest *req)
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b0a6c1457a88..0b96936129d6 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -4978,7 +4978,7 @@ static uint16_t nvme_ns_attachment(NvmeCtrl *n, 
NvmeRequest *req)
 return NVME_NS_NOT_ATTACHED | NVME_DNR;
 }
 
-ctrl->namespaces[nsid - 1] = NULL;
+ctrl->namespaces[nsid] = NULL;
 ns->attached--;
 
 nvme_update_dmrsl(ctrl);
@@ -6151,7 +6151,7 @@ void nvme_attach_ns(NvmeCtrl *n, NvmeNamespace *ns)
 uint32_t nsid = ns->params.nsid;
 assert(nsid && nsid <= NVME_MAX_NAMESPACES);
 
-n->namespaces[nsid - 1] = ns;
+n->namespaces[nsid] = ns;
 ns->attached++;
 
 n->dmrsl = MIN_NON_ZERO(n->dmrsl,
-- 
2.31.1

[PATCH 06/14] hw/block/nvme: remove non-shared defines from header file

2021-04-19 Thread Klaus Jensen

From: Klaus Jensen 

Remove non-shared defines from the shared header.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.h| 2 --
 hw/block/nvme-ns.c | 1 +
 hw/block/nvme.c| 1 +
 3 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index d9374d3e33e0..2c4e7b90fa54 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -24,8 +24,6 @@
 
 #include "block/nvme.h"
 
-#define NVME_DEFAULT_ZONE_SIZE   (128 * MiB)
-#define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
 #define NVME_MAX_CONTROLLERS 32
 #define NVME_MAX_NAMESPACES  256
 
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index aae06987e49a..35c146633223 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -23,6 +23,7 @@
 #include "trace.h"
 
 #define MIN_DISCARD_GRANULARITY (4 * KiB)
+#define NVME_DEFAULT_ZONE_SIZE   (128 * MiB)
 
 void nvme_ns_init_format(NvmeNamespace *ns)
 {
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index f7c5e83e6800..2c0af579e7a8 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -159,6 +159,7 @@
 #define NVME_TEMPERATURE_WARNING 0x157
 #define NVME_TEMPERATURE_CRITICAL 0x175
 #define NVME_NUM_FW_SLOTS 1
+#define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
-- 
2.31.1

[PATCH 07/14] hw/block/nvme: replace nvme_ns_status

2021-04-19 Thread Klaus Jensen

From: Klaus Jensen 

The inline nvme_ns_status() helper only has a single call site. Remove
it from the header file and inline it for real.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.h |  5 -
 hw/block/nvme.c | 15 ---
 2 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 2c4e7b90fa54..d9bee7e5a05c 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -137,11 +137,6 @@ typedef struct NvmeNamespace {
 } features;
 } NvmeNamespace;
 
-static inline uint16_t nvme_ns_status(NvmeNamespace *ns)
-{
-return ns->status;
-}
-
 static inline uint32_t nvme_nsid(NvmeNamespace *ns)
 {
 if (ns) {
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 2c0af579e7a8..bcef6038ae09 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -3594,8 +3594,8 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, 
NvmeRequest *req)
 
 static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
+NvmeNamespace *ns;
 uint32_t nsid = le32_to_cpu(req->cmd.nsid);
-uint16_t status;
 
 trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req),
   req->cmd.opcode, nvme_io_opc_str(req->cmd.opcode));
@@ -3627,21 +3627,22 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest 
*req)
 return nvme_flush(n, req);
 }
 
-req->ns = nvme_ns(n, nsid);
-if (unlikely(!req->ns)) {
+ns = nvme_ns(n, nsid);
+if (unlikely(!ns)) {
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-if (!(req->ns->iocs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
+if (!(ns->iocs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
 trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
 return NVME_INVALID_OPCODE | NVME_DNR;
 }
 
-status = nvme_ns_status(req->ns);
-if (unlikely(status)) {
-return status;
+if (ns->status) {
+return ns->status;
 }
 
+req->ns = ns;
+
 switch (req->cmd.opcode) {
 case NVME_CMD_WRITE_ZEROES:
 return nvme_write_zeroes(n, req);
-- 
2.31.1

[PATCH 04/14] hw/block/nvme: consolidate header files

2021-04-19 Thread Klaus Jensen

From: Klaus Jensen 

In preparation for moving the nvme device into its own subtree, merge
the header files into one.

Also add missing copyright notice and add list of authors with
substantial contributions.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme-dif.h|  63 ---
 hw/block/nvme-ns.h | 229 
 hw/block/nvme-subsys.h |  59 ---
 hw/block/nvme.h| 383 +
 hw/block/nvme-dif.c|   1 -
 hw/block/nvme-ns.c |   1 -
 hw/block/nvme-subsys.c |   1 -
 hw/block/nvme.c|   2 -
 8 files changed, 348 insertions(+), 391 deletions(-)
 delete mode 100644 hw/block/nvme-dif.h
 delete mode 100644 hw/block/nvme-ns.h
 delete mode 100644 hw/block/nvme-subsys.h

diff --git a/hw/block/nvme-dif.h b/hw/block/nvme-dif.h
deleted file mode 100644
index 524faffbd7a0..
--- a/hw/block/nvme-dif.h
+++ /dev/null
@@ -1,63 +0,0 @@
-/*
- * QEMU NVM Express End-to-End Data Protection support
- *
- * Copyright (c) 2021 Samsung Electronics Co., Ltd.
- *
- * Authors:
- *   Klaus Jensen   
- *   Gollu Appalanaidu  
- */
-
-#ifndef HW_NVME_DIF_H
-#define HW_NVME_DIF_H
-
-/* from Linux kernel (crypto/crct10dif_common.c) */
-static const uint16_t t10_dif_crc_table[256] = {
-0x, 0x8BB7, 0x9CD9, 0x176E, 0xB205, 0x39B2, 0x2EDC, 0xA56B,
-0xEFBD, 0x640A, 0x7364, 0xF8D3, 0x5DB8, 0xD60F, 0xC161, 0x4AD6,
-0x54CD, 0xDF7A, 0xC814, 0x43A3, 0xE6C8, 0x6D7F, 0x7A11, 0xF1A6,
-0xBB70, 0x30C7, 0x27A9, 0xAC1E, 0x0975, 0x82C2, 0x95AC, 0x1E1B,
-0xA99A, 0x222D, 0x3543, 0xBEF4, 0x1B9F, 0x9028, 0x8746, 0x0CF1,
-0x4627, 0xCD90, 0xDAFE, 0x5149, 0xF422, 0x7F95, 0x68FB, 0xE34C,
-0xFD57, 0x76E0, 0x618E, 0xEA39, 0x4F52, 0xC4E5, 0xD38B, 0x583C,
-0x12EA, 0x995D, 0x8E33, 0x0584, 0xA0EF, 0x2B58, 0x3C36, 0xB781,
-0xD883, 0x5334, 0x445A, 0xCFED, 0x6A86, 0xE131, 0xF65F, 0x7DE8,
-0x373E, 0xBC89, 0xABE7, 0x2050, 0x853B, 0x0E8C, 0x19E2, 0x9255,
-0x8C4E, 0x07F9, 0x1097, 0x9B20, 0x3E4B, 0xB5FC, 0xA292, 0x2925,
-0x63F3, 0xE844, 0xFF2A, 0x749D, 0xD1F6, 0x5A41, 0x4D2F, 0xC698,
-0x7119, 0xFAAE, 0xEDC0, 0x6677, 0xC31C, 0x48AB, 0x5FC5, 0xD472,
-0x9EA4, 0x1513, 0x027D, 0x89CA, 0x2CA1, 0xA716, 0xB078, 0x3BCF,
-0x25D4, 0xAE63, 0xB90D, 0x32BA, 0x97D1, 0x1C66, 0x0B08, 0x80BF,
-0xCA69, 0x41DE, 0x56B0, 0xDD07, 0x786C, 0xF3DB, 0xE4B5, 0x6F02,
-0x3AB1, 0xB106, 0xA668, 0x2DDF, 0x88B4, 0x0303, 0x146D, 0x9FDA,
-0xD50C, 0x5EBB, 0x49D5, 0xC262, 0x6709, 0xECBE, 0xFBD0, 0x7067,
-0x6E7C, 0xE5CB, 0xF2A5, 0x7912, 0xDC79, 0x57CE, 0x40A0, 0xCB17,
-0x81C1, 0x0A76, 0x1D18, 0x96AF, 0x33C4, 0xB873, 0xAF1D, 0x24AA,
-0x932B, 0x189C, 0x0FF2, 0x8445, 0x212E, 0xAA99, 0xBDF7, 0x3640,
-0x7C96, 0xF721, 0xE04F, 0x6BF8, 0xCE93, 0x4524, 0x524A, 0xD9FD,
-0xC7E6, 0x4C51, 0x5B3F, 0xD088, 0x75E3, 0xFE54, 0xE93A, 0x628D,
-0x285B, 0xA3EC, 0xB482, 0x3F35, 0x9A5E, 0x11E9, 0x0687, 0x8D30,
-0xE232, 0x6985, 0x7EEB, 0xF55C, 0x5037, 0xDB80, 0xCCEE, 0x4759,
-0x0D8F, 0x8638, 0x9156, 0x1AE1, 0xBF8A, 0x343D, 0x2353, 0xA8E4,
-0xB6FF, 0x3D48, 0x2A26, 0xA191, 0x04FA, 0x8F4D, 0x9823, 0x1394,
-0x5942, 0xD2F5, 0xC59B, 0x4E2C, 0xEB47, 0x60F0, 0x779E, 0xFC29,
-0x4BA8, 0xC01F, 0xD771, 0x5CC6, 0xF9AD, 0x721A, 0x6574, 0xEEC3,
-0xA415, 0x2FA2, 0x38CC, 0xB37B, 0x1610, 0x9DA7, 0x8AC9, 0x017E,
-0x1F65, 0x94D2, 0x83BC, 0x080B, 0xAD60, 0x26D7, 0x31B9, 0xBA0E,
-0xF0D8, 0x7B6F, 0x6C01, 0xE7B6, 0x42DD, 0xC96A, 0xDE04, 0x55B3
-};
-
-uint16_t nvme_check_prinfo(NvmeNamespace *ns, uint16_t ctrl, uint64_t slba,
-   uint32_t reftag);
-uint16_t nvme_dif_mangle_mdata(NvmeNamespace *ns, uint8_t *mbuf, size_t mlen,
-   uint64_t slba);
-void nvme_dif_pract_generate_dif(NvmeNamespace *ns, uint8_t *buf, size_t len,
- uint8_t *mbuf, size_t mlen, uint16_t apptag,
- uint32_t reftag);
-uint16_t nvme_dif_check(NvmeNamespace *ns, uint8_t *buf, size_t len,
-uint8_t *mbuf, size_t mlen, uint16_t ctrl,
-uint64_t slba, uint16_t apptag,
-uint16_t appmask, uint32_t reftag);
-uint16_t nvme_dif_rw(NvmeCtrl *n, NvmeRequest *req);
-
-#endif /* HW_NVME_DIF_H */
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
deleted file mode 100644
index fb0a41f912e7..
--- a/hw/block/nvme-ns.h
+++ /dev/null
@@ -1,229 +0,0 @@
-/*
- * QEMU NVM Express Virtual Namespace
- *
- * Copyright (c) 2019 CNEX Labs
- * Copyright (c) 2020 Samsung Electronics
- *
- * Authors:
- *  Klaus Jensen  
- *
- * This work is licensed under the terms of the GNU GPL, version 2. See the
- * COPYING file in the top-level directory.
- *
- */
-
-#ifndef NVME_NS_H
-#define NVME_NS_H
-
-#include "qemu/uuid.h"
-
-#define TYPE_NVME_NS "nvme-ns"
-#define NVME_NS(obj) \
-OBJECT_CHECK(NvmeNamespace, (obj), TYPE_NVME_NS)
-
-typedef struct NvmeZone {
-NvmeZoneDescr   d;
-uint64_t

[PATCH 05/14] hw/block/nvme: cleanup includes

2021-04-19 Thread Klaus Jensen

From: Klaus Jensen 

Clean up includes.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme-dif.c|  7 +++
 hw/block/nvme-ns.c | 11 ++-
 hw/block/nvme-subsys.c | 12 +---
 hw/block/nvme.c| 22 +-
 4 files changed, 15 insertions(+), 37 deletions(-)

diff --git a/hw/block/nvme-dif.c b/hw/block/nvme-dif.c
index 25e5a90854fa..e269d275ebed 100644
--- a/hw/block/nvme-dif.c
+++ b/hw/block/nvme-dif.c
@@ -9,12 +9,11 @@
  */
 
 #include "qemu/osdep.h"
-#include "hw/block/block.h"
-#include "sysemu/dma.h"
-#include "sysemu/block-backend.h"
 #include "qapi/error.h"
-#include "trace.h"
+#include "sysemu/block-backend.h"
+
 #include "nvme.h"
+#include "trace.h"
 
 uint16_t nvme_check_prinfo(NvmeNamespace *ns, uint16_t ctrl, uint64_t slba,
uint32_t reftag)
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index b538c4c4a13c..aae06987e49a 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -14,20 +14,13 @@
 
 #include "qemu/osdep.h"
 #include "qemu/units.h"
-#include "qemu/cutils.h"
-#include "qemu/log.h"
 #include "qemu/error-report.h"
-#include "hw/block/block.h"
-#include "hw/pci/pci.h"
+#include "qapi/error.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/block-backend.h"
-#include "qapi/error.h"
 
-#include "hw/qdev-properties.h"
-#include "hw/qdev-core.h"
-
-#include "trace.h"
 #include "nvme.h"
+#include "trace.h"
 
 #define MIN_DISCARD_GRANULARITY (4 * KiB)
 
diff --git a/hw/block/nvme-subsys.c b/hw/block/nvme-subsys.c
index b81067f7b0d3..192223d17ca1 100644
--- a/hw/block/nvme-subsys.c
+++ b/hw/block/nvme-subsys.c
@@ -6,19 +6,9 @@
  * This code is licensed under the GNU GPL v2.  Refer COPYING.
  */
 
-#include "qemu/units.h"
 #include "qemu/osdep.h"
-#include "qemu/uuid.h"
-#include "qemu/iov.h"
-#include "qemu/cutils.h"
 #include "qapi/error.h"
-#include "hw/qdev-properties.h"
-#include "hw/qdev-core.h"
-#include "hw/block/block.h"
-#include "block/aio.h"
-#include "block/accounting.h"
-#include "sysemu/sysemu.h"
-#include "hw/pci/pci.h"
+
 #include "nvme.h"
 
 int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index c51de480d9fd..f7c5e83e6800 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -135,24 +135,20 @@
  */
 
 #include "qemu/osdep.h"
-#include "qemu/units.h"
+#include "qemu/cutils.h"
 #include "qemu/error-report.h"
-#include "hw/block/block.h"
-#include "hw/pci/msix.h"
-#include "hw/pci/pci.h"
-#include "hw/qdev-properties.h"
-#include "migration/vmstate.h"
-#include "sysemu/sysemu.h"
+#include "qemu/log.h"
+#include "qemu/units.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
-#include "sysemu/hostmem.h"
+#include "sysemu/sysemu.h"
 #include "sysemu/block-backend.h"
-#include "exec/memory.h"
-#include "qemu/log.h"
-#include "qemu/module.h"
-#include "qemu/cutils.h"
-#include "trace.h"
+#include "sysemu/hostmem.h"
+#include "hw/pci/msix.h"
+#include "migration/vmstate.h"
+
 #include "nvme.h"
+#include "trace.h"
 
 #define NVME_MAX_IOQPAIRS 0x
 #define NVME_DB_SIZE  4
-- 
2.31.1

[PATCH 03/14] hw/block/nvme: rename __nvme_select_ns_iocs

2021-04-19 Thread Klaus Jensen

From: Klaus Jensen 

Get rid of the (reserved) double underscore use.

Cc: Philippe Mathieu-Daudé 
Cc: Thomas Huth 
Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 47 +++
 1 file changed, 23 insertions(+), 24 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index d1b94e36c6fb..f8209a92302b 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -4916,7 +4916,25 @@ static void nvme_update_dmrsl(NvmeCtrl *n)
 }
 }
 
-static void __nvme_select_ns_iocs(NvmeCtrl *n, NvmeNamespace *ns);
+static void nvme_select_iocs_ns(NvmeCtrl *n, NvmeNamespace *ns)
+{
+ns->iocs = nvme_cse_iocs_none;
+switch (ns->csi) {
+case NVME_CSI_NVM:
+if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) {
+ns->iocs = nvme_cse_iocs_nvm;
+}
+break;
+case NVME_CSI_ZONED:
+if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_CSI) {
+ns->iocs = nvme_cse_iocs_zoned;
+} else if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_NVM) {
+ns->iocs = nvme_cse_iocs_nvm;
+}
+break;
+}
+}
+
 static uint16_t nvme_ns_attachment(NvmeCtrl *n, NvmeRequest *req)
 {
 NvmeNamespace *ns;
@@ -4967,7 +4985,7 @@ static uint16_t nvme_ns_attachment(NvmeCtrl *n, 
NvmeRequest *req)
 }
 
 nvme_attach_ns(ctrl, ns);
-__nvme_select_ns_iocs(ctrl, ns);
+nvme_select_iocs_ns(ctrl, ns);
 } else {
 if (!nvme_ns(ctrl, nsid)) {
 return NVME_NS_NOT_ATTACHED | NVME_DNR;
@@ -5268,26 +5286,7 @@ static void nvme_ctrl_shutdown(NvmeCtrl *n)
 }
 }
 
-static void __nvme_select_ns_iocs(NvmeCtrl *n, NvmeNamespace *ns)
-{
-ns->iocs = nvme_cse_iocs_none;
-switch (ns->csi) {
-case NVME_CSI_NVM:
-if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) {
-ns->iocs = nvme_cse_iocs_nvm;
-}
-break;
-case NVME_CSI_ZONED:
-if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_CSI) {
-ns->iocs = nvme_cse_iocs_zoned;
-} else if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_NVM) {
-ns->iocs = nvme_cse_iocs_nvm;
-}
-break;
-}
-}
-
-static void nvme_select_ns_iocs(NvmeCtrl *n)
+static void nvme_select_iocs(NvmeCtrl *n)
 {
 NvmeNamespace *ns;
 int i;
@@ -5298,7 +5297,7 @@ static void nvme_select_ns_iocs(NvmeCtrl *n)
 continue;
 }
 
-__nvme_select_ns_iocs(n, ns);
+nvme_select_iocs_ns(n, ns);
 }
 }
 
@@ -5400,7 +5399,7 @@ static int nvme_start_ctrl(NvmeCtrl *n)
 
 QTAILQ_INIT(>aer_queue);
 
-nvme_select_ns_iocs(n);
+nvme_select_iocs(n);
 
 return 0;
 }
-- 
2.31.1

[PATCH 00/14] hw(/block/)nvme: spring cleaning

2021-04-19 Thread Klaus Jensen

From: Klaus Jensen 

This series consists of various clean up patches.

The final patch moves nvme emulation from hw/block to hw/nvme.

Klaus Jensen (14):
  hw/block/nvme: rename __nvme_zrm_open
  hw/block/nvme: rename __nvme_advance_zone_wp
  hw/block/nvme: rename __nvme_select_ns_iocs
  hw/block/nvme: consolidate header files
  hw/block/nvme: cleanup includes
  hw/block/nvme: remove non-shared defines from header file
  hw/block/nvme: replace nvme_ns_status
  hw/block/nvme: cache lba and ms sizes
  hw/block/nvme: add metadata offset helper
  hw/block/nvme: streamline namespace array indexing
  hw/block/nvme: remove num_namespaces member
  hw/block/nvme: remove irrelevant zone resource checks
  hw/block/nvme: move zoned constraints checks
  hw/nvme: move nvme emulation out of hw/block

 meson.build   |   1 +
 hw/block/nvme-dif.h   |  63 ---
 hw/block/nvme-ns.h| 229 -
 hw/block/nvme-subsys.h|  59 ---
 hw/block/nvme.h   | 266 ---
 hw/nvme/nvme.h| 547 ++
 hw/nvme/trace.h   |   1 +
 hw/{block/nvme.c => nvme/ctrl.c}  | 204 
 hw/{block/nvme-dif.c => nvme/dif.c}   |  57 +--
 hw/{block/nvme-ns.c => nvme/ns.c} | 104 ++--
 hw/{block/nvme-subsys.c => nvme/subsys.c} |  13 +-
 MAINTAINERS   |   2 +-
 hw/Kconfig|   1 +
 hw/block/Kconfig  |   5 -
 hw/block/meson.build  |   1 -
 hw/block/trace-events | 206 
 hw/meson.build|   1 +
 hw/nvme/Kconfig   |   4 +
 hw/nvme/meson.build   |   1 +
 hw/nvme/trace-events  | 204 
 20 files changed, 928 insertions(+), 1041 deletions(-)
 delete mode 100644 hw/block/nvme-dif.h
 delete mode 100644 hw/block/nvme-ns.h
 delete mode 100644 hw/block/nvme-subsys.h
 delete mode 100644 hw/block/nvme.h
 create mode 100644 hw/nvme/nvme.h
 create mode 100644 hw/nvme/trace.h
 rename hw/{block/nvme.c => nvme/ctrl.c} (98%)
 rename hw/{block/nvme-dif.c => nvme/dif.c} (90%)
 rename hw/{block/nvme-ns.c => nvme/ns.c} (87%)
 rename hw/{block/nvme-subsys.c => nvme/subsys.c} (85%)
 create mode 100644 hw/nvme/Kconfig
 create mode 100644 hw/nvme/meson.build
 create mode 100644 hw/nvme/trace-events

-- 
2.31.1

[PATCH 14/14] hw/nvme: move nvme emulation out of hw/block

2021-04-19 Thread Klaus Jensen

From: Klaus Jensen 

With the introduction of the nvme-subsystem device we are really
cluttering up the hw/block directory.

As suggested by Philippe previously, move the nvme emulation to hw/nvme.

Suggested-by: Philippe Mathieu-Daudé 
Signed-off-by: Klaus Jensen 
---
 meson.build   |   1 +
 hw/{block => nvme}/nvme.h |   6 +-
 hw/nvme/trace.h   |   1 +
 hw/{block/nvme.c => nvme/ctrl.c}  |   0
 hw/{block/nvme-dif.c => nvme/dif.c}   |   0
 hw/{block/nvme-ns.c => nvme/ns.c} |   0
 hw/{block/nvme-subsys.c => nvme/subsys.c} |   0
 MAINTAINERS   |   2 +-
 hw/Kconfig|   1 +
 hw/block/Kconfig  |   5 -
 hw/block/meson.build  |   1 -
 hw/block/trace-events | 206 --
 hw/meson.build|   1 +
 hw/nvme/Kconfig   |   4 +
 hw/nvme/meson.build   |   1 +
 hw/nvme/trace-events  | 204 +
 16 files changed, 217 insertions(+), 216 deletions(-)
 rename hw/{block => nvme}/nvme.h (99%)
 create mode 100644 hw/nvme/trace.h
 rename hw/{block/nvme.c => nvme/ctrl.c} (100%)
 rename hw/{block/nvme-dif.c => nvme/dif.c} (100%)
 rename hw/{block/nvme-ns.c => nvme/ns.c} (100%)
 rename hw/{block/nvme-subsys.c => nvme/subsys.c} (100%)
 create mode 100644 hw/nvme/Kconfig
 create mode 100644 hw/nvme/meson.build
 create mode 100644 hw/nvme/trace-events

diff --git a/meson.build b/meson.build
index c6f4b0cf5e8a..59354cd53c97 100644
--- a/meson.build
+++ b/meson.build
@@ -1811,6 +1811,7 @@ if have_system
 'hw/misc/macio',
 'hw/net',
 'hw/net/can',
+'hw/nvme',
 'hw/nvram',
 'hw/pci',
 'hw/pci-host',
diff --git a/hw/block/nvme.h b/hw/nvme/nvme.h
similarity index 99%
rename from hw/block/nvme.h
rename to hw/nvme/nvme.h
index fb028d81d16f..81a35cda142b 100644
--- a/hw/block/nvme.h
+++ b/hw/nvme/nvme.h
@@ -15,8 +15,8 @@
  * This code is licensed under the GNU GPL v2 or later.
  */
 
-#ifndef HW_NVME_H
-#define HW_NVME_H
+#ifndef HW_NVME_INTERNAL_H
+#define HW_NVME_INTERNAL_H
 
 #include "qemu/uuid.h"
 #include "hw/pci/pci.h"
@@ -544,4 +544,4 @@ uint16_t nvme_dif_check(NvmeNamespace *ns, uint8_t *buf, 
size_t len,
 uint16_t nvme_dif_rw(NvmeCtrl *n, NvmeRequest *req);
 
 
-#endif /* HW_NVME_H */
+#endif /* HW_NVME_INTERNAL_H */
diff --git a/hw/nvme/trace.h b/hw/nvme/trace.h
new file mode 100644
index ..b398ea107f59
--- /dev/null
+++ b/hw/nvme/trace.h
@@ -0,0 +1 @@
+#include "trace/trace-hw_nvme.h"
diff --git a/hw/block/nvme.c b/hw/nvme/ctrl.c
similarity index 100%
rename from hw/block/nvme.c
rename to hw/nvme/ctrl.c
diff --git a/hw/block/nvme-dif.c b/hw/nvme/dif.c
similarity index 100%
rename from hw/block/nvme-dif.c
rename to hw/nvme/dif.c
diff --git a/hw/block/nvme-ns.c b/hw/nvme/ns.c
similarity index 100%
rename from hw/block/nvme-ns.c
rename to hw/nvme/ns.c
diff --git a/hw/block/nvme-subsys.c b/hw/nvme/subsys.c
similarity index 100%
rename from hw/block/nvme-subsys.c
rename to hw/nvme/subsys.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 36055f14c594..986e0e590de6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1972,7 +1972,7 @@ M: Keith Busch 
 M: Klaus Jensen 
 L: qemu-block@nongnu.org
 S: Supported
-F: hw/block/nvme*
+F: hw/nvme/*
 F: include/block/nvme.h
 F: tests/qtest/nvme-test.c
 F: docs/system/nvme.rst
diff --git a/hw/Kconfig b/hw/Kconfig
index ff40bd3f7bb7..61d0e654eb02 100644
--- a/hw/Kconfig
+++ b/hw/Kconfig
@@ -21,6 +21,7 @@ source mem/Kconfig
 source misc/Kconfig
 source net/Kconfig
 source nubus/Kconfig
+source nvme/Kconfig
 source nvram/Kconfig
 source pci-bridge/Kconfig
 source pci-host/Kconfig
diff --git a/hw/block/Kconfig b/hw/block/Kconfig
index 4fcd15216684..295441e64ab4 100644
--- a/hw/block/Kconfig
+++ b/hw/block/Kconfig
@@ -25,11 +25,6 @@ config ONENAND
 config TC58128
 bool
 
-config NVME_PCI
-bool
-default y if PCI_DEVICES
-depends on PCI
-
 config VIRTIO_BLK
 bool
 default y
diff --git a/hw/block/meson.build b/hw/block/meson.build
index 5b4a7699f98f..8b0de54db1fc 100644
--- a/hw/block/meson.build
+++ b/hw/block/meson.build
@@ -13,7 +13,6 @@ softmmu_ss.add(when: 'CONFIG_SSI_M25P80', if_true: 
files('m25p80.c'))
 softmmu_ss.add(when: 'CONFIG_SWIM', if_true: files('swim.c'))
 softmmu_ss.add(when: 'CONFIG_XEN', if_true: files('xen-block.c'))
 softmmu_ss.add(when: 'CONFIG_TC58128', if_true: files('tc58128.c'))
-softmmu_ss.add(when: 'CONFIG_NVME_PCI', if_true: files('nvme.c', 'nvme-ns.c', 
'nvme-subsys.c', 'nvme-dif.c'))
 
 specific_ss.add(when: 'CONFIG_VIRTIO_BLK', if_true: files('virtio-blk.c'))
 specific_ss.add(when: 'CONFIG_VHOST_USER_BLK', if_true: 
files('vhost-user-blk.c'))
diff --git a/hw/block/trace-events b/hw/block/trace-events
index fa12e3a67a75..646917d045f7 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events

[PATCH 11/14] hw/block/nvme: remove num_namespaces member

2021-04-19 Thread Klaus Jensen

From: Klaus Jensen 

The NvmeCtrl num_namespaces member is just an indirection for the
NVME_MAX_NAMESPACES constant.

Remove the indirection.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.h |  1 -
 hw/block/nvme.c | 30 +++---
 2 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index ac3f0a886735..fb028d81d16f 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -401,7 +401,6 @@ typedef struct NvmeCtrl {
 uint16_tcqe_size;
 uint16_tsqe_size;
 uint32_treg_size;
-uint32_tnum_namespaces;
 uint32_tmax_q_ents;
 uint8_t outstanding_aers;
 uint32_tirq_status;
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 0b96936129d6..60424d9b19ea 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -382,7 +382,8 @@ static int nvme_addr_write(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 
 static bool nvme_nsid_valid(NvmeCtrl *n, uint32_t nsid)
 {
-return nsid && (nsid == NVME_NSID_BROADCAST || nsid <= n->num_namespaces);
+return nsid &&
+(nsid == NVME_NSID_BROADCAST || nsid <= NVME_MAX_NAMESPACES);
 }
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
@@ -2865,7 +2866,7 @@ static uint16_t nvme_flush(NvmeCtrl *n, NvmeRequest *req)
 /* 1-initialize; see comment in nvme_dsm */
 *num_flushes = 1;
 
-for (int i = 1; i <= n->num_namespaces; i++) {
+for (int i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
 continue;
@@ -3835,7 +3836,7 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t rae, 
uint32_t buf_len,
 } else {
 int i;
 
-for (i = 1; i <= n->num_namespaces; i++) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
 continue;
@@ -4332,7 +4333,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
NvmeRequest *req,
 return NVME_INVALID_NSID | NVME_DNR;
 }
 
-for (i = 1; i <= n->num_namespaces; i++) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
 if (!active) {
@@ -4380,7 +4381,7 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, 
NvmeRequest *req,
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-for (i = 1; i <= n->num_namespaces; i++) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
 if (!active) {
@@ -4646,7 +4647,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest 
*req)
 goto out;
 case NVME_VOLATILE_WRITE_CACHE:
 result = 0;
-for (i = 1; i <= n->num_namespaces; i++) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
 continue;
@@ -4796,7 +4797,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest 
*req)
 break;
 case NVME_ERROR_RECOVERY:
 if (nsid == NVME_NSID_BROADCAST) {
-for (i = 1; i <= n->num_namespaces; i++) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
 
 if (!ns) {
@@ -4817,7 +4818,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest 
*req)
 }
 break;
 case NVME_VOLATILE_WRITE_CACHE:
-for (i = 1; i <= n->num_namespaces; i++) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
 continue;
@@ -5110,7 +5111,7 @@ static uint16_t nvme_format(NvmeCtrl *n, NvmeRequest *req)
 req->status = status;
 }
 } else {
-for (i = 1; i <= n->num_namespaces; i++) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
 continue;
@@ -5221,7 +5222,7 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
 NvmeNamespace *ns;
 int i;
 
-for (i = 1; i <= n->num_namespaces; i++) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
 continue;
@@ -5263,7 +5264,7 @@ static void nvme_ctrl_shutdown(NvmeCtrl *n)
 memory_region_msync(>pmr.dev->mr, 0, n->pmr.dev->size);
 }
 
-for (i = 1; i <= n->num_namespaces; i++) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
 continue;
@@ -5278,7 +5279,7 @@ static void nvme_select_iocs(NvmeCtrl *n)
 NvmeNamespace *ns;
 int i;
 
-for (i = 1; i <= n->num_namespaces; i++) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
 continue;
@@ -5905,7 +5906,6 @@ static void nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 
 static void nvme_init_state(NvmeCtrl *n)
 {
-n->num_namespaces = NVME_MAX_NAMESPACES;
 /* add one to max_ioqpairs to account for the admin queue pair */

[PATCH 13/14] hw/block/nvme: move zoned constraints checks

2021-04-19 Thread Klaus Jensen

From: Klaus Jensen 

Validation of the max_active and max_open zoned parameters are
independent of any other state, so move them to the early
nvme_ns_check_constraints parameter checks.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme-ns.c | 52 +-
 1 file changed, 28 insertions(+), 24 deletions(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 2041d8138420..861b87f22bd8 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -210,30 +210,6 @@ static int nvme_ns_zoned_check_calc_geometry(NvmeNamespace 
*ns, Error **errp)
 return -1;
 }
 
-if (ns->params.max_active_zones) {
-if (ns->params.max_open_zones > ns->params.max_active_zones) {
-error_setg(errp, "max_open_zones (%u) exceeds max_active_zones 
(%u)",
-   ns->params.max_open_zones, ns->params.max_active_zones);
-return -1;
-}
-
-if (!ns->params.max_open_zones) {
-ns->params.max_open_zones = ns->params.max_active_zones;
-}
-}
-
-if (ns->params.zd_extension_size) {
-if (ns->params.zd_extension_size & 0x3f) {
-error_setg(errp,
-"zone descriptor extension size must be a multiple of 64B");
-return -1;
-}
-if ((ns->params.zd_extension_size >> 6) > 0xff) {
-error_setg(errp, "zone descriptor extension size is too large");
-return -1;
-}
-}
-
 return 0;
 }
 
@@ -403,6 +379,34 @@ static int nvme_ns_check_constraints(NvmeCtrl *n, 
NvmeNamespace *ns,
 }
 }
 
+if (ns->params.zoned) {
+if (ns->params.max_active_zones) {
+if (ns->params.max_open_zones > ns->params.max_active_zones) {
+error_setg(errp, "max_open_zones (%u) exceeds "
+   "max_active_zones (%u)", ns->params.max_open_zones,
+   ns->params.max_active_zones);
+return -1;
+}
+
+if (!ns->params.max_open_zones) {
+ns->params.max_open_zones = ns->params.max_active_zones;
+}
+}
+
+if (ns->params.zd_extension_size) {
+if (ns->params.zd_extension_size & 0x3f) {
+error_setg(errp, "zone descriptor extension size must be a "
+   "multiple of 64B");
+return -1;
+}
+if ((ns->params.zd_extension_size >> 6) > 0xff) {
+error_setg(errp,
+   "zone descriptor extension size is too large");
+return -1;
+}
+}
+}
+
 return 0;
 }
 
-- 
2.31.1

[PATCH 01/14] hw/block/nvme: rename __nvme_zrm_open

2021-04-19 Thread Klaus Jensen

From: Klaus Jensen 

Get rid of the (reserved) double underscore use.

Cc: Philippe Mathieu-Daudé 
Cc: Thomas Huth 
Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 624a1431d072..002c0672b397 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1682,8 +1682,12 @@ static void nvme_zrm_auto_transition_zone(NvmeNamespace 
*ns)
 }
 }
 
-static uint16_t __nvme_zrm_open(NvmeNamespace *ns, NvmeZone *zone,
-bool implicit)
+enum {
+NVME_ZRM_AUTO = 1 << 0,
+};
+
+static uint16_t nvme_zrm_open_flags(NvmeNamespace *ns, NvmeZone *zone,
+int flags)
 {
 int act = 0;
 uint16_t status;
@@ -1707,7 +1711,7 @@ static uint16_t __nvme_zrm_open(NvmeNamespace *ns, 
NvmeZone *zone,
 
 nvme_aor_inc_open(ns);
 
-if (implicit) {
+if (flags & NVME_ZRM_AUTO) {
 nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_IMPLICITLY_OPEN);
 return NVME_SUCCESS;
 }
@@ -1715,7 +1719,7 @@ static uint16_t __nvme_zrm_open(NvmeNamespace *ns, 
NvmeZone *zone,
 /* fallthrough */
 
 case NVME_ZONE_STATE_IMPLICITLY_OPEN:
-if (implicit) {
+if (flags & NVME_ZRM_AUTO) {
 return NVME_SUCCESS;
 }
 
@@ -1733,12 +1737,12 @@ static uint16_t __nvme_zrm_open(NvmeNamespace *ns, 
NvmeZone *zone,
 
 static inline uint16_t nvme_zrm_auto(NvmeNamespace *ns, NvmeZone *zone)
 {
-return __nvme_zrm_open(ns, zone, true);
+return nvme_zrm_open_flags(ns, zone, NVME_ZRM_AUTO);
 }
 
 static inline uint16_t nvme_zrm_open(NvmeNamespace *ns, NvmeZone *zone)
 {
-return __nvme_zrm_open(ns, zone, false);
+return nvme_zrm_open_flags(ns, zone, 0);
 }
 
 static void __nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone,
-- 
2.31.1

[PATCH 02/14] hw/block/nvme: rename __nvme_advance_zone_wp

2021-04-19 Thread Klaus Jensen

From: Klaus Jensen 

Get rid of the (reserved) double underscore use.

Cc: Philippe Mathieu-Daudé 
Cc: Thomas Huth 
Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 002c0672b397..d1b94e36c6fb 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1745,8 +1745,8 @@ static inline uint16_t nvme_zrm_open(NvmeNamespace *ns, 
NvmeZone *zone)
 return nvme_zrm_open_flags(ns, zone, 0);
 }
 
-static void __nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone,
-   uint32_t nlb)
+static void nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone,
+ uint32_t nlb)
 {
 zone->d.wp += nlb;
 
@@ -1766,7 +1766,7 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, 
NvmeRequest *req)
 nlb = le16_to_cpu(rw->nlb) + 1;
 zone = nvme_get_zone_by_slba(ns, slba);
 
-__nvme_advance_zone_wp(ns, zone, nlb);
+nvme_advance_zone_wp(ns, zone, nlb);
 }
 
 static inline bool nvme_is_write(NvmeRequest *req)
@@ -2155,7 +2155,7 @@ out:
 uint64_t sdlba = le64_to_cpu(copy->sdlba);
 NvmeZone *zone = nvme_get_zone_by_slba(ns, sdlba);
 
-__nvme_advance_zone_wp(ns, zone, ctx->nlb);
+nvme_advance_zone_wp(ns, zone, ctx->nlb);
 }
 
 g_free(ctx->bounce);
-- 
2.31.1

Re: [RFC PATCH 0/2] qemu-img convert: Fix sparseness detection

2021-04-19 Thread Peter Lieven




Von meinem iPhone gesendet

> Am 19.04.2021 um 14:31 schrieb Kevin Wolf :
> 
> Am 19.04.2021 um 11:13 hat Peter Lieven geschrieben:
>> 
>> 
 Am 19.04.2021 um 10:36 schrieb Peter Lieven :
>>> 
>>> 
>>> 
 Am 15.04.2021 um 17:22 schrieb Kevin Wolf :
 
 Peter, three years ago you changed 'qemu-img convert' to sacrifice some
 sparsification in order to get aligned requests on the target image. At
 the time, I thought the impact would be small, but it turns out that
 this can end up wasting gigabytes of storagee (like converting a fully
 zeroed 10 GB image taking 2.8 GB instead of a few kilobytes).
 
 https://bugzilla.redhat.com/show_bug.cgi?id=1882917
 
 I'm not entirely sure how to attack this best since this is a tradeoff,
 but maybe the approach in this series is still good enough for the case
 that you wanted to fix back then?
 
 Of course, it would be possible to have a more complete fix like looking
 forward a few blocks more before writing data, but that would probably
 not be entirely trivial because you would have to merge blocks with ZERO
 block status with DATA blocks that contain only zeros. I'm not sure if
 it's worth this complication of the code.
>>> 
>>> I will try to look into this asap.
>> 
>> Besides from the reproducer described in the ticket, I retried my old
>> conversion test in our environment:
>> 
>> Before commit 8dcd3c9b91: reads 4608 writes 14959
>> After commit 8dcd3c9b91: reads 0 writes 14924
>> With Kevins patch: reads 110 writes 14924
>> 
>> I think this is a good result if it avoids other issues.
> 
> Sounds like a promising way to make the tradeoff. Thanks for testing!

is this sth for 6.0-rc4?

Peter

Re: [for-6.1 0/4] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci

2021-04-19 Thread Greg Kurz

Ping ?

On Wed, 7 Apr 2021 16:34:57 +0200
Greg Kurz  wrote:

> Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU,
> a serious slow down may be observed on setups with a big enough number
> of vCPUs.
> 
> Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads):
> 
>   virtio-scsi  virtio-blk
> 
> 1 0m20.922s   0m21.346s
> 2 0m21.230s   0m20.350s
> 4 0m21.761s   0m20.997s
> 8 0m22.770s   0m20.051s
> 160m22.038s   0m19.994s
> 320m22.928s   0m20.803s
> 640m26.583s   0m22.953s
> 128   0m41.273s   0m32.333s
> 256   2m4.727s1m16.924s
> 384   6m5.563s3m26.186s
> 
> Both perf and gprof indicate that QEMU is hogging CPUs when setting up
> the ioeventfds:
> 
>  67.88%  swapper [kernel.kallsyms]  [k] power_pmu_enable
>   9.47%  qemu-kvm[kernel.kallsyms]  [k] smp_call_function_single
>   8.64%  qemu-kvm[kernel.kallsyms]  [k] power_pmu_enable
> =>2.79%  qemu-kvmqemu-kvm   [.] memory_region_ioeventfd_before
> =>2.12%  qemu-kvmqemu-kvm   [.] 
> address_space_update_ioeventfds
>   0.56%  kworker/8:0-mm  [kernel.kallsyms]  [k] smp_call_function_single
> 
> address_space_update_ioeventfds() is called when committing an MR
> transaction, i.e. for each ioeventfd with the current code base,
> and it internally loops on all ioventfds:
> 
> static void address_space_update_ioeventfds(AddressSpace *as)
> {
> [...]
> FOR_EACH_FLAT_RANGE(fr, view) {
> for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {
> 
> This means that the setup of ioeventfds for these devices has
> quadratic time complexity.
> 
> This series simply changes the device models to extend the transaction
> to all virtqueueues, like already done in the past in the generic
> code with 710fccf80d78 ("virtio: improve virtio devices initialization
> time").
> 
> Only virtio-scsi and virtio-blk are covered here, but a similar change
> might also be beneficial to other device types such as host-scsi-pci,
> vhost-user-scsi-pci and vhost-user-blk-pci.
> 
>   virtio-scsi  virtio-blk
> 
> 1 0m21.271s   0m22.076s
> 2 0m20.912s   0m19.716s
> 4 0m20.508s   0m19.310s
> 8 0m21.374s   0m20.273s
> 160m21.559s   0m21.374s
> 320m22.532s   0m21.271s
> 640m26.550s   0m22.007s
> 128   0m29.115s   0m27.446s
> 256   0m44.752s   0m41.004s
> 384   1m2.884s0m58.023s
> 
> This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108
> which reported the issue for virtio-scsi-pci.
> 
> Changes since RFC:
> 
> As suggested by Stefan, splimplify the code by directly beginning and
> committing the memory transaction from the device model, without all
> the virtio specific proxying code and no changes needed in the memory
> subsystem.
> 
> Greg Kurz (4):
>   virtio-blk: Fix rollback path in virtio_blk_data_plane_start()
>   virtio-blk: Configure all host notifiers in a single MR transaction
>   virtio-scsi: Set host notifiers and callbacks separately
>   virtio-scsi: Configure all host notifiers in a single MR transaction
> 
>  hw/block/dataplane/virtio-blk.c | 36 +++--
>  hw/scsi/virtio-scsi-dataplane.c | 56 ++---
>  2 files changed, 72 insertions(+), 20 deletions(-)
>

Re: [PATCH] block/file-posix: Fix problem with fallocate(PUNCH_HOLE) on GPFS

2021-04-19 Thread Kevin Wolf

Am 19.04.2021 um 07:06 hat Thomas Huth geschrieben:
> On 16/04/2021 22.34, Nir Soffer wrote:
> > On Fri, Apr 16, 2021 at 8:23 AM Thomas Huth  wrote:
> > > 
> > > A customer reported that running
> > > 
> > >   qemu-img convert -t none -O qcow2 -f qcow2 input.qcow2 output.qcow2
> > > 
> > > fails for them with the following error message when the images are
> > > stored on a GPFS file system:
> > > 
> > >   qemu-img: error while writing sector 0: Invalid argument
> > > 
> > > After analyzing the strace output, it seems like the problem is in
> > > handle_aiocb_write_zeroes(): The call to fallocate(FALLOC_FL_PUNCH_HOLE)
> > > returns EINVAL, which can apparently happen if the file system has
> > > a different idea of the granularity of the operation. It's arguably
> > > a bug in GPFS, since the PUNCH_HOLE mode should not result in EINVAL
> > > according to the man-page of fallocate(), but the file system is out
> > > there in production and so we have to deal with it. In commit 294682cc3a
> > > ("block: workaround for unaligned byte range in fallocate()") we also
> > > already applied the a work-around for the same problem to the earlier
> > > fallocate(FALLOC_FL_ZERO_RANGE) call, so do it now similar with the
> > > PUNCH_HOLE call.
> > > 
> > > Signed-off-by: Thomas Huth 
> > > ---
> > >   block/file-posix.c | 7 +++
> > >   1 file changed, 7 insertions(+)
> > > 
> > > diff --git a/block/file-posix.c b/block/file-posix.c
> > > index 20e14f8e96..7a40428d52 100644
> > > --- a/block/file-posix.c
> > > +++ b/block/file-posix.c
> > > @@ -1675,6 +1675,13 @@ static int handle_aiocb_write_zeroes(void *opaque)
> > >   }
> > >   s->has_fallocate = false;
> > >   } else if (ret != -ENOTSUP) {
> > > +if (ret == -EINVAL) {
> > > +/*
> > > + * File systems like GPFS do not like unaligned byte 
> > > ranges,
> > > + * treat it like unsupported (so caller falls back to 
> > > pwrite)
> > > + */
> > > +return -ENOTSUP;
> > 
> > This skips the next fallback, using plain fallocate(0) if we write
> > after the end of the file. Is this intended?
> > 
> > We can treat the buggy EINVAL return value as "filesystem is buggy,
> > let's not try other options", or "let's try the next option". Since falling
> > back to actually writing zeroes is so much slower, I think it is better to
> > try the next option.
> 
> I just did the same work-around as in commit 294682cc3a7 ... so if we agree
> to try the other options, too, we should change that spot, too...

Yes, changing both places to fall back to the next option feels right to
me.

> However, what is not clear to me, how would you handle s->has_write_zeroes
> and s->has_discard in such a case? Set them to "false"? ... but it could
> still work for some blocks with different alignment ... but if we keep them
> set to "true", the code tries again and again to call these ioctls, maybe
> wasting other precious cycles for this?

That it could still work for other requests is a good point. So I think
EINVAL shouldn't disable s->has_*, but otherwise behave the same as
ENOTSUP.

You're right that we're potentially wasting cycles for trying unaligned
requests again and again, but that they fail isn't our fault and the
benefit of having efficient zero writes with aligned requests seems more
important that losing a few cycles on unaligned requests.

> Maybe we should do a different approach instead: In case we hit a EINVAL
> here, print an error a la:
> 
>  error_report_once("You are running on a buggy file system, please complain
> to the file system vendor");
> 
> and return -ENOTSUP ... then it's hopefully clear to the users why they are
> getting a bad performance, and that they should complain to the file system
> vendor instead to get their problem fixed.

Sounds like a reasonable thing to do (probably in addition to the above)
when we know that a file system bug prevents us from getting optimal
performance.

Kevin

Re: [RFC PATCH 0/2] qemu-img convert: Fix sparseness detection

2021-04-19 Thread Kevin Wolf

Am 19.04.2021 um 11:13 hat Peter Lieven geschrieben:
> 
> 
> > Am 19.04.2021 um 10:36 schrieb Peter Lieven :
> > 
> > 
> > 
> >> Am 15.04.2021 um 17:22 schrieb Kevin Wolf :
> >> 
> >> Peter, three years ago you changed 'qemu-img convert' to sacrifice some
> >> sparsification in order to get aligned requests on the target image. At
> >> the time, I thought the impact would be small, but it turns out that
> >> this can end up wasting gigabytes of storagee (like converting a fully
> >> zeroed 10 GB image taking 2.8 GB instead of a few kilobytes).
> >> 
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1882917
> >> 
> >> I'm not entirely sure how to attack this best since this is a tradeoff,
> >> but maybe the approach in this series is still good enough for the case
> >> that you wanted to fix back then?
> >> 
> >> Of course, it would be possible to have a more complete fix like looking
> >> forward a few blocks more before writing data, but that would probably
> >> not be entirely trivial because you would have to merge blocks with ZERO
> >> block status with DATA blocks that contain only zeros. I'm not sure if
> >> it's worth this complication of the code.
> > 
> > I will try to look into this asap.
> 
> Besides from the reproducer described in the ticket, I retried my old
> conversion test in our environment:
> 
> Before commit 8dcd3c9b91: reads 4608 writes 14959
> After commit 8dcd3c9b91: reads 0 writes 14924
> With Kevins patch: reads 110 writes 14924
> 
> I think this is a good result if it avoids other issues.

Sounds like a promising way to make the tradeoff. Thanks for testing!

Kevin

Re: [RFC PATCH 0/2] qemu-img convert: Fix sparseness detection

2021-04-19 Thread Kevin Wolf

Am 19.04.2021 um 10:36 hat Peter Lieven geschrieben:
> 
> 
> > Am 15.04.2021 um 17:22 schrieb Kevin Wolf :
> > 
> > Peter, three years ago you changed 'qemu-img convert' to sacrifice some
> > sparsification in order to get aligned requests on the target image. At
> > the time, I thought the impact would be small, but it turns out that
> > this can end up wasting gigabytes of storagee (like converting a fully
> > zeroed 10 GB image taking 2.8 GB instead of a few kilobytes).
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1882917
> > 
> > I'm not entirely sure how to attack this best since this is a tradeoff,
> > but maybe the approach in this series is still good enough for the case
> > that you wanted to fix back then?
> > 
> > Of course, it would be possible to have a more complete fix like looking
> > forward a few blocks more before writing data, but that would probably
> > not be entirely trivial because you would have to merge blocks with ZERO
> > block status with DATA blocks that contain only zeros. I'm not sure if
> > it's worth this complication of the code.
> 
> I will try to look into this asap.
> 
> Is there a hint which FS I need to set the extent hint when creating
> the raw image? I was not able to do that.

Grepping the current kernel source, it seems extent size hints still
work only on XFS. But I don't think it's necessary for reproducing this
bug. In fact, disabling the extent size hint should cause a lot more
fragmentation, which should make the problem more visible.

Kevin

Re: [PATCH] hw/block/nvme: fix io-command set profile feature

2021-04-19 Thread Klaus Jensen


On Apr 19 16:18, Gollu Appalanaidu wrote:

Currently IO Command Set Profile feaure is supported, but
feature support flag not set and this feature is changable
add support for that.

Remove filling default value of feature in CQE CDW0 with zero,
since it fallbacks to default case and it is zero initialized,
if feature default value not set explicitly.

Signed-off-by: Gollu Appalanaidu 
---
hw/block/nvme.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 624a1431d0..b5d2c29fc4 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -185,6 +185,7 @@ static const bool nvme_feature_support[NVME_FID_MAX] = {
[NVME_WRITE_ATOMICITY]  = true,
[NVME_ASYNCHRONOUS_EVENT_CONF]  = true,
[NVME_TIMESTAMP]= true,
+[NVME_COMMAND_SET_PROFILE]  = true,
};

static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
@@ -194,6 +195,7 @@ static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
[NVME_NUMBER_OF_QUEUES] = NVME_FEAT_CAP_CHANGE,
[NVME_ASYNCHRONOUS_EVENT_CONF]  = NVME_FEAT_CAP_CHANGE,
[NVME_TIMESTAMP]= NVME_FEAT_CAP_CHANGE,
+[NVME_COMMAND_SET_PROFILE]  = NVME_FEAT_CAP_CHANGE,
};

static const uint32_t nvme_cse_acs[256] = {
@@ -4707,9 +4709,6 @@ defaults:
result |= NVME_INTVC_NOCOALESCING;
}
break;
-case NVME_COMMAND_SET_PROFILE:
-result = 0;
-break;
default:
result = nvme_feature_default[fid];
break;
--
2.17.1




LGTM.

Reviewed-by: Klaus Jensen 


signature.asc
Description: PGP signature

[PATCH] hw/block/nvme: fix io-command set profile feature

2021-04-19 Thread Gollu Appalanaidu

Currently IO Command Set Profile feaure is supported, but
feature support flag not set and this feature is changable
add support for that.

Remove filling default value of feature in CQE CDW0 with zero,
since it fallbacks to default case and it is zero initialized,
if feature default value not set explicitly.

Signed-off-by: Gollu Appalanaidu 
---
 hw/block/nvme.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 624a1431d0..b5d2c29fc4 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -185,6 +185,7 @@ static const bool nvme_feature_support[NVME_FID_MAX] = {
 [NVME_WRITE_ATOMICITY]  = true,
 [NVME_ASYNCHRONOUS_EVENT_CONF]  = true,
 [NVME_TIMESTAMP]= true,
+[NVME_COMMAND_SET_PROFILE]  = true,
 };
 
 static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
@@ -194,6 +195,7 @@ static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
 [NVME_NUMBER_OF_QUEUES] = NVME_FEAT_CAP_CHANGE,
 [NVME_ASYNCHRONOUS_EVENT_CONF]  = NVME_FEAT_CAP_CHANGE,
 [NVME_TIMESTAMP]= NVME_FEAT_CAP_CHANGE,
+[NVME_COMMAND_SET_PROFILE]  = NVME_FEAT_CAP_CHANGE,
 };
 
 static const uint32_t nvme_cse_acs[256] = {
@@ -4707,9 +4709,6 @@ defaults:
 result |= NVME_INTVC_NOCOALESCING;
 }
 break;
-case NVME_COMMAND_SET_PROFILE:
-result = 0;
-break;
 default:
 result = nvme_feature_default[fid];
 break;
-- 
2.17.1

Re: [PATCH v3 21/33] qemu-socket: pass monitor link to socket_get_fd directly

2021-04-19 Thread Vladimir Sementsov-Ogievskiy


19.04.2021 12:34, Daniel P. Berrangé wrote:

On Fri, Apr 16, 2021 at 11:08:59AM +0300, Vladimir Sementsov-Ogievskiy wrote:

Detecting monitor by current coroutine works bad when we are not in
coroutine context. And that's exactly so in nbd reconnect code, where
qio_channel_socket_connect_sync() is called from thread.

Add a possibility to pass monitor by hand, to be used in the following
commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/io/channel-socket.h| 20 
  include/qemu/sockets.h |  2 +-
  io/channel-socket.c| 17 +
  tests/unit/test-util-sockets.c | 16 
  util/qemu-sockets.c| 10 +-
  5 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index e747e63514..6d0915420d 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -78,6 +78,23 @@ qio_channel_socket_new_fd(int fd,
Error **errp);
  
  
+/**

+ * qio_channel_socket_connect_sync_mon:
+ * @ioc: the socket channel object
+ * @addr: the address to connect to
+ * @mon: current monitor. If NULL, it will be detected by
+ *   current coroutine.
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Attempt to connect to the address @addr. This method
+ * will run in the foreground so the caller will not regain
+ * execution control until the connection is established or
+ * an error occurs.
+ */
+int qio_channel_socket_connect_sync_mon(QIOChannelSocket *ioc,
+SocketAddress *addr,
+Monitor *mon,
+Error **errp);


I don't really like exposing the concept of the QEMU monitor in
the IO layer APIs. IMHO these ought to remain completely separate
subsystems from the API pov, and we ought to fix this problem by
making monitor_cur() work better in the scenario required.



Hmm..

I can add thread_mon hash-table to monitor/monitor.c, so we can set monitor for 
current thread in same way like for coroutine. And monitor_cur will look first at 
coroutine->monitor hash map and then to thread->monitor. Then, I'll pass needed 
monitor link to my specific thread, and thread will call 
monitor_set_cur_for_thread(), an then qio_channel_socket_connect_sync() will work 
correctly.

David, Markus, is it OK?

--
Best regards,
Vladimir

[PATCH v2 6/7] hw/block/pflash_cfi02: Remove pflash_setup_mappings()

2021-04-19 Thread Philippe Mathieu-Daudé

All boards calling pflash_cfi02_register() use nb_mappings=1,
which does not do any mapping:

  $ git grep -wl pflash_cfi02_register hw/
  hw/arm/xilinx_zynq.c
  hw/block/pflash_cfi02.c
  hw/lm32/lm32_boards.c
  hw/ppc/ppc405_boards.c
  hw/sh4/r2d.c

We can remove this now unneeded code.

Reviewed-by: David Gibson 
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/block/pflash_cfi02.c | 35 ++-
 1 file changed, 2 insertions(+), 33 deletions(-)

diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index 02c514fb6e0..6f4b3e3c3fe 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -75,7 +75,6 @@ struct PFlashCFI02 {
 uint32_t nb_blocs[PFLASH_MAX_ERASE_REGIONS];
 uint32_t sector_len[PFLASH_MAX_ERASE_REGIONS];
 uint32_t chip_len;
-uint8_t mappings;
 uint8_t width;
 uint8_t be;
 int wcycle; /* if 0, the flash is read normally */
@@ -92,13 +91,6 @@ struct PFlashCFI02 {
 uint16_t unlock_addr1;
 uint8_t cfi_table[0x4d];
 QEMUTimer timer;
-/*
- * The device replicates the flash memory across its memory space.  Emulate
- * that by having a container (.mem) filled with an array of aliases
- * (.mem_mappings) pointing to the flash memory (.orig_mem).
- */
-MemoryRegion mem;
-MemoryRegion *mem_mappings;/* array; one per mapping */
 MemoryRegion orig_mem;
 bool rom_mode;
 int read_counter; /* used for lazy switch-back to rom mode */
@@ -158,23 +150,6 @@ static inline void toggle_dq2(PFlashCFI02 *pfl)
 pfl->status ^= 0x04;
 }
 
-/*
- * Set up replicated mappings of the same region.
- */
-static void pflash_setup_mappings(PFlashCFI02 *pfl)
-{
-unsigned i;
-hwaddr size = memory_region_size(>orig_mem);
-
-memory_region_init(>mem, OBJECT(pfl), "pflash", pfl->mappings * size);
-pfl->mem_mappings = g_new(MemoryRegion, pfl->mappings);
-for (i = 0; i < pfl->mappings; ++i) {
-memory_region_init_alias(>mem_mappings[i], OBJECT(pfl),
- "pflash-alias", >orig_mem, 0, size);
-memory_region_add_subregion(>mem, i * size, 
>mem_mappings[i]);
-}
-}
-
 static void pflash_reset_state_machine(PFlashCFI02 *pfl)
 {
 trace_pflash_reset(pfl->name);
@@ -917,12 +892,7 @@ static void pflash_cfi02_realize(DeviceState *dev, Error 
**errp)
 pfl->sector_erase_map = bitmap_new(pfl->total_sectors);
 
 pfl->rom_mode = true;
-if (pfl->mappings > 1) {
-pflash_setup_mappings(pfl);
-sysbus_init_mmio(SYS_BUS_DEVICE(dev), >mem);
-} else {
-sysbus_init_mmio(SYS_BUS_DEVICE(dev), >orig_mem);
-}
+sysbus_init_mmio(SYS_BUS_DEVICE(dev), >orig_mem);
 
 timer_init_ns(>timer, QEMU_CLOCK_VIRTUAL, pflash_timer, pfl);
 pfl->status = 0;
@@ -950,7 +920,6 @@ static Property pflash_cfi02_properties[] = {
 DEFINE_PROP_UINT32("num-blocks3", PFlashCFI02, nb_blocs[3], 0),
 DEFINE_PROP_UINT32("sector-length3", PFlashCFI02, sector_len[3], 0),
 DEFINE_PROP_UINT8("width", PFlashCFI02, width, 0),
-DEFINE_PROP_UINT8("mappings", PFlashCFI02, mappings, 0),
 DEFINE_PROP_UINT8("big-endian", PFlashCFI02, be, 0),
 DEFINE_PROP_UINT16("id0", PFlashCFI02, ident0, 0),
 DEFINE_PROP_UINT16("id1", PFlashCFI02, ident1, 0),
@@ -1008,6 +977,7 @@ PFlashCFI02 *pflash_cfi02_register(hwaddr base,
 {
 DeviceState *dev = qdev_new(TYPE_PFLASH_CFI02);
 
+assert(nb_mappings <= 1);
 if (blk) {
 qdev_prop_set_drive(dev, "drive", blk);
 }
@@ -1015,7 +985,6 @@ PFlashCFI02 *pflash_cfi02_register(hwaddr base,
 qdev_prop_set_uint32(dev, "num-blocks", size / sector_len);
 qdev_prop_set_uint32(dev, "sector-length", sector_len);
 qdev_prop_set_uint8(dev, "width", width);
-qdev_prop_set_uint8(dev, "mappings", nb_mappings);
 qdev_prop_set_uint8(dev, "big-endian", !!be);
 qdev_prop_set_uint16(dev, "id0", id0);
 qdev_prop_set_uint16(dev, "id1", id1);
-- 
2.26.3

[PATCH v2 5/7] hw/arm/digic: Map flash using memory_region_add_subregion_aliased()

2021-04-19 Thread Philippe Mathieu-Daudé

Instead of using a device specific feature for mapping the
flash memory multiple times over a wider region, use the
generic memory_region_add_subregion_aliased() helper.

There is no change in the memory layout.

* before:

  $ qemu-system-arm -M canon-a1100 -S -monitor stdio
  QEMU 5.2.90 monitor - type 'help' for more information
  (qemu) info mtree
  address-space: memory
- (prio 0, i/o): system
  -03ff (prio 0, ram): ram
  c021-c02100ff (prio 0, i/o): digic-timer
  c0210100-c02101ff (prio 0, i/o): digic-timer
  c0210200-c02102ff (prio 0, i/o): digic-timer
  c080-c0800017 (prio 0, i/o): digic-uart
  f800- (prio 0, i/o): pflash
f800-f83f (prio 0, romd): alias pflash-alias 
@pflash -003f
f840-f87f (prio 0, romd): alias pflash-alias 
@pflash -003f
f880-f8bf (prio 0, romd): alias pflash-alias 
@pflash -003f
...
ff40-ff7f (prio 0, romd): alias pflash-alias 
@pflash -003f
ff80-ffbf (prio 0, romd): alias pflash-alias 
@pflash -003f
ffc0- (prio 0, romd): alias pflash-alias 
@pflash -003f

* after:

  (qemu) info mtree
  address-space: memory
- (prio 0, i/o): system
  -03ff (prio 0, ram): ram
  c021-c02100ff (prio 0, i/o): digic-timer
  c0210100-c02101ff (prio 0, i/o): digic-timer
  c0210200-c02102ff (prio 0, i/o): digic-timer
  c080-c0800017 (prio 0, i/o): digic-uart
  f800- (prio 0, i/o): masked pflash [span of 4 
MiB]
f800-f83f (prio 0, romd): alias pflash [#0/32] 
@pflash -003f
f840-f87f (prio 0, romd): alias pflash [#1/32] 
@pflash -003f
f880-f8bf (prio 0, romd): alias pflash [#2/32] 
@pflash -003f
...
ff40-ff7f (prio 0, romd): alias pflash [#29/32] 
@pflash -003f
ff80-ffbf (prio 0, romd): alias pflash [#30/32] 
@pflash -003f
ffc0- (prio 0, romd): alias pflash [#31/32] 
@pflash -003f

Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/arm/digic_boards.c | 8 +---
 hw/arm/Kconfig| 1 +
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/hw/arm/digic_boards.c b/hw/arm/digic_boards.c
index fc4a671b2e1..293402b1240 100644
--- a/hw/arm/digic_boards.c
+++ b/hw/arm/digic_boards.c
@@ -128,8 +128,7 @@ static void digic4_add_k8p3215uqb_rom(DigicState *s, hwaddr 
addr,
  FLASH_K8P3215UQB_SIZE / FLASH_K8P3215UQB_SECTOR_SIZE);
 qdev_prop_set_uint32(dev, "sector-length", FLASH_K8P3215UQB_SECTOR_SIZE);
 qdev_prop_set_uint8(dev, "width", 4); /* 32-bit */
-qdev_prop_set_uint8(dev, "mappings",
-DIGIC4_ROM_MAX_SIZE / FLASH_K8P3215UQB_SIZE);
+qdev_prop_set_uint8(dev, "mappings", 0);
 qdev_prop_set_uint8(dev, "big-endian", 0);
 qdev_prop_set_uint16(dev, "id0", 0x00ec);
 qdev_prop_set_uint16(dev, "id1", 0x007e);
@@ -140,7 +139,10 @@ static void digic4_add_k8p3215uqb_rom(DigicState *s, 
hwaddr addr,
 qdev_prop_set_string(dev, "name", "pflash");
 sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), _fatal);
 
-sysbus_mmio_map(SYS_BUS_DEVICE(dev), 0, addr);
+memory_region_add_subregion_aliased(get_system_memory(),
+addr, DIGIC4_ROM_MAX_SIZE,
+sysbus_mmio_get_region(SYS_BUS_DEVICE(dev), 0),
+FLASH_K8P3215UQB_SIZE);
 
 digic_load_rom(s, addr, FLASH_K8P3215UQB_SIZE, filename);
 }
diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
index aa8553b3cd3..1a7b9724d6c 100644
--- a/hw/arm/Kconfig
+++ b/hw/arm/Kconfig
@@ -42,6 +42,7 @@ config DIGIC
 bool
 select PTIMER
 select PFLASH_CFI02
+select ALIASED_REGION
 
 config EXYNOS4
 bool
-- 
2.26.3

[PATCH v2 7/7] hw/block/pflash_cfi02: Simplify pflash_cfi02_register() prototype

2021-04-19 Thread Philippe Mathieu-Daudé

The previous commit removed the mapping code from TYPE_PFLASH_CFI02.
pflash_cfi02_register() doesn't use the 'nb_mappings' argument
anymore. Simply remove it to simplify.

Reviewed-by: David Gibson 
Signed-off-by: Philippe Mathieu-Daudé 
---
 include/hw/block/flash.h | 1 -
 hw/arm/digic_boards.c| 1 -
 hw/arm/musicpal.c| 1 -
 hw/arm/xilinx_zynq.c | 2 +-
 hw/block/pflash_cfi02.c  | 3 +--
 hw/lm32/lm32_boards.c| 4 ++--
 hw/ppc/ppc405_boards.c   | 6 +++---
 hw/sh4/r2d.c | 2 +-
 8 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/include/hw/block/flash.h b/include/hw/block/flash.h
index 7dde0adcee7..0e5dd818a9d 100644
--- a/include/hw/block/flash.h
+++ b/include/hw/block/flash.h
@@ -36,7 +36,6 @@ PFlashCFI02 *pflash_cfi02_register(hwaddr base,
hwaddr size,
BlockBackend *blk,
uint32_t sector_len,
-   int nb_mappings,
int width,
uint16_t id0, uint16_t id1,
uint16_t id2, uint16_t id3,
diff --git a/hw/arm/digic_boards.c b/hw/arm/digic_boards.c
index 293402b1240..eb694c70d4c 100644
--- a/hw/arm/digic_boards.c
+++ b/hw/arm/digic_boards.c
@@ -128,7 +128,6 @@ static void digic4_add_k8p3215uqb_rom(DigicState *s, hwaddr 
addr,
  FLASH_K8P3215UQB_SIZE / FLASH_K8P3215UQB_SECTOR_SIZE);
 qdev_prop_set_uint32(dev, "sector-length", FLASH_K8P3215UQB_SECTOR_SIZE);
 qdev_prop_set_uint8(dev, "width", 4); /* 32-bit */
-qdev_prop_set_uint8(dev, "mappings", 0);
 qdev_prop_set_uint8(dev, "big-endian", 0);
 qdev_prop_set_uint16(dev, "id0", 0x00ec);
 qdev_prop_set_uint16(dev, "id1", 0x007e);
diff --git a/hw/arm/musicpal.c b/hw/arm/musicpal.c
index 7d1f2f3fb3f..e882e11df36 100644
--- a/hw/arm/musicpal.c
+++ b/hw/arm/musicpal.c
@@ -1657,7 +1657,6 @@ static void musicpal_init(MachineState *machine)
 qdev_prop_set_uint32(dev, "num-blocks", flash_size / sector_size);
 qdev_prop_set_uint32(dev, "sector-length", sector_size);
 qdev_prop_set_uint8(dev, "width", 2); /* 16-bit */
-qdev_prop_set_uint8(dev, "mappings", 0);
 qdev_prop_set_uint8(dev, "big-endian", 0);
 qdev_prop_set_uint16(dev, "id0", 0x00bf);
 qdev_prop_set_uint16(dev, "id1", 0x236d);
diff --git a/hw/arm/xilinx_zynq.c b/hw/arm/xilinx_zynq.c
index 8db6cfd47f5..d12b00e7648 100644
--- a/hw/arm/xilinx_zynq.c
+++ b/hw/arm/xilinx_zynq.c
@@ -220,7 +220,7 @@ static void zynq_init(MachineState *machine)
 pflash_cfi02_register(0xe200, "zynq.pflash", FLASH_SIZE,
   dinfo ? blk_by_legacy_dinfo(dinfo) : NULL,
   FLASH_SECTOR_SIZE, 1,
-  1, 0x0066, 0x0022, 0x, 0x, 0x0555, 0x2aa,
+  0x0066, 0x0022, 0x, 0x, 0x0555, 0x2aa,
   0);
 
 /* Create the main clock source, and feed slcr with it */
diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index 6f4b3e3c3fe..2b412402fac 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -968,7 +968,7 @@ PFlashCFI02 *pflash_cfi02_register(hwaddr base,
hwaddr size,
BlockBackend *blk,
uint32_t sector_len,
-   int nb_mappings, int width,
+   int width,
uint16_t id0, uint16_t id1,
uint16_t id2, uint16_t id3,
uint16_t unlock_addr0,
@@ -977,7 +977,6 @@ PFlashCFI02 *pflash_cfi02_register(hwaddr base,
 {
 DeviceState *dev = qdev_new(TYPE_PFLASH_CFI02);
 
-assert(nb_mappings <= 1);
 if (blk) {
 qdev_prop_set_drive(dev, "drive", blk);
 }
diff --git a/hw/lm32/lm32_boards.c b/hw/lm32/lm32_boards.c
index b5d97dd53ed..96877ba7cfb 100644
--- a/hw/lm32/lm32_boards.c
+++ b/hw/lm32/lm32_boards.c
@@ -121,7 +121,7 @@ static void lm32_evr_init(MachineState *machine)
 pflash_cfi02_register(flash_base, "lm32_evr.flash", flash_size,
   dinfo ? blk_by_legacy_dinfo(dinfo) : NULL,
   flash_sector_size,
-  1, 2, 0x01, 0x7e, 0x43, 0x00, 0x555, 0x2aa, 1);
+  2, 0x01, 0x7e, 0x43, 0x00, 0x555, 0x2aa, 1);
 
 /* create irq lines */
 env->pic_state = lm32_pic_init(qemu_allocate_irq(cpu_irq_handler, cpu, 0));
@@ -218,7 +218,7 @@ static void lm32_uclinux_init(MachineState *machine)
 pflash_cfi02_register(flash_base, "lm32_uclinux.flash", flash_size,
   dinfo ? blk_by_legacy_dinfo(dinfo) : NULL,
   flash_sector_size,
-  1, 2, 0x01, 0x7e, 0x43, 0x00, 0x555,

[PATCH v2 4/7] hw/arm/digic: Open-code pflash_cfi02_register() call

2021-04-19 Thread Philippe Mathieu-Daudé

To be able to manually map the flash region on the main memory
(in the next commit), first expand the pflash_cfi02_register
in place.

Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/arm/digic_boards.c | 27 +--
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/hw/arm/digic_boards.c b/hw/arm/digic_boards.c
index 6cdc1d83fca..fc4a671b2e1 100644
--- a/hw/arm/digic_boards.c
+++ b/hw/arm/digic_boards.c
@@ -31,6 +31,8 @@
 #include "hw/boards.h"
 #include "exec/address-spaces.h"
 #include "qemu/error-report.h"
+#include "hw/qdev-properties.h"
+#include "hw/misc/aliased_region.h"
 #include "hw/arm/digic.h"
 #include "hw/block/flash.h"
 #include "hw/loader.h"
@@ -120,12 +122,25 @@ static void digic4_add_k8p3215uqb_rom(DigicState *s, 
hwaddr addr,
 #define FLASH_K8P3215UQB_SIZE (4 * 1024 * 1024)
 #define FLASH_K8P3215UQB_SECTOR_SIZE (64 * 1024)
 
-pflash_cfi02_register(addr, "pflash", FLASH_K8P3215UQB_SIZE,
-  NULL, FLASH_K8P3215UQB_SECTOR_SIZE,
-  DIGIC4_ROM_MAX_SIZE / FLASH_K8P3215UQB_SIZE,
-  4,
-  0x00EC, 0x007E, 0x0003, 0x0001,
-  0x0555, 0x2aa, 0);
+DeviceState *dev = qdev_new(TYPE_PFLASH_CFI02);
+
+qdev_prop_set_uint32(dev, "num-blocks",
+ FLASH_K8P3215UQB_SIZE / FLASH_K8P3215UQB_SECTOR_SIZE);
+qdev_prop_set_uint32(dev, "sector-length", FLASH_K8P3215UQB_SECTOR_SIZE);
+qdev_prop_set_uint8(dev, "width", 4); /* 32-bit */
+qdev_prop_set_uint8(dev, "mappings",
+DIGIC4_ROM_MAX_SIZE / FLASH_K8P3215UQB_SIZE);
+qdev_prop_set_uint8(dev, "big-endian", 0);
+qdev_prop_set_uint16(dev, "id0", 0x00ec);
+qdev_prop_set_uint16(dev, "id1", 0x007e);
+qdev_prop_set_uint16(dev, "id2", 0x0003);
+qdev_prop_set_uint16(dev, "id3", 0x0001);
+qdev_prop_set_uint16(dev, "unlock-addr0", 0x0555);
+qdev_prop_set_uint16(dev, "unlock-addr1", 0x2aa);
+qdev_prop_set_string(dev, "name", "pflash");
+sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), _fatal);
+
+sysbus_mmio_map(SYS_BUS_DEVICE(dev), 0, addr);
 
 digic_load_rom(s, addr, FLASH_K8P3215UQB_SIZE, filename);
 }
-- 
2.26.3

[PATCH v2 3/7] hw/arm/musicpal: Map flash using memory_region_add_subregion_aliased()

2021-04-19 Thread Philippe Mathieu-Daudé

Instead of using a device specific feature for mapping the
flash memory multiple times over a wider region, use the
generic memory_region_add_subregion_aliased() helper.

There is no change in the memory layout:

- before:

  (qemu) info mtree
  fe00- (prio 0, i/o): pflash
fe00-fe7f (prio 0, romd): alias pflash-alias 
@musicpal.flash -007f
fe80-feff (prio 0, romd): alias pflash-alias 
@musicpal.flash -007f
ff00-ff7f (prio 0, romd): alias pflash-alias 
@musicpal.flash -007f
ff80- (prio 0, romd): alias pflash-alias 
@musicpal.flash -007f

- after:

  fe00- (prio 0, i/o): masked musicpal.flash [span 
of 8 MiB]
fe00-fe7f (prio 0, romd): alias musicpal.flash 
[#0/4] @musicpal.flash -007f
fe80-feff (prio 0, romd): alias musicpal.flash 
[#1/4] @musicpal.flash -007f
ff00-ff7f (prio 0, romd): alias musicpal.flash 
[#2/4] @musicpal.flash -007f
ff80- (prio 0, romd): alias musicpal.flash 
[#3/4] @musicpal.flash -007f

Flatview is the same:

  (qemu) info mtree -f
  FlatView #0
   AS "memory", root: system
   AS "cpu-memory-0", root: system
   AS "emac-dma", root: system
   Root memory region: system
fe00-fe7f (prio 0, romd): musicpal.flash
fe80-feff (prio 0, romd): musicpal.flash
ff00-ff7f (prio 0, romd): musicpal.flash
ff80- (prio 0, romd): musicpal.flash

Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/arm/musicpal.c | 11 +++
 hw/arm/Kconfig|  1 +
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/hw/arm/musicpal.c b/hw/arm/musicpal.c
index 8b58b66f263..7d1f2f3fb3f 100644
--- a/hw/arm/musicpal.c
+++ b/hw/arm/musicpal.c
@@ -30,6 +30,7 @@
 #include "hw/irq.h"
 #include "hw/or-irq.h"
 #include "hw/audio/wm8750.h"
+#include "hw/misc/aliased_region.h"
 #include "sysemu/block-backend.h"
 #include "sysemu/runstate.h"
 #include "sysemu/dma.h"
@@ -1656,7 +1657,7 @@ static void musicpal_init(MachineState *machine)
 qdev_prop_set_uint32(dev, "num-blocks", flash_size / sector_size);
 qdev_prop_set_uint32(dev, "sector-length", sector_size);
 qdev_prop_set_uint8(dev, "width", 2); /* 16-bit */
-qdev_prop_set_uint8(dev, "mappings", MP_FLASH_SIZE_MAX / flash_size);
+qdev_prop_set_uint8(dev, "mappings", 0);
 qdev_prop_set_uint8(dev, "big-endian", 0);
 qdev_prop_set_uint16(dev, "id0", 0x00bf);
 qdev_prop_set_uint16(dev, "id1", 0x236d);
@@ -1667,14 +1668,16 @@ static void musicpal_init(MachineState *machine)
 qdev_prop_set_string(dev, "name", "musicpal.flash");
 sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), _fatal);
 
-sysbus_mmio_map(SYS_BUS_DEVICE(dev), 0,
-0x1ULL - MP_FLASH_SIZE_MAX);
-
 /*
  * The original U-Boot accesses the flash at 0xFE00 instead of
  * 0xFF80 (if there is 8 MB flash). So remap flash access if the
  * image is smaller than 32 MB.
  */
+memory_region_add_subregion_aliased(get_system_memory(),
+0x1ULL - MP_FLASH_SIZE_MAX,
+MP_FLASH_SIZE_MAX,
+sysbus_mmio_get_region(SYS_BUS_DEVICE(dev), 0),
+flash_size);
 }
 sysbus_create_simple(TYPE_MV88W8618_FLASHCFG, MP_FLASHCFG_BASE, NULL);
 
diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
index 8c37cf00da7..aa8553b3cd3 100644
--- a/hw/arm/Kconfig
+++ b/hw/arm/Kconfig
@@ -101,6 +101,7 @@ config MUSICPAL
 select MARVELL_88W8618
 select PTIMER
 select PFLASH_CFI02
+select ALIASED_REGION
 select SERIAL
 select WM8750
 
-- 
2.26.3

[PATCH v2 2/7] hw/arm/musicpal: Open-code pflash_cfi02_register() call

2021-04-19 Thread Philippe Mathieu-Daudé

To be able to manually map the flash region on the main memory
(in the next commit), first expand the pflash_cfi02_register
in place.

Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/arm/musicpal.c | 27 +--
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/hw/arm/musicpal.c b/hw/arm/musicpal.c
index 9cebece2de0..8b58b66f263 100644
--- a/hw/arm/musicpal.c
+++ b/hw/arm/musicpal.c
@@ -10,6 +10,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/units.h"
 #include "qapi/error.h"
 #include "cpu.h"
 #include "hw/sysbus.h"
@@ -1640,6 +1641,7 @@ static void musicpal_init(MachineState *machine)
 /* Register flash */
 dinfo = drive_get(IF_PFLASH, 0, 0);
 if (dinfo) {
+static const size_t sector_size = 64 * KiB;
 BlockBackend *blk = blk_by_legacy_dinfo(dinfo);
 
 flash_size = blk_getlength(blk);
@@ -1649,17 +1651,30 @@ static void musicpal_init(MachineState *machine)
 exit(1);
 }
 
+dev = qdev_new(TYPE_PFLASH_CFI02);
+qdev_prop_set_drive(dev, "drive", blk);
+qdev_prop_set_uint32(dev, "num-blocks", flash_size / sector_size);
+qdev_prop_set_uint32(dev, "sector-length", sector_size);
+qdev_prop_set_uint8(dev, "width", 2); /* 16-bit */
+qdev_prop_set_uint8(dev, "mappings", MP_FLASH_SIZE_MAX / flash_size);
+qdev_prop_set_uint8(dev, "big-endian", 0);
+qdev_prop_set_uint16(dev, "id0", 0x00bf);
+qdev_prop_set_uint16(dev, "id1", 0x236d);
+qdev_prop_set_uint16(dev, "id2", 0x);
+qdev_prop_set_uint16(dev, "id3", 0x);
+qdev_prop_set_uint16(dev, "unlock-addr0", 0x);
+qdev_prop_set_uint16(dev, "unlock-addr1", 0x2aaa);
+qdev_prop_set_string(dev, "name", "musicpal.flash");
+sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), _fatal);
+
+sysbus_mmio_map(SYS_BUS_DEVICE(dev), 0,
+0x1ULL - MP_FLASH_SIZE_MAX);
+
 /*
  * The original U-Boot accesses the flash at 0xFE00 instead of
  * 0xFF80 (if there is 8 MB flash). So remap flash access if the
  * image is smaller than 32 MB.
  */
-pflash_cfi02_register(0x1ULL - MP_FLASH_SIZE_MAX,
-  "musicpal.flash", flash_size,
-  blk, 0x1,
-  MP_FLASH_SIZE_MAX / flash_size,
-  2, 0x00BF, 0x236D, 0x, 0x,
-  0x, 0x2AAA, 0);
 }
 sysbus_create_simple(TYPE_MV88W8618_FLASHCFG, MP_FLASHCFG_BASE, NULL);
 
-- 
2.26.3

[RFC PATCH v2 1/7] hw/misc: Add device to help managing aliased memory regions

2021-04-19 Thread Philippe Mathieu-Daudé

// TODO explain here how buses work? when some address lines are
// not bound we get memory aliasing, high addresses are masked.
// etc...

Add a helper to manage this use case easily.

For example a having @span_size = @region_size / 4 we get such mapping:

  ^---^
  |   |
  |   |
  | +---+ | +-+  <--+
  |   | +-+ |
  |   | | | |
  |   |   +---> | alias#3 | |
  |   |   | | | |
  |   |   | +-+ |
  |   |   | +-+ |
  |   |   | | | |
  |   |   |   +---> | alias#2 | |
  |   |   |   | | | |region
  | container |   |   | +-+ | size
  |   |   |   | +-+ |
  |   |   |   | | | |
  |   |   |   |  +> | alias#1 | |
  |   |   |   |  |  | | |
  |   |   |   |  |  +-+  <--+   |
  |   | +-+---+--+--+   +-+ |   |
  |   | |   |   | | |span   |
  |   | | subregion +-> | alias#0 | |size   |
   offset |   | |   |   | | |   |
   +> | +---+ | +---+   +-+  <--+<--+
   |  |   |
   |  |   |
   |  |   |
   |  |   |
   |  |   |
   |  ^---^

Signed-off-by: Philippe Mathieu-Daudé 
---
Not really RFC, simply that I'v to add the technical description,
but I'd like to know if there could be a possibility to not accept
this device (because I missed something) before keeping working on
it. So far it is only used in hobbyist boards.

Cc: Peter Xu 
Cc: Paolo Bonzini 
---
 include/hw/misc/aliased_region.h |  87 
 hw/misc/aliased_region.c | 132 +++
 MAINTAINERS  |   6 ++
 hw/misc/Kconfig  |   3 +
 hw/misc/meson.build  |   1 +
 5 files changed, 229 insertions(+)
 create mode 100644 include/hw/misc/aliased_region.h
 create mode 100644 hw/misc/aliased_region.c

diff --git a/include/hw/misc/aliased_region.h b/include/hw/misc/aliased_region.h
new file mode 100644
index 000..0ce0d5d1cef
--- /dev/null
+++ b/include/hw/misc/aliased_region.h
@@ -0,0 +1,87 @@
+/*
+ * Aliased memory regions
+ *
+ * Copyright (c) 2018  Philippe Mathieu-Daudé 
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_MISC_ALIASED_REGION_H
+#define HW_MISC_ALIASED_REGION_H
+
+#include "exec/memory.h"
+#include "hw/sysbus.h"
+
+#define TYPE_ALIASED_REGION "aliased-memory-region"
+OBJECT_DECLARE_SIMPLE_TYPE(AliasedRegionState, ALIASED_REGION)
+
+struct AliasedRegionState {
+/*< private >*/
+SysBusDevice parent_obj;
+
+/*< public >*/
+MemoryRegion container;
+uint64_t region_size;
+uint64_t span_size;
+MemoryRegion *mr;
+
+struct {
+size_t count;
+MemoryRegion *alias;
+} mem;
+};
+
+/**
+ * memory_region_add_subregion_aliased:
+ * @container: the #MemoryRegion to contain the aliased subregions.
+ * @offset: the offset relative to @container where the aliased subregion
+ *  are added.
+ * @region_size: size of the region containing the aliased subregions.
+ * @subregion: the subregion to be aliased.
+ * @span_size: size between each aliased subregion
+ *
+ * This utility function creates and maps an instance of aliased-memory-region,
+ * which is a dummy device of a single region which simply contains multiple
+ * aliases of the provided @subregion, spanned over the @region_size every
+ * @span_size. The device is mapped at @offset within @container.
+ *
+ * For example a having @span_size = @region_size / 4 we get such mapping:
+ *
+ *   +---+
+ *   |   |
+ *   |   |
+ *   | +---+ | +-+  <--+
+ *   |   | +-+ |
+ *   |   | | | |
+ *   |   |   +---> | alias#3 | |
+ *   |   |   | | | |
+ *   |   |   | +-+ |
+ *   |   |   | +-+ |
+ *   |   |   | | | |
+ *   |   |   |   +---> | alias#2 |

[PATCH v2 0/7] hw/misc: Add memory_region_add_subregion_aliased() helper [pflash part]

2021-04-19 Thread Philippe Mathieu-Daudé

Hi,

This series introduce the memory_region_add_subregion_aliased()
helper which basically create a device which maps a subregion
multiple times.

Since v1:
- Split series in 2, keeping the I/O regions (showed with the q800
  machine) part for 2nd part
- Added R-b tags

Examples are easier, so having a subregion aliased every @span_size
then mapped onto a container at an offset, you get something like:

  ^---^
  |   |
  |   |
  | +---+ | +-+  <--+
  |   | +-+ |
  |   | | | |
  |   |   +---> | alias#3 | |
  |   |   | | | |
  |   |   | +-+ |
  |   |   | +-+ |
  |   |   | | | |
  |   |   |   +---> | alias#2 | |
  |   |   |   | | | |region
  | container |   |   | +-+ | size
  |   |   |   | +-+ |
  |   |   |   | | | |
  |   |   |   |  +> | alias#1 | |
  |   |   |   |  |  | | |
  |   |   |   |  |  +-+  <--+   |
  |   | +-+---+--+--+   +-+ |   |
  |   | |   |   | | |span   |
  |   | | subregion +-> | alias#0 | |size   |
   offset |   | |   |   | | |   |
   +> | +---+ | +---+   +-+  <--+<--+
   |  |   |
   |  |   |
   |  |   |
   |  |   |
   |  |   |
   |  ^---^

I know it need more documentation and tests, but I prefer to send
as draft RFC for early review before spending more time on it.

Supersedes: <20210326002728.1069834-1-f4...@amsat.org>
Based-on: <20210325120921.858993-1-f4...@amsat.org>
https://www.mail-archive.com/qemu-devel@nongnu.org/msg795218.html

Philippe Mathieu-Daudé (7):
  hw/misc: Add device to help managing aliased memory regions
  hw/arm/musicpal: Open-code pflash_cfi02_register() call
  hw/arm/musicpal: Map flash using memory_region_add_subregion_aliased()
  hw/arm/digic: Open-code pflash_cfi02_register() call
  hw/arm/digic: Map flash using memory_region_add_subregion_aliased()
  hw/block/pflash_cfi02: Remove pflash_setup_mappings()
  hw/block/pflash_cfi02: Simplify pflash_cfi02_register() prototype

 include/hw/block/flash.h |   1 -
 include/hw/misc/aliased_region.h |  87 
 hw/arm/digic_boards.c|  28 +--
 hw/arm/musicpal.c|  29 +--
 hw/arm/xilinx_zynq.c |   2 +-
 hw/block/pflash_cfi02.c  |  36 +
 hw/lm32/lm32_boards.c|   4 +-
 hw/misc/aliased_region.c | 132 +++
 hw/ppc/ppc405_boards.c   |   6 +-
 hw/sh4/r2d.c |   2 +-
 MAINTAINERS  |   6 ++
 hw/arm/Kconfig   |   2 +
 hw/misc/Kconfig  |   3 +
 hw/misc/meson.build  |   1 +
 14 files changed, 285 insertions(+), 54 deletions(-)
 create mode 100644 include/hw/misc/aliased_region.h
 create mode 100644 hw/misc/aliased_region.c

-- 
2.26.3

Re: [PATCH v3 21/33] qemu-socket: pass monitor link to socket_get_fd directly

2021-04-19 Thread Daniel P . Berrangé

On Fri, Apr 16, 2021 at 11:08:59AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Detecting monitor by current coroutine works bad when we are not in
> coroutine context. And that's exactly so in nbd reconnect code, where
> qio_channel_socket_connect_sync() is called from thread.
> 
> Add a possibility to pass monitor by hand, to be used in the following
> commit.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  include/io/channel-socket.h| 20 
>  include/qemu/sockets.h |  2 +-
>  io/channel-socket.c| 17 +
>  tests/unit/test-util-sockets.c | 16 
>  util/qemu-sockets.c| 10 +-
>  5 files changed, 47 insertions(+), 18 deletions(-)
> 
> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> index e747e63514..6d0915420d 100644
> --- a/include/io/channel-socket.h
> +++ b/include/io/channel-socket.h
> @@ -78,6 +78,23 @@ qio_channel_socket_new_fd(int fd,
>Error **errp);
>  
>  
> +/**
> + * qio_channel_socket_connect_sync_mon:
> + * @ioc: the socket channel object
> + * @addr: the address to connect to
> + * @mon: current monitor. If NULL, it will be detected by
> + *   current coroutine.
> + * @errp: pointer to a NULL-initialized error object
> + *
> + * Attempt to connect to the address @addr. This method
> + * will run in the foreground so the caller will not regain
> + * execution control until the connection is established or
> + * an error occurs.
> + */
> +int qio_channel_socket_connect_sync_mon(QIOChannelSocket *ioc,
> +SocketAddress *addr,
> +Monitor *mon,
> +Error **errp);

I don't really like exposing the concept of the QEMU monitor in
the IO layer APIs. IMHO these ought to remain completely separate
subsystems from the API pov, and we ought to fix this problem by
making monitor_cur() work better in the scenario required.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH-for-6.1 0/2] hw/block/pflash_cfi02: Do not create aliases when not necessary

2021-04-19 Thread Philippe Mathieu-Daudé

On 3/25/21 1:09 PM, Philippe Mathieu-Daudé wrote:
> Simplify memory layout when no pflash_cfi02 mapping requested.

> Philippe Mathieu-Daud=C3=A9 (2):
>   hw/block/pflash_cfi02: Set romd mode in pflash_cfi02_realize()
>   hw/block/pflash_cfi02: Do not create aliases when not necessary

Thanks, series applied to pflash-next tree.

Re: [RFC PATCH 0/2] qemu-img convert: Fix sparseness detection

2021-04-19 Thread Peter Lieven



> Am 19.04.2021 um 10:36 schrieb Peter Lieven :
> 
> 
> 
>> Am 15.04.2021 um 17:22 schrieb Kevin Wolf :
>> 
>> Peter, three years ago you changed 'qemu-img convert' to sacrifice some
>> sparsification in order to get aligned requests on the target image. At
>> the time, I thought the impact would be small, but it turns out that
>> this can end up wasting gigabytes of storagee (like converting a fully
>> zeroed 10 GB image taking 2.8 GB instead of a few kilobytes).
>> 
>> https://bugzilla.redhat.com/show_bug.cgi?id=1882917
>> 
>> I'm not entirely sure how to attack this best since this is a tradeoff,
>> but maybe the approach in this series is still good enough for the case
>> that you wanted to fix back then?
>> 
>> Of course, it would be possible to have a more complete fix like looking
>> forward a few blocks more before writing data, but that would probably
>> not be entirely trivial because you would have to merge blocks with ZERO
>> block status with DATA blocks that contain only zeros. I'm not sure if
>> it's worth this complication of the code.
> 
> I will try to look into this asap.

Besides from the reproducer described in the ticket, I retried my old 
conversion test in our environment:

Before commit 8dcd3c9b91: reads 4608 writes 14959
After commit 8dcd3c9b91: reads 0 writes 14924
With Kevins patch: reads 110 writes 14924

I think this is a good result if it avoids other issues.

Peter

[PATCH v2 6/8] block: do not acquire AioContext in check_to_replace_node

2021-04-19 Thread Emanuele Giuseppe Esposito

Make the (only) caller do it.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c   | 22 +-
 blockdev.c|  7 ++-
 include/block/block.h |  1 +
 3 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/block.c b/block.c
index f40fb63c75..fd12f88062 100644
--- a/block.c
+++ b/block.c
@@ -6776,24 +6776,15 @@ bool bdrv_recurse_can_replace(BlockDriverState *bs,
  *
  * The result (whether the node can be replaced or not) is only valid
  * for as long as no graph or permission changes occur.
+ *
+ * Called with AioContext lock held.
  */
 BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
+BlockDriverState *to_replace_bs,
 const char *node_name, Error **errp)
 {
-BlockDriverState *to_replace_bs = bdrv_find_node(node_name);
-AioContext *aio_context;
-
-if (!to_replace_bs) {
-error_setg(errp, "Failed to find node with node-name='%s'", node_name);
-return NULL;
-}
-
-aio_context = bdrv_get_aio_context(to_replace_bs);
-aio_context_acquire(aio_context);
-
 if (bdrv_op_is_blocked(to_replace_bs, BLOCK_OP_TYPE_REPLACE, errp)) {
-to_replace_bs = NULL;
-goto out;
+return NULL;
 }
 
 /* We don't want arbitrary node of the BDS chain to be replaced only the 
top
@@ -6806,12 +6797,9 @@ BlockDriverState *check_to_replace_node(BlockDriverState 
*parent_bs,
"because it cannot be guaranteed that doing so would not "
"lead to an abrupt change of visible data",
node_name, parent_bs->node_name);
-to_replace_bs = NULL;
-goto out;
+return NULL;
 }
 
-out:
-aio_context_release(aio_context);
 return to_replace_bs;
 }
 
diff --git a/blockdev.c b/blockdev.c
index a57590aae4..e901107344 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3056,13 +3056,18 @@ static void blockdev_mirror_common(const char *job_id, 
BlockDriverState *bs,
 return;
 }
 
-to_replace_bs = check_to_replace_node(bs, replaces, errp);
+to_replace_bs = bdrv_find_node(replaces);
 if (!to_replace_bs) {
+error_setg(errp, "Failed to find node with node-name='%s'",
+   replaces);
 return;
 }
 
 replace_aio_context = bdrv_get_aio_context(to_replace_bs);
 aio_context_acquire(replace_aio_context);
+if (!check_to_replace_node(bs, to_replace_bs, replaces, errp)) {
+return;
+}
 replace_size = bdrv_getlength(to_replace_bs);
 aio_context_release(replace_aio_context);
 
diff --git a/include/block/block.h b/include/block/block.h
index b3f6e509d4..f57c34a19a 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -474,6 +474,7 @@ int bdrv_amend_options(BlockDriverState *bs_new, QemuOpts 
*opts,
 
 /* check if a named node can be replaced when doing drive-mirror */
 BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
+BlockDriverState *to_replace_bs,
 const char *node_name, Error **errp);
 
 /* async block I/O */
-- 
2.30.2

[PATCH v2 7/8] block/replication: do not acquire AioContext

2021-04-19 Thread Emanuele Giuseppe Esposito

Replication functions are mostly called when the BDS is quiescent and
does not have any pending I/O.  They do not need to synchronize on
anything since BDS and BB are now thread-safe.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/replication.c | 54 ++---
 1 file changed, 12 insertions(+), 42 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 97be7ef4de..25ee37b21b 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -45,6 +45,8 @@ typedef struct BDRVReplicationState {
 Error *blocker;
 bool orig_hidden_read_only;
 bool orig_secondary_read_only;
+
+/* This field is accessed asynchronously.  */
 int error;
 } BDRVReplicationState;
 
@@ -210,7 +212,7 @@ static int replication_return_value(BDRVReplicationState 
*s, int ret)
 }
 
 if (ret < 0) {
-s->error = ret;
+qatomic_set(>error, ret);
 ret = 0;
 }
 
@@ -307,6 +309,7 @@ out:
 return ret;
 }
 
+/* Called with no I/O pending.  */
 static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
 {
 Error *local_err = NULL;
@@ -420,7 +423,7 @@ static void backup_job_completed(void *opaque, int ret)
 
 if (s->stage != BLOCK_REPLICATION_FAILOVER) {
 /* The backup job is cancelled unexpectedly */
-s->error = -EIO;
+qatomic_set(>error, -EIO);
 }
 
 backup_job_cleanup(bs);
@@ -445,6 +448,7 @@ static bool check_top_bs(BlockDriverState *top_bs, 
BlockDriverState *bs)
 return false;
 }
 
+/* Called with no I/O pending.  */
 static void replication_start(ReplicationState *rs, ReplicationMode mode,
   Error **errp)
 {
@@ -452,12 +456,9 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 BDRVReplicationState *s;
 BlockDriverState *top_bs;
 int64_t active_length, hidden_length, disk_length;
-AioContext *aio_context;
 Error *local_err = NULL;
 BackupPerf perf = { .use_copy_range = true, .max_workers = 1 };
 
-aio_context = bdrv_get_aio_context(bs);
-aio_context_acquire(aio_context);
 s = bs->opaque;
 
 if (s->stage == BLOCK_REPLICATION_DONE ||
@@ -467,20 +468,17 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
  * Ignore the request because the secondary side of replication
  * doesn't have to do anything anymore.
  */
-aio_context_release(aio_context);
 return;
 }
 
 if (s->stage != BLOCK_REPLICATION_NONE) {
 error_setg(errp, "Block replication is running or done");
-aio_context_release(aio_context);
 return;
 }
 
 if (s->mode != mode) {
 error_setg(errp, "The parameter mode's value is invalid, needs %d,"
" but got %d", s->mode, mode);
-aio_context_release(aio_context);
 return;
 }
 
@@ -492,21 +490,18 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 if (!s->active_disk || !s->active_disk->bs ||
 !s->active_disk->bs->backing) {
 error_setg(errp, "Active disk doesn't have backing file");
-aio_context_release(aio_context);
 return;
 }
 
 s->hidden_disk = s->active_disk->bs->backing;
 if (!s->hidden_disk->bs || !s->hidden_disk->bs->backing) {
 error_setg(errp, "Hidden disk doesn't have backing file");
-aio_context_release(aio_context);
 return;
 }
 
 s->secondary_disk = s->hidden_disk->bs->backing;
 if (!s->secondary_disk->bs || !bdrv_has_blk(s->secondary_disk->bs)) {
 error_setg(errp, "The secondary disk doesn't have block backend");
-aio_context_release(aio_context);
 return;
 }
 
@@ -518,7 +513,6 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 active_length != hidden_length || hidden_length != disk_length) {
 error_setg(errp, "Active disk, hidden disk, secondary disk's 
length"
" are not the same");
-aio_context_release(aio_context);
 return;
 }
 
@@ -529,7 +523,6 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 !s->hidden_disk->bs->drv->bdrv_make_empty) {
 error_setg(errp,
"Active disk or hidden disk doesn't support 
make_empty");
-aio_context_release(aio_context);
 return;
 }
 
@@ -537,7 +530,6 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 reopen_backing_file(bs, true, _err);
 if (local_err) {
 error_propagate(errp, local_err);
-aio_context_release(aio_context);
 return;
 }
 
@@ -550,7 +542,6 @@ static void replication_start(ReplicationState *rs,

[PATCH v2 8/8] block: do not take AioContext around reopen

2021-04-19 Thread Emanuele Giuseppe Esposito

Reopen needs to handle AioContext carefully due to calling
bdrv_drain_all_begin/end.  By not taking AioContext around calls to
bdrv_reopen_multiple, we can drop the function's release/acquire
pair and the AioContext argument too.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c |  4 
 block/mirror.c|  9 -
 blockdev.c| 19 ++-
 3 files changed, 6 insertions(+), 26 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 413af51f3b..6fdc698e9e 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2291,20 +2291,16 @@ int blk_commit_all(void)
 BlockBackend *blk = NULL;
 
 while ((blk = blk_all_next(blk)) != NULL) {
-AioContext *aio_context = blk_get_aio_context(blk);
 BlockDriverState *unfiltered_bs = bdrv_skip_filters(blk_bs(blk));
 
-aio_context_acquire(aio_context);
 if (blk_is_inserted(blk) && bdrv_cow_child(unfiltered_bs)) {
 int ret;
 
 ret = bdrv_commit(unfiltered_bs);
 if (ret < 0) {
-aio_context_release(aio_context);
 return ret;
 }
 }
-aio_context_release(aio_context);
 }
 return 0;
 }
diff --git a/block/mirror.c b/block/mirror.c
index 5a71bd8bbc..43174bbc6b 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -631,7 +631,6 @@ static int mirror_exit_common(Job *job)
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
 BlockJob *bjob = >common;
 MirrorBDSOpaque *bs_opaque;
-AioContext *replace_aio_context = NULL;
 BlockDriverState *src;
 BlockDriverState *target_bs;
 BlockDriverState *mirror_top_bs;
@@ -699,11 +698,6 @@ static int mirror_exit_common(Job *job)
 }
 }
 
-if (s->to_replace) {
-replace_aio_context = bdrv_get_aio_context(s->to_replace);
-aio_context_acquire(replace_aio_context);
-}
-
 if (s->should_complete && !abort) {
 BlockDriverState *to_replace = s->to_replace ?: src;
 bool ro = bdrv_is_read_only(to_replace);
@@ -740,9 +734,6 @@ static int mirror_exit_common(Job *job)
 error_free(s->replace_blocker);
 bdrv_unref(s->to_replace);
 }
-if (replace_aio_context) {
-aio_context_release(replace_aio_context);
-}
 g_free(s->replaces);
 bdrv_unref(target_bs);
 
diff --git a/blockdev.c b/blockdev.c
index e901107344..1672ef756e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3469,7 +3469,6 @@ void qmp_change_backing_file(const char *device,
  Error **errp)
 {
 BlockDriverState *bs = NULL;
-AioContext *aio_context;
 BlockDriverState *image_bs = NULL;
 Error *local_err = NULL;
 bool ro;
@@ -3480,37 +3479,34 @@ void qmp_change_backing_file(const char *device,
 return;
 }
 
-aio_context = bdrv_get_aio_context(bs);
-aio_context_acquire(aio_context);
-
 image_bs = bdrv_lookup_bs(NULL, image_node_name, _err);
 if (local_err) {
 error_propagate(errp, local_err);
-goto out;
+return;
 }
 
 if (!image_bs) {
 error_setg(errp, "image file not found");
-goto out;
+return;
 }
 
 if (bdrv_find_base(image_bs) == image_bs) {
 error_setg(errp, "not allowing backing file change on an image "
  "without a backing file");
-goto out;
+return;
 }
 
 /* even though we are not necessarily operating on bs, we need it to
  * determine if block ops are currently prohibited on the chain */
 if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_CHANGE, errp)) {
-goto out;
+return;
 }
 
 /* final sanity check */
 if (!bdrv_chain_contains(bs, image_bs)) {
 error_setg(errp, "'%s' and image file are not in the same chain",
device);
-goto out;
+return;
 }
 
 /* if not r/w, reopen to make r/w */
@@ -3518,7 +3514,7 @@ void qmp_change_backing_file(const char *device,
 
 if (ro) {
 if (bdrv_reopen_set_read_only(image_bs, false, errp) != 0) {
-goto out;
+return;
 }
 }
 
@@ -3536,9 +3532,6 @@ void qmp_change_backing_file(const char *device,
 if (ro) {
 bdrv_reopen_set_read_only(image_bs, true, errp);
 }
-
-out:
-aio_context_release(aio_context);
 }
 
 void qmp_blockdev_add(BlockdevOptions *options, Error **errp)
-- 
2.30.2

[PATCH v2 5/8] block: add a few more notes on locking

2021-04-19 Thread Emanuele Giuseppe Esposito

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/block/block_int.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index a1aad5ad2d..67a0777e12 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -154,7 +154,9 @@ struct BlockDriver {
  */
 bool supports_backing;
 
-/* For handling image reopen for split or non-split files */
+/* For handling image reopen for split or non-split files.  Called
+ * with no I/O pending.
+ */
 int (*bdrv_reopen_prepare)(BDRVReopenState *reopen_state,
BlockReopenQueue *queue, Error **errp);
 void (*bdrv_reopen_commit)(BDRVReopenState *reopen_state);
@@ -168,6 +170,7 @@ struct BlockDriver {
 /* Protocol drivers should implement this instead of bdrv_open */
 int (*bdrv_file_open)(BlockDriverState *bs, QDict *options, int flags,
   Error **errp);
+/* Called from main thread.  */
 void (*bdrv_close)(BlockDriverState *bs);
 
 
-- 
2.30.2

[PATCH v2 3/8] util: use RCU accessors for notifiers

2021-04-19 Thread Emanuele Giuseppe Esposito

Note that calling rcu_read_lock() is left to the caller.  In fact,
if the notifier is really only used within the BQL, it's unnecessary.

Even outside the BQL, RCU accessors can also be used with any API that has
the same contract as synchronize_rcu, i.e. it stops until all concurrent
readers complete, no matter how "readers" are defined.  In the next patch,
for example, synchronize_rcu's role is taken by bdrv_drain (which is a
superset of synchronize_rcu, since it also blocks new incoming readers).

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 util/notify.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/util/notify.c b/util/notify.c
index 76bab212ae..529f1d121e 100644
--- a/util/notify.c
+++ b/util/notify.c
@@ -15,6 +15,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/notify.h"
+#include "qemu/rcu_queue.h"
 
 void notifier_list_init(NotifierList *list)
 {
@@ -23,19 +24,19 @@ void notifier_list_init(NotifierList *list)
 
 void notifier_list_add(NotifierList *list, Notifier *notifier)
 {
-QLIST_INSERT_HEAD(>notifiers, notifier, node);
+QLIST_INSERT_HEAD_RCU(>notifiers, notifier, node);
 }
 
 void notifier_remove(Notifier *notifier)
 {
-QLIST_REMOVE(notifier, node);
+QLIST_REMOVE_RCU(notifier, node);
 }
 
 void notifier_list_notify(NotifierList *list, void *data)
 {
 Notifier *notifier, *next;
 
-QLIST_FOREACH_SAFE(notifier, >notifiers, node, next) {
+QLIST_FOREACH_SAFE_RCU(notifier, >notifiers, node, next) {
 notifier->notify(notifier, data);
 }
 }
@@ -53,12 +54,12 @@ void notifier_with_return_list_init(NotifierWithReturnList 
*list)
 void notifier_with_return_list_add(NotifierWithReturnList *list,
NotifierWithReturn *notifier)
 {
-QLIST_INSERT_HEAD(>notifiers, notifier, node);
+QLIST_INSERT_HEAD_RCU(>notifiers, notifier, node);
 }
 
 void notifier_with_return_remove(NotifierWithReturn *notifier)
 {
-QLIST_REMOVE(notifier, node);
+QLIST_REMOVE_RCU(notifier, node);
 }
 
 int notifier_with_return_list_notify(NotifierWithReturnList *list, void *data)
@@ -66,7 +67,7 @@ int notifier_with_return_list_notify(NotifierWithReturnList 
*list, void *data)
 NotifierWithReturn *notifier, *next;
 int ret = 0;
 
-QLIST_FOREACH_SAFE(notifier, >notifiers, node, next) {
+QLIST_FOREACH_SAFE_RCU(notifier, >notifiers, node, next) {
 ret = notifier->notify(notifier, data);
 if (ret != 0) {
 break;
-- 
2.30.2

[PATCH v2 4/8] block: make before-write notifiers thread-safe

2021-04-19 Thread Emanuele Giuseppe Esposito

Reads access the list in RCU style, so be careful to avoid use-after-free
scenarios in the backup block job.  Apart from this, all that's needed
is protecting updates with a mutex.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c   |  6 +++---
 block/io.c| 12 
 block/write-threshold.c   |  2 +-
 include/block/block_int.h | 37 +
 4 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index c5b887cec1..f40fb63c75 100644
--- a/block.c
+++ b/block.c
@@ -6434,7 +6434,7 @@ static void 
bdrv_do_remove_aio_context_notifier(BdrvAioNotifier *ban)
 g_free(ban);
 }
 
-static void bdrv_detach_aio_context(BlockDriverState *bs)
+void bdrv_detach_aio_context(BlockDriverState *bs)
 {
 BdrvAioNotifier *baf, *baf_tmp;
 
@@ -6462,8 +6462,8 @@ static void bdrv_detach_aio_context(BlockDriverState *bs)
 bs->aio_context = NULL;
 }
 
-static void bdrv_attach_aio_context(BlockDriverState *bs,
-AioContext *new_context)
+void bdrv_attach_aio_context(BlockDriverState *bs,
+ AioContext *new_context)
 {
 BdrvAioNotifier *ban, *ban_tmp;
 
diff --git a/block/io.c b/block/io.c
index ca2dca3007..8f6af6077b 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3137,10 +3137,22 @@ bool bdrv_qiov_is_aligned(BlockDriverState *bs, 
QEMUIOVector *qiov)
 return true;
 }
 
+static QemuSpin notifiers_spin_lock;
+
 void bdrv_add_before_write_notifier(BlockDriverState *bs,
 NotifierWithReturn *notifier)
 {
+qemu_spin_lock(_spin_lock);
 notifier_with_return_list_add(>before_write_notifiers, notifier);
+qemu_spin_unlock(_spin_lock);
+}
+
+void bdrv_remove_before_write_notifier(BlockDriverState *bs,
+   NotifierWithReturn *notifier)
+{
+qemu_spin_lock(_spin_lock);
+notifier_with_return_remove(notifier);
+qemu_spin_unlock(_spin_lock);
 }
 
 void bdrv_io_plug(BlockDriverState *bs)
diff --git a/block/write-threshold.c b/block/write-threshold.c
index 77c74bdaa7..b87b749b02 100644
--- a/block/write-threshold.c
+++ b/block/write-threshold.c
@@ -32,7 +32,7 @@ bool bdrv_write_threshold_is_set(const BlockDriverState *bs)
 static void write_threshold_disable(BlockDriverState *bs)
 {
 if (bdrv_write_threshold_is_set(bs)) {
-notifier_with_return_remove(>write_threshold_notifier);
+bdrv_remove_before_write_notifier(bs, >write_threshold_notifier);
 bs->write_threshold_offset = 0;
 }
 }
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 88e4111939..a1aad5ad2d 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1089,6 +1089,8 @@ bool bdrv_backing_overridden(BlockDriverState *bs);
 
 /**
  * bdrv_add_before_write_notifier:
+ * @bs: The #BlockDriverState for which to register the notifier
+ * @notifier: The notifier to add
  *
  * Register a callback that is invoked before write requests are processed but
  * after any throttling or waiting for overlapping requests.
@@ -1096,6 +1098,41 @@ bool bdrv_backing_overridden(BlockDriverState *bs);
 void bdrv_add_before_write_notifier(BlockDriverState *bs,
 NotifierWithReturn *notifier);
 
+/**
+ * bdrv_remove_before_write_notifier:
+ * @bs: The #BlockDriverState for which to register the notifier
+ * @notifier: The notifier to add
+ *
+ * Unregister a callback that is invoked before write requests are processed 
but
+ * after any throttling or waiting for overlapping requests.
+ *
+ * Note that more I/O could be pending on @bs.  You need to wait for
+ * it to finish, for example with bdrv_drain(), before freeing @notifier.
+ */
+void bdrv_remove_before_write_notifier(BlockDriverState *bs,
+   NotifierWithReturn *notifier);
+
+/**
+ * bdrv_detach_aio_context:
+ *
+ * May be called from .bdrv_detach_aio_context() to detach children from the
+ * current #AioContext.  This is only needed by block drivers that manage their
+ * own children.  Both ->file and ->backing are automatically handled and
+ * block drivers should not call this function on them explicitly.
+ */
+void bdrv_detach_aio_context(BlockDriverState *bs);
+
+/**
+ * bdrv_attach_aio_context:
+ *
+ * May be called from .bdrv_attach_aio_context() to attach children to the new
+ * #AioContext.  This is only needed by block drivers that manage their own
+ * children.  Both ->file and ->backing are automatically handled and block
+ * drivers should not call this function on them explicitly.
+ */
+void bdrv_attach_aio_context(BlockDriverState *bs,
+ AioContext *new_context);
+
 /**
  * bdrv_add_aio_context_notifier:
  *
-- 
2.30.2

[PATCH v2 2/8] block: protect write threshold QMP commands from concurrent requests

2021-04-19 Thread Emanuele Giuseppe Esposito

For simplicity, use bdrv_drained_begin/end to avoid concurrent
writes to the write threshold, or reading it while it is being set.
qmp_block_set_write_threshold is protected by the BQL.

Reviewed-by: Stefan Hajnoczi 
Co-developed-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/write-threshold.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/block/write-threshold.c b/block/write-threshold.c
index 63040fa4f2..77c74bdaa7 100644
--- a/block/write-threshold.c
+++ b/block/write-threshold.c
@@ -111,7 +111,6 @@ void qmp_block_set_write_threshold(const char *node_name,
Error **errp)
 {
 BlockDriverState *bs;
-AioContext *aio_context;
 
 bs = bdrv_find_node(node_name);
 if (!bs) {
@@ -119,10 +118,8 @@ void qmp_block_set_write_threshold(const char *node_name,
 return;
 }
 
-aio_context = bdrv_get_aio_context(bs);
-aio_context_acquire(aio_context);
-
+/* Avoid a concurrent write_threshold_disable.  */
+bdrv_drained_begin(bs);
 bdrv_write_threshold_set(bs, threshold_bytes);
-
-aio_context_release(aio_context);
+bdrv_drained_end(bs);
 }
-- 
2.30.2

[PATCH v2 0/8] Block layer thread-safety, continued

2021-04-19 Thread Emanuele Giuseppe Esposito

This and the following serie of patches are based on Paolo's
v1 patches sent in 2017[*]. They have been ported to the current QEMU
version, but the goal remains the same: 
- make the block layer thread-safe (patches 1-5), and
- remove aio_context_acquire/release (patches 6-8).

[*] = https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg01398.html

Signed-off-by: Emanuele Giuseppe Esposito 
---
v1 (2017) -> v2 (2021):
- v1 Patch "block-backup: add reqs_lock" has been dropped, because now
  is completely different from the old version and all functions
  that were affected by it have been moved or deleted. 
  It will be replaced by another serie that aims to thread safety to 
  block/block-copy.c
- remaining v1 patches will be integrated in next serie.
- Patch "block: do not acquire AioContext in check_to_replace_node"
  moves part of the logic of check_to_replace_node to the caller,
  so that the function can be included in the aio_context_acquire/release
  block that follows.

Emanuele Giuseppe Esposito (8):
  block: prepare write threshold code for thread safety
  block: protect write threshold QMP commands from concurrent requests
  util: use RCU accessors for notifiers
  block: make before-write notifiers thread-safe
  block: add a few more notes on locking
  block: do not acquire AioContext in check_to_replace_node
  block/replication: do not acquire AioContext
  block: do not take AioContext around reopen

 block.c   | 28 ++--
 block/block-backend.c |  4 ---
 block/io.c| 12 +
 block/mirror.c|  9 ---
 block/replication.c   | 54 +--
 block/write-threshold.c   | 39 ++--
 blockdev.c| 26 +--
 include/block/block.h |  1 +
 include/block/block_int.h | 42 +-
 util/notify.c | 13 +-
 10 files changed, 113 insertions(+), 115 deletions(-)

-- 
2.30.2

[PATCH v2 1/8] block: prepare write threshold code for thread safety

2021-04-19 Thread Emanuele Giuseppe Esposito

Reviewed-by: Stefan Hajnoczi 
Co-developed-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/write-threshold.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/block/write-threshold.c b/block/write-threshold.c
index 85b78dc2a9..63040fa4f2 100644
--- a/block/write-threshold.c
+++ b/block/write-threshold.c
@@ -37,18 +37,22 @@ static void write_threshold_disable(BlockDriverState *bs)
 }
 }
 
+static uint64_t treshold_overage(const BdrvTrackedRequest *req,
+uint64_t thres)
+{
+if (thres > 0 && req->offset + req->bytes > thres) {
+return req->offset + req->bytes - thres;
+} else {
+return 0;
+}
+}
+
 uint64_t bdrv_write_threshold_exceeded(const BlockDriverState *bs,
const BdrvTrackedRequest *req)
 {
-if (bdrv_write_threshold_is_set(bs)) {
-if (req->offset > bs->write_threshold_offset) {
-return (req->offset - bs->write_threshold_offset) + req->bytes;
-}
-if ((req->offset + req->bytes) > bs->write_threshold_offset) {
-return (req->offset + req->bytes) - bs->write_threshold_offset;
-}
-}
-return 0;
+uint64_t thres = bdrv_write_threshold_get(bs);
+
+return treshold_overage(req, thres);
 }
 
 static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
@@ -56,14 +60,14 @@ static int coroutine_fn 
before_write_notify(NotifierWithReturn *notifier,
 {
 BdrvTrackedRequest *req = opaque;
 BlockDriverState *bs = req->bs;
-uint64_t amount = 0;
+uint64_t thres = bdrv_write_threshold_get(bs);
+uint64_t amount = treshold_overage(req, thres);
 
-amount = bdrv_write_threshold_exceeded(bs, req);
 if (amount > 0) {
 qapi_event_send_block_write_threshold(
 bs->node_name,
 amount,
-bs->write_threshold_offset);
+thres);
 
 /* autodisable to avoid flooding the monitor */
 write_threshold_disable(bs);
-- 
2.30.2

Re: [RFC PATCH 0/2] qemu-img convert: Fix sparseness detection

2021-04-19 Thread Peter Lieven




> Am 15.04.2021 um 17:22 schrieb Kevin Wolf :
> 
> Peter, three years ago you changed 'qemu-img convert' to sacrifice some
> sparsification in order to get aligned requests on the target image. At
> the time, I thought the impact would be small, but it turns out that
> this can end up wasting gigabytes of storagee (like converting a fully
> zeroed 10 GB image taking 2.8 GB instead of a few kilobytes).
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1882917
> 
> I'm not entirely sure how to attack this best since this is a tradeoff,
> but maybe the approach in this series is still good enough for the case
> that you wanted to fix back then?
> 
> Of course, it would be possible to have a more complete fix like looking
> forward a few blocks more before writing data, but that would probably
> not be entirely trivial because you would have to merge blocks with ZERO
> block status with DATA blocks that contain only zeros. I'm not sure if
> it's worth this complication of the code.

I will try to look into this asap.

Is there a hint which FS I need to set the extent hint when creating the raw 
image? I was not able to do that.

Peter

48 matches

Mail list logo