Re: [PATCH 03/14] hw/block/nvme: rename __nvme_select_ns_iocs
On 19/04/2021 21.27, Klaus Jensen wrote: From: Klaus Jensen Get rid of the (reserved) double underscore use. Cc: Philippe Mathieu-Daudé Cc: Thomas Huth Signed-off-by: Klaus Jensen --- hw/block/nvme.c | 47 +++ 1 file changed, 23 insertions(+), 24 deletions(-) Reviewed-by: Thomas Huth
Re: [PATCH 02/14] hw/block/nvme: rename __nvme_advance_zone_wp
On 19/04/2021 21.27, Klaus Jensen wrote: From: Klaus Jensen Get rid of the (reserved) double underscore use. Cc: Philippe Mathieu-Daudé Cc: Thomas Huth Signed-off-by: Klaus Jensen --- hw/block/nvme.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 002c0672b397..d1b94e36c6fb 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -1745,8 +1745,8 @@ static inline uint16_t nvme_zrm_open(NvmeNamespace *ns, NvmeZone *zone) return nvme_zrm_open_flags(ns, zone, 0); } -static void __nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone, - uint32_t nlb) +static void nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone, + uint32_t nlb) { zone->d.wp += nlb; @@ -1766,7 +1766,7 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req) nlb = le16_to_cpu(rw->nlb) + 1; zone = nvme_get_zone_by_slba(ns, slba); -__nvme_advance_zone_wp(ns, zone, nlb); +nvme_advance_zone_wp(ns, zone, nlb); } static inline bool nvme_is_write(NvmeRequest *req) @@ -2155,7 +2155,7 @@ out: uint64_t sdlba = le64_to_cpu(copy->sdlba); NvmeZone *zone = nvme_get_zone_by_slba(ns, sdlba); -__nvme_advance_zone_wp(ns, zone, ctx->nlb); +nvme_advance_zone_wp(ns, zone, ctx->nlb); } g_free(ctx->bounce); Reviewed-by: Thomas Huth
Re: [PATCH 01/14] hw/block/nvme: rename __nvme_zrm_open
On 19/04/2021 21.27, Klaus Jensen wrote: From: Klaus Jensen Get rid of the (reserved) double underscore use. Cc: Philippe Mathieu-Daudé Cc: Thomas Huth Signed-off-by: Klaus Jensen --- hw/block/nvme.c | 16 ++-- 1 file changed, 10 insertions(+), 6 deletions(-) I think it would be good to mention the change with NVME_ZRM_AUTO in the patch description, too. Apart from that: Reviewed-by: Thomas Huth
Re: [for-6.1 0/4] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci
On Wed, Apr 07, 2021 at 04:34:57PM +0200, Greg Kurz wrote: > Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU, > a serious slow down may be observed on setups with a big enough number > of vCPUs. > > Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads): > > virtio-scsi virtio-blk > > 1 0m20.922s 0m21.346s > 2 0m21.230s 0m20.350s > 4 0m21.761s 0m20.997s > 8 0m22.770s 0m20.051s > 160m22.038s 0m19.994s > 320m22.928s 0m20.803s > 640m26.583s 0m22.953s > 128 0m41.273s 0m32.333s > 256 2m4.727s1m16.924s > 384 6m5.563s3m26.186s > > Both perf and gprof indicate that QEMU is hogging CPUs when setting up > the ioeventfds: > > 67.88% swapper [kernel.kallsyms] [k] power_pmu_enable > 9.47% qemu-kvm[kernel.kallsyms] [k] smp_call_function_single > 8.64% qemu-kvm[kernel.kallsyms] [k] power_pmu_enable > =>2.79% qemu-kvmqemu-kvm [.] memory_region_ioeventfd_before > =>2.12% qemu-kvmqemu-kvm [.] > address_space_update_ioeventfds > 0.56% kworker/8:0-mm [kernel.kallsyms] [k] smp_call_function_single > > address_space_update_ioeventfds() is called when committing an MR > transaction, i.e. for each ioeventfd with the current code base, > and it internally loops on all ioventfds: > > static void address_space_update_ioeventfds(AddressSpace *as) > { > [...] > FOR_EACH_FLAT_RANGE(fr, view) { > for (i = 0; i < fr->mr->ioeventfd_nb; ++i) { > > This means that the setup of ioeventfds for these devices has > quadratic time complexity. > > This series simply changes the device models to extend the transaction > to all virtqueueues, like already done in the past in the generic > code with 710fccf80d78 ("virtio: improve virtio devices initialization > time"). > > Only virtio-scsi and virtio-blk are covered here, but a similar change > might also be beneficial to other device types such as host-scsi-pci, > vhost-user-scsi-pci and vhost-user-blk-pci. > > virtio-scsi virtio-blk > > 1 0m21.271s 0m22.076s > 2 0m20.912s 0m19.716s > 4 0m20.508s 0m19.310s > 8 0m21.374s 0m20.273s > 160m21.559s 0m21.374s > 320m22.532s 0m21.271s > 640m26.550s 0m22.007s > 128 0m29.115s 0m27.446s > 256 0m44.752s 0m41.004s > 384 1m2.884s0m58.023s > > This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108 > which reported the issue for virtio-scsi-pci. > > Changes since RFC: > > As suggested by Stefan, splimplify the code by directly beginning and > committing the memory transaction from the device model, without all > the virtio specific proxying code and no changes needed in the memory > subsystem. > > Greg Kurz (4): > virtio-blk: Fix rollback path in virtio_blk_data_plane_start() > virtio-blk: Configure all host notifiers in a single MR transaction > virtio-scsi: Set host notifiers and callbacks separately > virtio-scsi: Configure all host notifiers in a single MR transaction > > hw/block/dataplane/virtio-blk.c | 36 +++-- > hw/scsi/virtio-scsi-dataplane.c | 56 ++--- > 2 files changed, 72 insertions(+), 20 deletions(-) Tagged for 6.1, thanks! > -- > 2.26.3 >
[PATCH 09/14] hw/block/nvme: add metadata offset helper
From: Klaus Jensen Add an nvme_moff() helper. Signed-off-by: Klaus Jensen --- hw/block/nvme.h | 7 ++- hw/block/nvme-dif.c | 4 ++-- hw/block/nvme-ns.c | 2 +- hw/block/nvme.c | 12 ++-- 4 files changed, 15 insertions(+), 10 deletions(-) diff --git a/hw/block/nvme.h b/hw/block/nvme.h index dc065e57b509..9349d1c33ad7 100644 --- a/hw/block/nvme.h +++ b/hw/block/nvme.h @@ -107,7 +107,7 @@ typedef struct NvmeNamespace { BlockConfblkconf; int32_t bootindex; int64_t size; -int64_t mdata_offset; +int64_t moff; NvmeIdNs id_ns; NvmeLBAF lbaf; size_t lbasz; @@ -158,6 +158,11 @@ static inline size_t nvme_m2b(NvmeNamespace *ns, uint64_t lba) return ns->lbaf.ms * lba; } +static inline int64_t nvme_moff(NvmeNamespace *ns, uint64_t lba) +{ +return ns->moff + nvme_m2b(ns, lba); +} + static inline bool nvme_ns_ext(NvmeNamespace *ns) { return !!NVME_ID_NS_FLBAS_EXTENDED(ns->id_ns.flbas); diff --git a/hw/block/nvme-dif.c b/hw/block/nvme-dif.c index c72e43195abf..88efcbe9bd60 100644 --- a/hw/block/nvme-dif.c +++ b/hw/block/nvme-dif.c @@ -306,7 +306,7 @@ static void nvme_dif_rw_mdata_in_cb(void *opaque, int ret) uint64_t slba = le64_to_cpu(rw->slba); uint32_t nlb = le16_to_cpu(rw->nlb) + 1; size_t mlen = nvme_m2b(ns, nlb); -uint64_t offset = ns->mdata_offset + nvme_m2b(ns, slba); +uint64_t offset = nvme_moff(ns, slba); BlockBackend *blk = ns->blkconf.blk; trace_pci_nvme_dif_rw_mdata_in_cb(nvme_cid(req), blk_name(blk)); @@ -335,7 +335,7 @@ static void nvme_dif_rw_mdata_out_cb(void *opaque, int ret) NvmeNamespace *ns = req->ns; NvmeRwCmd *rw = (NvmeRwCmd *)>cmd; uint64_t slba = le64_to_cpu(rw->slba); -uint64_t offset = ns->mdata_offset + nvme_m2b(ns, slba); +uint64_t offset = nvme_moff(ns, slba); BlockBackend *blk = ns->blkconf.blk; trace_pci_nvme_dif_rw_mdata_out_cb(nvme_cid(req), blk_name(blk)); diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c index 2224b497e4b5..84f602652354 100644 --- a/hw/block/nvme-ns.c +++ b/hw/block/nvme-ns.c @@ -42,7 +42,7 @@ void nvme_ns_init_format(NvmeNamespace *ns) id_ns->ncap = id_ns->nsze; id_ns->nuse = id_ns->ncap; -ns->mdata_offset = (int64_t)nlbas << ns->lbaf.ds; +ns->moff = (int64_t)nlbas << ns->lbaf.ds; npdg = ns->blkconf.discard_granularity / ns->lbasz; diff --git a/hw/block/nvme.c b/hw/block/nvme.c index fcc0fe72dc33..b0a6c1457a88 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -1827,7 +1827,7 @@ static void nvme_rw_cb(void *opaque, int ret) NvmeRwCmd *rw = (NvmeRwCmd *)>cmd; uint64_t slba = le64_to_cpu(rw->slba); uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1; -uint64_t offset = ns->mdata_offset + nvme_m2b(ns, slba); +uint64_t offset = nvme_moff(ns, slba); if (req->cmd.opcode == NVME_CMD_WRITE_ZEROES) { size_t mlen = nvme_m2b(ns, nlb); @@ -1993,7 +1993,7 @@ static void nvme_verify_mdata_in_cb(void *opaque, int ret) uint64_t slba = le64_to_cpu(rw->slba); uint32_t nlb = le16_to_cpu(rw->nlb) + 1; size_t mlen = nvme_m2b(ns, nlb); -uint64_t offset = ns->mdata_offset + nvme_m2b(ns, slba); +uint64_t offset = nvme_moff(ns, slba); BlockBackend *blk = ns->blkconf.blk; trace_pci_nvme_verify_mdata_in_cb(nvme_cid(req), blk_name(blk)); @@ -2096,7 +2096,7 @@ static void nvme_aio_zone_reset_cb(void *opaque, int ret) } if (ns->lbaf.ms) { -int64_t offset = ns->mdata_offset + nvme_m2b(ns, zone->d.zslba); +int64_t offset = nvme_moff(ns, zone->d.zslba); blk_aio_pwrite_zeroes(ns->blkconf.blk, offset, nvme_m2b(ns, ns->zone_size), BDRV_REQ_MAY_UNMAP, @@ -2167,7 +2167,7 @@ static void nvme_copy_cb(void *opaque, int ret) if (ns->lbaf.ms) { NvmeCopyCmd *copy = (NvmeCopyCmd *)>cmd; uint64_t sdlba = le64_to_cpu(copy->sdlba); -int64_t offset = ns->mdata_offset + nvme_m2b(ns, sdlba); +int64_t offset = nvme_moff(ns, sdlba); qemu_iovec_reset(>sg.iov); qemu_iovec_add(>sg.iov, ctx->mbounce, nvme_m2b(ns, ctx->nlb)); @@ -2463,7 +2463,7 @@ static void nvme_compare_data_cb(void *opaque, int ret) uint64_t slba = le64_to_cpu(rw->slba); uint32_t nlb = le16_to_cpu(rw->nlb) + 1; size_t mlen = nvme_m2b(ns, nlb); -uint64_t offset = ns->mdata_offset + nvme_m2b(ns, slba); +uint64_t offset = nvme_moff(ns, slba); ctx->mdata.bounce = g_malloc(mlen); @@ -2744,7 +2744,7 @@ static uint16_t nvme_copy(NvmeCtrl *n, NvmeRequest *req) if (ns->lbaf.ms) { len = nvme_m2b(ns, nlb); -offset = ns->mdata_offset + nvme_m2b(ns, slba); +offset = nvme_moff(ns, slba); in_ctx = g_new(struct nvme_copy_in_ctx, 1); in_ctx->req = req; -- 2.31.1
[PATCH 12/14] hw/block/nvme: remove irrelevant zone resource checks
From: Klaus Jensen It is not an error to report more active/open zones supported than the number of zones in the namespace. Signed-off-by: Klaus Jensen --- hw/block/nvme-ns.c | 13 - 1 file changed, 13 deletions(-) diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c index 84f602652354..2041d8138420 100644 --- a/hw/block/nvme-ns.c +++ b/hw/block/nvme-ns.c @@ -210,19 +210,6 @@ static int nvme_ns_zoned_check_calc_geometry(NvmeNamespace *ns, Error **errp) return -1; } -if (ns->params.max_open_zones > ns->num_zones) { -error_setg(errp, - "max_open_zones value %u exceeds the number of zones %u", - ns->params.max_open_zones, ns->num_zones); -return -1; -} -if (ns->params.max_active_zones > ns->num_zones) { -error_setg(errp, - "max_active_zones value %u exceeds the number of zones %u", - ns->params.max_active_zones, ns->num_zones); -return -1; -} - if (ns->params.max_active_zones) { if (ns->params.max_open_zones > ns->params.max_active_zones) { error_setg(errp, "max_open_zones (%u) exceeds max_active_zones (%u)", -- 2.31.1
[PATCH 08/14] hw/block/nvme: cache lba and ms sizes
From: Klaus Jensen There is no need to look up the lba size and metadata size in the LBA Format structure everytime we want to use it. And we use it a lot. Cache the values in the NvmeNamespace and update them if the namespace is formatted. Signed-off-by: Klaus Jensen --- hw/block/nvme.h | 37 --- hw/block/nvme-dif.c | 45 ++- hw/block/nvme-ns.c | 26 + hw/block/nvme.c | 47 ++--- 4 files changed, 56 insertions(+), 99 deletions(-) diff --git a/hw/block/nvme.h b/hw/block/nvme.h index d9bee7e5a05c..dc065e57b509 100644 --- a/hw/block/nvme.h +++ b/hw/block/nvme.h @@ -109,6 +109,8 @@ typedef struct NvmeNamespace { int64_t size; int64_t mdata_offset; NvmeIdNs id_ns; +NvmeLBAF lbaf; +size_t lbasz; const uint32_t *iocs; uint8_t csi; uint16_t status; @@ -146,36 +148,14 @@ static inline uint32_t nvme_nsid(NvmeNamespace *ns) return 0; } -static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns) -{ -NvmeIdNs *id_ns = >id_ns; -return _ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)]; -} - -static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns) -{ -return nvme_ns_lbaf(ns)->ds; -} - -/* convert an LBA to the equivalent in bytes */ static inline size_t nvme_l2b(NvmeNamespace *ns, uint64_t lba) { -return lba << nvme_ns_lbads(ns); -} - -static inline size_t nvme_lsize(NvmeNamespace *ns) -{ -return 1 << nvme_ns_lbads(ns); -} - -static inline uint16_t nvme_msize(NvmeNamespace *ns) -{ -return nvme_ns_lbaf(ns)->ms; +return lba << ns->lbaf.ds; } static inline size_t nvme_m2b(NvmeNamespace *ns, uint64_t lba) { -return nvme_msize(ns) * lba; +return ns->lbaf.ms * lba; } static inline bool nvme_ns_ext(NvmeNamespace *ns) @@ -183,15 +163,6 @@ static inline bool nvme_ns_ext(NvmeNamespace *ns) return !!NVME_ID_NS_FLBAS_EXTENDED(ns->id_ns.flbas); } -/* calculate the number of LBAs that the namespace can accomodate */ -static inline uint64_t nvme_ns_nlbas(NvmeNamespace *ns) -{ -if (nvme_msize(ns)) { -return ns->size / (nvme_lsize(ns) + nvme_msize(ns)); -} -return ns->size >> nvme_ns_lbads(ns); -} - static inline NvmeZoneState nvme_get_zone_state(NvmeZone *zone) { return zone->d.zs >> 4; diff --git a/hw/block/nvme-dif.c b/hw/block/nvme-dif.c index e269d275ebed..c72e43195abf 100644 --- a/hw/block/nvme-dif.c +++ b/hw/block/nvme-dif.c @@ -44,20 +44,18 @@ void nvme_dif_pract_generate_dif(NvmeNamespace *ns, uint8_t *buf, size_t len, uint32_t reftag) { uint8_t *end = buf + len; -size_t lsize = nvme_lsize(ns); -size_t msize = nvme_msize(ns); int16_t pil = 0; if (!(ns->id_ns.dps & NVME_ID_NS_DPS_FIRST_EIGHT)) { -pil = nvme_msize(ns) - sizeof(NvmeDifTuple); +pil = ns->lbaf.ms - sizeof(NvmeDifTuple); } -trace_pci_nvme_dif_pract_generate_dif(len, lsize, lsize + pil, apptag, - reftag); +trace_pci_nvme_dif_pract_generate_dif(len, ns->lbasz, ns->lbasz + pil, + apptag, reftag); -for (; buf < end; buf += lsize, mbuf += msize) { +for (; buf < end; buf += ns->lbasz, mbuf += ns->lbaf.ms) { NvmeDifTuple *dif = (NvmeDifTuple *)(mbuf + pil); -uint16_t crc = crc_t10dif(0x0, buf, lsize); +uint16_t crc = crc_t10dif(0x0, buf, ns->lbasz); if (pil) { crc = crc_t10dif(crc, mbuf, pil); @@ -98,7 +96,7 @@ static uint16_t nvme_dif_prchk(NvmeNamespace *ns, NvmeDifTuple *dif, } if (ctrl & NVME_RW_PRINFO_PRCHK_GUARD) { -uint16_t crc = crc_t10dif(0x0, buf, nvme_lsize(ns)); +uint16_t crc = crc_t10dif(0x0, buf, ns->lbasz); if (pil) { crc = crc_t10dif(crc, mbuf, pil); @@ -137,8 +135,6 @@ uint16_t nvme_dif_check(NvmeNamespace *ns, uint8_t *buf, size_t len, uint16_t appmask, uint32_t reftag) { uint8_t *end = buf + len; -size_t lsize = nvme_lsize(ns); -size_t msize = nvme_msize(ns); int16_t pil = 0; uint16_t status; @@ -148,12 +144,12 @@ uint16_t nvme_dif_check(NvmeNamespace *ns, uint8_t *buf, size_t len, } if (!(ns->id_ns.dps & NVME_ID_NS_DPS_FIRST_EIGHT)) { -pil = nvme_msize(ns) - sizeof(NvmeDifTuple); +pil = ns->lbaf.ms - sizeof(NvmeDifTuple); } -trace_pci_nvme_dif_check(NVME_RW_PRINFO(ctrl), lsize + pil); +trace_pci_nvme_dif_check(NVME_RW_PRINFO(ctrl), ns->lbasz + pil); -for (; buf < end; buf += lsize, mbuf += msize) { +for (; buf < end; buf += ns->lbasz, mbuf += ns->lbaf.ms) { NvmeDifTuple *dif = (NvmeDifTuple *)(mbuf + pil); status = nvme_dif_prchk(ns, dif, buf, mbuf, pil, ctrl, apptag, @@ -176,20 +172,18 @@ uint16_t nvme_dif_mangle_mdata(NvmeNamespace *ns, uint8_t
[PATCH 10/14] hw/block/nvme: streamline namespace array indexing
From: Klaus Jensen Streamline namespace array indexing such that both the subsystem and controller namespaces arrays are 1-indexed. Signed-off-by: Klaus Jensen --- hw/block/nvme.h | 4 ++-- hw/block/nvme.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/hw/block/nvme.h b/hw/block/nvme.h index 9349d1c33ad7..ac3f0a886735 100644 --- a/hw/block/nvme.h +++ b/hw/block/nvme.h @@ -438,7 +438,7 @@ typedef struct NvmeCtrl { NvmeSubsystem *subsys; NvmeNamespace namespace; -NvmeNamespace *namespaces[NVME_MAX_NAMESPACES]; +NvmeNamespace *namespaces[NVME_MAX_NAMESPACES + 1]; NvmeSQueue **sq; NvmeCQueue **cq; NvmeSQueue admin_sq; @@ -460,7 +460,7 @@ static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid) return NULL; } -return n->namespaces[nsid - 1]; +return n->namespaces[nsid]; } static inline NvmeCQueue *nvme_cq(NvmeRequest *req) diff --git a/hw/block/nvme.c b/hw/block/nvme.c index b0a6c1457a88..0b96936129d6 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -4978,7 +4978,7 @@ static uint16_t nvme_ns_attachment(NvmeCtrl *n, NvmeRequest *req) return NVME_NS_NOT_ATTACHED | NVME_DNR; } -ctrl->namespaces[nsid - 1] = NULL; +ctrl->namespaces[nsid] = NULL; ns->attached--; nvme_update_dmrsl(ctrl); @@ -6151,7 +6151,7 @@ void nvme_attach_ns(NvmeCtrl *n, NvmeNamespace *ns) uint32_t nsid = ns->params.nsid; assert(nsid && nsid <= NVME_MAX_NAMESPACES); -n->namespaces[nsid - 1] = ns; +n->namespaces[nsid] = ns; ns->attached++; n->dmrsl = MIN_NON_ZERO(n->dmrsl, -- 2.31.1
[PATCH 06/14] hw/block/nvme: remove non-shared defines from header file
From: Klaus Jensen Remove non-shared defines from the shared header. Signed-off-by: Klaus Jensen --- hw/block/nvme.h| 2 -- hw/block/nvme-ns.c | 1 + hw/block/nvme.c| 1 + 3 files changed, 2 insertions(+), 2 deletions(-) diff --git a/hw/block/nvme.h b/hw/block/nvme.h index d9374d3e33e0..2c4e7b90fa54 100644 --- a/hw/block/nvme.h +++ b/hw/block/nvme.h @@ -24,8 +24,6 @@ #include "block/nvme.h" -#define NVME_DEFAULT_ZONE_SIZE (128 * MiB) -#define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB) #define NVME_MAX_CONTROLLERS 32 #define NVME_MAX_NAMESPACES 256 diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c index aae06987e49a..35c146633223 100644 --- a/hw/block/nvme-ns.c +++ b/hw/block/nvme-ns.c @@ -23,6 +23,7 @@ #include "trace.h" #define MIN_DISCARD_GRANULARITY (4 * KiB) +#define NVME_DEFAULT_ZONE_SIZE (128 * MiB) void nvme_ns_init_format(NvmeNamespace *ns) { diff --git a/hw/block/nvme.c b/hw/block/nvme.c index f7c5e83e6800..2c0af579e7a8 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -159,6 +159,7 @@ #define NVME_TEMPERATURE_WARNING 0x157 #define NVME_TEMPERATURE_CRITICAL 0x175 #define NVME_NUM_FW_SLOTS 1 +#define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB) #define NVME_GUEST_ERR(trace, fmt, ...) \ do { \ -- 2.31.1
[PATCH 07/14] hw/block/nvme: replace nvme_ns_status
From: Klaus Jensen The inline nvme_ns_status() helper only has a single call site. Remove it from the header file and inline it for real. Signed-off-by: Klaus Jensen --- hw/block/nvme.h | 5 - hw/block/nvme.c | 15 --- 2 files changed, 8 insertions(+), 12 deletions(-) diff --git a/hw/block/nvme.h b/hw/block/nvme.h index 2c4e7b90fa54..d9bee7e5a05c 100644 --- a/hw/block/nvme.h +++ b/hw/block/nvme.h @@ -137,11 +137,6 @@ typedef struct NvmeNamespace { } features; } NvmeNamespace; -static inline uint16_t nvme_ns_status(NvmeNamespace *ns) -{ -return ns->status; -} - static inline uint32_t nvme_nsid(NvmeNamespace *ns) { if (ns) { diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 2c0af579e7a8..bcef6038ae09 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -3594,8 +3594,8 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req) static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req) { +NvmeNamespace *ns; uint32_t nsid = le32_to_cpu(req->cmd.nsid); -uint16_t status; trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req), req->cmd.opcode, nvme_io_opc_str(req->cmd.opcode)); @@ -3627,21 +3627,22 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req) return nvme_flush(n, req); } -req->ns = nvme_ns(n, nsid); -if (unlikely(!req->ns)) { +ns = nvme_ns(n, nsid); +if (unlikely(!ns)) { return NVME_INVALID_FIELD | NVME_DNR; } -if (!(req->ns->iocs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) { +if (!(ns->iocs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) { trace_pci_nvme_err_invalid_opc(req->cmd.opcode); return NVME_INVALID_OPCODE | NVME_DNR; } -status = nvme_ns_status(req->ns); -if (unlikely(status)) { -return status; +if (ns->status) { +return ns->status; } +req->ns = ns; + switch (req->cmd.opcode) { case NVME_CMD_WRITE_ZEROES: return nvme_write_zeroes(n, req); -- 2.31.1
[PATCH 04/14] hw/block/nvme: consolidate header files
From: Klaus Jensen In preparation for moving the nvme device into its own subtree, merge the header files into one. Also add missing copyright notice and add list of authors with substantial contributions. Signed-off-by: Klaus Jensen --- hw/block/nvme-dif.h| 63 --- hw/block/nvme-ns.h | 229 hw/block/nvme-subsys.h | 59 --- hw/block/nvme.h| 383 + hw/block/nvme-dif.c| 1 - hw/block/nvme-ns.c | 1 - hw/block/nvme-subsys.c | 1 - hw/block/nvme.c| 2 - 8 files changed, 348 insertions(+), 391 deletions(-) delete mode 100644 hw/block/nvme-dif.h delete mode 100644 hw/block/nvme-ns.h delete mode 100644 hw/block/nvme-subsys.h diff --git a/hw/block/nvme-dif.h b/hw/block/nvme-dif.h deleted file mode 100644 index 524faffbd7a0.. --- a/hw/block/nvme-dif.h +++ /dev/null @@ -1,63 +0,0 @@ -/* - * QEMU NVM Express End-to-End Data Protection support - * - * Copyright (c) 2021 Samsung Electronics Co., Ltd. - * - * Authors: - * Klaus Jensen - * Gollu Appalanaidu - */ - -#ifndef HW_NVME_DIF_H -#define HW_NVME_DIF_H - -/* from Linux kernel (crypto/crct10dif_common.c) */ -static const uint16_t t10_dif_crc_table[256] = { -0x, 0x8BB7, 0x9CD9, 0x176E, 0xB205, 0x39B2, 0x2EDC, 0xA56B, -0xEFBD, 0x640A, 0x7364, 0xF8D3, 0x5DB8, 0xD60F, 0xC161, 0x4AD6, -0x54CD, 0xDF7A, 0xC814, 0x43A3, 0xE6C8, 0x6D7F, 0x7A11, 0xF1A6, -0xBB70, 0x30C7, 0x27A9, 0xAC1E, 0x0975, 0x82C2, 0x95AC, 0x1E1B, -0xA99A, 0x222D, 0x3543, 0xBEF4, 0x1B9F, 0x9028, 0x8746, 0x0CF1, -0x4627, 0xCD90, 0xDAFE, 0x5149, 0xF422, 0x7F95, 0x68FB, 0xE34C, -0xFD57, 0x76E0, 0x618E, 0xEA39, 0x4F52, 0xC4E5, 0xD38B, 0x583C, -0x12EA, 0x995D, 0x8E33, 0x0584, 0xA0EF, 0x2B58, 0x3C36, 0xB781, -0xD883, 0x5334, 0x445A, 0xCFED, 0x6A86, 0xE131, 0xF65F, 0x7DE8, -0x373E, 0xBC89, 0xABE7, 0x2050, 0x853B, 0x0E8C, 0x19E2, 0x9255, -0x8C4E, 0x07F9, 0x1097, 0x9B20, 0x3E4B, 0xB5FC, 0xA292, 0x2925, -0x63F3, 0xE844, 0xFF2A, 0x749D, 0xD1F6, 0x5A41, 0x4D2F, 0xC698, -0x7119, 0xFAAE, 0xEDC0, 0x6677, 0xC31C, 0x48AB, 0x5FC5, 0xD472, -0x9EA4, 0x1513, 0x027D, 0x89CA, 0x2CA1, 0xA716, 0xB078, 0x3BCF, -0x25D4, 0xAE63, 0xB90D, 0x32BA, 0x97D1, 0x1C66, 0x0B08, 0x80BF, -0xCA69, 0x41DE, 0x56B0, 0xDD07, 0x786C, 0xF3DB, 0xE4B5, 0x6F02, -0x3AB1, 0xB106, 0xA668, 0x2DDF, 0x88B4, 0x0303, 0x146D, 0x9FDA, -0xD50C, 0x5EBB, 0x49D5, 0xC262, 0x6709, 0xECBE, 0xFBD0, 0x7067, -0x6E7C, 0xE5CB, 0xF2A5, 0x7912, 0xDC79, 0x57CE, 0x40A0, 0xCB17, -0x81C1, 0x0A76, 0x1D18, 0x96AF, 0x33C4, 0xB873, 0xAF1D, 0x24AA, -0x932B, 0x189C, 0x0FF2, 0x8445, 0x212E, 0xAA99, 0xBDF7, 0x3640, -0x7C96, 0xF721, 0xE04F, 0x6BF8, 0xCE93, 0x4524, 0x524A, 0xD9FD, -0xC7E6, 0x4C51, 0x5B3F, 0xD088, 0x75E3, 0xFE54, 0xE93A, 0x628D, -0x285B, 0xA3EC, 0xB482, 0x3F35, 0x9A5E, 0x11E9, 0x0687, 0x8D30, -0xE232, 0x6985, 0x7EEB, 0xF55C, 0x5037, 0xDB80, 0xCCEE, 0x4759, -0x0D8F, 0x8638, 0x9156, 0x1AE1, 0xBF8A, 0x343D, 0x2353, 0xA8E4, -0xB6FF, 0x3D48, 0x2A26, 0xA191, 0x04FA, 0x8F4D, 0x9823, 0x1394, -0x5942, 0xD2F5, 0xC59B, 0x4E2C, 0xEB47, 0x60F0, 0x779E, 0xFC29, -0x4BA8, 0xC01F, 0xD771, 0x5CC6, 0xF9AD, 0x721A, 0x6574, 0xEEC3, -0xA415, 0x2FA2, 0x38CC, 0xB37B, 0x1610, 0x9DA7, 0x8AC9, 0x017E, -0x1F65, 0x94D2, 0x83BC, 0x080B, 0xAD60, 0x26D7, 0x31B9, 0xBA0E, -0xF0D8, 0x7B6F, 0x6C01, 0xE7B6, 0x42DD, 0xC96A, 0xDE04, 0x55B3 -}; - -uint16_t nvme_check_prinfo(NvmeNamespace *ns, uint16_t ctrl, uint64_t slba, - uint32_t reftag); -uint16_t nvme_dif_mangle_mdata(NvmeNamespace *ns, uint8_t *mbuf, size_t mlen, - uint64_t slba); -void nvme_dif_pract_generate_dif(NvmeNamespace *ns, uint8_t *buf, size_t len, - uint8_t *mbuf, size_t mlen, uint16_t apptag, - uint32_t reftag); -uint16_t nvme_dif_check(NvmeNamespace *ns, uint8_t *buf, size_t len, -uint8_t *mbuf, size_t mlen, uint16_t ctrl, -uint64_t slba, uint16_t apptag, -uint16_t appmask, uint32_t reftag); -uint16_t nvme_dif_rw(NvmeCtrl *n, NvmeRequest *req); - -#endif /* HW_NVME_DIF_H */ diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h deleted file mode 100644 index fb0a41f912e7.. --- a/hw/block/nvme-ns.h +++ /dev/null @@ -1,229 +0,0 @@ -/* - * QEMU NVM Express Virtual Namespace - * - * Copyright (c) 2019 CNEX Labs - * Copyright (c) 2020 Samsung Electronics - * - * Authors: - * Klaus Jensen - * - * This work is licensed under the terms of the GNU GPL, version 2. See the - * COPYING file in the top-level directory. - * - */ - -#ifndef NVME_NS_H -#define NVME_NS_H - -#include "qemu/uuid.h" - -#define TYPE_NVME_NS "nvme-ns" -#define NVME_NS(obj) \ -OBJECT_CHECK(NvmeNamespace, (obj), TYPE_NVME_NS) - -typedef struct NvmeZone { -NvmeZoneDescr d; -uint64_t
[PATCH 05/14] hw/block/nvme: cleanup includes
From: Klaus Jensen Clean up includes. Signed-off-by: Klaus Jensen --- hw/block/nvme-dif.c| 7 +++ hw/block/nvme-ns.c | 11 ++- hw/block/nvme-subsys.c | 12 +--- hw/block/nvme.c| 22 +- 4 files changed, 15 insertions(+), 37 deletions(-) diff --git a/hw/block/nvme-dif.c b/hw/block/nvme-dif.c index 25e5a90854fa..e269d275ebed 100644 --- a/hw/block/nvme-dif.c +++ b/hw/block/nvme-dif.c @@ -9,12 +9,11 @@ */ #include "qemu/osdep.h" -#include "hw/block/block.h" -#include "sysemu/dma.h" -#include "sysemu/block-backend.h" #include "qapi/error.h" -#include "trace.h" +#include "sysemu/block-backend.h" + #include "nvme.h" +#include "trace.h" uint16_t nvme_check_prinfo(NvmeNamespace *ns, uint16_t ctrl, uint64_t slba, uint32_t reftag) diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c index b538c4c4a13c..aae06987e49a 100644 --- a/hw/block/nvme-ns.c +++ b/hw/block/nvme-ns.c @@ -14,20 +14,13 @@ #include "qemu/osdep.h" #include "qemu/units.h" -#include "qemu/cutils.h" -#include "qemu/log.h" #include "qemu/error-report.h" -#include "hw/block/block.h" -#include "hw/pci/pci.h" +#include "qapi/error.h" #include "sysemu/sysemu.h" #include "sysemu/block-backend.h" -#include "qapi/error.h" -#include "hw/qdev-properties.h" -#include "hw/qdev-core.h" - -#include "trace.h" #include "nvme.h" +#include "trace.h" #define MIN_DISCARD_GRANULARITY (4 * KiB) diff --git a/hw/block/nvme-subsys.c b/hw/block/nvme-subsys.c index b81067f7b0d3..192223d17ca1 100644 --- a/hw/block/nvme-subsys.c +++ b/hw/block/nvme-subsys.c @@ -6,19 +6,9 @@ * This code is licensed under the GNU GPL v2. Refer COPYING. */ -#include "qemu/units.h" #include "qemu/osdep.h" -#include "qemu/uuid.h" -#include "qemu/iov.h" -#include "qemu/cutils.h" #include "qapi/error.h" -#include "hw/qdev-properties.h" -#include "hw/qdev-core.h" -#include "hw/block/block.h" -#include "block/aio.h" -#include "block/accounting.h" -#include "sysemu/sysemu.h" -#include "hw/pci/pci.h" + #include "nvme.h" int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp) diff --git a/hw/block/nvme.c b/hw/block/nvme.c index c51de480d9fd..f7c5e83e6800 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -135,24 +135,20 @@ */ #include "qemu/osdep.h" -#include "qemu/units.h" +#include "qemu/cutils.h" #include "qemu/error-report.h" -#include "hw/block/block.h" -#include "hw/pci/msix.h" -#include "hw/pci/pci.h" -#include "hw/qdev-properties.h" -#include "migration/vmstate.h" -#include "sysemu/sysemu.h" +#include "qemu/log.h" +#include "qemu/units.h" #include "qapi/error.h" #include "qapi/visitor.h" -#include "sysemu/hostmem.h" +#include "sysemu/sysemu.h" #include "sysemu/block-backend.h" -#include "exec/memory.h" -#include "qemu/log.h" -#include "qemu/module.h" -#include "qemu/cutils.h" -#include "trace.h" +#include "sysemu/hostmem.h" +#include "hw/pci/msix.h" +#include "migration/vmstate.h" + #include "nvme.h" +#include "trace.h" #define NVME_MAX_IOQPAIRS 0x #define NVME_DB_SIZE 4 -- 2.31.1
[PATCH 03/14] hw/block/nvme: rename __nvme_select_ns_iocs
From: Klaus Jensen Get rid of the (reserved) double underscore use. Cc: Philippe Mathieu-Daudé Cc: Thomas Huth Signed-off-by: Klaus Jensen --- hw/block/nvme.c | 47 +++ 1 file changed, 23 insertions(+), 24 deletions(-) diff --git a/hw/block/nvme.c b/hw/block/nvme.c index d1b94e36c6fb..f8209a92302b 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -4916,7 +4916,25 @@ static void nvme_update_dmrsl(NvmeCtrl *n) } } -static void __nvme_select_ns_iocs(NvmeCtrl *n, NvmeNamespace *ns); +static void nvme_select_iocs_ns(NvmeCtrl *n, NvmeNamespace *ns) +{ +ns->iocs = nvme_cse_iocs_none; +switch (ns->csi) { +case NVME_CSI_NVM: +if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) { +ns->iocs = nvme_cse_iocs_nvm; +} +break; +case NVME_CSI_ZONED: +if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_CSI) { +ns->iocs = nvme_cse_iocs_zoned; +} else if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_NVM) { +ns->iocs = nvme_cse_iocs_nvm; +} +break; +} +} + static uint16_t nvme_ns_attachment(NvmeCtrl *n, NvmeRequest *req) { NvmeNamespace *ns; @@ -4967,7 +4985,7 @@ static uint16_t nvme_ns_attachment(NvmeCtrl *n, NvmeRequest *req) } nvme_attach_ns(ctrl, ns); -__nvme_select_ns_iocs(ctrl, ns); +nvme_select_iocs_ns(ctrl, ns); } else { if (!nvme_ns(ctrl, nsid)) { return NVME_NS_NOT_ATTACHED | NVME_DNR; @@ -5268,26 +5286,7 @@ static void nvme_ctrl_shutdown(NvmeCtrl *n) } } -static void __nvme_select_ns_iocs(NvmeCtrl *n, NvmeNamespace *ns) -{ -ns->iocs = nvme_cse_iocs_none; -switch (ns->csi) { -case NVME_CSI_NVM: -if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) { -ns->iocs = nvme_cse_iocs_nvm; -} -break; -case NVME_CSI_ZONED: -if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_CSI) { -ns->iocs = nvme_cse_iocs_zoned; -} else if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_NVM) { -ns->iocs = nvme_cse_iocs_nvm; -} -break; -} -} - -static void nvme_select_ns_iocs(NvmeCtrl *n) +static void nvme_select_iocs(NvmeCtrl *n) { NvmeNamespace *ns; int i; @@ -5298,7 +5297,7 @@ static void nvme_select_ns_iocs(NvmeCtrl *n) continue; } -__nvme_select_ns_iocs(n, ns); +nvme_select_iocs_ns(n, ns); } } @@ -5400,7 +5399,7 @@ static int nvme_start_ctrl(NvmeCtrl *n) QTAILQ_INIT(>aer_queue); -nvme_select_ns_iocs(n); +nvme_select_iocs(n); return 0; } -- 2.31.1
[PATCH 00/14] hw(/block/)nvme: spring cleaning
From: Klaus Jensen This series consists of various clean up patches. The final patch moves nvme emulation from hw/block to hw/nvme. Klaus Jensen (14): hw/block/nvme: rename __nvme_zrm_open hw/block/nvme: rename __nvme_advance_zone_wp hw/block/nvme: rename __nvme_select_ns_iocs hw/block/nvme: consolidate header files hw/block/nvme: cleanup includes hw/block/nvme: remove non-shared defines from header file hw/block/nvme: replace nvme_ns_status hw/block/nvme: cache lba and ms sizes hw/block/nvme: add metadata offset helper hw/block/nvme: streamline namespace array indexing hw/block/nvme: remove num_namespaces member hw/block/nvme: remove irrelevant zone resource checks hw/block/nvme: move zoned constraints checks hw/nvme: move nvme emulation out of hw/block meson.build | 1 + hw/block/nvme-dif.h | 63 --- hw/block/nvme-ns.h| 229 - hw/block/nvme-subsys.h| 59 --- hw/block/nvme.h | 266 --- hw/nvme/nvme.h| 547 ++ hw/nvme/trace.h | 1 + hw/{block/nvme.c => nvme/ctrl.c} | 204 hw/{block/nvme-dif.c => nvme/dif.c} | 57 +-- hw/{block/nvme-ns.c => nvme/ns.c} | 104 ++-- hw/{block/nvme-subsys.c => nvme/subsys.c} | 13 +- MAINTAINERS | 2 +- hw/Kconfig| 1 + hw/block/Kconfig | 5 - hw/block/meson.build | 1 - hw/block/trace-events | 206 hw/meson.build| 1 + hw/nvme/Kconfig | 4 + hw/nvme/meson.build | 1 + hw/nvme/trace-events | 204 20 files changed, 928 insertions(+), 1041 deletions(-) delete mode 100644 hw/block/nvme-dif.h delete mode 100644 hw/block/nvme-ns.h delete mode 100644 hw/block/nvme-subsys.h delete mode 100644 hw/block/nvme.h create mode 100644 hw/nvme/nvme.h create mode 100644 hw/nvme/trace.h rename hw/{block/nvme.c => nvme/ctrl.c} (98%) rename hw/{block/nvme-dif.c => nvme/dif.c} (90%) rename hw/{block/nvme-ns.c => nvme/ns.c} (87%) rename hw/{block/nvme-subsys.c => nvme/subsys.c} (85%) create mode 100644 hw/nvme/Kconfig create mode 100644 hw/nvme/meson.build create mode 100644 hw/nvme/trace-events -- 2.31.1
[PATCH 14/14] hw/nvme: move nvme emulation out of hw/block
From: Klaus Jensen With the introduction of the nvme-subsystem device we are really cluttering up the hw/block directory. As suggested by Philippe previously, move the nvme emulation to hw/nvme. Suggested-by: Philippe Mathieu-Daudé Signed-off-by: Klaus Jensen --- meson.build | 1 + hw/{block => nvme}/nvme.h | 6 +- hw/nvme/trace.h | 1 + hw/{block/nvme.c => nvme/ctrl.c} | 0 hw/{block/nvme-dif.c => nvme/dif.c} | 0 hw/{block/nvme-ns.c => nvme/ns.c} | 0 hw/{block/nvme-subsys.c => nvme/subsys.c} | 0 MAINTAINERS | 2 +- hw/Kconfig| 1 + hw/block/Kconfig | 5 - hw/block/meson.build | 1 - hw/block/trace-events | 206 -- hw/meson.build| 1 + hw/nvme/Kconfig | 4 + hw/nvme/meson.build | 1 + hw/nvme/trace-events | 204 + 16 files changed, 217 insertions(+), 216 deletions(-) rename hw/{block => nvme}/nvme.h (99%) create mode 100644 hw/nvme/trace.h rename hw/{block/nvme.c => nvme/ctrl.c} (100%) rename hw/{block/nvme-dif.c => nvme/dif.c} (100%) rename hw/{block/nvme-ns.c => nvme/ns.c} (100%) rename hw/{block/nvme-subsys.c => nvme/subsys.c} (100%) create mode 100644 hw/nvme/Kconfig create mode 100644 hw/nvme/meson.build create mode 100644 hw/nvme/trace-events diff --git a/meson.build b/meson.build index c6f4b0cf5e8a..59354cd53c97 100644 --- a/meson.build +++ b/meson.build @@ -1811,6 +1811,7 @@ if have_system 'hw/misc/macio', 'hw/net', 'hw/net/can', +'hw/nvme', 'hw/nvram', 'hw/pci', 'hw/pci-host', diff --git a/hw/block/nvme.h b/hw/nvme/nvme.h similarity index 99% rename from hw/block/nvme.h rename to hw/nvme/nvme.h index fb028d81d16f..81a35cda142b 100644 --- a/hw/block/nvme.h +++ b/hw/nvme/nvme.h @@ -15,8 +15,8 @@ * This code is licensed under the GNU GPL v2 or later. */ -#ifndef HW_NVME_H -#define HW_NVME_H +#ifndef HW_NVME_INTERNAL_H +#define HW_NVME_INTERNAL_H #include "qemu/uuid.h" #include "hw/pci/pci.h" @@ -544,4 +544,4 @@ uint16_t nvme_dif_check(NvmeNamespace *ns, uint8_t *buf, size_t len, uint16_t nvme_dif_rw(NvmeCtrl *n, NvmeRequest *req); -#endif /* HW_NVME_H */ +#endif /* HW_NVME_INTERNAL_H */ diff --git a/hw/nvme/trace.h b/hw/nvme/trace.h new file mode 100644 index ..b398ea107f59 --- /dev/null +++ b/hw/nvme/trace.h @@ -0,0 +1 @@ +#include "trace/trace-hw_nvme.h" diff --git a/hw/block/nvme.c b/hw/nvme/ctrl.c similarity index 100% rename from hw/block/nvme.c rename to hw/nvme/ctrl.c diff --git a/hw/block/nvme-dif.c b/hw/nvme/dif.c similarity index 100% rename from hw/block/nvme-dif.c rename to hw/nvme/dif.c diff --git a/hw/block/nvme-ns.c b/hw/nvme/ns.c similarity index 100% rename from hw/block/nvme-ns.c rename to hw/nvme/ns.c diff --git a/hw/block/nvme-subsys.c b/hw/nvme/subsys.c similarity index 100% rename from hw/block/nvme-subsys.c rename to hw/nvme/subsys.c diff --git a/MAINTAINERS b/MAINTAINERS index 36055f14c594..986e0e590de6 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1972,7 +1972,7 @@ M: Keith Busch M: Klaus Jensen L: qemu-block@nongnu.org S: Supported -F: hw/block/nvme* +F: hw/nvme/* F: include/block/nvme.h F: tests/qtest/nvme-test.c F: docs/system/nvme.rst diff --git a/hw/Kconfig b/hw/Kconfig index ff40bd3f7bb7..61d0e654eb02 100644 --- a/hw/Kconfig +++ b/hw/Kconfig @@ -21,6 +21,7 @@ source mem/Kconfig source misc/Kconfig source net/Kconfig source nubus/Kconfig +source nvme/Kconfig source nvram/Kconfig source pci-bridge/Kconfig source pci-host/Kconfig diff --git a/hw/block/Kconfig b/hw/block/Kconfig index 4fcd15216684..295441e64ab4 100644 --- a/hw/block/Kconfig +++ b/hw/block/Kconfig @@ -25,11 +25,6 @@ config ONENAND config TC58128 bool -config NVME_PCI -bool -default y if PCI_DEVICES -depends on PCI - config VIRTIO_BLK bool default y diff --git a/hw/block/meson.build b/hw/block/meson.build index 5b4a7699f98f..8b0de54db1fc 100644 --- a/hw/block/meson.build +++ b/hw/block/meson.build @@ -13,7 +13,6 @@ softmmu_ss.add(when: 'CONFIG_SSI_M25P80', if_true: files('m25p80.c')) softmmu_ss.add(when: 'CONFIG_SWIM', if_true: files('swim.c')) softmmu_ss.add(when: 'CONFIG_XEN', if_true: files('xen-block.c')) softmmu_ss.add(when: 'CONFIG_TC58128', if_true: files('tc58128.c')) -softmmu_ss.add(when: 'CONFIG_NVME_PCI', if_true: files('nvme.c', 'nvme-ns.c', 'nvme-subsys.c', 'nvme-dif.c')) specific_ss.add(when: 'CONFIG_VIRTIO_BLK', if_true: files('virtio-blk.c')) specific_ss.add(when: 'CONFIG_VHOST_USER_BLK', if_true: files('vhost-user-blk.c')) diff --git a/hw/block/trace-events b/hw/block/trace-events index fa12e3a67a75..646917d045f7 100644 --- a/hw/block/trace-events +++ b/hw/block/trace-events
[PATCH 11/14] hw/block/nvme: remove num_namespaces member
From: Klaus Jensen The NvmeCtrl num_namespaces member is just an indirection for the NVME_MAX_NAMESPACES constant. Remove the indirection. Signed-off-by: Klaus Jensen --- hw/block/nvme.h | 1 - hw/block/nvme.c | 30 +++--- 2 files changed, 15 insertions(+), 16 deletions(-) diff --git a/hw/block/nvme.h b/hw/block/nvme.h index ac3f0a886735..fb028d81d16f 100644 --- a/hw/block/nvme.h +++ b/hw/block/nvme.h @@ -401,7 +401,6 @@ typedef struct NvmeCtrl { uint16_tcqe_size; uint16_tsqe_size; uint32_treg_size; -uint32_tnum_namespaces; uint32_tmax_q_ents; uint8_t outstanding_aers; uint32_tirq_status; diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 0b96936129d6..60424d9b19ea 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -382,7 +382,8 @@ static int nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf, int size) static bool nvme_nsid_valid(NvmeCtrl *n, uint32_t nsid) { -return nsid && (nsid == NVME_NSID_BROADCAST || nsid <= n->num_namespaces); +return nsid && +(nsid == NVME_NSID_BROADCAST || nsid <= NVME_MAX_NAMESPACES); } static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid) @@ -2865,7 +2866,7 @@ static uint16_t nvme_flush(NvmeCtrl *n, NvmeRequest *req) /* 1-initialize; see comment in nvme_dsm */ *num_flushes = 1; -for (int i = 1; i <= n->num_namespaces; i++) { +for (int i = 1; i <= NVME_MAX_NAMESPACES; i++) { ns = nvme_ns(n, i); if (!ns) { continue; @@ -3835,7 +3836,7 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len, } else { int i; -for (i = 1; i <= n->num_namespaces; i++) { +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { ns = nvme_ns(n, i); if (!ns) { continue; @@ -4332,7 +4333,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req, return NVME_INVALID_NSID | NVME_DNR; } -for (i = 1; i <= n->num_namespaces; i++) { +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { ns = nvme_ns(n, i); if (!ns) { if (!active) { @@ -4380,7 +4381,7 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req, return NVME_INVALID_FIELD | NVME_DNR; } -for (i = 1; i <= n->num_namespaces; i++) { +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { ns = nvme_ns(n, i); if (!ns) { if (!active) { @@ -4646,7 +4647,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req) goto out; case NVME_VOLATILE_WRITE_CACHE: result = 0; -for (i = 1; i <= n->num_namespaces; i++) { +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { ns = nvme_ns(n, i); if (!ns) { continue; @@ -4796,7 +4797,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req) break; case NVME_ERROR_RECOVERY: if (nsid == NVME_NSID_BROADCAST) { -for (i = 1; i <= n->num_namespaces; i++) { +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { ns = nvme_ns(n, i); if (!ns) { @@ -4817,7 +4818,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req) } break; case NVME_VOLATILE_WRITE_CACHE: -for (i = 1; i <= n->num_namespaces; i++) { +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { ns = nvme_ns(n, i); if (!ns) { continue; @@ -5110,7 +5111,7 @@ static uint16_t nvme_format(NvmeCtrl *n, NvmeRequest *req) req->status = status; } } else { -for (i = 1; i <= n->num_namespaces; i++) { +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { ns = nvme_ns(n, i); if (!ns) { continue; @@ -5221,7 +5222,7 @@ static void nvme_ctrl_reset(NvmeCtrl *n) NvmeNamespace *ns; int i; -for (i = 1; i <= n->num_namespaces; i++) { +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { ns = nvme_ns(n, i); if (!ns) { continue; @@ -5263,7 +5264,7 @@ static void nvme_ctrl_shutdown(NvmeCtrl *n) memory_region_msync(>pmr.dev->mr, 0, n->pmr.dev->size); } -for (i = 1; i <= n->num_namespaces; i++) { +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { ns = nvme_ns(n, i); if (!ns) { continue; @@ -5278,7 +5279,7 @@ static void nvme_select_iocs(NvmeCtrl *n) NvmeNamespace *ns; int i; -for (i = 1; i <= n->num_namespaces; i++) { +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { ns = nvme_ns(n, i); if (!ns) { continue; @@ -5905,7 +5906,6 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp) static void nvme_init_state(NvmeCtrl *n) { -n->num_namespaces = NVME_MAX_NAMESPACES; /* add one to max_ioqpairs to account for the admin queue pair */
[PATCH 13/14] hw/block/nvme: move zoned constraints checks
From: Klaus Jensen Validation of the max_active and max_open zoned parameters are independent of any other state, so move them to the early nvme_ns_check_constraints parameter checks. Signed-off-by: Klaus Jensen --- hw/block/nvme-ns.c | 52 +- 1 file changed, 28 insertions(+), 24 deletions(-) diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c index 2041d8138420..861b87f22bd8 100644 --- a/hw/block/nvme-ns.c +++ b/hw/block/nvme-ns.c @@ -210,30 +210,6 @@ static int nvme_ns_zoned_check_calc_geometry(NvmeNamespace *ns, Error **errp) return -1; } -if (ns->params.max_active_zones) { -if (ns->params.max_open_zones > ns->params.max_active_zones) { -error_setg(errp, "max_open_zones (%u) exceeds max_active_zones (%u)", - ns->params.max_open_zones, ns->params.max_active_zones); -return -1; -} - -if (!ns->params.max_open_zones) { -ns->params.max_open_zones = ns->params.max_active_zones; -} -} - -if (ns->params.zd_extension_size) { -if (ns->params.zd_extension_size & 0x3f) { -error_setg(errp, -"zone descriptor extension size must be a multiple of 64B"); -return -1; -} -if ((ns->params.zd_extension_size >> 6) > 0xff) { -error_setg(errp, "zone descriptor extension size is too large"); -return -1; -} -} - return 0; } @@ -403,6 +379,34 @@ static int nvme_ns_check_constraints(NvmeCtrl *n, NvmeNamespace *ns, } } +if (ns->params.zoned) { +if (ns->params.max_active_zones) { +if (ns->params.max_open_zones > ns->params.max_active_zones) { +error_setg(errp, "max_open_zones (%u) exceeds " + "max_active_zones (%u)", ns->params.max_open_zones, + ns->params.max_active_zones); +return -1; +} + +if (!ns->params.max_open_zones) { +ns->params.max_open_zones = ns->params.max_active_zones; +} +} + +if (ns->params.zd_extension_size) { +if (ns->params.zd_extension_size & 0x3f) { +error_setg(errp, "zone descriptor extension size must be a " + "multiple of 64B"); +return -1; +} +if ((ns->params.zd_extension_size >> 6) > 0xff) { +error_setg(errp, + "zone descriptor extension size is too large"); +return -1; +} +} +} + return 0; } -- 2.31.1
[PATCH 01/14] hw/block/nvme: rename __nvme_zrm_open
From: Klaus Jensen Get rid of the (reserved) double underscore use. Cc: Philippe Mathieu-Daudé Cc: Thomas Huth Signed-off-by: Klaus Jensen --- hw/block/nvme.c | 16 ++-- 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 624a1431d072..002c0672b397 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -1682,8 +1682,12 @@ static void nvme_zrm_auto_transition_zone(NvmeNamespace *ns) } } -static uint16_t __nvme_zrm_open(NvmeNamespace *ns, NvmeZone *zone, -bool implicit) +enum { +NVME_ZRM_AUTO = 1 << 0, +}; + +static uint16_t nvme_zrm_open_flags(NvmeNamespace *ns, NvmeZone *zone, +int flags) { int act = 0; uint16_t status; @@ -1707,7 +1711,7 @@ static uint16_t __nvme_zrm_open(NvmeNamespace *ns, NvmeZone *zone, nvme_aor_inc_open(ns); -if (implicit) { +if (flags & NVME_ZRM_AUTO) { nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_IMPLICITLY_OPEN); return NVME_SUCCESS; } @@ -1715,7 +1719,7 @@ static uint16_t __nvme_zrm_open(NvmeNamespace *ns, NvmeZone *zone, /* fallthrough */ case NVME_ZONE_STATE_IMPLICITLY_OPEN: -if (implicit) { +if (flags & NVME_ZRM_AUTO) { return NVME_SUCCESS; } @@ -1733,12 +1737,12 @@ static uint16_t __nvme_zrm_open(NvmeNamespace *ns, NvmeZone *zone, static inline uint16_t nvme_zrm_auto(NvmeNamespace *ns, NvmeZone *zone) { -return __nvme_zrm_open(ns, zone, true); +return nvme_zrm_open_flags(ns, zone, NVME_ZRM_AUTO); } static inline uint16_t nvme_zrm_open(NvmeNamespace *ns, NvmeZone *zone) { -return __nvme_zrm_open(ns, zone, false); +return nvme_zrm_open_flags(ns, zone, 0); } static void __nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone, -- 2.31.1
[PATCH 02/14] hw/block/nvme: rename __nvme_advance_zone_wp
From: Klaus Jensen Get rid of the (reserved) double underscore use. Cc: Philippe Mathieu-Daudé Cc: Thomas Huth Signed-off-by: Klaus Jensen --- hw/block/nvme.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 002c0672b397..d1b94e36c6fb 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -1745,8 +1745,8 @@ static inline uint16_t nvme_zrm_open(NvmeNamespace *ns, NvmeZone *zone) return nvme_zrm_open_flags(ns, zone, 0); } -static void __nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone, - uint32_t nlb) +static void nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone, + uint32_t nlb) { zone->d.wp += nlb; @@ -1766,7 +1766,7 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req) nlb = le16_to_cpu(rw->nlb) + 1; zone = nvme_get_zone_by_slba(ns, slba); -__nvme_advance_zone_wp(ns, zone, nlb); +nvme_advance_zone_wp(ns, zone, nlb); } static inline bool nvme_is_write(NvmeRequest *req) @@ -2155,7 +2155,7 @@ out: uint64_t sdlba = le64_to_cpu(copy->sdlba); NvmeZone *zone = nvme_get_zone_by_slba(ns, sdlba); -__nvme_advance_zone_wp(ns, zone, ctx->nlb); +nvme_advance_zone_wp(ns, zone, ctx->nlb); } g_free(ctx->bounce); -- 2.31.1
Re: [RFC PATCH 0/2] qemu-img convert: Fix sparseness detection
Von meinem iPhone gesendet > Am 19.04.2021 um 14:31 schrieb Kevin Wolf : > > Am 19.04.2021 um 11:13 hat Peter Lieven geschrieben: >> >> Am 19.04.2021 um 10:36 schrieb Peter Lieven : >>> >>> >>> Am 15.04.2021 um 17:22 schrieb Kevin Wolf : Peter, three years ago you changed 'qemu-img convert' to sacrifice some sparsification in order to get aligned requests on the target image. At the time, I thought the impact would be small, but it turns out that this can end up wasting gigabytes of storagee (like converting a fully zeroed 10 GB image taking 2.8 GB instead of a few kilobytes). https://bugzilla.redhat.com/show_bug.cgi?id=1882917 I'm not entirely sure how to attack this best since this is a tradeoff, but maybe the approach in this series is still good enough for the case that you wanted to fix back then? Of course, it would be possible to have a more complete fix like looking forward a few blocks more before writing data, but that would probably not be entirely trivial because you would have to merge blocks with ZERO block status with DATA blocks that contain only zeros. I'm not sure if it's worth this complication of the code. >>> >>> I will try to look into this asap. >> >> Besides from the reproducer described in the ticket, I retried my old >> conversion test in our environment: >> >> Before commit 8dcd3c9b91: reads 4608 writes 14959 >> After commit 8dcd3c9b91: reads 0 writes 14924 >> With Kevins patch: reads 110 writes 14924 >> >> I think this is a good result if it avoids other issues. > > Sounds like a promising way to make the tradeoff. Thanks for testing! is this sth for 6.0-rc4? Peter
Re: [for-6.1 0/4] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci
Ping ? On Wed, 7 Apr 2021 16:34:57 +0200 Greg Kurz wrote: > Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU, > a serious slow down may be observed on setups with a big enough number > of vCPUs. > > Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads): > > virtio-scsi virtio-blk > > 1 0m20.922s 0m21.346s > 2 0m21.230s 0m20.350s > 4 0m21.761s 0m20.997s > 8 0m22.770s 0m20.051s > 160m22.038s 0m19.994s > 320m22.928s 0m20.803s > 640m26.583s 0m22.953s > 128 0m41.273s 0m32.333s > 256 2m4.727s1m16.924s > 384 6m5.563s3m26.186s > > Both perf and gprof indicate that QEMU is hogging CPUs when setting up > the ioeventfds: > > 67.88% swapper [kernel.kallsyms] [k] power_pmu_enable > 9.47% qemu-kvm[kernel.kallsyms] [k] smp_call_function_single > 8.64% qemu-kvm[kernel.kallsyms] [k] power_pmu_enable > =>2.79% qemu-kvmqemu-kvm [.] memory_region_ioeventfd_before > =>2.12% qemu-kvmqemu-kvm [.] > address_space_update_ioeventfds > 0.56% kworker/8:0-mm [kernel.kallsyms] [k] smp_call_function_single > > address_space_update_ioeventfds() is called when committing an MR > transaction, i.e. for each ioeventfd with the current code base, > and it internally loops on all ioventfds: > > static void address_space_update_ioeventfds(AddressSpace *as) > { > [...] > FOR_EACH_FLAT_RANGE(fr, view) { > for (i = 0; i < fr->mr->ioeventfd_nb; ++i) { > > This means that the setup of ioeventfds for these devices has > quadratic time complexity. > > This series simply changes the device models to extend the transaction > to all virtqueueues, like already done in the past in the generic > code with 710fccf80d78 ("virtio: improve virtio devices initialization > time"). > > Only virtio-scsi and virtio-blk are covered here, but a similar change > might also be beneficial to other device types such as host-scsi-pci, > vhost-user-scsi-pci and vhost-user-blk-pci. > > virtio-scsi virtio-blk > > 1 0m21.271s 0m22.076s > 2 0m20.912s 0m19.716s > 4 0m20.508s 0m19.310s > 8 0m21.374s 0m20.273s > 160m21.559s 0m21.374s > 320m22.532s 0m21.271s > 640m26.550s 0m22.007s > 128 0m29.115s 0m27.446s > 256 0m44.752s 0m41.004s > 384 1m2.884s0m58.023s > > This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108 > which reported the issue for virtio-scsi-pci. > > Changes since RFC: > > As suggested by Stefan, splimplify the code by directly beginning and > committing the memory transaction from the device model, without all > the virtio specific proxying code and no changes needed in the memory > subsystem. > > Greg Kurz (4): > virtio-blk: Fix rollback path in virtio_blk_data_plane_start() > virtio-blk: Configure all host notifiers in a single MR transaction > virtio-scsi: Set host notifiers and callbacks separately > virtio-scsi: Configure all host notifiers in a single MR transaction > > hw/block/dataplane/virtio-blk.c | 36 +++-- > hw/scsi/virtio-scsi-dataplane.c | 56 ++--- > 2 files changed, 72 insertions(+), 20 deletions(-) >
Re: [PATCH] block/file-posix: Fix problem with fallocate(PUNCH_HOLE) on GPFS
Am 19.04.2021 um 07:06 hat Thomas Huth geschrieben: > On 16/04/2021 22.34, Nir Soffer wrote: > > On Fri, Apr 16, 2021 at 8:23 AM Thomas Huth wrote: > > > > > > A customer reported that running > > > > > > qemu-img convert -t none -O qcow2 -f qcow2 input.qcow2 output.qcow2 > > > > > > fails for them with the following error message when the images are > > > stored on a GPFS file system: > > > > > > qemu-img: error while writing sector 0: Invalid argument > > > > > > After analyzing the strace output, it seems like the problem is in > > > handle_aiocb_write_zeroes(): The call to fallocate(FALLOC_FL_PUNCH_HOLE) > > > returns EINVAL, which can apparently happen if the file system has > > > a different idea of the granularity of the operation. It's arguably > > > a bug in GPFS, since the PUNCH_HOLE mode should not result in EINVAL > > > according to the man-page of fallocate(), but the file system is out > > > there in production and so we have to deal with it. In commit 294682cc3a > > > ("block: workaround for unaligned byte range in fallocate()") we also > > > already applied the a work-around for the same problem to the earlier > > > fallocate(FALLOC_FL_ZERO_RANGE) call, so do it now similar with the > > > PUNCH_HOLE call. > > > > > > Signed-off-by: Thomas Huth > > > --- > > > block/file-posix.c | 7 +++ > > > 1 file changed, 7 insertions(+) > > > > > > diff --git a/block/file-posix.c b/block/file-posix.c > > > index 20e14f8e96..7a40428d52 100644 > > > --- a/block/file-posix.c > > > +++ b/block/file-posix.c > > > @@ -1675,6 +1675,13 @@ static int handle_aiocb_write_zeroes(void *opaque) > > > } > > > s->has_fallocate = false; > > > } else if (ret != -ENOTSUP) { > > > +if (ret == -EINVAL) { > > > +/* > > > + * File systems like GPFS do not like unaligned byte > > > ranges, > > > + * treat it like unsupported (so caller falls back to > > > pwrite) > > > + */ > > > +return -ENOTSUP; > > > > This skips the next fallback, using plain fallocate(0) if we write > > after the end of the file. Is this intended? > > > > We can treat the buggy EINVAL return value as "filesystem is buggy, > > let's not try other options", or "let's try the next option". Since falling > > back to actually writing zeroes is so much slower, I think it is better to > > try the next option. > > I just did the same work-around as in commit 294682cc3a7 ... so if we agree > to try the other options, too, we should change that spot, too... Yes, changing both places to fall back to the next option feels right to me. > However, what is not clear to me, how would you handle s->has_write_zeroes > and s->has_discard in such a case? Set them to "false"? ... but it could > still work for some blocks with different alignment ... but if we keep them > set to "true", the code tries again and again to call these ioctls, maybe > wasting other precious cycles for this? That it could still work for other requests is a good point. So I think EINVAL shouldn't disable s->has_*, but otherwise behave the same as ENOTSUP. You're right that we're potentially wasting cycles for trying unaligned requests again and again, but that they fail isn't our fault and the benefit of having efficient zero writes with aligned requests seems more important that losing a few cycles on unaligned requests. > Maybe we should do a different approach instead: In case we hit a EINVAL > here, print an error a la: > > error_report_once("You are running on a buggy file system, please complain > to the file system vendor"); > > and return -ENOTSUP ... then it's hopefully clear to the users why they are > getting a bad performance, and that they should complain to the file system > vendor instead to get their problem fixed. Sounds like a reasonable thing to do (probably in addition to the above) when we know that a file system bug prevents us from getting optimal performance. Kevin
Re: [RFC PATCH 0/2] qemu-img convert: Fix sparseness detection
Am 19.04.2021 um 11:13 hat Peter Lieven geschrieben: > > > > Am 19.04.2021 um 10:36 schrieb Peter Lieven : > > > > > > > >> Am 15.04.2021 um 17:22 schrieb Kevin Wolf : > >> > >> Peter, three years ago you changed 'qemu-img convert' to sacrifice some > >> sparsification in order to get aligned requests on the target image. At > >> the time, I thought the impact would be small, but it turns out that > >> this can end up wasting gigabytes of storagee (like converting a fully > >> zeroed 10 GB image taking 2.8 GB instead of a few kilobytes). > >> > >> https://bugzilla.redhat.com/show_bug.cgi?id=1882917 > >> > >> I'm not entirely sure how to attack this best since this is a tradeoff, > >> but maybe the approach in this series is still good enough for the case > >> that you wanted to fix back then? > >> > >> Of course, it would be possible to have a more complete fix like looking > >> forward a few blocks more before writing data, but that would probably > >> not be entirely trivial because you would have to merge blocks with ZERO > >> block status with DATA blocks that contain only zeros. I'm not sure if > >> it's worth this complication of the code. > > > > I will try to look into this asap. > > Besides from the reproducer described in the ticket, I retried my old > conversion test in our environment: > > Before commit 8dcd3c9b91: reads 4608 writes 14959 > After commit 8dcd3c9b91: reads 0 writes 14924 > With Kevins patch: reads 110 writes 14924 > > I think this is a good result if it avoids other issues. Sounds like a promising way to make the tradeoff. Thanks for testing! Kevin
Re: [RFC PATCH 0/2] qemu-img convert: Fix sparseness detection
Am 19.04.2021 um 10:36 hat Peter Lieven geschrieben: > > > > Am 15.04.2021 um 17:22 schrieb Kevin Wolf : > > > > Peter, three years ago you changed 'qemu-img convert' to sacrifice some > > sparsification in order to get aligned requests on the target image. At > > the time, I thought the impact would be small, but it turns out that > > this can end up wasting gigabytes of storagee (like converting a fully > > zeroed 10 GB image taking 2.8 GB instead of a few kilobytes). > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1882917 > > > > I'm not entirely sure how to attack this best since this is a tradeoff, > > but maybe the approach in this series is still good enough for the case > > that you wanted to fix back then? > > > > Of course, it would be possible to have a more complete fix like looking > > forward a few blocks more before writing data, but that would probably > > not be entirely trivial because you would have to merge blocks with ZERO > > block status with DATA blocks that contain only zeros. I'm not sure if > > it's worth this complication of the code. > > I will try to look into this asap. > > Is there a hint which FS I need to set the extent hint when creating > the raw image? I was not able to do that. Grepping the current kernel source, it seems extent size hints still work only on XFS. But I don't think it's necessary for reproducing this bug. In fact, disabling the extent size hint should cause a lot more fragmentation, which should make the problem more visible. Kevin
Re: [PATCH] hw/block/nvme: fix io-command set profile feature
On Apr 19 16:18, Gollu Appalanaidu wrote: Currently IO Command Set Profile feaure is supported, but feature support flag not set and this feature is changable add support for that. Remove filling default value of feature in CQE CDW0 with zero, since it fallbacks to default case and it is zero initialized, if feature default value not set explicitly. Signed-off-by: Gollu Appalanaidu --- hw/block/nvme.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 624a1431d0..b5d2c29fc4 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -185,6 +185,7 @@ static const bool nvme_feature_support[NVME_FID_MAX] = { [NVME_WRITE_ATOMICITY] = true, [NVME_ASYNCHRONOUS_EVENT_CONF] = true, [NVME_TIMESTAMP]= true, +[NVME_COMMAND_SET_PROFILE] = true, }; static const uint32_t nvme_feature_cap[NVME_FID_MAX] = { @@ -194,6 +195,7 @@ static const uint32_t nvme_feature_cap[NVME_FID_MAX] = { [NVME_NUMBER_OF_QUEUES] = NVME_FEAT_CAP_CHANGE, [NVME_ASYNCHRONOUS_EVENT_CONF] = NVME_FEAT_CAP_CHANGE, [NVME_TIMESTAMP]= NVME_FEAT_CAP_CHANGE, +[NVME_COMMAND_SET_PROFILE] = NVME_FEAT_CAP_CHANGE, }; static const uint32_t nvme_cse_acs[256] = { @@ -4707,9 +4709,6 @@ defaults: result |= NVME_INTVC_NOCOALESCING; } break; -case NVME_COMMAND_SET_PROFILE: -result = 0; -break; default: result = nvme_feature_default[fid]; break; -- 2.17.1 LGTM. Reviewed-by: Klaus Jensen signature.asc Description: PGP signature
[PATCH] hw/block/nvme: fix io-command set profile feature
Currently IO Command Set Profile feaure is supported, but feature support flag not set and this feature is changable add support for that. Remove filling default value of feature in CQE CDW0 with zero, since it fallbacks to default case and it is zero initialized, if feature default value not set explicitly. Signed-off-by: Gollu Appalanaidu --- hw/block/nvme.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 624a1431d0..b5d2c29fc4 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -185,6 +185,7 @@ static const bool nvme_feature_support[NVME_FID_MAX] = { [NVME_WRITE_ATOMICITY] = true, [NVME_ASYNCHRONOUS_EVENT_CONF] = true, [NVME_TIMESTAMP]= true, +[NVME_COMMAND_SET_PROFILE] = true, }; static const uint32_t nvme_feature_cap[NVME_FID_MAX] = { @@ -194,6 +195,7 @@ static const uint32_t nvme_feature_cap[NVME_FID_MAX] = { [NVME_NUMBER_OF_QUEUES] = NVME_FEAT_CAP_CHANGE, [NVME_ASYNCHRONOUS_EVENT_CONF] = NVME_FEAT_CAP_CHANGE, [NVME_TIMESTAMP]= NVME_FEAT_CAP_CHANGE, +[NVME_COMMAND_SET_PROFILE] = NVME_FEAT_CAP_CHANGE, }; static const uint32_t nvme_cse_acs[256] = { @@ -4707,9 +4709,6 @@ defaults: result |= NVME_INTVC_NOCOALESCING; } break; -case NVME_COMMAND_SET_PROFILE: -result = 0; -break; default: result = nvme_feature_default[fid]; break; -- 2.17.1
Re: [PATCH v3 21/33] qemu-socket: pass monitor link to socket_get_fd directly
19.04.2021 12:34, Daniel P. Berrangé wrote: On Fri, Apr 16, 2021 at 11:08:59AM +0300, Vladimir Sementsov-Ogievskiy wrote: Detecting monitor by current coroutine works bad when we are not in coroutine context. And that's exactly so in nbd reconnect code, where qio_channel_socket_connect_sync() is called from thread. Add a possibility to pass monitor by hand, to be used in the following commit. Signed-off-by: Vladimir Sementsov-Ogievskiy --- include/io/channel-socket.h| 20 include/qemu/sockets.h | 2 +- io/channel-socket.c| 17 + tests/unit/test-util-sockets.c | 16 util/qemu-sockets.c| 10 +- 5 files changed, 47 insertions(+), 18 deletions(-) diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h index e747e63514..6d0915420d 100644 --- a/include/io/channel-socket.h +++ b/include/io/channel-socket.h @@ -78,6 +78,23 @@ qio_channel_socket_new_fd(int fd, Error **errp); +/** + * qio_channel_socket_connect_sync_mon: + * @ioc: the socket channel object + * @addr: the address to connect to + * @mon: current monitor. If NULL, it will be detected by + * current coroutine. + * @errp: pointer to a NULL-initialized error object + * + * Attempt to connect to the address @addr. This method + * will run in the foreground so the caller will not regain + * execution control until the connection is established or + * an error occurs. + */ +int qio_channel_socket_connect_sync_mon(QIOChannelSocket *ioc, +SocketAddress *addr, +Monitor *mon, +Error **errp); I don't really like exposing the concept of the QEMU monitor in the IO layer APIs. IMHO these ought to remain completely separate subsystems from the API pov, and we ought to fix this problem by making monitor_cur() work better in the scenario required. Hmm.. I can add thread_mon hash-table to monitor/monitor.c, so we can set monitor for current thread in same way like for coroutine. And monitor_cur will look first at coroutine->monitor hash map and then to thread->monitor. Then, I'll pass needed monitor link to my specific thread, and thread will call monitor_set_cur_for_thread(), an then qio_channel_socket_connect_sync() will work correctly. David, Markus, is it OK? -- Best regards, Vladimir
[PATCH v2 6/7] hw/block/pflash_cfi02: Remove pflash_setup_mappings()
All boards calling pflash_cfi02_register() use nb_mappings=1, which does not do any mapping: $ git grep -wl pflash_cfi02_register hw/ hw/arm/xilinx_zynq.c hw/block/pflash_cfi02.c hw/lm32/lm32_boards.c hw/ppc/ppc405_boards.c hw/sh4/r2d.c We can remove this now unneeded code. Reviewed-by: David Gibson Signed-off-by: Philippe Mathieu-Daudé --- hw/block/pflash_cfi02.c | 35 ++- 1 file changed, 2 insertions(+), 33 deletions(-) diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c index 02c514fb6e0..6f4b3e3c3fe 100644 --- a/hw/block/pflash_cfi02.c +++ b/hw/block/pflash_cfi02.c @@ -75,7 +75,6 @@ struct PFlashCFI02 { uint32_t nb_blocs[PFLASH_MAX_ERASE_REGIONS]; uint32_t sector_len[PFLASH_MAX_ERASE_REGIONS]; uint32_t chip_len; -uint8_t mappings; uint8_t width; uint8_t be; int wcycle; /* if 0, the flash is read normally */ @@ -92,13 +91,6 @@ struct PFlashCFI02 { uint16_t unlock_addr1; uint8_t cfi_table[0x4d]; QEMUTimer timer; -/* - * The device replicates the flash memory across its memory space. Emulate - * that by having a container (.mem) filled with an array of aliases - * (.mem_mappings) pointing to the flash memory (.orig_mem). - */ -MemoryRegion mem; -MemoryRegion *mem_mappings;/* array; one per mapping */ MemoryRegion orig_mem; bool rom_mode; int read_counter; /* used for lazy switch-back to rom mode */ @@ -158,23 +150,6 @@ static inline void toggle_dq2(PFlashCFI02 *pfl) pfl->status ^= 0x04; } -/* - * Set up replicated mappings of the same region. - */ -static void pflash_setup_mappings(PFlashCFI02 *pfl) -{ -unsigned i; -hwaddr size = memory_region_size(>orig_mem); - -memory_region_init(>mem, OBJECT(pfl), "pflash", pfl->mappings * size); -pfl->mem_mappings = g_new(MemoryRegion, pfl->mappings); -for (i = 0; i < pfl->mappings; ++i) { -memory_region_init_alias(>mem_mappings[i], OBJECT(pfl), - "pflash-alias", >orig_mem, 0, size); -memory_region_add_subregion(>mem, i * size, >mem_mappings[i]); -} -} - static void pflash_reset_state_machine(PFlashCFI02 *pfl) { trace_pflash_reset(pfl->name); @@ -917,12 +892,7 @@ static void pflash_cfi02_realize(DeviceState *dev, Error **errp) pfl->sector_erase_map = bitmap_new(pfl->total_sectors); pfl->rom_mode = true; -if (pfl->mappings > 1) { -pflash_setup_mappings(pfl); -sysbus_init_mmio(SYS_BUS_DEVICE(dev), >mem); -} else { -sysbus_init_mmio(SYS_BUS_DEVICE(dev), >orig_mem); -} +sysbus_init_mmio(SYS_BUS_DEVICE(dev), >orig_mem); timer_init_ns(>timer, QEMU_CLOCK_VIRTUAL, pflash_timer, pfl); pfl->status = 0; @@ -950,7 +920,6 @@ static Property pflash_cfi02_properties[] = { DEFINE_PROP_UINT32("num-blocks3", PFlashCFI02, nb_blocs[3], 0), DEFINE_PROP_UINT32("sector-length3", PFlashCFI02, sector_len[3], 0), DEFINE_PROP_UINT8("width", PFlashCFI02, width, 0), -DEFINE_PROP_UINT8("mappings", PFlashCFI02, mappings, 0), DEFINE_PROP_UINT8("big-endian", PFlashCFI02, be, 0), DEFINE_PROP_UINT16("id0", PFlashCFI02, ident0, 0), DEFINE_PROP_UINT16("id1", PFlashCFI02, ident1, 0), @@ -1008,6 +977,7 @@ PFlashCFI02 *pflash_cfi02_register(hwaddr base, { DeviceState *dev = qdev_new(TYPE_PFLASH_CFI02); +assert(nb_mappings <= 1); if (blk) { qdev_prop_set_drive(dev, "drive", blk); } @@ -1015,7 +985,6 @@ PFlashCFI02 *pflash_cfi02_register(hwaddr base, qdev_prop_set_uint32(dev, "num-blocks", size / sector_len); qdev_prop_set_uint32(dev, "sector-length", sector_len); qdev_prop_set_uint8(dev, "width", width); -qdev_prop_set_uint8(dev, "mappings", nb_mappings); qdev_prop_set_uint8(dev, "big-endian", !!be); qdev_prop_set_uint16(dev, "id0", id0); qdev_prop_set_uint16(dev, "id1", id1); -- 2.26.3
[PATCH v2 5/7] hw/arm/digic: Map flash using memory_region_add_subregion_aliased()
Instead of using a device specific feature for mapping the flash memory multiple times over a wider region, use the generic memory_region_add_subregion_aliased() helper. There is no change in the memory layout. * before: $ qemu-system-arm -M canon-a1100 -S -monitor stdio QEMU 5.2.90 monitor - type 'help' for more information (qemu) info mtree address-space: memory - (prio 0, i/o): system -03ff (prio 0, ram): ram c021-c02100ff (prio 0, i/o): digic-timer c0210100-c02101ff (prio 0, i/o): digic-timer c0210200-c02102ff (prio 0, i/o): digic-timer c080-c0800017 (prio 0, i/o): digic-uart f800- (prio 0, i/o): pflash f800-f83f (prio 0, romd): alias pflash-alias @pflash -003f f840-f87f (prio 0, romd): alias pflash-alias @pflash -003f f880-f8bf (prio 0, romd): alias pflash-alias @pflash -003f ... ff40-ff7f (prio 0, romd): alias pflash-alias @pflash -003f ff80-ffbf (prio 0, romd): alias pflash-alias @pflash -003f ffc0- (prio 0, romd): alias pflash-alias @pflash -003f * after: (qemu) info mtree address-space: memory - (prio 0, i/o): system -03ff (prio 0, ram): ram c021-c02100ff (prio 0, i/o): digic-timer c0210100-c02101ff (prio 0, i/o): digic-timer c0210200-c02102ff (prio 0, i/o): digic-timer c080-c0800017 (prio 0, i/o): digic-uart f800- (prio 0, i/o): masked pflash [span of 4 MiB] f800-f83f (prio 0, romd): alias pflash [#0/32] @pflash -003f f840-f87f (prio 0, romd): alias pflash [#1/32] @pflash -003f f880-f8bf (prio 0, romd): alias pflash [#2/32] @pflash -003f ... ff40-ff7f (prio 0, romd): alias pflash [#29/32] @pflash -003f ff80-ffbf (prio 0, romd): alias pflash [#30/32] @pflash -003f ffc0- (prio 0, romd): alias pflash [#31/32] @pflash -003f Signed-off-by: Philippe Mathieu-Daudé --- hw/arm/digic_boards.c | 8 +--- hw/arm/Kconfig| 1 + 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/hw/arm/digic_boards.c b/hw/arm/digic_boards.c index fc4a671b2e1..293402b1240 100644 --- a/hw/arm/digic_boards.c +++ b/hw/arm/digic_boards.c @@ -128,8 +128,7 @@ static void digic4_add_k8p3215uqb_rom(DigicState *s, hwaddr addr, FLASH_K8P3215UQB_SIZE / FLASH_K8P3215UQB_SECTOR_SIZE); qdev_prop_set_uint32(dev, "sector-length", FLASH_K8P3215UQB_SECTOR_SIZE); qdev_prop_set_uint8(dev, "width", 4); /* 32-bit */ -qdev_prop_set_uint8(dev, "mappings", -DIGIC4_ROM_MAX_SIZE / FLASH_K8P3215UQB_SIZE); +qdev_prop_set_uint8(dev, "mappings", 0); qdev_prop_set_uint8(dev, "big-endian", 0); qdev_prop_set_uint16(dev, "id0", 0x00ec); qdev_prop_set_uint16(dev, "id1", 0x007e); @@ -140,7 +139,10 @@ static void digic4_add_k8p3215uqb_rom(DigicState *s, hwaddr addr, qdev_prop_set_string(dev, "name", "pflash"); sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), _fatal); -sysbus_mmio_map(SYS_BUS_DEVICE(dev), 0, addr); +memory_region_add_subregion_aliased(get_system_memory(), +addr, DIGIC4_ROM_MAX_SIZE, +sysbus_mmio_get_region(SYS_BUS_DEVICE(dev), 0), +FLASH_K8P3215UQB_SIZE); digic_load_rom(s, addr, FLASH_K8P3215UQB_SIZE, filename); } diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig index aa8553b3cd3..1a7b9724d6c 100644 --- a/hw/arm/Kconfig +++ b/hw/arm/Kconfig @@ -42,6 +42,7 @@ config DIGIC bool select PTIMER select PFLASH_CFI02 +select ALIASED_REGION config EXYNOS4 bool -- 2.26.3
[PATCH v2 7/7] hw/block/pflash_cfi02: Simplify pflash_cfi02_register() prototype
The previous commit removed the mapping code from TYPE_PFLASH_CFI02. pflash_cfi02_register() doesn't use the 'nb_mappings' argument anymore. Simply remove it to simplify. Reviewed-by: David Gibson Signed-off-by: Philippe Mathieu-Daudé --- include/hw/block/flash.h | 1 - hw/arm/digic_boards.c| 1 - hw/arm/musicpal.c| 1 - hw/arm/xilinx_zynq.c | 2 +- hw/block/pflash_cfi02.c | 3 +-- hw/lm32/lm32_boards.c| 4 ++-- hw/ppc/ppc405_boards.c | 6 +++--- hw/sh4/r2d.c | 2 +- 8 files changed, 8 insertions(+), 12 deletions(-) diff --git a/include/hw/block/flash.h b/include/hw/block/flash.h index 7dde0adcee7..0e5dd818a9d 100644 --- a/include/hw/block/flash.h +++ b/include/hw/block/flash.h @@ -36,7 +36,6 @@ PFlashCFI02 *pflash_cfi02_register(hwaddr base, hwaddr size, BlockBackend *blk, uint32_t sector_len, - int nb_mappings, int width, uint16_t id0, uint16_t id1, uint16_t id2, uint16_t id3, diff --git a/hw/arm/digic_boards.c b/hw/arm/digic_boards.c index 293402b1240..eb694c70d4c 100644 --- a/hw/arm/digic_boards.c +++ b/hw/arm/digic_boards.c @@ -128,7 +128,6 @@ static void digic4_add_k8p3215uqb_rom(DigicState *s, hwaddr addr, FLASH_K8P3215UQB_SIZE / FLASH_K8P3215UQB_SECTOR_SIZE); qdev_prop_set_uint32(dev, "sector-length", FLASH_K8P3215UQB_SECTOR_SIZE); qdev_prop_set_uint8(dev, "width", 4); /* 32-bit */ -qdev_prop_set_uint8(dev, "mappings", 0); qdev_prop_set_uint8(dev, "big-endian", 0); qdev_prop_set_uint16(dev, "id0", 0x00ec); qdev_prop_set_uint16(dev, "id1", 0x007e); diff --git a/hw/arm/musicpal.c b/hw/arm/musicpal.c index 7d1f2f3fb3f..e882e11df36 100644 --- a/hw/arm/musicpal.c +++ b/hw/arm/musicpal.c @@ -1657,7 +1657,6 @@ static void musicpal_init(MachineState *machine) qdev_prop_set_uint32(dev, "num-blocks", flash_size / sector_size); qdev_prop_set_uint32(dev, "sector-length", sector_size); qdev_prop_set_uint8(dev, "width", 2); /* 16-bit */ -qdev_prop_set_uint8(dev, "mappings", 0); qdev_prop_set_uint8(dev, "big-endian", 0); qdev_prop_set_uint16(dev, "id0", 0x00bf); qdev_prop_set_uint16(dev, "id1", 0x236d); diff --git a/hw/arm/xilinx_zynq.c b/hw/arm/xilinx_zynq.c index 8db6cfd47f5..d12b00e7648 100644 --- a/hw/arm/xilinx_zynq.c +++ b/hw/arm/xilinx_zynq.c @@ -220,7 +220,7 @@ static void zynq_init(MachineState *machine) pflash_cfi02_register(0xe200, "zynq.pflash", FLASH_SIZE, dinfo ? blk_by_legacy_dinfo(dinfo) : NULL, FLASH_SECTOR_SIZE, 1, - 1, 0x0066, 0x0022, 0x, 0x, 0x0555, 0x2aa, + 0x0066, 0x0022, 0x, 0x, 0x0555, 0x2aa, 0); /* Create the main clock source, and feed slcr with it */ diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c index 6f4b3e3c3fe..2b412402fac 100644 --- a/hw/block/pflash_cfi02.c +++ b/hw/block/pflash_cfi02.c @@ -968,7 +968,7 @@ PFlashCFI02 *pflash_cfi02_register(hwaddr base, hwaddr size, BlockBackend *blk, uint32_t sector_len, - int nb_mappings, int width, + int width, uint16_t id0, uint16_t id1, uint16_t id2, uint16_t id3, uint16_t unlock_addr0, @@ -977,7 +977,6 @@ PFlashCFI02 *pflash_cfi02_register(hwaddr base, { DeviceState *dev = qdev_new(TYPE_PFLASH_CFI02); -assert(nb_mappings <= 1); if (blk) { qdev_prop_set_drive(dev, "drive", blk); } diff --git a/hw/lm32/lm32_boards.c b/hw/lm32/lm32_boards.c index b5d97dd53ed..96877ba7cfb 100644 --- a/hw/lm32/lm32_boards.c +++ b/hw/lm32/lm32_boards.c @@ -121,7 +121,7 @@ static void lm32_evr_init(MachineState *machine) pflash_cfi02_register(flash_base, "lm32_evr.flash", flash_size, dinfo ? blk_by_legacy_dinfo(dinfo) : NULL, flash_sector_size, - 1, 2, 0x01, 0x7e, 0x43, 0x00, 0x555, 0x2aa, 1); + 2, 0x01, 0x7e, 0x43, 0x00, 0x555, 0x2aa, 1); /* create irq lines */ env->pic_state = lm32_pic_init(qemu_allocate_irq(cpu_irq_handler, cpu, 0)); @@ -218,7 +218,7 @@ static void lm32_uclinux_init(MachineState *machine) pflash_cfi02_register(flash_base, "lm32_uclinux.flash", flash_size, dinfo ? blk_by_legacy_dinfo(dinfo) : NULL, flash_sector_size, - 1, 2, 0x01, 0x7e, 0x43, 0x00, 0x555,
[PATCH v2 4/7] hw/arm/digic: Open-code pflash_cfi02_register() call
To be able to manually map the flash region on the main memory (in the next commit), first expand the pflash_cfi02_register in place. Signed-off-by: Philippe Mathieu-Daudé --- hw/arm/digic_boards.c | 27 +-- 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/hw/arm/digic_boards.c b/hw/arm/digic_boards.c index 6cdc1d83fca..fc4a671b2e1 100644 --- a/hw/arm/digic_boards.c +++ b/hw/arm/digic_boards.c @@ -31,6 +31,8 @@ #include "hw/boards.h" #include "exec/address-spaces.h" #include "qemu/error-report.h" +#include "hw/qdev-properties.h" +#include "hw/misc/aliased_region.h" #include "hw/arm/digic.h" #include "hw/block/flash.h" #include "hw/loader.h" @@ -120,12 +122,25 @@ static void digic4_add_k8p3215uqb_rom(DigicState *s, hwaddr addr, #define FLASH_K8P3215UQB_SIZE (4 * 1024 * 1024) #define FLASH_K8P3215UQB_SECTOR_SIZE (64 * 1024) -pflash_cfi02_register(addr, "pflash", FLASH_K8P3215UQB_SIZE, - NULL, FLASH_K8P3215UQB_SECTOR_SIZE, - DIGIC4_ROM_MAX_SIZE / FLASH_K8P3215UQB_SIZE, - 4, - 0x00EC, 0x007E, 0x0003, 0x0001, - 0x0555, 0x2aa, 0); +DeviceState *dev = qdev_new(TYPE_PFLASH_CFI02); + +qdev_prop_set_uint32(dev, "num-blocks", + FLASH_K8P3215UQB_SIZE / FLASH_K8P3215UQB_SECTOR_SIZE); +qdev_prop_set_uint32(dev, "sector-length", FLASH_K8P3215UQB_SECTOR_SIZE); +qdev_prop_set_uint8(dev, "width", 4); /* 32-bit */ +qdev_prop_set_uint8(dev, "mappings", +DIGIC4_ROM_MAX_SIZE / FLASH_K8P3215UQB_SIZE); +qdev_prop_set_uint8(dev, "big-endian", 0); +qdev_prop_set_uint16(dev, "id0", 0x00ec); +qdev_prop_set_uint16(dev, "id1", 0x007e); +qdev_prop_set_uint16(dev, "id2", 0x0003); +qdev_prop_set_uint16(dev, "id3", 0x0001); +qdev_prop_set_uint16(dev, "unlock-addr0", 0x0555); +qdev_prop_set_uint16(dev, "unlock-addr1", 0x2aa); +qdev_prop_set_string(dev, "name", "pflash"); +sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), _fatal); + +sysbus_mmio_map(SYS_BUS_DEVICE(dev), 0, addr); digic_load_rom(s, addr, FLASH_K8P3215UQB_SIZE, filename); } -- 2.26.3
[PATCH v2 3/7] hw/arm/musicpal: Map flash using memory_region_add_subregion_aliased()
Instead of using a device specific feature for mapping the flash memory multiple times over a wider region, use the generic memory_region_add_subregion_aliased() helper. There is no change in the memory layout: - before: (qemu) info mtree fe00- (prio 0, i/o): pflash fe00-fe7f (prio 0, romd): alias pflash-alias @musicpal.flash -007f fe80-feff (prio 0, romd): alias pflash-alias @musicpal.flash -007f ff00-ff7f (prio 0, romd): alias pflash-alias @musicpal.flash -007f ff80- (prio 0, romd): alias pflash-alias @musicpal.flash -007f - after: fe00- (prio 0, i/o): masked musicpal.flash [span of 8 MiB] fe00-fe7f (prio 0, romd): alias musicpal.flash [#0/4] @musicpal.flash -007f fe80-feff (prio 0, romd): alias musicpal.flash [#1/4] @musicpal.flash -007f ff00-ff7f (prio 0, romd): alias musicpal.flash [#2/4] @musicpal.flash -007f ff80- (prio 0, romd): alias musicpal.flash [#3/4] @musicpal.flash -007f Flatview is the same: (qemu) info mtree -f FlatView #0 AS "memory", root: system AS "cpu-memory-0", root: system AS "emac-dma", root: system Root memory region: system fe00-fe7f (prio 0, romd): musicpal.flash fe80-feff (prio 0, romd): musicpal.flash ff00-ff7f (prio 0, romd): musicpal.flash ff80- (prio 0, romd): musicpal.flash Signed-off-by: Philippe Mathieu-Daudé --- hw/arm/musicpal.c | 11 +++ hw/arm/Kconfig| 1 + 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/hw/arm/musicpal.c b/hw/arm/musicpal.c index 8b58b66f263..7d1f2f3fb3f 100644 --- a/hw/arm/musicpal.c +++ b/hw/arm/musicpal.c @@ -30,6 +30,7 @@ #include "hw/irq.h" #include "hw/or-irq.h" #include "hw/audio/wm8750.h" +#include "hw/misc/aliased_region.h" #include "sysemu/block-backend.h" #include "sysemu/runstate.h" #include "sysemu/dma.h" @@ -1656,7 +1657,7 @@ static void musicpal_init(MachineState *machine) qdev_prop_set_uint32(dev, "num-blocks", flash_size / sector_size); qdev_prop_set_uint32(dev, "sector-length", sector_size); qdev_prop_set_uint8(dev, "width", 2); /* 16-bit */ -qdev_prop_set_uint8(dev, "mappings", MP_FLASH_SIZE_MAX / flash_size); +qdev_prop_set_uint8(dev, "mappings", 0); qdev_prop_set_uint8(dev, "big-endian", 0); qdev_prop_set_uint16(dev, "id0", 0x00bf); qdev_prop_set_uint16(dev, "id1", 0x236d); @@ -1667,14 +1668,16 @@ static void musicpal_init(MachineState *machine) qdev_prop_set_string(dev, "name", "musicpal.flash"); sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), _fatal); -sysbus_mmio_map(SYS_BUS_DEVICE(dev), 0, -0x1ULL - MP_FLASH_SIZE_MAX); - /* * The original U-Boot accesses the flash at 0xFE00 instead of * 0xFF80 (if there is 8 MB flash). So remap flash access if the * image is smaller than 32 MB. */ +memory_region_add_subregion_aliased(get_system_memory(), +0x1ULL - MP_FLASH_SIZE_MAX, +MP_FLASH_SIZE_MAX, +sysbus_mmio_get_region(SYS_BUS_DEVICE(dev), 0), +flash_size); } sysbus_create_simple(TYPE_MV88W8618_FLASHCFG, MP_FLASHCFG_BASE, NULL); diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig index 8c37cf00da7..aa8553b3cd3 100644 --- a/hw/arm/Kconfig +++ b/hw/arm/Kconfig @@ -101,6 +101,7 @@ config MUSICPAL select MARVELL_88W8618 select PTIMER select PFLASH_CFI02 +select ALIASED_REGION select SERIAL select WM8750 -- 2.26.3
[PATCH v2 2/7] hw/arm/musicpal: Open-code pflash_cfi02_register() call
To be able to manually map the flash region on the main memory (in the next commit), first expand the pflash_cfi02_register in place. Signed-off-by: Philippe Mathieu-Daudé --- hw/arm/musicpal.c | 27 +-- 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/hw/arm/musicpal.c b/hw/arm/musicpal.c index 9cebece2de0..8b58b66f263 100644 --- a/hw/arm/musicpal.c +++ b/hw/arm/musicpal.c @@ -10,6 +10,7 @@ */ #include "qemu/osdep.h" +#include "qemu/units.h" #include "qapi/error.h" #include "cpu.h" #include "hw/sysbus.h" @@ -1640,6 +1641,7 @@ static void musicpal_init(MachineState *machine) /* Register flash */ dinfo = drive_get(IF_PFLASH, 0, 0); if (dinfo) { +static const size_t sector_size = 64 * KiB; BlockBackend *blk = blk_by_legacy_dinfo(dinfo); flash_size = blk_getlength(blk); @@ -1649,17 +1651,30 @@ static void musicpal_init(MachineState *machine) exit(1); } +dev = qdev_new(TYPE_PFLASH_CFI02); +qdev_prop_set_drive(dev, "drive", blk); +qdev_prop_set_uint32(dev, "num-blocks", flash_size / sector_size); +qdev_prop_set_uint32(dev, "sector-length", sector_size); +qdev_prop_set_uint8(dev, "width", 2); /* 16-bit */ +qdev_prop_set_uint8(dev, "mappings", MP_FLASH_SIZE_MAX / flash_size); +qdev_prop_set_uint8(dev, "big-endian", 0); +qdev_prop_set_uint16(dev, "id0", 0x00bf); +qdev_prop_set_uint16(dev, "id1", 0x236d); +qdev_prop_set_uint16(dev, "id2", 0x); +qdev_prop_set_uint16(dev, "id3", 0x); +qdev_prop_set_uint16(dev, "unlock-addr0", 0x); +qdev_prop_set_uint16(dev, "unlock-addr1", 0x2aaa); +qdev_prop_set_string(dev, "name", "musicpal.flash"); +sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), _fatal); + +sysbus_mmio_map(SYS_BUS_DEVICE(dev), 0, +0x1ULL - MP_FLASH_SIZE_MAX); + /* * The original U-Boot accesses the flash at 0xFE00 instead of * 0xFF80 (if there is 8 MB flash). So remap flash access if the * image is smaller than 32 MB. */ -pflash_cfi02_register(0x1ULL - MP_FLASH_SIZE_MAX, - "musicpal.flash", flash_size, - blk, 0x1, - MP_FLASH_SIZE_MAX / flash_size, - 2, 0x00BF, 0x236D, 0x, 0x, - 0x, 0x2AAA, 0); } sysbus_create_simple(TYPE_MV88W8618_FLASHCFG, MP_FLASHCFG_BASE, NULL); -- 2.26.3
[RFC PATCH v2 1/7] hw/misc: Add device to help managing aliased memory regions
// TODO explain here how buses work? when some address lines are // not bound we get memory aliasing, high addresses are masked. // etc... Add a helper to manage this use case easily. For example a having @span_size = @region_size / 4 we get such mapping: ^---^ | | | | | +---+ | +-+ <--+ | | +-+ | | | | | | | | +---> | alias#3 | | | | | | | | | | | +-+ | | | | +-+ | | | | | | | | | | +---> | alias#2 | | | | | | | | |region | container | | | +-+ | size | | | | +-+ | | | | | | | | | | | | +> | alias#1 | | | | | | | | | | | | | | | +-+ <--+ | | | +-+---+--+--+ +-+ | | | | | | | | |span | | | | subregion +-> | alias#0 | |size | offset | | | | | | | | +> | +---+ | +---+ +-+ <--+<--+ | | | | | | | | | | | | | | | | ^---^ Signed-off-by: Philippe Mathieu-Daudé --- Not really RFC, simply that I'v to add the technical description, but I'd like to know if there could be a possibility to not accept this device (because I missed something) before keeping working on it. So far it is only used in hobbyist boards. Cc: Peter Xu Cc: Paolo Bonzini --- include/hw/misc/aliased_region.h | 87 hw/misc/aliased_region.c | 132 +++ MAINTAINERS | 6 ++ hw/misc/Kconfig | 3 + hw/misc/meson.build | 1 + 5 files changed, 229 insertions(+) create mode 100644 include/hw/misc/aliased_region.h create mode 100644 hw/misc/aliased_region.c diff --git a/include/hw/misc/aliased_region.h b/include/hw/misc/aliased_region.h new file mode 100644 index 000..0ce0d5d1cef --- /dev/null +++ b/include/hw/misc/aliased_region.h @@ -0,0 +1,87 @@ +/* + * Aliased memory regions + * + * Copyright (c) 2018 Philippe Mathieu-Daudé + * + * SPDX-License-Identifier: GPL-2.0-or-later + */ + +#ifndef HW_MISC_ALIASED_REGION_H +#define HW_MISC_ALIASED_REGION_H + +#include "exec/memory.h" +#include "hw/sysbus.h" + +#define TYPE_ALIASED_REGION "aliased-memory-region" +OBJECT_DECLARE_SIMPLE_TYPE(AliasedRegionState, ALIASED_REGION) + +struct AliasedRegionState { +/*< private >*/ +SysBusDevice parent_obj; + +/*< public >*/ +MemoryRegion container; +uint64_t region_size; +uint64_t span_size; +MemoryRegion *mr; + +struct { +size_t count; +MemoryRegion *alias; +} mem; +}; + +/** + * memory_region_add_subregion_aliased: + * @container: the #MemoryRegion to contain the aliased subregions. + * @offset: the offset relative to @container where the aliased subregion + * are added. + * @region_size: size of the region containing the aliased subregions. + * @subregion: the subregion to be aliased. + * @span_size: size between each aliased subregion + * + * This utility function creates and maps an instance of aliased-memory-region, + * which is a dummy device of a single region which simply contains multiple + * aliases of the provided @subregion, spanned over the @region_size every + * @span_size. The device is mapped at @offset within @container. + * + * For example a having @span_size = @region_size / 4 we get such mapping: + * + * +---+ + * | | + * | | + * | +---+ | +-+ <--+ + * | | +-+ | + * | | | | | + * | | +---> | alias#3 | | + * | | | | | | + * | | | +-+ | + * | | | +-+ | + * | | | | | | + * | | | +---> | alias#2 |
[PATCH v2 0/7] hw/misc: Add memory_region_add_subregion_aliased() helper [pflash part]
Hi, This series introduce the memory_region_add_subregion_aliased() helper which basically create a device which maps a subregion multiple times. Since v1: - Split series in 2, keeping the I/O regions (showed with the q800 machine) part for 2nd part - Added R-b tags Examples are easier, so having a subregion aliased every @span_size then mapped onto a container at an offset, you get something like: ^---^ | | | | | +---+ | +-+ <--+ | | +-+ | | | | | | | | +---> | alias#3 | | | | | | | | | | | +-+ | | | | +-+ | | | | | | | | | | +---> | alias#2 | | | | | | | | |region | container | | | +-+ | size | | | | +-+ | | | | | | | | | | | | +> | alias#1 | | | | | | | | | | | | | | | +-+ <--+ | | | +-+---+--+--+ +-+ | | | | | | | | |span | | | | subregion +-> | alias#0 | |size | offset | | | | | | | | +> | +---+ | +---+ +-+ <--+<--+ | | | | | | | | | | | | | | | | ^---^ I know it need more documentation and tests, but I prefer to send as draft RFC for early review before spending more time on it. Supersedes: <20210326002728.1069834-1-f4...@amsat.org> Based-on: <20210325120921.858993-1-f4...@amsat.org> https://www.mail-archive.com/qemu-devel@nongnu.org/msg795218.html Philippe Mathieu-Daudé (7): hw/misc: Add device to help managing aliased memory regions hw/arm/musicpal: Open-code pflash_cfi02_register() call hw/arm/musicpal: Map flash using memory_region_add_subregion_aliased() hw/arm/digic: Open-code pflash_cfi02_register() call hw/arm/digic: Map flash using memory_region_add_subregion_aliased() hw/block/pflash_cfi02: Remove pflash_setup_mappings() hw/block/pflash_cfi02: Simplify pflash_cfi02_register() prototype include/hw/block/flash.h | 1 - include/hw/misc/aliased_region.h | 87 hw/arm/digic_boards.c| 28 +-- hw/arm/musicpal.c| 29 +-- hw/arm/xilinx_zynq.c | 2 +- hw/block/pflash_cfi02.c | 36 + hw/lm32/lm32_boards.c| 4 +- hw/misc/aliased_region.c | 132 +++ hw/ppc/ppc405_boards.c | 6 +- hw/sh4/r2d.c | 2 +- MAINTAINERS | 6 ++ hw/arm/Kconfig | 2 + hw/misc/Kconfig | 3 + hw/misc/meson.build | 1 + 14 files changed, 285 insertions(+), 54 deletions(-) create mode 100644 include/hw/misc/aliased_region.h create mode 100644 hw/misc/aliased_region.c -- 2.26.3
Re: [PATCH v3 21/33] qemu-socket: pass monitor link to socket_get_fd directly
On Fri, Apr 16, 2021 at 11:08:59AM +0300, Vladimir Sementsov-Ogievskiy wrote: > Detecting monitor by current coroutine works bad when we are not in > coroutine context. And that's exactly so in nbd reconnect code, where > qio_channel_socket_connect_sync() is called from thread. > > Add a possibility to pass monitor by hand, to be used in the following > commit. > > Signed-off-by: Vladimir Sementsov-Ogievskiy > --- > include/io/channel-socket.h| 20 > include/qemu/sockets.h | 2 +- > io/channel-socket.c| 17 + > tests/unit/test-util-sockets.c | 16 > util/qemu-sockets.c| 10 +- > 5 files changed, 47 insertions(+), 18 deletions(-) > > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h > index e747e63514..6d0915420d 100644 > --- a/include/io/channel-socket.h > +++ b/include/io/channel-socket.h > @@ -78,6 +78,23 @@ qio_channel_socket_new_fd(int fd, >Error **errp); > > > +/** > + * qio_channel_socket_connect_sync_mon: > + * @ioc: the socket channel object > + * @addr: the address to connect to > + * @mon: current monitor. If NULL, it will be detected by > + * current coroutine. > + * @errp: pointer to a NULL-initialized error object > + * > + * Attempt to connect to the address @addr. This method > + * will run in the foreground so the caller will not regain > + * execution control until the connection is established or > + * an error occurs. > + */ > +int qio_channel_socket_connect_sync_mon(QIOChannelSocket *ioc, > +SocketAddress *addr, > +Monitor *mon, > +Error **errp); I don't really like exposing the concept of the QEMU monitor in the IO layer APIs. IMHO these ought to remain completely separate subsystems from the API pov, and we ought to fix this problem by making monitor_cur() work better in the scenario required. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH-for-6.1 0/2] hw/block/pflash_cfi02: Do not create aliases when not necessary
On 3/25/21 1:09 PM, Philippe Mathieu-Daudé wrote: > Simplify memory layout when no pflash_cfi02 mapping requested. > Philippe Mathieu-Daud=C3=A9 (2): > hw/block/pflash_cfi02: Set romd mode in pflash_cfi02_realize() > hw/block/pflash_cfi02: Do not create aliases when not necessary Thanks, series applied to pflash-next tree.
Re: [RFC PATCH 0/2] qemu-img convert: Fix sparseness detection
> Am 19.04.2021 um 10:36 schrieb Peter Lieven : > > > >> Am 15.04.2021 um 17:22 schrieb Kevin Wolf : >> >> Peter, three years ago you changed 'qemu-img convert' to sacrifice some >> sparsification in order to get aligned requests on the target image. At >> the time, I thought the impact would be small, but it turns out that >> this can end up wasting gigabytes of storagee (like converting a fully >> zeroed 10 GB image taking 2.8 GB instead of a few kilobytes). >> >> https://bugzilla.redhat.com/show_bug.cgi?id=1882917 >> >> I'm not entirely sure how to attack this best since this is a tradeoff, >> but maybe the approach in this series is still good enough for the case >> that you wanted to fix back then? >> >> Of course, it would be possible to have a more complete fix like looking >> forward a few blocks more before writing data, but that would probably >> not be entirely trivial because you would have to merge blocks with ZERO >> block status with DATA blocks that contain only zeros. I'm not sure if >> it's worth this complication of the code. > > I will try to look into this asap. Besides from the reproducer described in the ticket, I retried my old conversion test in our environment: Before commit 8dcd3c9b91: reads 4608 writes 14959 After commit 8dcd3c9b91: reads 0 writes 14924 With Kevins patch: reads 110 writes 14924 I think this is a good result if it avoids other issues. Peter
[PATCH v2 6/8] block: do not acquire AioContext in check_to_replace_node
Make the (only) caller do it. Reviewed-by: Stefan Hajnoczi Signed-off-by: Paolo Bonzini Signed-off-by: Emanuele Giuseppe Esposito --- block.c | 22 +- blockdev.c| 7 ++- include/block/block.h | 1 + 3 files changed, 12 insertions(+), 18 deletions(-) diff --git a/block.c b/block.c index f40fb63c75..fd12f88062 100644 --- a/block.c +++ b/block.c @@ -6776,24 +6776,15 @@ bool bdrv_recurse_can_replace(BlockDriverState *bs, * * The result (whether the node can be replaced or not) is only valid * for as long as no graph or permission changes occur. + * + * Called with AioContext lock held. */ BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs, +BlockDriverState *to_replace_bs, const char *node_name, Error **errp) { -BlockDriverState *to_replace_bs = bdrv_find_node(node_name); -AioContext *aio_context; - -if (!to_replace_bs) { -error_setg(errp, "Failed to find node with node-name='%s'", node_name); -return NULL; -} - -aio_context = bdrv_get_aio_context(to_replace_bs); -aio_context_acquire(aio_context); - if (bdrv_op_is_blocked(to_replace_bs, BLOCK_OP_TYPE_REPLACE, errp)) { -to_replace_bs = NULL; -goto out; +return NULL; } /* We don't want arbitrary node of the BDS chain to be replaced only the top @@ -6806,12 +6797,9 @@ BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs, "because it cannot be guaranteed that doing so would not " "lead to an abrupt change of visible data", node_name, parent_bs->node_name); -to_replace_bs = NULL; -goto out; +return NULL; } -out: -aio_context_release(aio_context); return to_replace_bs; } diff --git a/blockdev.c b/blockdev.c index a57590aae4..e901107344 100644 --- a/blockdev.c +++ b/blockdev.c @@ -3056,13 +3056,18 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs, return; } -to_replace_bs = check_to_replace_node(bs, replaces, errp); +to_replace_bs = bdrv_find_node(replaces); if (!to_replace_bs) { +error_setg(errp, "Failed to find node with node-name='%s'", + replaces); return; } replace_aio_context = bdrv_get_aio_context(to_replace_bs); aio_context_acquire(replace_aio_context); +if (!check_to_replace_node(bs, to_replace_bs, replaces, errp)) { +return; +} replace_size = bdrv_getlength(to_replace_bs); aio_context_release(replace_aio_context); diff --git a/include/block/block.h b/include/block/block.h index b3f6e509d4..f57c34a19a 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -474,6 +474,7 @@ int bdrv_amend_options(BlockDriverState *bs_new, QemuOpts *opts, /* check if a named node can be replaced when doing drive-mirror */ BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs, +BlockDriverState *to_replace_bs, const char *node_name, Error **errp); /* async block I/O */ -- 2.30.2
[PATCH v2 7/8] block/replication: do not acquire AioContext
Replication functions are mostly called when the BDS is quiescent and does not have any pending I/O. They do not need to synchronize on anything since BDS and BB are now thread-safe. Signed-off-by: Paolo Bonzini Signed-off-by: Emanuele Giuseppe Esposito --- block/replication.c | 54 ++--- 1 file changed, 12 insertions(+), 42 deletions(-) diff --git a/block/replication.c b/block/replication.c index 97be7ef4de..25ee37b21b 100644 --- a/block/replication.c +++ b/block/replication.c @@ -45,6 +45,8 @@ typedef struct BDRVReplicationState { Error *blocker; bool orig_hidden_read_only; bool orig_secondary_read_only; + +/* This field is accessed asynchronously. */ int error; } BDRVReplicationState; @@ -210,7 +212,7 @@ static int replication_return_value(BDRVReplicationState *s, int ret) } if (ret < 0) { -s->error = ret; +qatomic_set(>error, ret); ret = 0; } @@ -307,6 +309,7 @@ out: return ret; } +/* Called with no I/O pending. */ static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp) { Error *local_err = NULL; @@ -420,7 +423,7 @@ static void backup_job_completed(void *opaque, int ret) if (s->stage != BLOCK_REPLICATION_FAILOVER) { /* The backup job is cancelled unexpectedly */ -s->error = -EIO; +qatomic_set(>error, -EIO); } backup_job_cleanup(bs); @@ -445,6 +448,7 @@ static bool check_top_bs(BlockDriverState *top_bs, BlockDriverState *bs) return false; } +/* Called with no I/O pending. */ static void replication_start(ReplicationState *rs, ReplicationMode mode, Error **errp) { @@ -452,12 +456,9 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode, BDRVReplicationState *s; BlockDriverState *top_bs; int64_t active_length, hidden_length, disk_length; -AioContext *aio_context; Error *local_err = NULL; BackupPerf perf = { .use_copy_range = true, .max_workers = 1 }; -aio_context = bdrv_get_aio_context(bs); -aio_context_acquire(aio_context); s = bs->opaque; if (s->stage == BLOCK_REPLICATION_DONE || @@ -467,20 +468,17 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode, * Ignore the request because the secondary side of replication * doesn't have to do anything anymore. */ -aio_context_release(aio_context); return; } if (s->stage != BLOCK_REPLICATION_NONE) { error_setg(errp, "Block replication is running or done"); -aio_context_release(aio_context); return; } if (s->mode != mode) { error_setg(errp, "The parameter mode's value is invalid, needs %d," " but got %d", s->mode, mode); -aio_context_release(aio_context); return; } @@ -492,21 +490,18 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode, if (!s->active_disk || !s->active_disk->bs || !s->active_disk->bs->backing) { error_setg(errp, "Active disk doesn't have backing file"); -aio_context_release(aio_context); return; } s->hidden_disk = s->active_disk->bs->backing; if (!s->hidden_disk->bs || !s->hidden_disk->bs->backing) { error_setg(errp, "Hidden disk doesn't have backing file"); -aio_context_release(aio_context); return; } s->secondary_disk = s->hidden_disk->bs->backing; if (!s->secondary_disk->bs || !bdrv_has_blk(s->secondary_disk->bs)) { error_setg(errp, "The secondary disk doesn't have block backend"); -aio_context_release(aio_context); return; } @@ -518,7 +513,6 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode, active_length != hidden_length || hidden_length != disk_length) { error_setg(errp, "Active disk, hidden disk, secondary disk's length" " are not the same"); -aio_context_release(aio_context); return; } @@ -529,7 +523,6 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode, !s->hidden_disk->bs->drv->bdrv_make_empty) { error_setg(errp, "Active disk or hidden disk doesn't support make_empty"); -aio_context_release(aio_context); return; } @@ -537,7 +530,6 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode, reopen_backing_file(bs, true, _err); if (local_err) { error_propagate(errp, local_err); -aio_context_release(aio_context); return; } @@ -550,7 +542,6 @@ static void replication_start(ReplicationState *rs,
[PATCH v2 8/8] block: do not take AioContext around reopen
Reopen needs to handle AioContext carefully due to calling bdrv_drain_all_begin/end. By not taking AioContext around calls to bdrv_reopen_multiple, we can drop the function's release/acquire pair and the AioContext argument too. Signed-off-by: Paolo Bonzini Signed-off-by: Emanuele Giuseppe Esposito --- block/block-backend.c | 4 block/mirror.c| 9 - blockdev.c| 19 ++- 3 files changed, 6 insertions(+), 26 deletions(-) diff --git a/block/block-backend.c b/block/block-backend.c index 413af51f3b..6fdc698e9e 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -2291,20 +2291,16 @@ int blk_commit_all(void) BlockBackend *blk = NULL; while ((blk = blk_all_next(blk)) != NULL) { -AioContext *aio_context = blk_get_aio_context(blk); BlockDriverState *unfiltered_bs = bdrv_skip_filters(blk_bs(blk)); -aio_context_acquire(aio_context); if (blk_is_inserted(blk) && bdrv_cow_child(unfiltered_bs)) { int ret; ret = bdrv_commit(unfiltered_bs); if (ret < 0) { -aio_context_release(aio_context); return ret; } } -aio_context_release(aio_context); } return 0; } diff --git a/block/mirror.c b/block/mirror.c index 5a71bd8bbc..43174bbc6b 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -631,7 +631,6 @@ static int mirror_exit_common(Job *job) MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job); BlockJob *bjob = >common; MirrorBDSOpaque *bs_opaque; -AioContext *replace_aio_context = NULL; BlockDriverState *src; BlockDriverState *target_bs; BlockDriverState *mirror_top_bs; @@ -699,11 +698,6 @@ static int mirror_exit_common(Job *job) } } -if (s->to_replace) { -replace_aio_context = bdrv_get_aio_context(s->to_replace); -aio_context_acquire(replace_aio_context); -} - if (s->should_complete && !abort) { BlockDriverState *to_replace = s->to_replace ?: src; bool ro = bdrv_is_read_only(to_replace); @@ -740,9 +734,6 @@ static int mirror_exit_common(Job *job) error_free(s->replace_blocker); bdrv_unref(s->to_replace); } -if (replace_aio_context) { -aio_context_release(replace_aio_context); -} g_free(s->replaces); bdrv_unref(target_bs); diff --git a/blockdev.c b/blockdev.c index e901107344..1672ef756e 100644 --- a/blockdev.c +++ b/blockdev.c @@ -3469,7 +3469,6 @@ void qmp_change_backing_file(const char *device, Error **errp) { BlockDriverState *bs = NULL; -AioContext *aio_context; BlockDriverState *image_bs = NULL; Error *local_err = NULL; bool ro; @@ -3480,37 +3479,34 @@ void qmp_change_backing_file(const char *device, return; } -aio_context = bdrv_get_aio_context(bs); -aio_context_acquire(aio_context); - image_bs = bdrv_lookup_bs(NULL, image_node_name, _err); if (local_err) { error_propagate(errp, local_err); -goto out; +return; } if (!image_bs) { error_setg(errp, "image file not found"); -goto out; +return; } if (bdrv_find_base(image_bs) == image_bs) { error_setg(errp, "not allowing backing file change on an image " "without a backing file"); -goto out; +return; } /* even though we are not necessarily operating on bs, we need it to * determine if block ops are currently prohibited on the chain */ if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_CHANGE, errp)) { -goto out; +return; } /* final sanity check */ if (!bdrv_chain_contains(bs, image_bs)) { error_setg(errp, "'%s' and image file are not in the same chain", device); -goto out; +return; } /* if not r/w, reopen to make r/w */ @@ -3518,7 +3514,7 @@ void qmp_change_backing_file(const char *device, if (ro) { if (bdrv_reopen_set_read_only(image_bs, false, errp) != 0) { -goto out; +return; } } @@ -3536,9 +3532,6 @@ void qmp_change_backing_file(const char *device, if (ro) { bdrv_reopen_set_read_only(image_bs, true, errp); } - -out: -aio_context_release(aio_context); } void qmp_blockdev_add(BlockdevOptions *options, Error **errp) -- 2.30.2
[PATCH v2 5/8] block: add a few more notes on locking
Reviewed-by: Stefan Hajnoczi Signed-off-by: Paolo Bonzini Signed-off-by: Emanuele Giuseppe Esposito --- include/block/block_int.h | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/include/block/block_int.h b/include/block/block_int.h index a1aad5ad2d..67a0777e12 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -154,7 +154,9 @@ struct BlockDriver { */ bool supports_backing; -/* For handling image reopen for split or non-split files */ +/* For handling image reopen for split or non-split files. Called + * with no I/O pending. + */ int (*bdrv_reopen_prepare)(BDRVReopenState *reopen_state, BlockReopenQueue *queue, Error **errp); void (*bdrv_reopen_commit)(BDRVReopenState *reopen_state); @@ -168,6 +170,7 @@ struct BlockDriver { /* Protocol drivers should implement this instead of bdrv_open */ int (*bdrv_file_open)(BlockDriverState *bs, QDict *options, int flags, Error **errp); +/* Called from main thread. */ void (*bdrv_close)(BlockDriverState *bs); -- 2.30.2
[PATCH v2 3/8] util: use RCU accessors for notifiers
Note that calling rcu_read_lock() is left to the caller. In fact, if the notifier is really only used within the BQL, it's unnecessary. Even outside the BQL, RCU accessors can also be used with any API that has the same contract as synchronize_rcu, i.e. it stops until all concurrent readers complete, no matter how "readers" are defined. In the next patch, for example, synchronize_rcu's role is taken by bdrv_drain (which is a superset of synchronize_rcu, since it also blocks new incoming readers). Reviewed-by: Stefan Hajnoczi Signed-off-by: Paolo Bonzini Signed-off-by: Emanuele Giuseppe Esposito --- util/notify.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/util/notify.c b/util/notify.c index 76bab212ae..529f1d121e 100644 --- a/util/notify.c +++ b/util/notify.c @@ -15,6 +15,7 @@ #include "qemu/osdep.h" #include "qemu/notify.h" +#include "qemu/rcu_queue.h" void notifier_list_init(NotifierList *list) { @@ -23,19 +24,19 @@ void notifier_list_init(NotifierList *list) void notifier_list_add(NotifierList *list, Notifier *notifier) { -QLIST_INSERT_HEAD(>notifiers, notifier, node); +QLIST_INSERT_HEAD_RCU(>notifiers, notifier, node); } void notifier_remove(Notifier *notifier) { -QLIST_REMOVE(notifier, node); +QLIST_REMOVE_RCU(notifier, node); } void notifier_list_notify(NotifierList *list, void *data) { Notifier *notifier, *next; -QLIST_FOREACH_SAFE(notifier, >notifiers, node, next) { +QLIST_FOREACH_SAFE_RCU(notifier, >notifiers, node, next) { notifier->notify(notifier, data); } } @@ -53,12 +54,12 @@ void notifier_with_return_list_init(NotifierWithReturnList *list) void notifier_with_return_list_add(NotifierWithReturnList *list, NotifierWithReturn *notifier) { -QLIST_INSERT_HEAD(>notifiers, notifier, node); +QLIST_INSERT_HEAD_RCU(>notifiers, notifier, node); } void notifier_with_return_remove(NotifierWithReturn *notifier) { -QLIST_REMOVE(notifier, node); +QLIST_REMOVE_RCU(notifier, node); } int notifier_with_return_list_notify(NotifierWithReturnList *list, void *data) @@ -66,7 +67,7 @@ int notifier_with_return_list_notify(NotifierWithReturnList *list, void *data) NotifierWithReturn *notifier, *next; int ret = 0; -QLIST_FOREACH_SAFE(notifier, >notifiers, node, next) { +QLIST_FOREACH_SAFE_RCU(notifier, >notifiers, node, next) { ret = notifier->notify(notifier, data); if (ret != 0) { break; -- 2.30.2
[PATCH v2 4/8] block: make before-write notifiers thread-safe
Reads access the list in RCU style, so be careful to avoid use-after-free scenarios in the backup block job. Apart from this, all that's needed is protecting updates with a mutex. Reviewed-by: Stefan Hajnoczi Signed-off-by: Paolo Bonzini Signed-off-by: Emanuele Giuseppe Esposito --- block.c | 6 +++--- block/io.c| 12 block/write-threshold.c | 2 +- include/block/block_int.h | 37 + 4 files changed, 53 insertions(+), 4 deletions(-) diff --git a/block.c b/block.c index c5b887cec1..f40fb63c75 100644 --- a/block.c +++ b/block.c @@ -6434,7 +6434,7 @@ static void bdrv_do_remove_aio_context_notifier(BdrvAioNotifier *ban) g_free(ban); } -static void bdrv_detach_aio_context(BlockDriverState *bs) +void bdrv_detach_aio_context(BlockDriverState *bs) { BdrvAioNotifier *baf, *baf_tmp; @@ -6462,8 +6462,8 @@ static void bdrv_detach_aio_context(BlockDriverState *bs) bs->aio_context = NULL; } -static void bdrv_attach_aio_context(BlockDriverState *bs, -AioContext *new_context) +void bdrv_attach_aio_context(BlockDriverState *bs, + AioContext *new_context) { BdrvAioNotifier *ban, *ban_tmp; diff --git a/block/io.c b/block/io.c index ca2dca3007..8f6af6077b 100644 --- a/block/io.c +++ b/block/io.c @@ -3137,10 +3137,22 @@ bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov) return true; } +static QemuSpin notifiers_spin_lock; + void bdrv_add_before_write_notifier(BlockDriverState *bs, NotifierWithReturn *notifier) { +qemu_spin_lock(_spin_lock); notifier_with_return_list_add(>before_write_notifiers, notifier); +qemu_spin_unlock(_spin_lock); +} + +void bdrv_remove_before_write_notifier(BlockDriverState *bs, + NotifierWithReturn *notifier) +{ +qemu_spin_lock(_spin_lock); +notifier_with_return_remove(notifier); +qemu_spin_unlock(_spin_lock); } void bdrv_io_plug(BlockDriverState *bs) diff --git a/block/write-threshold.c b/block/write-threshold.c index 77c74bdaa7..b87b749b02 100644 --- a/block/write-threshold.c +++ b/block/write-threshold.c @@ -32,7 +32,7 @@ bool bdrv_write_threshold_is_set(const BlockDriverState *bs) static void write_threshold_disable(BlockDriverState *bs) { if (bdrv_write_threshold_is_set(bs)) { -notifier_with_return_remove(>write_threshold_notifier); +bdrv_remove_before_write_notifier(bs, >write_threshold_notifier); bs->write_threshold_offset = 0; } } diff --git a/include/block/block_int.h b/include/block/block_int.h index 88e4111939..a1aad5ad2d 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -1089,6 +1089,8 @@ bool bdrv_backing_overridden(BlockDriverState *bs); /** * bdrv_add_before_write_notifier: + * @bs: The #BlockDriverState for which to register the notifier + * @notifier: The notifier to add * * Register a callback that is invoked before write requests are processed but * after any throttling or waiting for overlapping requests. @@ -1096,6 +1098,41 @@ bool bdrv_backing_overridden(BlockDriverState *bs); void bdrv_add_before_write_notifier(BlockDriverState *bs, NotifierWithReturn *notifier); +/** + * bdrv_remove_before_write_notifier: + * @bs: The #BlockDriverState for which to register the notifier + * @notifier: The notifier to add + * + * Unregister a callback that is invoked before write requests are processed but + * after any throttling or waiting for overlapping requests. + * + * Note that more I/O could be pending on @bs. You need to wait for + * it to finish, for example with bdrv_drain(), before freeing @notifier. + */ +void bdrv_remove_before_write_notifier(BlockDriverState *bs, + NotifierWithReturn *notifier); + +/** + * bdrv_detach_aio_context: + * + * May be called from .bdrv_detach_aio_context() to detach children from the + * current #AioContext. This is only needed by block drivers that manage their + * own children. Both ->file and ->backing are automatically handled and + * block drivers should not call this function on them explicitly. + */ +void bdrv_detach_aio_context(BlockDriverState *bs); + +/** + * bdrv_attach_aio_context: + * + * May be called from .bdrv_attach_aio_context() to attach children to the new + * #AioContext. This is only needed by block drivers that manage their own + * children. Both ->file and ->backing are automatically handled and block + * drivers should not call this function on them explicitly. + */ +void bdrv_attach_aio_context(BlockDriverState *bs, + AioContext *new_context); + /** * bdrv_add_aio_context_notifier: * -- 2.30.2
[PATCH v2 2/8] block: protect write threshold QMP commands from concurrent requests
For simplicity, use bdrv_drained_begin/end to avoid concurrent writes to the write threshold, or reading it while it is being set. qmp_block_set_write_threshold is protected by the BQL. Reviewed-by: Stefan Hajnoczi Co-developed-by: Paolo Bonzini Signed-off-by: Emanuele Giuseppe Esposito --- block/write-threshold.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/block/write-threshold.c b/block/write-threshold.c index 63040fa4f2..77c74bdaa7 100644 --- a/block/write-threshold.c +++ b/block/write-threshold.c @@ -111,7 +111,6 @@ void qmp_block_set_write_threshold(const char *node_name, Error **errp) { BlockDriverState *bs; -AioContext *aio_context; bs = bdrv_find_node(node_name); if (!bs) { @@ -119,10 +118,8 @@ void qmp_block_set_write_threshold(const char *node_name, return; } -aio_context = bdrv_get_aio_context(bs); -aio_context_acquire(aio_context); - +/* Avoid a concurrent write_threshold_disable. */ +bdrv_drained_begin(bs); bdrv_write_threshold_set(bs, threshold_bytes); - -aio_context_release(aio_context); +bdrv_drained_end(bs); } -- 2.30.2
[PATCH v2 0/8] Block layer thread-safety, continued
This and the following serie of patches are based on Paolo's v1 patches sent in 2017[*]. They have been ported to the current QEMU version, but the goal remains the same: - make the block layer thread-safe (patches 1-5), and - remove aio_context_acquire/release (patches 6-8). [*] = https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg01398.html Signed-off-by: Emanuele Giuseppe Esposito --- v1 (2017) -> v2 (2021): - v1 Patch "block-backup: add reqs_lock" has been dropped, because now is completely different from the old version and all functions that were affected by it have been moved or deleted. It will be replaced by another serie that aims to thread safety to block/block-copy.c - remaining v1 patches will be integrated in next serie. - Patch "block: do not acquire AioContext in check_to_replace_node" moves part of the logic of check_to_replace_node to the caller, so that the function can be included in the aio_context_acquire/release block that follows. Emanuele Giuseppe Esposito (8): block: prepare write threshold code for thread safety block: protect write threshold QMP commands from concurrent requests util: use RCU accessors for notifiers block: make before-write notifiers thread-safe block: add a few more notes on locking block: do not acquire AioContext in check_to_replace_node block/replication: do not acquire AioContext block: do not take AioContext around reopen block.c | 28 ++-- block/block-backend.c | 4 --- block/io.c| 12 + block/mirror.c| 9 --- block/replication.c | 54 +-- block/write-threshold.c | 39 ++-- blockdev.c| 26 +-- include/block/block.h | 1 + include/block/block_int.h | 42 +- util/notify.c | 13 +- 10 files changed, 113 insertions(+), 115 deletions(-) -- 2.30.2
[PATCH v2 1/8] block: prepare write threshold code for thread safety
Reviewed-by: Stefan Hajnoczi Co-developed-by: Paolo Bonzini Signed-off-by: Emanuele Giuseppe Esposito --- block/write-threshold.c | 28 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/block/write-threshold.c b/block/write-threshold.c index 85b78dc2a9..63040fa4f2 100644 --- a/block/write-threshold.c +++ b/block/write-threshold.c @@ -37,18 +37,22 @@ static void write_threshold_disable(BlockDriverState *bs) } } +static uint64_t treshold_overage(const BdrvTrackedRequest *req, +uint64_t thres) +{ +if (thres > 0 && req->offset + req->bytes > thres) { +return req->offset + req->bytes - thres; +} else { +return 0; +} +} + uint64_t bdrv_write_threshold_exceeded(const BlockDriverState *bs, const BdrvTrackedRequest *req) { -if (bdrv_write_threshold_is_set(bs)) { -if (req->offset > bs->write_threshold_offset) { -return (req->offset - bs->write_threshold_offset) + req->bytes; -} -if ((req->offset + req->bytes) > bs->write_threshold_offset) { -return (req->offset + req->bytes) - bs->write_threshold_offset; -} -} -return 0; +uint64_t thres = bdrv_write_threshold_get(bs); + +return treshold_overage(req, thres); } static int coroutine_fn before_write_notify(NotifierWithReturn *notifier, @@ -56,14 +60,14 @@ static int coroutine_fn before_write_notify(NotifierWithReturn *notifier, { BdrvTrackedRequest *req = opaque; BlockDriverState *bs = req->bs; -uint64_t amount = 0; +uint64_t thres = bdrv_write_threshold_get(bs); +uint64_t amount = treshold_overage(req, thres); -amount = bdrv_write_threshold_exceeded(bs, req); if (amount > 0) { qapi_event_send_block_write_threshold( bs->node_name, amount, -bs->write_threshold_offset); +thres); /* autodisable to avoid flooding the monitor */ write_threshold_disable(bs); -- 2.30.2
Re: [RFC PATCH 0/2] qemu-img convert: Fix sparseness detection
> Am 15.04.2021 um 17:22 schrieb Kevin Wolf : > > Peter, three years ago you changed 'qemu-img convert' to sacrifice some > sparsification in order to get aligned requests on the target image. At > the time, I thought the impact would be small, but it turns out that > this can end up wasting gigabytes of storagee (like converting a fully > zeroed 10 GB image taking 2.8 GB instead of a few kilobytes). > > https://bugzilla.redhat.com/show_bug.cgi?id=1882917 > > I'm not entirely sure how to attack this best since this is a tradeoff, > but maybe the approach in this series is still good enough for the case > that you wanted to fix back then? > > Of course, it would be possible to have a more complete fix like looking > forward a few blocks more before writing data, but that would probably > not be entirely trivial because you would have to merge blocks with ZERO > block status with DATA blocks that contain only zeros. I'm not sure if > it's worth this complication of the code. I will try to look into this asap. Is there a hint which FS I need to set the extent hint when creating the raw image? I was not able to do that. Peter