date:20200929

Re: [PATCH v2 11/13] block/export: convert vhost-user-blk server to block export API

2020-09-29 Thread Markus Armbruster

Stefan Hajnoczi  writes:

> Use the new QAPI block exports API instead of defining our own QOM
> objects.
>
> This is a large change because the lifecycle of VuBlockDev needs to
> follow BlockExportDriver. QOM properties are replaced by QAPI options
> objects.
>
> VuBlockDev is renamed VuBlkExport and contains a BlockExport field.
> Several fields can be dropped since BlockExport already has equivalents.
>
> The file names and meson build integration will be adjusted in a future
> patch. libvhost-user should probably be built as a static library that
> is linked into QEMU instead of as a .c file that results in duplicate
> compilation.
>
> The new command-line syntax is:
>
>   $ qemu-storage-daemon \
>   --blockdev file,node-name=drive0,filename=test.img \
>   --export 
> vhost-user-blk,node-name=drive0,id=export0,unix-socket=/tmp/vhost-user-blk.sock
>
> Note that unix-socket is optional because we may wish to accept chardevs
> too in the future.
>
> Signed-off-by: Stefan Hajnoczi 
> ---
> v2:
>  * Replace str unix-socket with SocketAddress addr to match NBD and
>support file descriptor passing
>  * Make addr mandatory [Markus]
>  * Update vhost-user-blk-test.c to use --export syntax
> ---
>  qapi/block-export.json   |  21 +-
>  block/export/vhost-user-blk-server.h |  23 +-
>  block/export/export.c|   8 +-
>  block/export/vhost-user-blk-server.c | 452 +++
>  tests/qtest/vhost-user-blk-test.c|   2 +-
>  util/vhost-user-server.c |  10 +-
>  block/export/meson.build |   1 +
>  block/meson.build|   1 -
>  8 files changed, 158 insertions(+), 360 deletions(-)
>
> diff --git a/qapi/block-export.json b/qapi/block-export.json
> index ace0d66e17..2e44625bb1 100644
> --- a/qapi/block-export.json
> +++ b/qapi/block-export.json
> @@ -84,6 +84,21 @@
>'data': { '*name': 'str', '*description': 'str',
>  '*bitmap': 'str' } }
>  
> +##
> +# @BlockExportOptionsVhostUserBlk:
> +#
> +# A vhost-user-blk block export.
> +#
> +# @addr: The vhost-user socket on which to listen. Both 'unix' and 'fd'
> +#SocketAddress types are supported. Passed fds must be UNIX domain
> +#sockets.

"addr.type must be 'unix' or 'fd'" is not visible in introspection.
Awkward.  Practical problem only if other addresses ever become
available here.  Is that possible?

> +# @logical-block-size: Logical block size in bytes. Defaults to 512 bytes.
> +#
> +# Since: 5.2
> +##
> +{ 'struct': 'BlockExportOptionsVhostUserBlk',
> +  'data': { 'addr': 'SocketAddress', '*logical-block-size': 'size' } }
> +
>  ##
>  # @NbdServerAddOptions:
>  #
> @@ -180,11 +195,12 @@
>  # An enumeration of block export types
>  #
>  # @nbd: NBD export
> +# @vhost-user-blk: vhost-user-blk export (since 5.2)
>  #
>  # Since: 4.2
>  ##
>  { 'enum': 'BlockExportType',
> -  'data': [ 'nbd' ] }
> +  'data': [ 'nbd', 'vhost-user-blk' ] }
>  
>  ##
>  # @BlockExportOptions:
> @@ -213,7 +229,8 @@
>  '*writethrough': 'bool' },
>'discriminator': 'type',
>'data': {
> -  'nbd': 'BlockExportOptionsNbd'
> +  'nbd': 'BlockExportOptionsNbd',
> +  'vhost-user-blk': 'BlockExportOptionsVhostUserBlk'
> } }
>  
>  ##
[...]

Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set

2020-09-29 Thread Klaus Jensen

On Sep 28 12:42, Klaus Jensen wrote:
> On Sep 28 11:35, Dmitry Fomichev wrote:
> > The emulation code has been changed to advertise NVM Command Set when
> > "zoned" device property is not set (default) and Zoned Namespace
> > Command Set otherwise.
> > 
> > Handlers for three new NVMe commands introduced in Zoned Namespace
> > Command Set specification are added, namely for Zone Management
> > Receive, Zone Management Send and Zone Append.
> > 
> > Device initialization code has been extended to create a proper
> > configuration for zoned operation using device properties.
> > 
> > Read/Write command handler is modified to only allow writes at the
> > write pointer if the namespace is zoned. For Zone Append command,
> > writes implicitly happen at the write pointer and the starting write
> > pointer value is returned as the result of the command. Write Zeroes
> > handler is modified to add zoned checks that are identical to those
> > done as a part of Write flow.
> > 
> > The code to support for Zone Descriptor Extensions is not included in
> > this commit and ZDES 0 is always reported. A later commit in this
> > series will add ZDE support.
> > 
> > This commit doesn't yet include checks for active and open zone
> > limits. It is assumed that there are no limits on either active or
> > open zones.
> > 
> 
> I think the fill_pattern feature stands separate, so it would be nice to
> extract that to a patch on its own.
> 

Please disregard this.

Since the fill_pattern feature is tightly bound to reading in zones, it
doesnt really make sense to extract it.


signature.asc
Description: PGP signature

Re: [PATCH v2] hw/ide: check null block before _cancel_dma_sync

2020-09-29 Thread P J P

+-- On Tue, 29 Sep 2020, Li Qiang wrote --+
| P J P  于2020年9月29日周二 下午2:22写道：
| > +-- On Fri, 18 Sep 2020, Li Qiang wrote --+
| > | P J P  于2020年9月18日周五 下午6:26写道：
| > | > +-- On Fri, 18 Sep 2020, Li Qiang wrote --+
| > | > | Update v2: use an assert() call
| > | > |   
->https://lists.nongnu.org/archive/html/qemu-devel/2020-08/msg08336.html
| > |
| > | In 'ide_ioport_write' the guest can set 'bus->unit' to 0 or 1 by issue
| > | 'ATA_IOPORT_WR_DEVICE_HEAD'. So this case the guest can set the active 
ifs.
| > | If the guest set this to 1.
| > |
| > | Then in 'idebus_active_if' will return 'IDEBus.ifs[1]' and thus the 
's->blk'
| > | will be NULL.
| >
| > Right, guest does select the drive via
| >
| >   portio_write
| >->ide_ioport_write
| >   case ATA_IOPORT_WR_DEVICE_HEAD:
| >   /* FIXME: HOB readback uses bit 7 */
| >   bus->ifs[0].select = (val & ~0x10) | 0xa0;
| >   bus->ifs[1].select = (val | 0x10) | 0xa0;
| >   /* select drive */
| >   bus->unit = (val >> 4) & 1; <== set bus->unit=0x1
| >   break;
| >
| >
| > | So from your (Peter's) saying, we need to check the value in
| > | 'ATA_IOPORT_WR_DEVICE_HEAD' handler. To say if the guest
| > | set a valid 'bus->unit'. This can also work I think.
| >
| > Yes, with the following fix, an assert(3) in ide_cancel_dma_sync fails.
| >
| > ===
| > diff --git a/hw/ide/core.c b/hw/ide/core.c
| > index f76f7e5234..cb55cc8b0f 100644
| > --- a/hw/ide/core.c
| > +++ b/hw/ide/core.c
| > @@ -1300,7 +1300,11 @@ void ide_ioport_write(void *opaque, uint32_t addr,
| > uint_)
| >  bus->ifs[0].select = (val & ~0x10) | 0xa0;
| >  bus->ifs[1].select = (val | 0x10) | 0xa0;
| >  /* select drive */
| > +uint8_t bu = bus->unit;
| >  bus->unit = (val >> 4) & 1;
| > +if (!bus->ifs[bus->unit].blk) {
| > +bus->unit = bu;
| > +}
| >  break;
| >  default:
| >
| > qemu-system-x86_64: ../hw/ide/core.c:724: ide_cancel_dma_sync: Assertion 
`s->bus->dma->aiocb == NULL' failed.
| > Aborted (core dumped)
| 
| This is what I am worried, in the 'ide_ioport_write' set the 'bus->unit'. It 
| also change the 'buf->ifs[0].select'. Also there maybe some other corner 
| case that causes some inconsistent. And if we choice this method we need to 
| deep into the more ahci-spec to know how things really going.
| 
| > ===
| >
| > | As we the 'ide_exec_cmd' and other functions in 'hw/ide/core.c' check the
| > | 's->blk' directly. I think we just check it in 'ide_cancel_dma_sync' is
| > | enough and also this is more consistent with the other functions.
| > | 'ide_cancel_dma_sync' is also called by 'cmd_device_reset' which is one of
| > | the 'ide_cmd_table' handler.
| >
| >   Yes, I'm okay with either approach. Earlier patch v1 checks 's->blk' in
| > ide_cancel_dma_sync().
| 
| I prefer the 'check the s->blk in the beginning of ide_cancel_dma_sync' 
method.
| Some little different with your earlier patch.
| 
| Anyway, let the maintainer do the choices.
| 

@John ...wdyt?

Thank you.
--
Prasad J Pandit / Red Hat Product Security Team
8685 545E B54C 486B C6EB 271E E285 8B5A F050 DE8D

[PATCH 2/2] hw/block/m25p80: Fix nonvolatile-cfg property default value

2020-09-29 Thread Joe Komlodi

The nvcfg registers are all 1s, unless previously modified.

Signed-off-by: Joe Komlodi 
---
 hw/block/m25p80.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
index 43830c9..69c88d4 100644
--- a/hw/block/m25p80.c
+++ b/hw/block/m25p80.c
@@ -1334,7 +1334,7 @@ static int m25p80_pre_save(void *opaque)
 
 static Property m25p80_properties[] = {
 /* This is default value for Micron flash */
-DEFINE_PROP_UINT32("nonvolatile-cfg", Flash, nonvolatile_cfg, 0x8FFF),
+DEFINE_PROP_UINT32("nonvolatile-cfg", Flash, nonvolatile_cfg, 0x),
 DEFINE_PROP_UINT8("spansion-cr1nv", Flash, spansion_cr1nv, 0x0),
 DEFINE_PROP_UINT8("spansion-cr2nv", Flash, spansion_cr2nv, 0x8),
 DEFINE_PROP_UINT8("spansion-cr3nv", Flash, spansion_cr3nv, 0x2),
-- 
2.7.4

[PATCH 0/2] hw/block/m25p80: Fix Numonyx flash dummy cycle register behavior

2020-09-29 Thread Joe Komlodi

Hi all,

This series addresses a couple issues with dummy cycle counts with Numonyx
flashes.

The first patch fixes the behavior of the dummy cycle register so it's closer to
how hardware behaves.
As a consequence, it also corrects the amount of dummy cycles QIOR and QIOR4
commands need by default.

The second patch changes the default value of the nvcfg register so it
matches what would be in hardware from the factory.

Thanks!
Joe

Joe Komlodi (2):
  hw/block/m25p80: Fix Numonyx dummy cycle register behavior
  hw/block/m25p80: Fix nonvolatile-cfg property default value

 hw/block/m25p80.c | 28 
 1 file changed, 24 insertions(+), 4 deletions(-)

-- 
2.7.4

[PATCH 1/2] hw/block/m25p80: Fix Numonyx dummy cycle register behavior

2020-09-29 Thread Joe Komlodi

Numonyx chips determine the number of cycles to wait based on bits 7:4 in the
volatile configuration register.

However, if these bits are 0x0 or 0xF, the number of dummy cycles to wait is
10 on a QIOR or QIOR4 command, or 8 on any other currently supported
fast read command. [1]

[1] http://www.micron.com/-/media/client/global/documents/products/
data-sheet/nor-flash/serial-nor/n25q/n25q_512mb_1_8v_65nm.pdf

Page 22 note 2, and page 30 notes 5 and 10.

Signed-off-by: Joe Komlodi 
---
 hw/block/m25p80.c | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
index 483925f..43830c9 100644
--- a/hw/block/m25p80.c
+++ b/hw/block/m25p80.c
@@ -820,6 +820,26 @@ static void reset_memory(Flash *s)
 trace_m25p80_reset_done(s);
 }
 
+static uint8_t numonyx_fast_read_num_dummies(Flash *s)
+{
+uint8_t cycle_count;
+uint8_t num_dummies;
+assert(get_man(s) == MAN_NUMONYX);
+
+cycle_count = extract32(s->volatile_cfg, 4, 4);
+if (cycle_count == 0x0 || cycle_count == 0x0F) {
+if (s->cmd_in_progress == QIOR || s->cmd_in_progress == QIOR4) {
+num_dummies = 10;
+} else {
+num_dummies = 8;
+}
+} else {
+num_dummies = cycle_count;
+}
+
+return num_dummies;
+}
+
 static void decode_fast_read_cmd(Flash *s)
 {
 s->needed_bytes = get_addr_length(s);
@@ -829,7 +849,7 @@ static void decode_fast_read_cmd(Flash *s)
 s->needed_bytes += 8;
 break;
 case MAN_NUMONYX:
-s->needed_bytes += extract32(s->volatile_cfg, 4, 4);
+s->needed_bytes += numonyx_fast_read_num_dummies(s);
 break;
 case MAN_MACRONIX:
 if (extract32(s->volatile_cfg, 6, 2) == 1) {
@@ -868,7 +888,7 @@ static void decode_dio_read_cmd(Flash *s)
 );
 break;
 case MAN_NUMONYX:
-s->needed_bytes += extract32(s->volatile_cfg, 4, 4);
+s->needed_bytes += numonyx_fast_read_num_dummies(s);
 break;
 case MAN_MACRONIX:
 switch (extract32(s->volatile_cfg, 6, 2)) {
@@ -908,7 +928,7 @@ static void decode_qio_read_cmd(Flash *s)
 );
 break;
 case MAN_NUMONYX:
-s->needed_bytes += extract32(s->volatile_cfg, 4, 4);
+s->needed_bytes += numonyx_fast_read_num_dummies(s);
 break;
 case MAN_MACRONIX:
 switch (extract32(s->volatile_cfg, 6, 2)) {
-- 
2.7.4

[PATCH v2 14/14] hw/block/nvme: allow open to close transitions by controller

2020-09-29 Thread Klaus Jensen

From: Klaus Jensen 

Allow the controller to release open resources by transitioning
implicitly and explicitly opened zones to closed. This is done using a
naive "least recently opened" strategy.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme-ns.h|   5 ++
 hw/block/nvme-ns.c|   5 ++
 hw/block/nvme.c   | 105 +++---
 hw/block/trace-events |   5 ++
 4 files changed, 103 insertions(+), 17 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index ff34cd37af7d..491a77f3ae2f 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -62,6 +62,8 @@ typedef struct NvmeZone {
 uint8_t*zde;
 
 uint64_t wp_staging;
+
+QTAILQ_ENTRY(NvmeZone) lru_entry;
 } NvmeZone;
 
 typedef struct NvmeNamespace {
@@ -101,6 +103,9 @@ typedef struct NvmeNamespace {
 struct {
 uint32_t open;
 uint32_t active;
+
+QTAILQ_HEAD(, NvmeZone) lru_open;
+QTAILQ_HEAD(, NvmeZone) lru_active;
 } resources;
 } zns;
 } NvmeNamespace;
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 9584fbb3f62d..26c9f846417a 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -225,6 +225,9 @@ void nvme_ns_zns_init_zone_state(NvmeNamespace *ns)
 ns->zns.resources.open = ns->params.zns.mor != 0x ?
 ns->params.zns.mor + 1 : ns->zns.num_zones;
 
+QTAILQ_INIT(>zns.resources.lru_open);
+QTAILQ_INIT(>zns.resources.lru_active);
+
 for (int i = 0; i < ns->zns.num_zones; i++) {
 NvmeZone *zone = >zns.zones[i];
 zone->zd = >zns.zd[i];
@@ -248,6 +251,8 @@ void nvme_ns_zns_init_zone_state(NvmeNamespace *ns)
 
 if (ns->zns.resources.active) {
 ns->zns.resources.active--;
+QTAILQ_INSERT_TAIL(>zns.resources.lru_active, zone,
+   lru_entry);
 continue;
 }
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index fc5b119e3f35..34093f33ad1a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1209,12 +1209,61 @@ static inline void nvme_zone_reset_wp(NvmeZone *zone)
 zone->wp_staging = nvme_zslba(zone);
 }
 
-static uint16_t nvme_zrm_transition(NvmeNamespace *ns, NvmeZone *zone,
-NvmeZoneState to)
+static uint16_t nvme_zrm_transition(NvmeCtrl *n, NvmeNamespace *ns,
+NvmeZone *zone, NvmeZoneState to,
+NvmeRequest *req);
+
+static uint16_t nvme_zrm_release_open(NvmeCtrl *n, NvmeNamespace *ns,
+  NvmeRequest *req)
+{
+NvmeZone *candidate;
+NvmeZoneState zs;
+uint16_t status;
+
+trace_pci_nvme_zone_zrm_release_open(nvme_cid(req), ns->params.nsid);
+
+QTAILQ_FOREACH(candidate, >zns.resources.lru_open, lru_entry) {
+zs = nvme_zs(candidate);
+
+trace_pci_nvme_zone_zrm_candidate(nvme_cid(req), ns->params.nsid,
+  nvme_zslba(candidate),
+  nvme_wp(candidate), zs);
+
+/* skip explicitly opened zones */
+if (zs == NVME_ZS_ZSEO) {
+continue;
+}
+
+/* the zone cannot be closed if it is currently writing */
+if (candidate->wp_staging != nvme_wp(candidate)) {
+continue;
+}
+
+status = nvme_zrm_transition(n, ns, candidate, NVME_ZS_ZSC, req);
+if (status) {
+return status;
+}
+
+if (nvme_zns_commit_zone(ns, candidate) < 0) {
+return NVME_INTERNAL_DEV_ERROR;
+}
+
+return NVME_SUCCESS;
+}
+
+return NVME_TOO_MANY_OPEN_ZONES;
+}
+
+static uint16_t nvme_zrm_transition(NvmeCtrl *n, NvmeNamespace *ns,
+NvmeZone *zone, NvmeZoneState to,
+NvmeRequest *req)
 {
 NvmeZoneState from = nvme_zs(zone);
+uint16_t status;
+
+trace_pci_nvme_zone_zrm_transition(nvme_cid(req), ns->params.nsid,
+   nvme_zslba(zone), nvme_zs(zone), to);
 
-/* fast path */
 if (from == to) {
 return NVME_SUCCESS;
 }
@@ -1229,25 +1278,32 @@ static uint16_t nvme_zrm_transition(NvmeNamespace *ns, 
NvmeZone *zone,
 
 case NVME_ZS_ZSC:
 if (!ns->zns.resources.active) {
+trace_pci_nvme_err_too_many_active_zones(nvme_cid(req));
 return NVME_TOO_MANY_ACTIVE_ZONES;
 }
 
 ns->zns.resources.active--;
+QTAILQ_INSERT_TAIL(>zns.resources.lru_active, zone, lru_entry);
 
 break;
 
 case NVME_ZS_ZSIO:
 case NVME_ZS_ZSEO:
 if (!ns->zns.resources.active) {
+trace_pci_nvme_err_too_many_active_zones(nvme_cid(req));
 return NVME_TOO_MANY_ACTIVE_ZONES;
 }
 
 if (!ns->zns.resources.open) {
-return

[PATCH v2 12/14] hw/block/nvme: add the zone append command

2020-09-29 Thread Klaus Jensen

From: Klaus Jensen 

Add the Zone Append command.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.h   |  6 
 include/block/nvme.h  |  7 +
 hw/block/nvme.c   | 71 +++
 hw/block/trace-events |  1 +
 4 files changed, 79 insertions(+), 6 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index c704663e0a3e..8cd2d936548e 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -16,6 +16,10 @@ typedef struct NvmeParams {
 uint32_t aer_max_queued;
 uint8_t  mdts;
 bool use_intel_id;
+
+struct {
+uint8_t zasl;
+} zns;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
@@ -41,6 +45,7 @@ static inline bool nvme_req_is_write(NvmeRequest *req)
 switch (req->cmd.opcode) {
 case NVME_CMD_WRITE:
 case NVME_CMD_WRITE_ZEROES:
+case NVME_CMD_ZONE_APPEND:
 return true;
 default:
 return false;
@@ -73,6 +78,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
 case NVME_CMD_WRITE_ZEROES: return "NVME_NVM_CMD_WRITE_ZEROES";
 case NVME_CMD_ZONE_MGMT_SEND:   return "NVME_ZONED_CMD_ZONE_MGMT_SEND";
 case NVME_CMD_ZONE_MGMT_RECV:   return "NVME_ZONED_CMD_ZONE_MGMT_RECV";
+case NVME_CMD_ZONE_APPEND:  return "NVME_ZONED_CMD_ZONE_APPEND";
 default:return "NVME_NVM_CMD_UNKNOWN";
 }
 }
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 967b42eb5da7..5f8914f594f4 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -483,6 +483,7 @@ enum NvmeIoCommands {
 NVME_CMD_DSM= 0x09,
 NVME_CMD_ZONE_MGMT_SEND = 0x79,
 NVME_CMD_ZONE_MGMT_RECV = 0x7a,
+NVME_CMD_ZONE_APPEND= 0x7d,
 };
 
 typedef struct QEMU_PACKED NvmeDeleteQ {
@@ -1018,6 +1019,11 @@ enum NvmeIdCtrlLpa {
 NVME_LPA_EXTENDED = 1 << 2,
 };
 
+typedef struct QEMU_PACKED NvmeIdCtrlZns {
+uint8_t zasl;
+uint8_t rsvd1[4095];
+} NvmeIdCtrlZns;
+
 #define NVME_CTRL_SQES_MIN(sqes) ((sqes) & 0xf)
 #define NVME_CTRL_SQES_MAX(sqes) (((sqes) >> 4) & 0xf)
 #define NVME_CTRL_CQES_MIN(cqes) ((cqes) & 0xf)
@@ -1242,6 +1248,7 @@ static inline void _nvme_check_size(void)
 QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
 QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
 QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
+QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrlZns) != 4096);
 QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsNvm) != 4096);
 QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsZns) != 4096);
 QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 5c109cab58e8..a891ee284d1d 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -166,6 +166,8 @@ static const NvmeEffectsLog nvme_effects[NVME_IOCS_MAX] = {
 [NVME_CMD_ZONE_MGMT_RECV] = NVME_EFFECTS_CSUPP,
 [NVME_CMD_ZONE_MGMT_SEND] = NVME_EFFECTS_CSUPP |
 NVME_EFFECTS_LBCC,
+[NVME_CMD_ZONE_APPEND]= NVME_EFFECTS_CSUPP |
+NVME_EFFECTS_LBCC,
 },
 },
 };
@@ -1041,6 +1043,21 @@ static inline uint16_t nvme_check_mdts(NvmeCtrl *n, 
size_t len)
 return NVME_SUCCESS;
 }
 
+static inline uint16_t nvme_check_zasl(NvmeCtrl *n, size_t len)
+{
+uint8_t zasl = n->params.zns.zasl;
+
+if (!zasl) {
+return nvme_check_mdts(n, len);
+}
+
+if (len > n->page_size << zasl) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+return NVME_SUCCESS;
+}
+
 static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace *ns,
  uint64_t slba, uint32_t nlb)
 {
@@ -1410,6 +1427,7 @@ static uint16_t nvme_do_aio(BlockBackend *blk, int64_t 
offset, size_t len,
 break;
 
 case NVME_CMD_WRITE:
+case NVME_CMD_ZONE_APPEND:
 is_write = true;
 
 /* fallthrough */
@@ -1945,12 +1963,40 @@ static uint16_t nvme_rwz(NvmeCtrl *n, NvmeRequest *req)
 uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
 size_t len = nvme_l2b(ns, nlb);
 
-bool is_write = nvme_req_is_write(req);
+bool is_append, is_write = nvme_req_is_write(req);
 uint16_t status;
 
 trace_pci_nvme_rwz(nvme_cid(req), nvme_io_opc_str(rw->opcode),
nvme_nsid(ns), nlb, len, slba);
 
+if (req->cmd.opcode == NVME_CMD_ZONE_APPEND) {
+uint64_t wp;
+is_append = true;
+
+zone = nvme_ns_get_zone(ns, slba);
+if (!zone) {
+trace_pci_nvme_err_invalid_zone(nvme_cid(req), slba);
+status = NVME_INVALID_FIELD | NVME_DNR;
+goto invalid;
+}
+
+wp = zone->wp_staging;
+
+if (slba != nvme_zslba(zone)) {
+trace_pci_nvme_err_invalid_zslba(nvme_cid(req), slba);
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+status = nvme_check_zasl(n, len);
+if (status) {
+trace_pci_nvme_err_zasl(nvme_cid(req), len);
+

[PATCH v2 10/14] hw/block/nvme: add the zone management receive command

2020-09-29 Thread Klaus Jensen

From: Klaus Jensen 

Add the Zone Management Receive command.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme-ns.h|  11 +++-
 hw/block/nvme.h   |   1 +
 include/block/nvme.h  |  46 ++
 hw/block/nvme-ns.c|  49 ---
 hw/block/nvme.c   | 135 ++
 hw/block/trace-events |   1 +
 6 files changed, 233 insertions(+), 10 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index e16f50dc4bb8..82cb0b0bce82 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -37,9 +37,10 @@ typedef struct NvmePstateHeader {
 struct {
 uint64_t zcap;
 uint64_t zsze;
+uint8_t  zdes;
 } QEMU_PACKED zns;
 
-uint8_t  rsvd3088[1008];
+uint8_t  rsvd3089[1007];
 } QEMU_PACKED NvmePstateHeader;
 
 typedef struct NvmeNamespaceParams {
@@ -50,11 +51,13 @@ typedef struct NvmeNamespaceParams {
 struct {
 uint64_t zcap;
 uint64_t zsze;
+uint8_t  zdes;
 } zns;
 } NvmeNamespaceParams;
 
 typedef struct NvmeZone {
 NvmeZoneDescriptor *zd;
+uint8_t*zde;
 
 uint64_t wp_staging;
 } NvmeZone;
@@ -91,6 +94,7 @@ typedef struct NvmeNamespace {
 
 NvmeZone   *zones;
 NvmeZoneDescriptor *zd;
+uint8_t*zde;
 } zns;
 } NvmeNamespace;
 
@@ -183,6 +187,11 @@ static inline void nvme_zs_set(NvmeZone *zone, 
NvmeZoneState zs)
 zone->zd->zs = zs << 4;
 }
 
+static inline size_t nvme_ns_zdes_bytes(NvmeNamespace *ns)
+{
+return ns->params.zns.zdes << 6;
+}
+
 static inline bool nvme_ns_zone_wp_valid(NvmeZone *zone)
 {
 switch (nvme_zs(zone)) {
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index f66ed9ab7eff..523eef0bcad8 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -71,6 +71,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
 case NVME_CMD_WRITE:return "NVME_NVM_CMD_WRITE";
 case NVME_CMD_READ: return "NVME_NVM_CMD_READ";
 case NVME_CMD_WRITE_ZEROES: return "NVME_NVM_CMD_WRITE_ZEROES";
+case NVME_CMD_ZONE_MGMT_RECV:   return "NVME_ZONED_CMD_ZONE_MGMT_RECV";
 default:return "NVME_NVM_CMD_UNKNOWN";
 }
 }
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 2e523c9d97b4..9bacf48ee9e9 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -481,6 +481,7 @@ enum NvmeIoCommands {
 NVME_CMD_COMPARE= 0x05,
 NVME_CMD_WRITE_ZEROES   = 0x08,
 NVME_CMD_DSM= 0x09,
+NVME_CMD_ZONE_MGMT_RECV = 0x7a,
 };
 
 typedef struct QEMU_PACKED NvmeDeleteQ {
@@ -593,6 +594,44 @@ enum {
 NVME_RW_PRINFO_PRCHK_REF= 1 << 10,
 };
 
+typedef struct QEMU_PACKED NvmeZoneManagementRecvCmd {
+uint8_t opcode;
+uint8_t flags;
+uint16_tcid;
+uint32_tnsid;
+uint8_t rsvd8[16];
+NvmeCmdDptr dptr;
+uint64_tslba;
+uint32_tnumdw;
+uint8_t zra;
+uint8_t zrasp;
+uint8_t zrasf;
+uint8_t rsvd55[9];
+} NvmeZoneManagementRecvCmd;
+
+typedef enum NvmeZoneManagementRecvAction {
+NVME_CMD_ZONE_MGMT_RECV_REPORT_ZONES  = 0x0,
+NVME_CMD_ZONE_MGMT_RECV_EXTENDED_REPORT_ZONES = 0x1,
+} NvmeZoneManagementRecvAction;
+
+typedef enum NvmeZoneManagementRecvActionSpecificField {
+NVME_CMD_ZONE_MGMT_RECV_LIST_ALL  = 0x0,
+NVME_CMD_ZONE_MGMT_RECV_LIST_ZSE  = 0x1,
+NVME_CMD_ZONE_MGMT_RECV_LIST_ZSIO = 0x2,
+NVME_CMD_ZONE_MGMT_RECV_LIST_ZSEO = 0x3,
+NVME_CMD_ZONE_MGMT_RECV_LIST_ZSC  = 0x4,
+NVME_CMD_ZONE_MGMT_RECV_LIST_ZSF  = 0x5,
+NVME_CMD_ZONE_MGMT_RECV_LIST_ZSRO = 0x6,
+NVME_CMD_ZONE_MGMT_RECV_LIST_ZSO  = 0x7,
+} NvmeZoneManagementRecvActionSpecificField;
+
+#define NVME_CMD_ZONE_MGMT_RECEIVE_PARTIAL 0x1
+
+typedef struct QEMU_PACKED NvmeZoneReportHeader {
+uint64_t num_zones;
+uint8_t  rsvd[56];
+} NvmeZoneReportHeader;
+
 typedef struct QEMU_PACKED NvmeDsmCmd {
 uint8_t opcode;
 uint8_t flags;
@@ -812,6 +851,12 @@ typedef struct QEMU_PACKED NvmeZoneDescriptor {
 uint8_t  rsvd32[32];
 } NvmeZoneDescriptor;
 
+#define NVME_ZA_ZDEV (1 << 7)
+
+#define NVME_ZA_SET(za, attrs)   ((za) |= (attrs))
+#define NVME_ZA_CLEAR(za, attrs) ((za) &= ~(attrs))
+#define NVME_ZA_CLEAR_ALL(za)((za) = 0x0)
+
 enum NvmeSmartWarn {
 NVME_SMART_SPARE  = 1 << 0,
 NVME_SMART_TEMPERATURE= 1 << 1,
@@ -1162,6 +1207,7 @@ static inline void _nvme_check_size(void)
 QEMU_BUILD_BUG_ON(sizeof(NvmeIdentify) != 64);
 QEMU_BUILD_BUG_ON(sizeof(NvmeRwCmd) != 64);
 QEMU_BUILD_BUG_ON(sizeof(NvmeDsmCmd) != 64);
+QEMU_BUILD_BUG_ON(sizeof(NvmeZoneManagementRecvCmd) != 64);
 QEMU_BUILD_BUG_ON(sizeof(NvmeRangeType) != 64);
 QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64);
 QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index ba32ea6d1326..bc49c7f2674f 100644
---

[PATCH v2 13/14] hw/block/nvme: track and enforce zone resources

2020-09-29 Thread Klaus Jensen

From: Klaus Jensen 

Track number of open/active resources.

Signed-off-by: Klaus Jensen 
---
 docs/specs/nvme.txt  |  7 +
 hw/block/nvme-ns.h   |  7 +
 include/block/nvme.h |  2 ++
 hw/block/nvme-ns.c   | 25 +++--
 hw/block/nvme.c  | 67 +++-
 5 files changed, 104 insertions(+), 4 deletions(-)

diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
index b23e59dd3075..e3810843cd2d 100644
--- a/docs/specs/nvme.txt
+++ b/docs/specs/nvme.txt
@@ -42,6 +42,13 @@ nvme-ns Options
  zns.zcap; if the zone capacity is a power of two, the zone size will be
  set to that, otherwise it will default to the next power of two.
 
+  `zns.mar`; Specifies the number of active resources available. This is a 0s
+ based value.
+
+  `zns.mor`; Specifies the number of open resources available. This is a 0s
+ based value.
+
+
 Reference Specifications
 
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 82cb0b0bce82..ff34cd37af7d 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -52,6 +52,8 @@ typedef struct NvmeNamespaceParams {
 uint64_t zcap;
 uint64_t zsze;
 uint8_t  zdes;
+uint32_t mar;
+uint32_t mor;
 } zns;
 } NvmeNamespaceParams;
 
@@ -95,6 +97,11 @@ typedef struct NvmeNamespace {
 NvmeZone   *zones;
 NvmeZoneDescriptor *zd;
 uint8_t*zde;
+
+struct {
+uint32_t open;
+uint32_t active;
+} resources;
 } zns;
 } NvmeNamespace;
 
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 5f8914f594f4..d51f397e7ff1 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -776,6 +776,8 @@ enum NvmeStatusCodes {
 NVME_ZONE_IS_READ_ONLY  = 0x01ba,
 NVME_ZONE_IS_OFFLINE= 0x01bb,
 NVME_ZONE_INVALID_WRITE = 0x01bc,
+NVME_TOO_MANY_ACTIVE_ZONES  = 0x01bd,
+NVME_TOO_MANY_OPEN_ZONES= 0x01be,
 NVME_INVALID_ZONE_STATE_TRANSITION = 0x01bf,
 NVME_WRITE_FAULT= 0x0280,
 NVME_UNRECOVERED_READ   = 0x0281,
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index bc49c7f2674f..9584fbb3f62d 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -119,8 +119,8 @@ static void nvme_ns_init_zoned(NvmeNamespace *ns)
 ns->zns.zde = g_malloc0_n(ns->zns.num_zones, nvme_ns_zdes_bytes(ns));
 }
 
-id_ns_zns->mar = 0x;
-id_ns_zns->mor = 0x;
+id_ns_zns->mar = cpu_to_le32(ns->params.zns.mar);
+id_ns_zns->mor = cpu_to_le32(ns->params.zns.mor);
 }
 
 static void nvme_ns_init(NvmeNamespace *ns)
@@ -220,6 +220,11 @@ static int nvme_ns_pstate_init(NvmeNamespace *ns, Error 
**errp)
 
 void nvme_ns_zns_init_zone_state(NvmeNamespace *ns)
 {
+ns->zns.resources.active = ns->params.zns.mar != 0x ?
+ns->params.zns.mar + 1 : ns->zns.num_zones;
+ns->zns.resources.open = ns->params.zns.mor != 0x ?
+ns->params.zns.mor + 1 : ns->zns.num_zones;
+
 for (int i = 0; i < ns->zns.num_zones; i++) {
 NvmeZone *zone = >zns.zones[i];
 zone->zd = >zns.zd[i];
@@ -238,9 +243,15 @@ void nvme_ns_zns_init_zone_state(NvmeNamespace *ns)
 if (nvme_wp(zone) == nvme_zslba(zone) &&
 !(zone->zd->za & NVME_ZA_ZDEV)) {
 nvme_zs_set(zone, NVME_ZS_ZSE);
+continue;
 }
 
-continue;
+if (ns->zns.resources.active) {
+ns->zns.resources.active--;
+continue;
+}
+
+/* fallthrough */
 
 case NVME_ZS_ZSIO:
 case NVME_ZS_ZSEO:
@@ -462,6 +473,12 @@ static int nvme_ns_check_constraints(NvmeNamespace *ns, 
Error **errp)
 return -1;
 }
 
+if (ns->params.zns.mor > ns->params.zns.mar) {
+error_setg(errp, "maximum open resources (zns.mor) must be less "
+   "than or equal to maximum active resources (zns.mar)");
+return -1;
+}
+
 break;
 
 default:
@@ -547,6 +564,8 @@ static Property nvme_ns_props[] = {
 DEFINE_PROP_UINT64("zns.zcap", NvmeNamespace, params.zns.zcap, 0),
 DEFINE_PROP_UINT64("zns.zsze", NvmeNamespace, params.zns.zsze, 0),
 DEFINE_PROP_UINT8("zns.zdes", NvmeNamespace, params.zns.zdes, 0),
+DEFINE_PROP_UINT32("zns.mar", NvmeNamespace, params.zns.mar, 0x),
+DEFINE_PROP_UINT32("zns.mor", NvmeNamespace, params.zns.mor, 0x),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a891ee284d1d..fc5b119e3f35 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1221,6 +1221,40 @@ static uint16_t nvme_zrm_transition(NvmeNamespace *ns, 
NvmeZone *zone,
 
 switch (from) {
 case NVME_ZS_ZSE:
+switch (to) {
+case NVME_ZS_ZSF:
+case NVME_ZS_ZSRO:
+case NVME_ZS_ZSO:
+break;
+
+case NVME_ZS_ZSC:
+

[PATCH v2 11/14] hw/block/nvme: add the zone management send command

2020-09-29 Thread Klaus Jensen

From: Klaus Jensen 

Add the Zone Management Send command.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.h   |   1 +
 include/block/nvme.h  |  29 +++
 hw/block/nvme.c   | 552 --
 hw/block/trace-events |  11 +
 4 files changed, 574 insertions(+), 19 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 523eef0bcad8..c704663e0a3e 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -71,6 +71,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
 case NVME_CMD_WRITE:return "NVME_NVM_CMD_WRITE";
 case NVME_CMD_READ: return "NVME_NVM_CMD_READ";
 case NVME_CMD_WRITE_ZEROES: return "NVME_NVM_CMD_WRITE_ZEROES";
+case NVME_CMD_ZONE_MGMT_SEND:   return "NVME_ZONED_CMD_ZONE_MGMT_SEND";
 case NVME_CMD_ZONE_MGMT_RECV:   return "NVME_ZONED_CMD_ZONE_MGMT_RECV";
 default:return "NVME_NVM_CMD_UNKNOWN";
 }
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 9bacf48ee9e9..967b42eb5da7 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -481,6 +481,7 @@ enum NvmeIoCommands {
 NVME_CMD_COMPARE= 0x05,
 NVME_CMD_WRITE_ZEROES   = 0x08,
 NVME_CMD_DSM= 0x09,
+NVME_CMD_ZONE_MGMT_SEND = 0x79,
 NVME_CMD_ZONE_MGMT_RECV = 0x7a,
 };
 
@@ -594,6 +595,32 @@ enum {
 NVME_RW_PRINFO_PRCHK_REF= 1 << 10,
 };
 
+typedef struct QEMU_PACKED NvmeZoneManagementSendCmd {
+uint8_t opcode;
+uint8_t flags;
+uint16_tcid;
+uint32_tnsid;
+uint32_trsvd8[4];
+NvmeCmdDptr dptr;
+uint64_tslba;
+uint32_trsvd48;
+uint8_t zsa;
+uint8_t zsflags;
+uint16_trsvd54;
+uint32_trsvd56[2];
+} NvmeZoneManagementSendCmd;
+
+#define NVME_CMD_ZONE_MGMT_SEND_SELECT_ALL(zsflags) ((zsflags) & 0x1)
+
+typedef enum NvmeZoneManagementSendAction {
+NVME_CMD_ZONE_MGMT_SEND_CLOSE   = 0x1,
+NVME_CMD_ZONE_MGMT_SEND_FINISH  = 0x2,
+NVME_CMD_ZONE_MGMT_SEND_OPEN= 0x3,
+NVME_CMD_ZONE_MGMT_SEND_RESET   = 0x4,
+NVME_CMD_ZONE_MGMT_SEND_OFFLINE = 0x5,
+NVME_CMD_ZONE_MGMT_SEND_SET_ZDE = 0x10,
+} NvmeZoneManagementSendAction;
+
 typedef struct QEMU_PACKED NvmeZoneManagementRecvCmd {
 uint8_t opcode;
 uint8_t flags;
@@ -748,6 +775,7 @@ enum NvmeStatusCodes {
 NVME_ZONE_IS_READ_ONLY  = 0x01ba,
 NVME_ZONE_IS_OFFLINE= 0x01bb,
 NVME_ZONE_INVALID_WRITE = 0x01bc,
+NVME_INVALID_ZONE_STATE_TRANSITION = 0x01bf,
 NVME_WRITE_FAULT= 0x0280,
 NVME_UNRECOVERED_READ   = 0x0281,
 NVME_E2E_GUARD_ERROR= 0x0282,
@@ -1207,6 +1235,7 @@ static inline void _nvme_check_size(void)
 QEMU_BUILD_BUG_ON(sizeof(NvmeIdentify) != 64);
 QEMU_BUILD_BUG_ON(sizeof(NvmeRwCmd) != 64);
 QEMU_BUILD_BUG_ON(sizeof(NvmeDsmCmd) != 64);
+QEMU_BUILD_BUG_ON(sizeof(NvmeZoneManagementSendCmd) != 64);
 QEMU_BUILD_BUG_ON(sizeof(NvmeZoneManagementRecvCmd) != 64);
 QEMU_BUILD_BUG_ON(sizeof(NvmeRangeType) != 64);
 QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64);
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 1e6c57752769..5c109cab58e8 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -164,6 +164,8 @@ static const NvmeEffectsLog nvme_effects[NVME_IOCS_MAX] = {
 .iocs = {
 NVME_EFFECTS_NVM_INITIALIZER,
 [NVME_CMD_ZONE_MGMT_RECV] = NVME_EFFECTS_CSUPP,
+[NVME_CMD_ZONE_MGMT_SEND] = NVME_EFFECTS_CSUPP |
+NVME_EFFECTS_LBCC,
 },
 },
 };
@@ -1064,21 +1066,20 @@ static inline uint16_t nvme_check_dulbe(NvmeNamespace 
*ns, uint64_t slba,
 return NVME_SUCCESS;
 }
 
-static int nvme_allocate(NvmeNamespace *ns, uint64_t slba, uint32_t nlb)
+static int __nvme_allocate(NvmeNamespace *ns, uint64_t slba, uint32_t nlb,
+   bool deallocate)
 {
 int nlongs, idx;
 int64_t offset;
 unsigned long *map, *src;
 int ret;
 
-if (!(ns->pstate.blk && nvme_check_dulbe(ns, slba, nlb))) {
-return 0;
+if (deallocate) {
+bitmap_clear(ns->pstate.utilization.map, slba, nlb);
+} else {
+bitmap_set(ns->pstate.utilization.map, slba, nlb);
 }
 
-trace_pci_nvme_allocate(nvme_nsid(ns), slba, nlb);
-
-bitmap_set(ns->pstate.utilization.map, slba, nlb);
-
 /*
  * The bitmap is an array of unsigned longs, so calculate the index given
  * the size of a long.
@@ -1123,6 +1124,28 @@ static int nvme_allocate(NvmeNamespace *ns, uint64_t 
slba, uint32_t nlb)
 return ret;
 }
 
+static int nvme_allocate(NvmeNamespace *ns, uint64_t slba, uint32_t nlb)
+{
+if (!(ns->pstate.blk && nvme_check_dulbe(ns, slba, nlb))) {
+return 0;
+}
+
+trace_pci_nvme_allocate(nvme_nsid(ns), slba, nlb);
+
+return __nvme_allocate(ns, slba, nlb, false /* deallocate */);
+}
+
+static int nvme_deallocate(NvmeNamespace *ns, uint64_t

[PATCH v2 07/14] hw/block/nvme: add commands supported and effects log page

2020-09-29 Thread Klaus Jensen

From: Gollu Appalanaidu 

This is to support for the Commands Supported and Effects log page. See
NVM Express Spec 1.3d, sec. 5.14.1.5 ("Commands Supported and Effects")

Signed-off-by: Gollu Appalanaidu 
Signed-off-by: Klaus Jensen 
---
 include/block/nvme.h | 21 +
 hw/block/nvme.c  | 75 +++-
 2 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index abd49d371e63..5a5e19f6bedc 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -734,6 +734,24 @@ typedef struct QEMU_PACKED NvmeSmartLog {
 uint8_t reserved2[320];
 } NvmeSmartLog;
 
+typedef struct QEMU_PACKED NvmeEffectsLog {
+uint32_t acs[256];
+uint32_t iocs[256];
+uint8_t  rsvd2048[2048];
+} NvmeEffectsLog;
+
+enum {
+NVME_EFFECTS_CSUPP  = 1 <<  0,
+NVME_EFFECTS_LBCC   = 1 <<  1,
+NVME_EFFECTS_NCC= 1 <<  2,
+NVME_EFFECTS_NIC= 1 <<  3,
+NVME_EFFECTS_CCC= 1 <<  4,
+NVME_EFFECTS_CSE_SINGLE = 1 << 16,
+NVME_EFFECTS_CSE_MULTI  = 1 << 17,
+NVME_EFFECTS_CSE_MASK   = 3 << 16,
+NVME_EFFECTS_UUID_SEL   = 1 << 19,
+};
+
 enum NvmeSmartWarn {
 NVME_SMART_SPARE  = 1 << 0,
 NVME_SMART_TEMPERATURE= 1 << 1,
@@ -746,6 +764,7 @@ enum NvmeLogIdentifier {
 NVME_LOG_ERROR_INFO = 0x01,
 NVME_LOG_SMART_INFO = 0x02,
 NVME_LOG_FW_SLOT_INFO   = 0x03,
+NVME_LOG_EFFECTS= 0x05,
 };
 
 typedef struct QEMU_PACKED NvmePSD {
@@ -857,6 +876,7 @@ enum NvmeIdCtrlFrmw {
 };
 
 enum NvmeIdCtrlLpa {
+NVME_LPA_EFFECTS_LOG  = 1 << 1,
 NVME_LPA_EXTENDED = 1 << 2,
 };
 
@@ -1064,5 +1084,6 @@ static inline void _nvme_check_size(void)
 QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
 QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
 QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
+QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
 }
 #endif
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index be5a0a7dfa09..0bc19a1e3688 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -81,6 +81,7 @@
 #define NVME_TEMPERATURE_WARNING 0x157
 #define NVME_TEMPERATURE_CRITICAL 0x175
 #define NVME_NUM_FW_SLOTS 1
+#define NVME_MAX_ADM_IO_CMDS 0xFF
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -112,6 +113,46 @@ static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
 [NVME_TIMESTAMP]= NVME_FEAT_CAP_CHANGE,
 };
 
+#define NVME_EFFECTS_ADMIN_INITIALIZER \
+[NVME_ADM_CMD_DELETE_SQ]= NVME_EFFECTS_CSUPP,  \
+[NVME_ADM_CMD_CREATE_SQ]= NVME_EFFECTS_CSUPP,  \
+[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_EFFECTS_CSUPP,  \
+[NVME_ADM_CMD_DELETE_CQ]= NVME_EFFECTS_CSUPP,  \
+[NVME_ADM_CMD_CREATE_CQ]= NVME_EFFECTS_CSUPP,  \
+[NVME_ADM_CMD_IDENTIFY] = NVME_EFFECTS_CSUPP,  \
+[NVME_ADM_CMD_ABORT]= NVME_EFFECTS_CSUPP,  \
+[NVME_ADM_CMD_SET_FEATURES] = NVME_EFFECTS_CSUPP | \
+  NVME_EFFECTS_CCC |   \
+  NVME_EFFECTS_NIC |   \
+  NVME_EFFECTS_NCC,\
+[NVME_ADM_CMD_GET_FEATURES] = NVME_EFFECTS_CSUPP,  \
+[NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_EFFECTS_CSUPP
+
+#define NVME_EFFECTS_NVM_INITIALIZER   \
+[NVME_CMD_FLUSH]= NVME_EFFECTS_CSUPP | \
+  NVME_EFFECTS_LBCC,   \
+[NVME_CMD_WRITE]= NVME_EFFECTS_CSUPP | \
+  NVME_EFFECTS_LBCC,   \
+[NVME_CMD_READ] = NVME_EFFECTS_CSUPP,  \
+[NVME_CMD_WRITE_ZEROES] = NVME_EFFECTS_CSUPP | \
+  NVME_EFFECTS_LBCC
+
+static const NvmeEffectsLog nvme_effects_admin_only = {
+.acs = {
+NVME_EFFECTS_ADMIN_INITIALIZER,
+},
+};
+
+static const NvmeEffectsLog nvme_effects = {
+.acs = {
+NVME_EFFECTS_ADMIN_INITIALIZER,
+},
+
+.iocs = {
+NVME_EFFECTS_NVM_INITIALIZER,
+},
+};
+
 static void nvme_process_sq(void *opaque);
 
 static uint16_t nvme_cid(NvmeRequest *req)
@@ -1382,6 +1423,36 @@ static uint16_t nvme_error_info(NvmeCtrl *n, uint8_t 
rae, uint32_t buf_len,
 DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static uint16_t nvme_effects_log(NvmeCtrl *n, uint32_t buf_len, uint64_t off,
+ NvmeRequest *req)
+{
+const NvmeEffectsLog *effects;
+
+uint32_t trans_len;
+
+if (off > sizeof(NvmeEffectsLog)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+switch (NVME_CC_CSS(n->bar.cc)) {
+case NVME_CC_CSS_ADMIN_ONLY:
+effects = _effects_admin_only;
+break;
+
+case NVME_CC_CSS_NVM:
+effects = _effects;
+break;
+
+default:
+return NVME_INTERNAL_DEV_ERROR | NVME_DNR;
+}
+
+trans_len = MIN(sizeof(NvmeEffectsLog) - off, buf_len);
+
+return nvme_dma(n, (uint8_t *)effects + off,

[PATCH v2 08/14] hw/block/nvme: support namespace types

2020-09-29 Thread Klaus Jensen

From: Klaus Jensen 

Implement support for TP 4056 ("Namespace Types"). This adds the 'iocs'
(I/O Command Set) device parameter to the nvme-ns device.

Signed-off-by: Klaus Jensen 
---
 docs/specs/nvme.txt   |   3 +
 hw/block/nvme-ns.h|  14 ++-
 hw/block/nvme.h   |   3 +
 include/block/nvme.h  |  58 +--
 block/nvme.c  |   4 +-
 hw/block/nvme-ns.c|  29 +-
 hw/block/nvme.c   | 228 ++
 hw/block/trace-events |   6 +-
 8 files changed, 285 insertions(+), 60 deletions(-)

diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
index 6d00ac064998..a13c7a5dbe86 100644
--- a/docs/specs/nvme.txt
+++ b/docs/specs/nvme.txt
@@ -12,6 +12,9 @@ nvme-ns Options
  namespace. It is specified in terms of a power of two. Only values between
  9 and 12 (both inclusive) are supported.
 
+  `iocs`; The "I/O Command Set" associated with the namespace. E.g. 0x0 for the
+ NVM Command Set (the default), or 0x2 for the Zoned Namespace Command Set.
+
   `pstate`; This parameter specifies another blockdev to be used for storing
  persistent state such as logical block allocation tracking. Adding this
  parameter enables various optional features of the device.
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 0ad83910dde9..d39fa79fa682 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -29,12 +29,14 @@ typedef struct NvmePstateHeader {
 int64_t  blk_len;
 
 uint8_t  lbads;
+uint8_t  iocs;
 
-uint8_t  rsvd17[4079];
+uint8_t  rsvd18[4078];
 } QEMU_PACKED NvmePstateHeader;
 
 typedef struct NvmeNamespaceParams {
 uint32_t nsid;
+uint8_t  iocs;
 uint8_t  lbads;
 } NvmeNamespaceParams;
 
@@ -43,7 +45,8 @@ typedef struct NvmeNamespace {
 BlockConfblkconf;
 int32_t  bootindex;
 int64_t  size;
-NvmeIdNs id_ns;
+uint8_t  iocs;
+void *id_ns[NVME_IOCS_MAX];
 
 struct {
 BlockBackend *blk;
@@ -70,9 +73,14 @@ static inline uint32_t nvme_nsid(NvmeNamespace *ns)
 return -1;
 }
 
+static inline NvmeIdNsNvm *nvme_ns_id_nvm(NvmeNamespace *ns)
+{
+return ns->id_ns[NVME_IOCS_NVM];
+}
+
 static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
 {
-NvmeIdNs *id_ns = >id_ns;
+NvmeIdNsNvm *id_ns = nvme_ns_id_nvm(ns);
 return _ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
 }
 
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index ccf52ac7bb82..f66ed9ab7eff 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -123,6 +123,7 @@ typedef struct NvmeFeatureVal {
 };
 uint32_tasync_config;
 uint32_tvwc;
+uint32_tiocsci;
 } NvmeFeatureVal;
 
 typedef struct NvmeCtrl {
@@ -150,6 +151,7 @@ typedef struct NvmeCtrl {
 uint64_ttimestamp_set_qemu_clock_ms;/* QEMU clock time */
 uint64_tstarttime_ms;
 uint16_ttemperature;
+uint64_tiocscs[512];
 
 HostMemoryBackend *pmrdev;
 
@@ -165,6 +167,7 @@ typedef struct NvmeCtrl {
 NvmeSQueue  admin_sq;
 NvmeCQueue  admin_cq;
 NvmeIdCtrl  id_ctrl;
+void*id_ctrl_iocss[NVME_IOCS_MAX];
 NvmeFeatureVal  features;
 } NvmeCtrl;
 
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 5a5e19f6bedc..792fccf8c81f 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -82,6 +82,11 @@ enum NvmeCapMask {
 #define NVME_CAP_SET_PMRS(cap, val) (cap |= (uint64_t)(val & CAP_PMR_MASK)\
 << CAP_PMR_SHIFT)
 
+enum NvmeCapCss {
+NVME_CAP_CSS_NVM = 1 << 0,
+NVME_CAP_CSS_CSI = 1 << 6,
+};
+
 enum NvmeCcShift {
 CC_EN_SHIFT = 0,
 CC_CSS_SHIFT= 4,
@@ -112,6 +117,7 @@ enum NvmeCcMask {
 
 enum NvmeCcCss {
 NVME_CC_CSS_NVM= 0x0,
+NVME_CC_CSS_ALL= 0x6,
 NVME_CC_CSS_ADMIN_ONLY = 0x7,
 };
 
@@ -383,6 +389,11 @@ enum NvmePmrmscMask {
 #define NVME_PMRMSC_SET_CBA(pmrmsc, val)   \
 (pmrmsc |= (uint64_t)(val & PMRMSC_CBA_MASK) << PMRMSC_CBA_SHIFT)
 
+enum NvmeCommandSet {
+NVME_IOCS_NVM = 0x0,
+NVME_IOCS_MAX = 0x1,
+};
+
 enum NvmeSglDescriptorType {
 NVME_SGL_DESCR_TYPE_DATA_BLOCK  = 0x0,
 NVME_SGL_DESCR_TYPE_BIT_BUCKET  = 0x1,
@@ -531,8 +542,13 @@ typedef struct QEMU_PACKED NvmeIdentify {
 uint64_trsvd2[2];
 uint64_tprp1;
 uint64_tprp2;
-uint32_tcns;
-uint32_trsvd11[5];
+uint8_t cns;
+uint8_t rsvd3;
+uint16_tcntid;
+uint16_tnvmsetid;
+uint8_t rsvd4;
+uint8_t csi;
+uint32_trsvd11[4];
 } NvmeIdentify;
 
 typedef struct QEMU_PACKED NvmeRwCmd {
@@ -624,8 +640,15 @@ typedef struct QEMU_PACKED NvmeAerResult {
 } NvmeAerResult;
 
 typedef struct QEMU_PACKED NvmeCqe {
-uint32_tresult;
-uint32_trsvd;
+union {
+struct {
+uint32_tdw0;
+uint32_tdw1;
+};
+
+uint64_t qw0;
+};
+
 uint16_tsq_head;

[PATCH v2 05/14] hw/block/nvme: consolidate read, write and write zeroes

2020-09-29 Thread Klaus Jensen

From: Klaus Jensen 

Consolidate the read/write and write zeroes functions.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.h   | 11 
 include/block/nvme.h  |  2 ++
 hw/block/nvme.c   | 65 +++
 hw/block/trace-events |  3 +-
 4 files changed, 37 insertions(+), 44 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index e080a2318a50..ccf52ac7bb82 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -36,6 +36,17 @@ typedef struct NvmeRequest {
 QTAILQ_ENTRY(NvmeRequest)entry;
 } NvmeRequest;
 
+static inline bool nvme_req_is_write(NvmeRequest *req)
+{
+switch (req->cmd.opcode) {
+case NVME_CMD_WRITE:
+case NVME_CMD_WRITE_ZEROES:
+return true;
+default:
+return false;
+}
+}
+
 static inline const char *nvme_adm_opc_str(uint8_t opc)
 {
 switch (opc) {
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 7a30cf285ae0..999b4f8ae0d4 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -438,6 +438,8 @@ typedef struct QEMU_PACKED NvmeCmd {
 uint32_tcdw15;
 } NvmeCmd;
 
+#define NVME_CMD_OPCODE_DATA_TRANSFER_MASK 0x3
+
 #define NVME_CMD_FLAGS_FUSE(flags) (flags & 0x3)
 #define NVME_CMD_FLAGS_PSDT(flags) ((flags >> 6) & 0x3)
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 27af2f0b38d5..795c7e7c529f 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -997,48 +997,19 @@ static uint16_t nvme_flush(NvmeCtrl *n, NvmeRequest *req)
 return nvme_do_aio(ns->blkconf.blk, 0, 0, req);
 }
 
-static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_rwz(NvmeCtrl *n, NvmeRequest *req)
 {
 NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;
 NvmeNamespace *ns = req->ns;
+
 uint64_t slba = le64_to_cpu(rw->slba);
 uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
-uint64_t offset = nvme_l2b(ns, slba);
-uint32_t count = nvme_l2b(ns, nlb);
+size_t len = nvme_l2b(ns, nlb);
+
 uint16_t status;
 
-trace_pci_nvme_write_zeroes(nvme_cid(req), nvme_nsid(ns), slba, nlb);
-
-status = nvme_check_bounds(n, ns, slba, nlb);
-if (status) {
-trace_pci_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
-return status;
-}
-
-return nvme_do_aio(ns->blkconf.blk, offset, count, req);
-}
-
-static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
-{
-NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;
-NvmeNamespace *ns = req->ns;
-uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
-uint64_t slba = le64_to_cpu(rw->slba);
-
-uint64_t data_size = nvme_l2b(ns, nlb);
-uint64_t data_offset = nvme_l2b(ns, slba);
-enum BlockAcctType acct = req->cmd.opcode == NVME_CMD_WRITE ?
-BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
-uint16_t status;
-
-trace_pci_nvme_rw(nvme_cid(req), nvme_io_opc_str(rw->opcode),
-  nvme_nsid(ns), nlb, data_size, slba);
-
-status = nvme_check_mdts(n, data_size);
-if (status) {
-trace_pci_nvme_err_mdts(nvme_cid(req), data_size);
-goto invalid;
-}
+trace_pci_nvme_rwz(nvme_cid(req), nvme_io_opc_str(rw->opcode),
+   nvme_nsid(ns), nlb, len, slba);
 
 status = nvme_check_bounds(n, ns, slba, nlb);
 if (status) {
@@ -1046,15 +1017,26 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
 goto invalid;
 }
 
-status = nvme_map_dptr(n, data_size, req);
-if (status) {
-goto invalid;
+if (req->cmd.opcode & NVME_CMD_OPCODE_DATA_TRANSFER_MASK) {
+status = nvme_check_mdts(n, len);
+if (status) {
+trace_pci_nvme_err_mdts(nvme_cid(req), len);
+goto invalid;
+}
+
+status = nvme_map_dptr(n, len, req);
+if (status) {
+goto invalid;
+}
 }
 
-return nvme_do_aio(ns->blkconf.blk, data_offset, data_size, req);
+return nvme_do_aio(ns->blkconf.blk, nvme_l2b(ns, slba), len, req);
 
 invalid:
-block_acct_invalid(blk_get_stats(ns->blkconf.blk), acct);
+block_acct_invalid(blk_get_stats(ns->blkconf.blk),
+   nvme_req_is_write(req) ? BLOCK_ACCT_WRITE :
+   BLOCK_ACCT_READ);
+
 return status;
 }
 
@@ -1082,10 +1064,9 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest 
*req)
 case NVME_CMD_FLUSH:
 return nvme_flush(n, req);
 case NVME_CMD_WRITE_ZEROES:
-return nvme_write_zeroes(n, req);
 case NVME_CMD_WRITE:
 case NVME_CMD_READ:
-return nvme_rw(n, req);
+return nvme_rwz(n, req);
 default:
 trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
 return NVME_INVALID_OPCODE | NVME_DNR;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 9e7507c5abde..b18056c49836 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -40,9 +40,8 @@ pci_nvme_map_prp(uint64_t trans_len, uint32_t len, uint64_t 
prp1, uint64_t prp2,
 pci_nvme_map_sgl(uint16_t cid, uint8_t typ, uint64_t len) "cid

[PATCH v2 09/14] hw/block/nvme: add basic read/write for zoned namespaces

2020-09-29 Thread Klaus Jensen

From: Klaus Jensen 

This adds basic read and write for zoned namespaces.

A zoned namespace is created by setting the iocs namespace parameter to
0x2 and specifying the zns.zcap parameter (zone capacity) in number of
logical blocks per zone. If a zone size (zns.zsze) is not specified, the
namespace device will set the zone size to be the next power of two and
fit in as many zones as possible on the underlying namespace blockdev.
This behavior is not required by the specification, but ensures that the
device can be initialized by the Linux kernel nvme driver, which
requires a power of two zone size.

If the namespace has an associated 'pstate' blockdev it will be used to
store the zone states persistently. Only zone state changes are
persisted, that is, zone write pointer updates are only persistent if
the zone is explicitly closed. On boot up, any zones that were in an
Opened state will be transitioned to Full.

Signed-off-by: Klaus Jensen 
---
 docs/specs/nvme.txt   |   7 ++
 hw/block/nvme-ns.h| 116 +++-
 include/block/nvme.h  |  59 ++-
 hw/block/nvme-ns.c| 184 +++-
 hw/block/nvme.c   | 239 +-
 hw/block/trace-events |   8 ++
 6 files changed, 604 insertions(+), 9 deletions(-)

diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
index a13c7a5dbe86..b23e59dd3075 100644
--- a/docs/specs/nvme.txt
+++ b/docs/specs/nvme.txt
@@ -34,6 +34,13 @@ nvme-ns Options
  values (such as lbads) differs from those specified for the nvme-ns
  device an error is produced.
 
+  `zns.zcap`; If `iocs` is 0x2, this specifies the zone capacity. It is
+ specified in units of logical blocks.
+
+  `zns.zsze`; If `iocs` is 0x2, this specifies the zone size. It is specified
+ in units of the logical blocks. If not specified, the value depends on
+ zns.zcap; if the zone capacity is a power of two, the zone size will be
+ set to that, otherwise it will default to the next power of two.
 
 Reference Specifications
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index d39fa79fa682..e16f50dc4bb8 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -31,15 +31,34 @@ typedef struct NvmePstateHeader {
 uint8_t  lbads;
 uint8_t  iocs;
 
-uint8_t  rsvd18[4078];
+uint8_t  rsvd18[3054];
+
+/* offset 0xc00 */
+struct {
+uint64_t zcap;
+uint64_t zsze;
+} QEMU_PACKED zns;
+
+uint8_t  rsvd3088[1008];
 } QEMU_PACKED NvmePstateHeader;
 
 typedef struct NvmeNamespaceParams {
 uint32_t nsid;
 uint8_t  iocs;
 uint8_t  lbads;
+
+struct {
+uint64_t zcap;
+uint64_t zsze;
+} zns;
 } NvmeNamespaceParams;
 
+typedef struct NvmeZone {
+NvmeZoneDescriptor *zd;
+
+uint64_t wp_staging;
+} NvmeZone;
+
 typedef struct NvmeNamespace {
 DeviceState  parent_obj;
 BlockConfblkconf;
@@ -55,6 +74,10 @@ typedef struct NvmeNamespace {
 unsigned long *map;
 int64_t   offset;
 } utilization;
+
+struct {
+int64_t offset;
+} zns;
 } pstate;
 
 NvmeNamespaceParams params;
@@ -62,8 +85,20 @@ typedef struct NvmeNamespace {
 struct {
 uint32_t err_rec;
 } features;
+
+struct {
+int num_zones;
+
+NvmeZone   *zones;
+NvmeZoneDescriptor *zd;
+} zns;
 } NvmeNamespace;
 
+static inline bool nvme_ns_zoned(NvmeNamespace *ns)
+{
+return ns->iocs == NVME_IOCS_ZONED;
+}
+
 static inline uint32_t nvme_nsid(NvmeNamespace *ns)
 {
 if (ns) {
@@ -78,17 +113,39 @@ static inline NvmeIdNsNvm *nvme_ns_id_nvm(NvmeNamespace 
*ns)
 return ns->id_ns[NVME_IOCS_NVM];
 }
 
+static inline NvmeIdNsZns *nvme_ns_id_zoned(NvmeNamespace *ns)
+{
+return ns->id_ns[NVME_IOCS_ZONED];
+}
+
 static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
 {
 NvmeIdNsNvm *id_ns = nvme_ns_id_nvm(ns);
 return _ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
 }
 
+static inline NvmeLBAFE *nvme_ns_lbafe(NvmeNamespace *ns)
+{
+NvmeIdNsNvm *id_ns = nvme_ns_id_nvm(ns);
+NvmeIdNsZns *id_ns_zns = nvme_ns_id_zoned(ns);
+return _ns_zns->lbafe[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
+}
+
 static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
 {
 return nvme_ns_lbaf(ns)->ds;
 }
 
+static inline uint64_t nvme_ns_zsze(NvmeNamespace *ns)
+{
+return nvme_ns_lbafe(ns)->zsze;
+}
+
+static inline uint64_t nvme_ns_zsze_bytes(NvmeNamespace *ns)
+{
+return nvme_ns_zsze(ns) << nvme_ns_lbads(ns);
+}
+
 /* calculate the number of LBAs that the namespace can accomodate */
 static inline uint64_t nvme_ns_nlbas(NvmeNamespace *ns)
 {
@@ -101,12 +158,69 @@ static inline size_t nvme_l2b(NvmeNamespace *ns, uint64_t 
lba)
 return lba << nvme_ns_lbads(ns);
 }
 
+static inline int nvme_ns_zone_idx(NvmeNamespace *ns, uint64_t lba)
+{
+return lba / nvme_ns_zsze(ns);
+}
+
+static inline NvmeZone

[PATCH v2 03/14] hw/block/nvme: make lba data size configurable

2020-09-29 Thread Klaus Jensen

From: Klaus Jensen 

Allos the LBA data size (lbads) to be set between 9 and 12.

Signed-off-by: Klaus Jensen 
Acked-by: Keith Busch 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Philippe Mathieu-Daudé 
---
 docs/specs/nvme.txt | 11 ++-
 hw/block/nvme-ns.h  |  1 +
 hw/block/nvme-ns.c  |  8 +++-
 hw/block/nvme.c |  1 +
 4 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
index 56d393884e7a..438ca50d698c 100644
--- a/docs/specs/nvme.txt
+++ b/docs/specs/nvme.txt
@@ -1,7 +1,16 @@
 NVM Express Controller
 ==
 
-The nvme device (-device nvme) emulates an NVM Express Controller.
+The nvme device (-device nvme) emulates an NVM Express Controller. It is used
+together with nvme-ns devices (-device nvme-ns) which emulates an NVM Express
+Namespace.
+
+nvme-ns Options
+---
+
+  `lbads`; The "LBA Data Size (LBADS)" indicates the LBA data size used by the
+ namespace. It is specified in terms of a power of two. Only values between
+ 9 and 12 (both inclusive) are supported.
 
 
 Reference Specifications
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 83734f4606e1..78b0d1a00672 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -21,6 +21,7 @@
 
 typedef struct NvmeNamespaceParams {
 uint32_t nsid;
+uint8_t  lbads;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 2ba0263ddaca..576c7486f45b 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -36,7 +36,7 @@ static void nvme_ns_init(NvmeNamespace *ns)
 ns->id_ns.dlfeat = 0x9;
 }
 
-id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+id_ns->lbaf[0].ds = ns->params.lbads;
 
 id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(ns));
 
@@ -77,6 +77,11 @@ static int nvme_ns_check_constraints(NvmeNamespace *ns, 
Error **errp)
 return -1;
 }
 
+if (ns->params.lbads < 9 || ns->params.lbads > 12) {
+error_setg(errp, "unsupported lbads (supported: 9-12)");
+return -1;
+}
+
 return 0;
 }
 
@@ -125,6 +130,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 static Property nvme_ns_props[] = {
 DEFINE_BLOCK_PROPERTIES(NvmeNamespace, blkconf),
 DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
+DEFINE_PROP_UINT8("lbads", NvmeNamespace, params.lbads, BDRV_SECTOR_BITS),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3cbc3c7b75b1..758f58c88026 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -2812,6 +2812,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 if (n->namespace.blkconf.blk) {
 ns = >namespace;
 ns->params.nsid = 1;
+ns->params.lbads = BDRV_SECTOR_BITS;
 
 if (nvme_ns_setup(n, ns, errp)) {
 return;
-- 
2.28.0

[PATCH v2 06/14] hw/block/nvme: add support for dulbe and block utilization tracking

2020-09-29 Thread Klaus Jensen

From: Klaus Jensen 

This adds support for reporting the Deallocated or Unwritten Logical
Block error (DULBE). This requires tracking the allocated/deallocated
status of all logical blocks.

Introduce a bitmap that does this. The bitmap is persisted on the new
'pstate' blockdev that is associated with a namespace. If no such drive
is attached, the controller will not indicate support for DULBE.

Signed-off-by: Klaus Jensen 
---
 docs/specs/nvme.txt   |  19 +
 hw/block/nvme-ns.h|  32 
 include/block/nvme.h  |   5 ++
 hw/block/nvme-ns.c| 186 ++
 hw/block/nvme.c   | 130 -
 hw/block/trace-events |   2 +
 6 files changed, 371 insertions(+), 3 deletions(-)

diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
index 438ca50d698c..6d00ac064998 100644
--- a/docs/specs/nvme.txt
+++ b/docs/specs/nvme.txt
@@ -12,6 +12,25 @@ nvme-ns Options
  namespace. It is specified in terms of a power of two. Only values between
  9 and 12 (both inclusive) are supported.
 
+  `pstate`; This parameter specifies another blockdev to be used for storing
+ persistent state such as logical block allocation tracking. Adding this
+ parameter enables various optional features of the device.
+
+   -drive id=pstate,file=pstate.img,format=raw
+   -device nvme-ns,pstate=pstate,...
+
+ To reset (or initialize) state, the blockdev image should be of zero size:
+
+   qemu-img create -f raw pstate.img 0
+
+ The image will be intialized with a file format header and truncated to
+ the required size.
+
+ If the pstate given is of non-zero size, it will be assumed to contain
+ previous saved state and will be checked for consistency. If any stored
+ values (such as lbads) differs from those specified for the nvme-ns
+ device an error is produced.
+
 
 Reference Specifications
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 78b0d1a00672..0ad83910dde9 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -19,6 +19,20 @@
 #define NVME_NS(obj) \
 OBJECT_CHECK(NvmeNamespace, (obj), TYPE_NVME_NS)
 
+#define NVME_PSTATE_MAGIC ((0x00 << 24) | ('S' << 16) | ('P' << 8) | 'N')
+#define NVME_PSTATE_V1 1
+
+typedef struct NvmePstateHeader {
+uint32_t magic;
+uint32_t version;
+
+int64_t  blk_len;
+
+uint8_t  lbads;
+
+uint8_t  rsvd17[4079];
+} QEMU_PACKED NvmePstateHeader;
+
 typedef struct NvmeNamespaceParams {
 uint32_t nsid;
 uint8_t  lbads;
@@ -31,7 +45,20 @@ typedef struct NvmeNamespace {
 int64_t  size;
 NvmeIdNs id_ns;
 
+struct {
+BlockBackend *blk;
+
+struct {
+unsigned long *map;
+int64_t   offset;
+} utilization;
+} pstate;
+
 NvmeNamespaceParams params;
+
+struct {
+uint32_t err_rec;
+} features;
 } NvmeNamespace;
 
 static inline uint32_t nvme_nsid(NvmeNamespace *ns)
@@ -72,4 +99,9 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error 
**errp);
 void nvme_ns_drain(NvmeNamespace *ns);
 void nvme_ns_flush(NvmeNamespace *ns);
 
+static inline void _nvme_ns_check_size(void)
+{
+QEMU_BUILD_BUG_ON(sizeof(NvmePstateHeader) != 4096);
+}
+
 #endif /* NVME_NS_H */
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 999b4f8ae0d4..abd49d371e63 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -683,6 +683,7 @@ enum NvmeStatusCodes {
 NVME_E2E_REF_ERROR  = 0x0284,
 NVME_CMP_FAILURE= 0x0285,
 NVME_ACCESS_DENIED  = 0x0286,
+NVME_DULB   = 0x0287,
 NVME_MORE   = 0x2000,
 NVME_DNR= 0x4000,
 NVME_NO_COMPLETE= 0x,
@@ -898,6 +899,9 @@ enum NvmeIdCtrlLpa {
 #define NVME_AEC_NS_ATTR(aec)   ((aec >> 8) & 0x1)
 #define NVME_AEC_FW_ACTIVATION(aec) ((aec >> 9) & 0x1)
 
+#define NVME_ERR_REC_TLER(err_rec)  (err_rec & 0x)
+#define NVME_ERR_REC_DULBE(err_rec) (err_rec & 0x1)
+
 enum NvmeFeatureIds {
 NVME_ARBITRATION= 0x1,
 NVME_POWER_MANAGEMENT   = 0x2,
@@ -1018,6 +1022,7 @@ enum NvmeNsIdentifierType {
 
 
 #define NVME_ID_NS_NSFEAT_THIN(nsfeat)  ((nsfeat & 0x1))
+#define NVME_ID_NS_NSFEAT_DULBE(nsfeat) ((nsfeat >> 2) & 0x1)
 #define NVME_ID_NS_FLBAS_EXTENDED(flbas)((flbas >> 4) & 0x1)
 #define NVME_ID_NS_FLBAS_INDEX(flbas)   ((flbas & 0xf))
 #define NVME_ID_NS_MC_SEPARATE(mc)  ((mc >> 1) & 0x1)
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 576c7486f45b..5e24b1a5dacd 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -25,9 +25,36 @@
 #include "hw/qdev-properties.h"
 #include "hw/qdev-core.h"
 
+#include "trace.h"
+
 #include "nvme.h"
 #include "nvme-ns.h"
 
+static int nvme_blk_truncate(BlockBackend *blk, size_t len, Error **errp)
+{
+int ret;
+uint64_t perm, shared_perm;
+
+blk_get_perm(blk, , _perm);
+
+

[PATCH v2 02/14] hw/block/nvme: add trace event for requests with non-zero status code

2020-09-29 Thread Klaus Jensen

From: Klaus Jensen 

If a command results in a non-zero status code, trace it.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c   | 6 ++
 hw/block/trace-events | 1 +
 2 files changed, 7 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 84b6b516fa7b..3cbc3c7b75b1 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -777,6 +777,12 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 assert(cq->cqid == req->sq->cqid);
 trace_pci_nvme_enqueue_req_completion(nvme_cid(req), cq->cqid,
   req->status);
+
+if (req->status) {
+trace_pci_nvme_err_req_status(nvme_cid(req), nvme_nsid(req->ns),
+  req->status, req->cmd.opcode);
+}
+
 QTAILQ_REMOVE(>sq->out_req_list, req, entry);
 QTAILQ_INSERT_TAIL(>req_list, req, entry);
 timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 5a239b80bf36..9e7507c5abde 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -89,6 +89,7 @@ pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
 
 # nvme traces for error conditions
 pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
+pci_nvme_err_req_status(uint16_t cid, uint32_t nsid, uint16_t status, uint8_t 
opc) "cid %"PRIu16" nsid %"PRIu32" status 0x%"PRIx16" opc 0x%"PRIx8""
 pci_nvme_err_addr_read(uint64_t addr) "addr 0x%"PRIx64""
 pci_nvme_err_addr_write(uint64_t addr) "addr 0x%"PRIx64""
 pci_nvme_err_cfs(void) "controller fatal status"
-- 
2.28.0

[PATCH v2 04/14] hw/block/nvme: reject io commands if only admin command set selected

2020-09-29 Thread Klaus Jensen

From: Klaus Jensen 

If the host sets CC.CSS to 111b, all commands submitted to I/O queues
should be completed with status Invalid Command Opcode.

Note that this is technically a v1.4 feature, but it does not hurt to
implement before we finally bump the reported version implemented.

Signed-off-by: Klaus Jensen 
---
 include/block/nvme.h | 5 +
 hw/block/nvme.c  | 4 
 2 files changed, 9 insertions(+)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index 58647bcdad0b..7a30cf285ae0 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -110,6 +110,11 @@ enum NvmeCcMask {
 #define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK)
 #define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK)
 
+enum NvmeCcCss {
+NVME_CC_CSS_NVM= 0x0,
+NVME_CC_CSS_ADMIN_ONLY = 0x7,
+};
+
 enum NvmeCstsShift {
 CSTS_RDY_SHIFT  = 0,
 CSTS_CFS_SHIFT  = 1,
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 758f58c88026..27af2f0b38d5 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1065,6 +1065,10 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req),
   req->cmd.opcode, nvme_io_opc_str(req->cmd.opcode));
 
+if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_ADMIN_ONLY) {
+return NVME_INVALID_OPCODE | NVME_DNR;
+}
+
 if (!nvme_nsid_valid(n, nsid)) {
 return NVME_INVALID_NSID | NVME_DNR;
 }
-- 
2.28.0

[PATCH v2 00/14] hw/block/nvme: zoned namespace command set

2020-09-29 Thread Klaus Jensen

From: Klaus Jensen 

So, things escalated a bit.

I'm removing the old cover letter. I don't think we need it anymore - you all
know the reason I keep posting versions of this implementation ;)

Based-on: <20200922084533.1273962-1-...@irrelevant.dk>

Changes for v2
~~

  * "hw/block/nvme: add support for dulbe and block utilization tracking"
- Factor out pstate init/load into separate functions.

- Fixed a stupid off-by-1 bug that would trigger when resetting the
  last zone.

- I added a more formalized pstate file format that includes a
  header. This is pretty similar to what is done in Dmitry's series,
  but with fewer fields. The device parameters for nvme-ns are still
  the "authoritative" ones, so if any parameters that influence LBA
  size, number of zones, etc. do not match, an error indicating the
  discrepancy will be produced. IIRC, Dmitry's version does the
  same.

  It is set up such that newer versions can load pstate files from
  older versions. The file format header is not unlike a standard
  nvme datastructure with reserved areas. This means that when
  adding new command sets that require persistent state, it is not
  needed to bump the version number, unless the header has to change
  dramatically.  This is demonstrated when the zoned namespace
  command set support is added in "hw/block/nvme: add basic
  read/write for zoned namespaces".

  * "hw/block/nvme: add basic read/write for zoned namespaces"
- The device will now transition all opened zones to Closed on
  "normal shutdown". You can force the "transition to Full" behavior
  by killing QEMU from the monitor.

  * "hw/block/nvme: add the zone append command"
- Slightly reordered the logic so a LBA Out of Range error is
  returned before Invalid Field in Command for normal read/write
  commands.

  * "hw/block/nvme: support zone active excursions"
- Dropped. Optional and non-critical.

  * "hw/block/nvme: support reset/finish recommended limits"
- Dropped. Optional and non-critical.

Gollu Appalanaidu (1):
  hw/block/nvme: add commands supported and effects log page

Klaus Jensen (13):
  hw/block/nvme: add nsid to get/setfeat trace events
  hw/block/nvme: add trace event for requests with non-zero status code
  hw/block/nvme: make lba data size configurable
  hw/block/nvme: reject io commands if only admin command set selected
  hw/block/nvme: consolidate read, write and write zeroes
  hw/block/nvme: add support for dulbe and block utilization tracking
  hw/block/nvme: support namespace types
  hw/block/nvme: add basic read/write for zoned namespaces
  hw/block/nvme: add the zone management receive command
  hw/block/nvme: add the zone management send command
  hw/block/nvme: add the zone append command
  hw/block/nvme: track and enforce zone resources
  hw/block/nvme: allow open to close transitions by controller

 docs/specs/nvme.txt   |   47 +-
 hw/block/nvme-ns.h|  180 -
 hw/block/nvme.h   |   22 +
 include/block/nvme.h  |  230 +-
 block/nvme.c  |4 +-
 hw/block/nvme-ns.c|  456 +++-
 hw/block/nvme.c   | 1560 +++--
 hw/block/trace-events |   42 +-
 8 files changed, 2443 insertions(+), 98 deletions(-)

-- 
2.28.0

[PATCH v2 01/14] hw/block/nvme: add nsid to get/setfeat trace events

2020-09-29 Thread Klaus Jensen

From: Klaus Jensen 

Include the namespace id in the pci_nvme_{get,set}feat trace events.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c   | 4 ++--
 hw/block/trace-events | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index da8344f196a8..84b6b516fa7b 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1660,7 +1660,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest 
*req)
 [NVME_ARBITRATION] = NVME_ARB_AB_NOLIMIT,
 };
 
-trace_pci_nvme_getfeat(nvme_cid(req), fid, sel, dw11);
+trace_pci_nvme_getfeat(nvme_cid(req), nsid, fid, sel, dw11);
 
 if (!nvme_feature_support[fid]) {
 return NVME_INVALID_FIELD | NVME_DNR;
@@ -1798,7 +1798,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest 
*req)
 uint8_t fid = NVME_GETSETFEAT_FID(dw10);
 uint8_t save = NVME_SETFEAT_SAVE(dw10);
 
-trace_pci_nvme_setfeat(nvme_cid(req), fid, save, dw11);
+trace_pci_nvme_setfeat(nvme_cid(req), nsid, fid, save, dw11);
 
 if (save) {
 return NVME_FID_NOT_SAVEABLE | NVME_DNR;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 446cca08e9ab..5a239b80bf36 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -53,8 +53,8 @@ pci_nvme_identify_ns(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_nslist(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_ns_descr_list(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_get_log(uint16_t cid, uint8_t lid, uint8_t lsp, uint8_t rae, uint32_t 
len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" lsp 0x%"PRIx8" rae 0x%"PRIx8" 
len %"PRIu32" off %"PRIu64""
-pci_nvme_getfeat(uint16_t cid, uint8_t fid, uint8_t sel, uint32_t cdw11) "cid 
%"PRIu16" fid 0x%"PRIx8" sel 0x%"PRIx8" cdw11 0x%"PRIx32""
-pci_nvme_setfeat(uint16_t cid, uint8_t fid, uint8_t save, uint32_t cdw11) "cid 
%"PRIu16" fid 0x%"PRIx8" save 0x%"PRIx8" cdw11 0x%"PRIx32""
+pci_nvme_getfeat(uint16_t cid, uint32_t nsid, uint8_t fid, uint8_t sel, 
uint32_t cdw11) "cid %"PRIu16" nsid 0x%"PRIx32" fid 0x%"PRIx8" sel 0x%"PRIx8" 
cdw11 0x%"PRIx32""
+pci_nvme_setfeat(uint16_t cid, uint32_t nsid, uint8_t fid, uint8_t save, 
uint32_t cdw11) "cid %"PRIu16" nsid 0x%"PRIx32" fid 0x%"PRIx8" save 0x%"PRIx8" 
cdw11 0x%"PRIx32""
 pci_nvme_getfeat_vwcache(const char* result) "get feature volatile write 
cache, result=%s"
 pci_nvme_getfeat_numq(int result) "get feature number of queues, result=%d"
 pci_nvme_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested 
cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
-- 
2.28.0

Re: [PATCH v3 07/18] hw/block/nvme: add support for the get log page command

2020-09-29 Thread Keith Busch

On Wed, Sep 30, 2020 at 12:42:48AM +0200, Klaus Jensen wrote:
> On Sep 29 15:34, Keith Busch wrote:
> > Yeah, looks safe as-is, but we're missing out on returning the spec
> > required 'Invalid Field'.
> 
> I can't see where it says that we should do that? Invalid Field in
> Command if offset is *greater* than the size of the log page.
> 
> Some dynamic log pages have side-effects of being read, so while this is
> a super wierd way of specifying that we want nothing returned, I think
> it is valid?

Eh, when spec says "size of the log page", I assume they're using the
"zeroes based" definition for size as aligned with the NUMD field. So
512 is bigger than the sizeof the smart log occupying bytes 0-511.

But I guess there's room to see it the other way, so maybe it is a
way to request a no data log.

Re: [PATCH v3 07/18] hw/block/nvme: add support for the get log page command

2020-09-29 Thread Klaus Jensen

On Sep 29 15:34, Keith Busch wrote:
> On Tue, Sep 29, 2020 at 11:46:00PM +0200, Klaus Jensen wrote:
> > On Sep 29 14:11, Peter Maydell wrote:
> > > > +static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > > > buf_len,
> > > > + uint64_t off, NvmeRequest *req)
> > > > +{
> > > > +uint32_t trans_len;
> > > > +uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
> > > > +uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
> > > > +NvmeFwSlotInfoLog fw_log = {
> > > > +.afi = 0x1,
> > > > +};
> > > > +
> > > > +strpadcpy((char *)_log.frs1, sizeof(fw_log.frs1), "1.0", ' ');
> > > > +
> > > > +if (off > sizeof(fw_log)) {
> > > > +return NVME_INVALID_FIELD | NVME_DNR;
> > > > +}
> > > > +
> > > > +trans_len = MIN(sizeof(fw_log) - off, buf_len);
> > > > +
> > > > +return nvme_dma_read_prp(n, (uint8_t *) _log + off, trans_len, 
> > > > prp1,
> > > > + prp2);
> > > 
> > > Coverity warns about the same structure here (CID 1432411).
> > > 
> > > thanks
> > > -- PMM
> > 
> > Hi Peter,
> > 
> > Thanks. This is somewhere in the middle of a bunch of patches I got
> > merged I think, commit 94a7897c41db? I just requested Coverity access.
> > 
> > What happens is that nvme_dma_read_prp will call into nvme_map_prp which
> > wont map anything because len is 0. This will cause the statically
> > allocated QEMUSGList and QEMUIOVector in the request to be
> > uninitialized. Returning from nvme_map_prp, nvme_dma_read_prp will
> > notice that req->qsg.nsg is zero so it will default to the iov and move
> > into qemu_iovec_{to,from}_buf(>iov, ...). In there we actually pass
> > the NULL struct iovec, but since there is a __builtin_constant_p(bytes)
> > condition at the end of it all, we never follow it.
> > 
> > Not "serious" I think, but definitely not good. We will of course fix
> > this up.
> > 
> > @keith, do you agree with my analysis?
> 
> Yeah, looks safe as-is, but we're missing out on returning the spec
> required 'Invalid Field'.

I can't see where it says that we should do that? Invalid Field in
Command if offset is *greater* than the size of the log page.

Some dynamic log pages have side-effects of being read, so while this is
a super wierd way of specifying that we want nothing returned, I think
it is valid?


signature.asc
Description: PGP signature

Re: [PATCH v3 07/18] hw/block/nvme: add support for the get log page command

2020-09-29 Thread Keith Busch

On Tue, Sep 29, 2020 at 11:46:00PM +0200, Klaus Jensen wrote:
> On Sep 29 14:11, Peter Maydell wrote:
> > > +static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > > buf_len,
> > > + uint64_t off, NvmeRequest *req)
> > > +{
> > > +uint32_t trans_len;
> > > +uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
> > > +uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
> > > +NvmeFwSlotInfoLog fw_log = {
> > > +.afi = 0x1,
> > > +};
> > > +
> > > +strpadcpy((char *)_log.frs1, sizeof(fw_log.frs1), "1.0", ' ');
> > > +
> > > +if (off > sizeof(fw_log)) {
> > > +return NVME_INVALID_FIELD | NVME_DNR;
> > > +}
> > > +
> > > +trans_len = MIN(sizeof(fw_log) - off, buf_len);
> > > +
> > > +return nvme_dma_read_prp(n, (uint8_t *) _log + off, trans_len, 
> > > prp1,
> > > + prp2);
> > 
> > Coverity warns about the same structure here (CID 1432411).
> > 
> > thanks
> > -- PMM
> 
> Hi Peter,
> 
> Thanks. This is somewhere in the middle of a bunch of patches I got
> merged I think, commit 94a7897c41db? I just requested Coverity access.
> 
> What happens is that nvme_dma_read_prp will call into nvme_map_prp which
> wont map anything because len is 0. This will cause the statically
> allocated QEMUSGList and QEMUIOVector in the request to be
> uninitialized. Returning from nvme_map_prp, nvme_dma_read_prp will
> notice that req->qsg.nsg is zero so it will default to the iov and move
> into qemu_iovec_{to,from}_buf(>iov, ...). In there we actually pass
> the NULL struct iovec, but since there is a __builtin_constant_p(bytes)
> condition at the end of it all, we never follow it.
> 
> Not "serious" I think, but definitely not good. We will of course fix
> this up.
> 
> @keith, do you agree with my analysis?

Yeah, looks safe as-is, but we're missing out on returning the spec
required 'Invalid Field'.

Re: [PATCH v3 07/18] hw/block/nvme: add support for the get log page command

2020-09-29 Thread Klaus Jensen

On Sep 29 14:11, Peter Maydell wrote:
> On Mon, 6 Jul 2020 at 07:15, Klaus Jensen  wrote:
> >
> > From: Klaus Jensen 
> >
> > Add support for the Get Log Page command and basic implementations of
> > the mandatory Error Information, SMART / Health Information and Firmware
> > Slot Information log pages.
> >
> > In violation of the specification, the SMART / Health Information log
> > page does not persist information over the lifetime of the controller
> > because the device has no place to store such persistent state.
> >
> > Note that the LPA field in the Identify Controller data structure
> > intentionally has bit 0 cleared because there is no namespace specific
> > information in the SMART / Health information log page.
> >
> > Required for compliance with NVMe revision 1.3d. See NVM Express 1.3d,
> > Section 5.14 ("Get Log Page command").
> 
> Hi; Coverity reports a potential issue in this code
> (CID 1432413):
> 
> > +static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > buf_len,
> > +uint64_t off, NvmeRequest *req)
> > +{
> > +uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
> > +uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
> > +uint32_t nsid = le32_to_cpu(cmd->nsid);
> > +
> > +uint32_t trans_len;
> > +time_t current_ms;
> > +uint64_t units_read = 0, units_written = 0;
> > +uint64_t read_commands = 0, write_commands = 0;
> > +NvmeSmartLog smart;
> > +BlockAcctStats *s;
> > +
> > +if (nsid && nsid != 0x) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +s = blk_get_stats(n->conf.blk);
> > +
> > +units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> > +units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> > +read_commands = s->nr_ops[BLOCK_ACCT_READ];
> > +write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
> > +
> > +if (off > sizeof(smart)) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> 
> Here we check for off > sizeof(smart), which means that we allow
> off == sizeof(smart)...
> 
> > +
> > +trans_len = MIN(sizeof(smart) - off, buf_len);
> 
> > +return nvme_dma_read_prp(n, (uint8_t *)  + off, trans_len, prp1,
> > + prp2);
> 
> ...in which case the pointer we pass to nvme_dma_read_prp() will
> be off the end of the 'smart' object.
> 
> Now we are passing 0 as the trans_len, so I *think* this function
> will not actually read the buffer (Coverity is not smart
> enough to see this); so I could just close the Coverity issue as
> a false-positive. But maybe there is a clearer-to-humans as well
> as clearer-to-Coverity way to write this. What do you think ?
> 
> > +static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > buf_len,
> > + uint64_t off, NvmeRequest *req)
> > +{
> > +uint32_t trans_len;
> > +uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
> > +uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
> > +NvmeFwSlotInfoLog fw_log = {
> > +.afi = 0x1,
> > +};
> > +
> > +strpadcpy((char *)_log.frs1, sizeof(fw_log.frs1), "1.0", ' ');
> > +
> > +if (off > sizeof(fw_log)) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +trans_len = MIN(sizeof(fw_log) - off, buf_len);
> > +
> > +return nvme_dma_read_prp(n, (uint8_t *) _log + off, trans_len, prp1,
> > + prp2);
> 
> Coverity warns about the same structure here (CID 1432411).
> 
> thanks
> -- PMM

Hi Peter,

Thanks. This is somewhere in the middle of a bunch of patches I got
merged I think, commit 94a7897c41db? I just requested Coverity access.

What happens is that nvme_dma_read_prp will call into nvme_map_prp which
wont map anything because len is 0. This will cause the statically
allocated QEMUSGList and QEMUIOVector in the request to be
uninitialized. Returning from nvme_map_prp, nvme_dma_read_prp will
notice that req->qsg.nsg is zero so it will default to the iov and move
into qemu_iovec_{to,from}_buf(>iov, ...). In there we actually pass
the NULL struct iovec, but since there is a __builtin_constant_p(bytes)
condition at the end of it all, we never follow it.

Not "serious" I think, but definitely not good. We will of course fix
this up.

@keith, do you agree with my analysis?


signature.asc
Description: PGP signature

RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Matias Bjorling



> -Original Message-
> From: Klaus Jensen 
> Sent: Tuesday, 29 September 2020 20.00
> To: Keith Busch 
> Cc: Damien Le Moal ; Fam Zheng
> ; Kevin Wolf ; qemu-
> bl...@nongnu.org; Niklas Cassel ; Klaus Jensen
> ; qemu-de...@nongnu.org; Alistair Francis
> ; Philippe Mathieu-Daudé ;
> Matias Bjorling 
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and
> Zoned Namespace Command Set
> 
> On Sep 29 10:29, Keith Busch wrote:
> > On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > > It is unmistakably clear that you are invalidating my arguments
> > > about portability and endianness issues by suggesting that we just
> > > remove persistent state and deal with it later, but persistence is
> > > the killer feature that sets the QEMU emulated device apart from
> > > other emulation options. It is not about using emulation in
> > > production (because yeah, why would you?), but persistence is what
> > > makes it possible to develop and test "zoned FTLs" or something that
> requires recovery at power up.
> > > This is what allows testing of how your host software deals with
> > > opened zones being transitioned to FULL on power up and the
> > > persistent tracking of LBA allocation (in my series) can be used to
> > > properly test error recovery if you lost state in the app.
> >
> > Hold up -- why does an OPEN zone transition to FULL on power up? The
> > spec suggests it should be CLOSED. The spec does appear to support
> > going to FULL on a NVM Subsystem Reset, though. Actually, now that I'm
> > looking at this part of the spec, these implicit transitions seem a
> > bit less clear than I expected. I'm not sure it's clear enough to
> > evaluate qemu's compliance right now.
> >
> > But I don't see what testing these transitions has to do with having a
> > persistent state. You can reboot your VM without tearing down the
> > running QEMU instance. You can also unbind the driver or shutdown the
> > controller within the running operating system. That should make those
> > implicit state transitions reachable in order to exercise your FTL's
> > recovery.
> >
> 
> Oh dear - don't "spec" with me ;)
> 
> NVMe v1.4 Section 7.3.1:
> 
> An NVM Subsystem Reset is initiated when:
>   * Main power is applied to the NVM subsystem;
>   * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
> field;
>   * Requested using a method defined in the NVMe Management
> Interface specification; or
>   * A vendor specific event occurs.
> 
> In the context of QEMU, "Main power" is tearing down QEMU and starting it
> from scratch. Just like on a "real" host, unbinding the driver, rebooting or
> shutting down the controller does not cause a subsystem reset (and does not
> cause the zones to change state). And since the device does not indicate
> support for the optional NSSR.NSSRC register, that way to initiate a subsystem
> cannot be used.
> 
> The reason for moving to FULL is that write pointer updates are not persisted
> on each advancement, only when the zone state changes. So zones that were
> opened might have valid data, but invalid write pointer.
> So the device transitions them to FULL as it is allowed to.
> 

How about when one must also recover from intermediate states (i.e., 
open/closed upon power loss). For example, I don't hope a real SSD 
implementation transition zones to full when it has thousands of open 
simultaneously. That could be a disaster for the PE cycles, and a lot of media 
going to waste. One would want applications to support that kind of failure 
mode as well.

RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Matias Bjorling



> -Original Message-
> From: Klaus Jensen 
> Sent: Tuesday, 29 September 2020 20.36
> To: Matias Bjorling 
> Cc: Keith Busch ; Damien Le Moal
> ; Fam Zheng ; Kevin Wolf
> ; qemu-block@nongnu.org; Niklas Cassel
> ; Klaus Jensen ; qemu-
> de...@nongnu.org; Alistair Francis ; Philippe
> Mathieu-Daudé 
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and
> Zoned Namespace Command Set
> 
> On Sep 29 18:17, Matias Bjorling wrote:
> >
> >
> > > -Original Message-
> > > From: Klaus Jensen 
> > > Sent: Tuesday, 29 September 2020 20.00
> > > To: Keith Busch 
> > > Cc: Damien Le Moal ; Fam Zheng
> > > ; Kevin Wolf ; qemu-
> > > bl...@nongnu.org; Niklas Cassel ; Klaus
> > > Jensen ; qemu-de...@nongnu.org; Alistair
> > > Francis ; Philippe Mathieu-Daudé
> > > ; Matias Bjorling 
> > > Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> > > and Zoned Namespace Command Set
> > >
> > > On Sep 29 10:29, Keith Busch wrote:
> > > > On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > > > > It is unmistakably clear that you are invalidating my arguments
> > > > > about portability and endianness issues by suggesting that we
> > > > > just remove persistent state and deal with it later, but
> > > > > persistence is the killer feature that sets the QEMU emulated
> > > > > device apart from other emulation options. It is not about using
> > > > > emulation in production (because yeah, why would you?), but
> > > > > persistence is what makes it possible to develop and test "zoned
> > > > > FTLs" or something that
> > > requires recovery at power up.
> > > > > This is what allows testing of how your host software deals with
> > > > > opened zones being transitioned to FULL on power up and the
> > > > > persistent tracking of LBA allocation (in my series) can be used
> > > > > to properly test error recovery if you lost state in the app.
> > > >
> > > > Hold up -- why does an OPEN zone transition to FULL on power up?
> > > > The spec suggests it should be CLOSED. The spec does appear to
> > > > support going to FULL on a NVM Subsystem Reset, though. Actually,
> > > > now that I'm looking at this part of the spec, these implicit
> > > > transitions seem a bit less clear than I expected. I'm not sure
> > > > it's clear enough to evaluate qemu's compliance right now.
> > > >
> > > > But I don't see what testing these transitions has to do with
> > > > having a persistent state. You can reboot your VM without tearing
> > > > down the running QEMU instance. You can also unbind the driver or
> > > > shutdown the controller within the running operating system. That
> > > > should make those implicit state transitions reachable in order to
> > > > exercise your FTL's recovery.
> > > >
> > >
> > > Oh dear - don't "spec" with me ;)
> > >
> > > NVMe v1.4 Section 7.3.1:
> > >
> > > An NVM Subsystem Reset is initiated when:
> > >   * Main power is applied to the NVM subsystem;
> > >   * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
> > > field;
> > >   * Requested using a method defined in the NVMe Management
> > > Interface specification; or
> > >   * A vendor specific event occurs.
> > >
> > > In the context of QEMU, "Main power" is tearing down QEMU and
> > > starting it from scratch. Just like on a "real" host, unbinding the
> > > driver, rebooting or shutting down the controller does not cause a
> > > subsystem reset (and does not cause the zones to change state). And
> > > since the device does not indicate support for the optional
> > > NSSR.NSSRC register, that way to initiate a subsystem cannot be used.
> > >
> > > The reason for moving to FULL is that write pointer updates are not
> > > persisted on each advancement, only when the zone state changes. So
> > > zones that were opened might have valid data, but invalid write pointer.
> > > So the device transitions them to FULL as it is allowed to.
> > >
> >
> > How about when one must also recover from intermediate states (i.e.,
> > open/closed upon power loss). For example, I don't hope a real SSD
> > implementation transition zones to full when it has thousands of open
> > simultaneously. That could be a disaster for the PE cycles, and a lot
> > of media going to waste. One would want applications to support that
> > kind of failure mode as well.
> 
> Christ. The WDC Strike Force is really jumping out of lightspeed here.
> I'm afraid I don't have an opposing force to engage with. So I'll be your only
> boxing bag for the evening.
> 
> As Keith just said, "Opened" is not a valid intial state. Didn't you write the
> spec? ;) As for Closed, they will be brought up as is.

Upon power failure, a zone in the Explicitly Opened state or the Implicitly 
Opened state, and has LBAs written, can either be transitioned to Full or 
Closed state by the controller.

In the previous mail, I wanted to point out that if the intention of qemu was 
to test applications upon power

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Keith Busch

All,

Let's de-escalate this, please. There's no reason to doubt Klaus wants
to see this to work well, just as everyone else does. We unfortunately
have conflicting proposals posted, and everyone is passionate enough
about their work, but please simmer down.

As I mentioned earlier, I'd like to refocus on the basic implementation
and save the persistent state discussion once the core is solid. After
going through it all, I feel there's enough to discuss there to keep us
busy for little while longer. Additional comments on the code will be
coming from me later today.

RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Dmitry Fomichev




> -Original Message-
> From: Keith Busch 
> Sent: Tuesday, September 29, 2020 3:22 PM
> To: Klaus Jensen 
> Cc: Dmitry Fomichev ; Kevin Wolf
> ; Fam Zheng ; Damien Le Moal
> ; qemu-block@nongnu.org; Niklas Cassel
> ; Klaus Jensen ; qemu-
> de...@nongnu.org; Alistair Francis ; Philippe
> Mathieu-Daudé ; Matias Bjorling
> 
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> and Zoned Namespace Command Set
> 
> All,
> 
> Let's de-escalate this, please. There's no reason to doubt Klaus wants
> to see this to work well, just as everyone else does. We unfortunately
> have conflicting proposals posted, and everyone is passionate enough
> about their work, but please simmer down.
> 
> As I mentioned earlier, I'd like to refocus on the basic implementation
> and save the persistent state discussion once the core is solid. After
> going through it all, I feel there's enough to discuss there to keep us
> busy for little while longer. Additional comments on the code will be
> coming from me later today.

OK, I agree with this and I will not be replying to the email prior to this
one it the thread. Let's calm down so we will be able to have a beer at a
conference one day :)

The only one thing that I would like to cover is lack of response to Klaus'
ZNS patchset. Klaus, you are right to complain about it. Since discovering
about the large backlog of NVMe patches that you had pending
(something that we were not aware at the time of publishing our patches),
we made the decision to rebase our series on top of the patches that you
had posted before the publication time of WDC ZNS patchset. Since then,
I got caught in the constant cycle of rebasing our patches on top of your
series and that prevented me from doing much in terms of reviewing of
your commits. Now, once we seem to catch up with the current head of
development, I should be able to do more of this. There is absolutely no
ill will involved :)

Dmitry

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Klaus Jensen

On Sep 29 15:42, Dmitry Fomichev wrote:
> > -Original Message-
> > From: Klaus Jensen 
> > Sent: Monday, September 28, 2020 2:37 AM
> > To: Dmitry Fomichev 
> > Cc: Keith Busch ; Damien Le Moal
> > ; Klaus Jensen ; Kevin
> > Wolf ; Philippe Mathieu-Daudé ;
> > Maxim Levitsky ; Fam Zheng ;
> > Niklas Cassel ; qemu-block@nongnu.org; qemu-
> > de...@nongnu.org; Alistair Francis ; Matias
> > Bjorling 
> > Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> > and Zoned Namespace Command Set
> > 
> > On Sep 28 02:33, Dmitry Fomichev wrote:
> > > > -Original Message-
> > > > From: Klaus Jensen 
> > > >
> > > > If it really needs to be memory mapped, then I think a hostmem-based
> > > > approach similar to what Andrzej did for PMR is needed (I think that
> > > > will get rid of the CONFIG_POSIX ifdef at least, but still leave it
> > > > slightly tricky to get it to work on all platforms AFAIK).
> > >
> > > Ok, it looks that using the HostMemoryBackendFile backend will be
> > > more appropriate. This will remove the need for conditional compile.
> > >
> > > The mmap() portability is pretty decent across software platforms.
> > > Any poor Windows user who is forced to emulate ZNS on mingw will be
> > > able to do so, just without having zone state persistency. Considering
> > > how specialized this stuff is in first place, I estimate the number of 
> > > users
> > > affected by this "limitation" to be exactly zero.
> > >
> > 
> > QEMU is a cross platform project - we should strive for portability.
> > 
> > Alienating developers that use a Windows platform and calling them out
> > as "poor" is not exactly good for the zoned ecosystem.
> > 
> 
> Wow. By bringing up political correctness here you are basically admitting
> the fact that you have no real technical argument here.

I prefer that we support all platforms if and when we can. That's a
technical argument, not a personal one like you those you start using
now.

> The whole Windows issue is red herring that you are using to attack
> the code that is absolutely legit, but comes from a competitor.

I can't even...

> Your initial complaint was that it doesn't compile in mingw and that
> it uses "wrong" API. You have even suggested the API to use. Now, the
> code uses that API and builds fine, but now it's still not good simply
> because you "do not like it". It's a disgrace.
> 

I answered this in a previous reply.

> > > > But really,
> > > > since we do not require memory semantics for this, then I think the
> > > > abstraction is fundamentally wrong.
> > > >
> > >
> > > Seriously, what is wrong with using mmap :) ? It is used successfully for
> > > similar applications, for example -
> > > https://github.com/open-iscsi/tcmu-runner/blob/master/file_zbc.c
> > >
> > 
> > There is nothing fundamentally wrong with mmap. I just think it is the
> > wrong abstraction here (and it limits portability for no good reason).
> > For PMR there is a good reason - it requires memory semantics.
> > 
> 
> We are trying to emulate NVMEe controller NVRAM.  The best abstraction
> for emulating NVRAM would be... NVRAM!
> 

You never brought that up before and sure it could be a fair argument,
except it is not true.

PMR is emulating NVRAM (and requires memory semantics). Persistent state
is not emulating anything. It is an implementation detail.

> > > > I am, of course, blowing my own horn, since my implementation uses a
> > > > portable blockdev for this.
> > > >
> > >
> > > You are making it sound like the entire WDC series relies on this 
> > > approach.
> > > Actually, the persistency is introduced in the second to last patch in the
> > > series and it only adds a couple of lines of code in the i/o path to mark
> > > zones dirty. This is possible because of using mmap() and I find the way
> > > it is done to be quite elegant, not ugly :)
> > >
> > 
> > No, I understand that your implementation works fine without
> > persistance, but persistance is key. That is why my series adds it in
> > the first patch. Without persistence it is just a toy. And the QEMU
> > device is not just an "NVMe-version" of null_blk.
> > 
> > And I don't think I ever called the use of mmap ugly. I called out the
> > physical memory API shenanigans as a hack.
> > 
> > > > Another issue is the complete lack of endian conversions. Does it
> > > > matter? It depends. Will anyone ever use this on a big endian host and
> > > > move the meta data backing file to a little endian host? Probably not.
> > > > So does it really matter? Probably not, but it is cutting corners.
> > > >
> > 
> > After I had replied this, I considered a follow-up, because there are
> > probably QEMU developers that would call me out on this.
> > 
> > This definitely DOES matter to QEMU.
> > 
> > >
> > > Great point on endianness! Naturally, all file backed values are stored in
> > > their native endianness. This way, there is no extra overhead on big 
> > > endian
> > > hardware architectures.

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Klaus Jensen

On Sep 29 18:17, Matias Bjorling wrote:
> 
> 
> > -Original Message-
> > From: Klaus Jensen 
> > Sent: Tuesday, 29 September 2020 20.00
> > To: Keith Busch 
> > Cc: Damien Le Moal ; Fam Zheng
> > ; Kevin Wolf ; qemu-
> > bl...@nongnu.org; Niklas Cassel ; Klaus Jensen
> > ; qemu-de...@nongnu.org; Alistair Francis
> > ; Philippe Mathieu-Daudé ;
> > Matias Bjorling 
> > Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and
> > Zoned Namespace Command Set
> > 
> > On Sep 29 10:29, Keith Busch wrote:
> > > On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > > > It is unmistakably clear that you are invalidating my arguments
> > > > about portability and endianness issues by suggesting that we just
> > > > remove persistent state and deal with it later, but persistence is
> > > > the killer feature that sets the QEMU emulated device apart from
> > > > other emulation options. It is not about using emulation in
> > > > production (because yeah, why would you?), but persistence is what
> > > > makes it possible to develop and test "zoned FTLs" or something that
> > requires recovery at power up.
> > > > This is what allows testing of how your host software deals with
> > > > opened zones being transitioned to FULL on power up and the
> > > > persistent tracking of LBA allocation (in my series) can be used to
> > > > properly test error recovery if you lost state in the app.
> > >
> > > Hold up -- why does an OPEN zone transition to FULL on power up? The
> > > spec suggests it should be CLOSED. The spec does appear to support
> > > going to FULL on a NVM Subsystem Reset, though. Actually, now that I'm
> > > looking at this part of the spec, these implicit transitions seem a
> > > bit less clear than I expected. I'm not sure it's clear enough to
> > > evaluate qemu's compliance right now.
> > >
> > > But I don't see what testing these transitions has to do with having a
> > > persistent state. You can reboot your VM without tearing down the
> > > running QEMU instance. You can also unbind the driver or shutdown the
> > > controller within the running operating system. That should make those
> > > implicit state transitions reachable in order to exercise your FTL's
> > > recovery.
> > >
> > 
> > Oh dear - don't "spec" with me ;)
> > 
> > NVMe v1.4 Section 7.3.1:
> > 
> > An NVM Subsystem Reset is initiated when:
> >   * Main power is applied to the NVM subsystem;
> >   * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
> > field;
> >   * Requested using a method defined in the NVMe Management
> > Interface specification; or
> >   * A vendor specific event occurs.
> > 
> > In the context of QEMU, "Main power" is tearing down QEMU and starting it
> > from scratch. Just like on a "real" host, unbinding the driver, rebooting or
> > shutting down the controller does not cause a subsystem reset (and does not
> > cause the zones to change state). And since the device does not indicate
> > support for the optional NSSR.NSSRC register, that way to initiate a 
> > subsystem
> > cannot be used.
> > 
> > The reason for moving to FULL is that write pointer updates are not 
> > persisted
> > on each advancement, only when the zone state changes. So zones that were
> > opened might have valid data, but invalid write pointer.
> > So the device transitions them to FULL as it is allowed to.
> > 
> 
> How about when one must also recover from intermediate states (i.e.,
> open/closed upon power loss). For example, I don't hope a real SSD
> implementation transition zones to full when it has thousands of open
> simultaneously. That could be a disaster for the PE cycles, and a lot
> of media going to waste. One would want applications to support that
> kind of failure mode as well. 

Christ. The WDC Strike Force is really jumping out of lightspeed here.
I'm afraid I don't have an opposing force to engage with. So I'll be
your only boxing bag for the evening.

As Keith just said, "Opened" is not a valid intial state. Didn't you
write the spec? ;) As for Closed, they will be brought up as is.

With that in mind, I'm not sure what you specifically refer to? I'll
gently remind you that the QEMU nvme device is not a real SSD and does
not deal with NAND so it does not really do any "recovering" of
intermediate states on power on if that is what you refer to?


signature.asc
Description: PGP signature

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Klaus Jensen

On Sep 29 11:15, Keith Busch wrote:
> On Tue, Sep 29, 2020 at 08:00:04PM +0200, Klaus Jensen wrote:
> > On Sep 29 10:29, Keith Busch wrote:
> > > On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > > > It is unmistakably clear that you are invalidating my arguments about
> > > > portability and endianness issues by suggesting that we just remove
> > > > persistent state and deal with it later, but persistence is the killer
> > > > feature that sets the QEMU emulated device apart from other emulation
> > > > options. It is not about using emulation in production (because yeah,
> > > > why would you?), but persistence is what makes it possible to develop
> > > > and test "zoned FTLs" or something that requires recovery at power up.
> > > > This is what allows testing of how your host software deals with opened
> > > > zones being transitioned to FULL on power up and the persistent tracking
> > > > of LBA allocation (in my series) can be used to properly test error
> > > > recovery if you lost state in the app.
> > > 
> > > Hold up -- why does an OPEN zone transition to FULL on power up? The
> > > spec suggests it should be CLOSED. The spec does appear to support going
> > > to FULL on a NVM Subsystem Reset, though. Actually, now that I'm looking
> > > at this part of the spec, these implicit transitions seem a bit less
> > > clear than I expected. I'm not sure it's clear enough to evaluate qemu's
> > > compliance right now.
> > > 
> > > But I don't see what testing these transitions has to do with having a
> > > persistent state. You can reboot your VM without tearing down the
> > > running QEMU instance. You can also unbind the driver or shutdown the
> > > controller within the running operating system. That should make those
> > > implicit state transitions reachable in order to exercise your FTL's
> > > recovery.
> > > 
> > 
> > Oh dear - don't "spec" with me ;)
> > 
> > NVMe v1.4 Section 7.3.1:
> > 
> > An NVM Subsystem Reset is initiated when:
> >   * Main power is applied to the NVM subsystem;
> >   * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
> > field;
> >   * Requested using a method defined in the NVMe Management
> > Interface specification; or
> >   * A vendor specific event occurs.
>  
> Okay. I wish the nvme twg would strip the changelog from the published
> TPs. We have unhelpful statements like this in the ZNS spec:
> 
>   "Default active zones to transition to Closed state on power/controller 
> reset."
> 
> > In the context of QEMU, "Main power" is tearing down QEMU and starting
> > it from scratch. Just like on a "real" host, unbinding the driver,
> > rebooting or shutting down the controller does not cause a subsystem
> > reset (and does not cause the zones to change state). 
> 
> That can't be right. The ZNS spec says:
> 
>   The initial state of a zone state machine is set as a result of:
> a) an NVM Subsystem Reset; or
> b) all controllers in the NVM subsystem reporting Shutdown
>processing complete ((i.e., 10b in the Shutdown Status (SHST)
>register)
> 
> So a CC.SHN had better cause an implicit transition of open zones to
> their "initial" state since 'open' is not a valid initial state.

Oh snap; true, you got me there.


signature.asc
Description: PGP signature

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Keith Busch

On Tue, Sep 29, 2020 at 08:00:04PM +0200, Klaus Jensen wrote:
> On Sep 29 10:29, Keith Busch wrote:
> > On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > > It is unmistakably clear that you are invalidating my arguments about
> > > portability and endianness issues by suggesting that we just remove
> > > persistent state and deal with it later, but persistence is the killer
> > > feature that sets the QEMU emulated device apart from other emulation
> > > options. It is not about using emulation in production (because yeah,
> > > why would you?), but persistence is what makes it possible to develop
> > > and test "zoned FTLs" or something that requires recovery at power up.
> > > This is what allows testing of how your host software deals with opened
> > > zones being transitioned to FULL on power up and the persistent tracking
> > > of LBA allocation (in my series) can be used to properly test error
> > > recovery if you lost state in the app.
> > 
> > Hold up -- why does an OPEN zone transition to FULL on power up? The
> > spec suggests it should be CLOSED. The spec does appear to support going
> > to FULL on a NVM Subsystem Reset, though. Actually, now that I'm looking
> > at this part of the spec, these implicit transitions seem a bit less
> > clear than I expected. I'm not sure it's clear enough to evaluate qemu's
> > compliance right now.
> > 
> > But I don't see what testing these transitions has to do with having a
> > persistent state. You can reboot your VM without tearing down the
> > running QEMU instance. You can also unbind the driver or shutdown the
> > controller within the running operating system. That should make those
> > implicit state transitions reachable in order to exercise your FTL's
> > recovery.
> > 
> 
> Oh dear - don't "spec" with me ;)
> 
> NVMe v1.4 Section 7.3.1:
> 
> An NVM Subsystem Reset is initiated when:
>   * Main power is applied to the NVM subsystem;
>   * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
> field;
>   * Requested using a method defined in the NVMe Management
> Interface specification; or
>   * A vendor specific event occurs.
 
Okay. I wish the nvme twg would strip the changelog from the published
TPs. We have unhelpful statements like this in the ZNS spec:

  "Default active zones to transition to Closed state on power/controller 
reset."

> In the context of QEMU, "Main power" is tearing down QEMU and starting
> it from scratch. Just like on a "real" host, unbinding the driver,
> rebooting or shutting down the controller does not cause a subsystem
> reset (and does not cause the zones to change state). 

That can't be right. The ZNS spec says:

  The initial state of a zone state machine is set as a result of:
a) an NVM Subsystem Reset; or
b) all controllers in the NVM subsystem reporting Shutdown
   processing complete ((i.e., 10b in the Shutdown Status (SHST)
   register)

So a CC.SHN had better cause an implicit transition of open zones to
their "initial" state since 'open' is not a valid initial state.

Re: [PATCH] job: delete job_{lock, unlock} functions and replace them with lock guard

2020-09-29 Thread John Snow


On 9/29/20 9:42 AM, Elena Afanasova wrote:

Signed-off-by: Elena Afanasova 


Hi, can I have a commit message here, please?


---
  job.c | 46 +-
  1 file changed, 17 insertions(+), 29 deletions(-)

diff --git a/job.c b/job.c
index 8fecf38960..89ceb53434 100644
--- a/job.c
+++ b/job.c
@@ -79,16 +79,6 @@ struct JobTxn {
   * job_enter. */
  static QemuMutex job_mutex;
  
-static void job_lock(void)

-{
-qemu_mutex_lock(_mutex);
-}
-
-static void job_unlock(void)
-{
-qemu_mutex_unlock(_mutex);
-}
-
  static void __attribute__((__constructor__)) job_init(void)
  {
  qemu_mutex_init(_mutex);
@@ -437,21 +427,19 @@ void job_enter_cond(Job *job, bool(*fn)(Job *job))
  return;
  }
  
-job_lock();

-if (job->busy) {
-job_unlock();
-return;
-}
+WITH_QEMU_LOCK_GUARD(_mutex) {
+if (job->busy) {
+return;
+}
  
-if (fn && !fn(job)) {

-job_unlock();
-return;
-}
+if (fn && !fn(job)) {
+return;
+}
  
-assert(!job->deferred_to_main_loop);

-timer_del(>sleep_timer);
-job->busy = true;
-job_unlock();
+assert(!job->deferred_to_main_loop);
+timer_del(>sleep_timer);
+job->busy = true;
+}
  aio_co_enter(job->aio_context, job->co);
  }
  
@@ -468,13 +456,13 @@ void job_enter(Job *job)

   * called explicitly. */
  static void coroutine_fn job_do_yield(Job *job, uint64_t ns)
  {
-job_lock();
-if (ns != -1) {
-timer_mod(>sleep_timer, ns);
+WITH_QEMU_LOCK_GUARD(_mutex) {
+if (ns != -1) {
+timer_mod(>sleep_timer, ns);
+}
+job->busy = false;
+job_event_idle(job);


Is this new macro safe to use in a coroutine context?


  }
-job->busy = false;
-job_event_idle(job);
-job_unlock();
  qemu_coroutine_yield();
  
  /* Set by job_enter_cond() before re-entering the coroutine.  */




I haven't looked into WITH_QEMU_LOCK_GUARD before, I assume it's new. If 
it works like I think it does, this change seems good.


(I'm assuming it works like a Python context manager and it drops the 
lock when it leaves the scope of the macro using GCC/Clang language 
extensions.)

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Klaus Jensen

On Sep 29 10:29, Keith Busch wrote:
> On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > It is unmistakably clear that you are invalidating my arguments about
> > portability and endianness issues by suggesting that we just remove
> > persistent state and deal with it later, but persistence is the killer
> > feature that sets the QEMU emulated device apart from other emulation
> > options. It is not about using emulation in production (because yeah,
> > why would you?), but persistence is what makes it possible to develop
> > and test "zoned FTLs" or something that requires recovery at power up.
> > This is what allows testing of how your host software deals with opened
> > zones being transitioned to FULL on power up and the persistent tracking
> > of LBA allocation (in my series) can be used to properly test error
> > recovery if you lost state in the app.
> 
> Hold up -- why does an OPEN zone transition to FULL on power up? The
> spec suggests it should be CLOSED. The spec does appear to support going
> to FULL on a NVM Subsystem Reset, though. Actually, now that I'm looking
> at this part of the spec, these implicit transitions seem a bit less
> clear than I expected. I'm not sure it's clear enough to evaluate qemu's
> compliance right now.
> 
> But I don't see what testing these transitions has to do with having a
> persistent state. You can reboot your VM without tearing down the
> running QEMU instance. You can also unbind the driver or shutdown the
> controller within the running operating system. That should make those
> implicit state transitions reachable in order to exercise your FTL's
> recovery.
> 

Oh dear - don't "spec" with me ;)

NVMe v1.4 Section 7.3.1:

An NVM Subsystem Reset is initiated when:
  * Main power is applied to the NVM subsystem;
  * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
field;
  * Requested using a method defined in the NVMe Management
Interface specification; or
  * A vendor specific event occurs.

In the context of QEMU, "Main power" is tearing down QEMU and starting
it from scratch. Just like on a "real" host, unbinding the driver,
rebooting or shutting down the controller does not cause a subsystem
reset (and does not cause the zones to change state). And since the
device does not indicate support for the optional NSSR.NSSRC register,
that way to initiate a subsystem cannot be used.

The reason for moving to FULL is that write pointer updates are not
persisted on each advancement, only when the zone state changes. So
zones that were opened might have valid data, but invalid write pointer.
So the device transitions them to FULL as it is allowed to.

QED.

> I agree the persistent state provides conveniences for developers. I
> just don't want to gate ZNS enabling on it either since the core design
> doesn't depend on it.

I just don't see why we cant have the icing on the cake when it is
already there :)

signature.asc
Description: PGP signature

Re: [PATCH 4/4] qemu-storage-daemon: Remove QemuOpts from --object parser

2020-09-29 Thread Eric Blake

On 9/29/20 12:26 PM, Kevin Wolf wrote:
> The command line parser for --object parses the input twice: Once into
> QemuOpts just for detecting help options, and then again into a QDict
> using the keyval parser for actually creating the object.
> 
> Now that the keyval parser can also detect help options, we can simplify
> this and remove the QemuOpts part.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  storage-daemon/qemu-storage-daemon.c | 15 ---
>  1 file changed, 4 insertions(+), 11 deletions(-)
> 
> diff --git a/storage-daemon/qemu-storage-daemon.c 
> b/storage-daemon/qemu-storage-daemon.c
> index bb9cb740f0..7cbdbf0b23 100644
> --- a/storage-daemon/qemu-storage-daemon.c
> +++ b/storage-daemon/qemu-storage-daemon.c
> @@ -264,21 +264,14 @@ static void process_options(int argc, char *argv[])
>  }
>  case OPTION_OBJECT:
>  {
> -QemuOpts *opts;
> -const char *type;
>  QDict *args;
> +bool help;
>  
> -/* FIXME The keyval parser rejects 'help' arguments, so we 
> must
> - * unconditionall try QemuOpts first. */

And you're fixing a typo by deleting it ;)

> -opts = qemu_opts_parse(_object_opts,
> -   optarg, true, _fatal);
> -type = qemu_opt_get(opts, "qom-type");
> -if (type && user_creatable_print_help(type, opts)) {
> +args = keyval_parse(optarg, "qom-type", , _fatal);
> +if (help) {
> +user_creatable_print_help_from_qdict(args);
>  exit(EXIT_SUCCESS);
>  }

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 3/4] qom: Add user_creatable_print_help_from_qdict()

2020-09-29 Thread Eric Blake

On 9/29/20 12:26 PM, Kevin Wolf wrote:
> This adds a function that, given a QDict of non-help options, prints
> help for user creatable objects.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  include/qom/object_interfaces.h | 9 +
>  qom/object_interfaces.c | 9 +
>  2 files changed, 18 insertions(+)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 2/4] qom: Factor out helpers from user_creatable_print_help()

2020-09-29 Thread Eric Blake

On 9/29/20 12:26 PM, Kevin Wolf wrote:
> This creates separate helper functions for printing a list of user
> creatable object types and for printing a list of properties of a given
> type. This allows using these parts without having a QemuOpts.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  qom/object_interfaces.c | 90 -
>  1 file changed, 52 insertions(+), 38 deletions(-)
> 

Awkward diff as presented; ignoring whitespace makes it easier.

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 1/4] keyval: Parse help options

2020-09-29 Thread Eric Blake

On 9/29/20 12:26 PM, Kevin Wolf wrote:
> This adds a new parameter 'help' to keyval_parse() that enables parsing
> of help options. If NULL is passed, the function behaves the same as
> before. But if a bool pointer is given, it contains the information
> whether an option "help" without value was given (which would otherwise
> either result in an error or be interpreted as the value for an implied
> key).
> 
> Signed-off-by: Kevin Wolf 
> ---

> +++ b/util/keyval.c

Might be nice to see this before the testsuite changes by tweaking the
git orderfile.

> @@ -166,7 +166,7 @@ static QObject *keyval_parse_put(QDict *cur,
>   * On failure, return NULL.
>   */
>  static const char *keyval_parse_one(QDict *qdict, const char *params,
> -const char *implied_key,
> +const char *implied_key, bool *help,
>  Error **errp)
>  {
>  const char *key, *key_end, *s, *end;
> @@ -179,6 +179,16 @@ static const char *keyval_parse_one(QDict *qdict, const 
> char *params,
>  
>  key = params;
>  len = strcspn(params, "=,");
> +
> +if (help && key[len] != '=' && !strncmp(key, "help", len)) {

What if the user typed "help,," to get "help," as the value of the
implied key?

> +*help = true;
> +s = key + len;
> +if (key[len] != '\0') {
> +s++;
> +}
> +return s;
> +}
> +
>  if (implied_key && len && key[len] != '=') {
>  /* Desugar implied key */
>  key = implied_key;
> @@ -388,21 +398,33 @@ static QObject *keyval_listify(QDict *cur, GSList 
> *key_of_cur, Error **errp)
>  
>  /*
>   * Parse @params in QEMU's traditional KEY=VALUE,... syntax.
> + *
>   * If @implied_key, the first KEY= can be omitted.  @implied_key is
>   * implied then, and VALUE can't be empty or contain ',' or '='.
> + *
> + * If @help is given, an option "help" without a value isn't added to
> + * the resulting dictionary, but instead sets @help to true. If no
> + * help option is found, @help is false on return. All other options
> + * are parsed and returned normally so that context specific help can
> + * be printed.
> + *
>   * On success, return a dictionary of the parsed keys and values.
>   * On failure, store an error through @errp and return NULL.
>   */
>  QDict *keyval_parse(const char *params, const char *implied_key,
> -Error **errp)
> +bool *help, Error **errp)
>  {
>  QDict *qdict = qdict_new();
>  QObject *listified;
>  const char *s;
>  
> +if (help) {
> +*help = false;
> +}
> +
>  s = params;
>  while (*s) {
> -s = keyval_parse_one(qdict, s, implied_key, errp);
> +s = keyval_parse_one(qdict, s, implied_key, help, errp);
>  if (!s) {
>  qobject_unref(qdict);
>  return NULL;
> 

I like it, but wonder if you are missing one corner case.


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Keith Busch

On Tue, Sep 29, 2020 at 11:13:51AM +, Damien Le Moal wrote:
> OK. Then let's move the persistence implementation as the last patch in the
> series. This way, if it is still controversial, it will not block the rest.
> 
> Here is what I propose:
> Dmitry: remove persistence stuff from your patches, address comments and 
> resend.
> Klaus: Rebase your persistence patch(es) with reworked format on top of Dmitry
> series and send.
> 
> That creates a pipeline for reviews and persistence is not a blocker. And I
> agree that other ZNS feature can come after we get all of that done first.
> 
> Thoughts ? Keith ? Would that work for you ?

That works for me. I will have comments for Dmitry's v5, though, so
please wait one more day before considering a respin.

Re: [PATCH v2 3/4] block: move block exports to libblockdev

2020-09-29 Thread Eric Blake

On 9/29/20 7:55 AM, Stefan Hajnoczi wrote:
> Block exports are used by softmmu, qemu-storage-daemon, and qemu-nbd.
> They are not used by other programs and are not otherwise needed in
> libblock.
> 
> Undo the recent move of blockdev-nbd.c from blockdev_ss into block_ss.
> Since bdrv_close_all() (libblock) calls blk_exp_close_all()
> (libblockdev) a stub function is required..
> 
> Make qemu-ndb.c use signal handling utility functions instead of
> duplicating the code. This helps because os-posix.c is in libblockdev
> and it depends on a qemu_system_killed() symbol that qemu-nbd.c lacks.
> Once we use the signal handling utility functions we also end up
> providing the necessary symbol.

Hmm. I just stumbled on a long-standing bug in qemu-nbd - it installs a
SIGTERM handler, but not a SIGINT or SIGHUP handler.  This matters in
the following sequence:

qemu-nbd -f qcow2 -B bitmap image   # Ctrl-C
qemu-nbd -f qcow2 -B bitmap image

because the first instance dies with SIGINT but there is no handler
installed, qemu-nbd does not release the bitmap from being marked
in-use, and the second instance then fails with:

qemu-nbd: Bitmap 'b0' is inconsistent and cannot be used

And to my surprise, while I was trying to find the root cause to fixing
the bug I just found, I noticed that your patch happens to fix that...

> +++ b/qemu-nbd.c

> @@ -581,20 +586,12 @@ int main(int argc, char **argv)
>  const char *pid_file_name = NULL;
>  BlockExportOptions *export_opts;
>  
> +os_setup_early_signal_handling();
> +
>  #if HAVE_NBD_DEVICE
> -/* The client thread uses SIGTERM to interrupt the server.  A signal
> - * handler ensures that "qemu-nbd -v -c" exits with a nice status code.
> - */
> -struct sigaction sa_sigterm;
> -memset(_sigterm, 0, sizeof(sa_sigterm));
> -sa_sigterm.sa_handler = termsig_handler;
> -sigaction(SIGTERM, _sigterm, NULL);
> +os_setup_signal_handling();

...by installing a SIGINT handler.

Is HAVE_NBD_DEVICE really the best gate for this code, or is it really
whether we are compiling for mingw?  At any rate, you may want to add a
link to https://bugzilla.redhat.com/show_bug.cgi?id=1883608 in the
commit message, and/or separate the bug fix out into a separate commit.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Keith Busch

On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> It is unmistakably clear that you are invalidating my arguments about
> portability and endianness issues by suggesting that we just remove
> persistent state and deal with it later, but persistence is the killer
> feature that sets the QEMU emulated device apart from other emulation
> options. It is not about using emulation in production (because yeah,
> why would you?), but persistence is what makes it possible to develop
> and test "zoned FTLs" or something that requires recovery at power up.
> This is what allows testing of how your host software deals with opened
> zones being transitioned to FULL on power up and the persistent tracking
> of LBA allocation (in my series) can be used to properly test error
> recovery if you lost state in the app.

Hold up -- why does an OPEN zone transition to FULL on power up? The
spec suggests it should be CLOSED. The spec does appear to support going
to FULL on a NVM Subsystem Reset, though. Actually, now that I'm looking
at this part of the spec, these implicit transitions seem a bit less
clear than I expected. I'm not sure it's clear enough to evaluate qemu's
compliance right now.

But I don't see what testing these transitions has to do with having a
persistent state. You can reboot your VM without tearing down the
running QEMU instance. You can also unbind the driver or shutdown the
controller within the running operating system. That should make those
implicit state transitions reachable in order to exercise your FTL's
recovery.

I agree the persistent state provides conveniences for developers. I
just don't want to gate ZNS enabling on it either since the core design
doesn't depend on it.

[PATCH 2/4] qom: Factor out helpers from user_creatable_print_help()

2020-09-29 Thread Kevin Wolf

This creates separate helper functions for printing a list of user
creatable object types and for printing a list of properties of a given
type. This allows using these parts without having a QemuOpts.

Signed-off-by: Kevin Wolf 
---
 qom/object_interfaces.c | 90 -
 1 file changed, 52 insertions(+), 38 deletions(-)

diff --git a/qom/object_interfaces.c b/qom/object_interfaces.c
index e8e1523960..3fd1da157e 100644
--- a/qom/object_interfaces.c
+++ b/qom/object_interfaces.c
@@ -214,54 +214,68 @@ char *object_property_help(const char *name, const char 
*type,
 return g_string_free(str, false);
 }
 
-bool user_creatable_print_help(const char *type, QemuOpts *opts)
+static void user_creatable_print_types(void)
+{
+GSList *l, *list;
+
+printf("List of user creatable objects:\n");
+list = object_class_get_list_sorted(TYPE_USER_CREATABLE, false);
+for (l = list; l != NULL; l = l->next) {
+ObjectClass *oc = OBJECT_CLASS(l->data);
+printf("  %s\n", object_class_get_name(oc));
+}
+g_slist_free(list);
+}
+
+static bool user_creatable_print_type_properites(const char *type)
 {
 ObjectClass *klass;
+ObjectPropertyIterator iter;
+ObjectProperty *prop;
+GPtrArray *array;
+int i;
 
-if (is_help_option(type)) {
-GSList *l, *list;
+klass = object_class_by_name(type);
+if (!klass) {
+return false;
+}
 
-printf("List of user creatable objects:\n");
-list = object_class_get_list_sorted(TYPE_USER_CREATABLE, false);
-for (l = list; l != NULL; l = l->next) {
-ObjectClass *oc = OBJECT_CLASS(l->data);
-printf("  %s\n", object_class_get_name(oc));
+array = g_ptr_array_new();
+object_class_property_iter_init(, klass);
+while ((prop = object_property_iter_next())) {
+if (!prop->set) {
+continue;
 }
-g_slist_free(list);
-return true;
+
+g_ptr_array_add(array,
+object_property_help(prop->name, prop->type,
+ prop->defval, prop->description));
 }
+g_ptr_array_sort(array, (GCompareFunc)qemu_pstrcmp0);
+if (array->len > 0) {
+printf("%s options:\n", type);
+} else {
+printf("There are no options for %s.\n", type);
+}
+for (i = 0; i < array->len; i++) {
+printf("%s\n", (char *)array->pdata[i]);
+}
+g_ptr_array_set_free_func(array, g_free);
+g_ptr_array_free(array, true);
+return true;
+}
 
-klass = object_class_by_name(type);
-if (klass && qemu_opt_has_help_opt(opts)) {
-ObjectPropertyIterator iter;
-ObjectProperty *prop;
-GPtrArray *array = g_ptr_array_new();
-int i;
-
-object_class_property_iter_init(, klass);
-while ((prop = object_property_iter_next())) {
-if (!prop->set) {
-continue;
-}
-
-g_ptr_array_add(array,
-object_property_help(prop->name, prop->type,
- prop->defval, 
prop->description));
-}
-g_ptr_array_sort(array, (GCompareFunc)qemu_pstrcmp0);
-if (array->len > 0) {
-printf("%s options:\n", type);
-} else {
-printf("There are no options for %s.\n", type);
-}
-for (i = 0; i < array->len; i++) {
-printf("%s\n", (char *)array->pdata[i]);
-}
-g_ptr_array_set_free_func(array, g_free);
-g_ptr_array_free(array, true);
+bool user_creatable_print_help(const char *type, QemuOpts *opts)
+{
+if (is_help_option(type)) {
+user_creatable_print_types();
 return true;
 }
 
+if (qemu_opt_has_help_opt(opts)) {
+return user_creatable_print_type_properites(type);
+}
+
 return false;
 }
 
-- 
2.25.4

[PATCH 3/4] qom: Add user_creatable_print_help_from_qdict()

2020-09-29 Thread Kevin Wolf

This adds a function that, given a QDict of non-help options, prints
help for user creatable objects.

Signed-off-by: Kevin Wolf 
---
 include/qom/object_interfaces.h | 9 +
 qom/object_interfaces.c | 9 +
 2 files changed, 18 insertions(+)

diff --git a/include/qom/object_interfaces.h b/include/qom/object_interfaces.h
index f118fb516b..53b114b11a 100644
--- a/include/qom/object_interfaces.h
+++ b/include/qom/object_interfaces.h
@@ -161,6 +161,15 @@ int user_creatable_add_opts_foreach(void *opaque,
  */
 bool user_creatable_print_help(const char *type, QemuOpts *opts);
 
+/**
+ * user_creatable_print_help_from_qdict:
+ * @args: options to create
+ *
+ * Prints help considering the other options given in @args (if "qom-type" is
+ * given and valid, print properties for the type, otherwise print valid types)
+ */
+void user_creatable_print_help_from_qdict(QDict *args);
+
 /**
  * user_creatable_del:
  * @id: the unique ID for the object
diff --git a/qom/object_interfaces.c b/qom/object_interfaces.c
index 3fd1da157e..ed896fe764 100644
--- a/qom/object_interfaces.c
+++ b/qom/object_interfaces.c
@@ -279,6 +279,15 @@ bool user_creatable_print_help(const char *type, QemuOpts 
*opts)
 return false;
 }
 
+void user_creatable_print_help_from_qdict(QDict *args)
+{
+const char *type = qdict_get_try_str(args, "qom-type");
+
+if (!type || !user_creatable_print_type_properites(type)) {
+user_creatable_print_types();
+}
+}
+
 bool user_creatable_del(const char *id, Error **errp)
 {
 Object *container;
-- 
2.25.4

[PATCH 0/4] qemu-storage-daemon: Remove QemuOpts from --object parser

2020-09-29 Thread Kevin Wolf

This replaces the QemuOpts-based help code for --object in the storage
daemon with code based on the keyval parser.

Kevin Wolf (4):
  keyval: Parse help options
  qom: Factor out helpers from user_creatable_print_help()
  qom: Add user_creatable_print_help_from_qdict()
  qemu-storage-daemon: Remove QemuOpts from --object parser

 include/qemu/option.h|   2 +-
 include/qom/object_interfaces.h  |   9 ++
 qapi/qobject-input-visitor.c |   2 +-
 qom/object_interfaces.c  |  99 ++---
 storage-daemon/qemu-storage-daemon.c |  15 +--
 tests/test-keyval.c  | 157 ---
 util/keyval.c|  28 -
 7 files changed, 196 insertions(+), 116 deletions(-)

-- 
2.25.4

[PATCH 4/4] qemu-storage-daemon: Remove QemuOpts from --object parser

2020-09-29 Thread Kevin Wolf

The command line parser for --object parses the input twice: Once into
QemuOpts just for detecting help options, and then again into a QDict
using the keyval parser for actually creating the object.

Now that the keyval parser can also detect help options, we can simplify
this and remove the QemuOpts part.

Signed-off-by: Kevin Wolf 
---
 storage-daemon/qemu-storage-daemon.c | 15 ---
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/storage-daemon/qemu-storage-daemon.c 
b/storage-daemon/qemu-storage-daemon.c
index bb9cb740f0..7cbdbf0b23 100644
--- a/storage-daemon/qemu-storage-daemon.c
+++ b/storage-daemon/qemu-storage-daemon.c
@@ -264,21 +264,14 @@ static void process_options(int argc, char *argv[])
 }
 case OPTION_OBJECT:
 {
-QemuOpts *opts;
-const char *type;
 QDict *args;
+bool help;
 
-/* FIXME The keyval parser rejects 'help' arguments, so we must
- * unconditionall try QemuOpts first. */
-opts = qemu_opts_parse(_object_opts,
-   optarg, true, _fatal);
-type = qemu_opt_get(opts, "qom-type");
-if (type && user_creatable_print_help(type, opts)) {
+args = keyval_parse(optarg, "qom-type", , _fatal);
+if (help) {
+user_creatable_print_help_from_qdict(args);
 exit(EXIT_SUCCESS);
 }
-qemu_opts_del(opts);
-
-args = keyval_parse(optarg, "qom-type", NULL, _fatal);
 user_creatable_add_dict(args, true, _fatal);
 qobject_unref(args);
 break;
-- 
2.25.4

[PATCH 1/4] keyval: Parse help options

2020-09-29 Thread Kevin Wolf

This adds a new parameter 'help' to keyval_parse() that enables parsing
of help options. If NULL is passed, the function behaves the same as
before. But if a bool pointer is given, it contains the information
whether an option "help" without value was given (which would otherwise
either result in an error or be interpreted as the value for an implied
key).

Signed-off-by: Kevin Wolf 
---
 include/qemu/option.h|   2 +-
 qapi/qobject-input-visitor.c |   2 +-
 storage-daemon/qemu-storage-daemon.c |   2 +-
 tests/test-keyval.c  | 157 ---
 util/keyval.c|  28 -
 5 files changed, 123 insertions(+), 68 deletions(-)

diff --git a/include/qemu/option.h b/include/qemu/option.h
index 05e8a15c73..ac69352e0e 100644
--- a/include/qemu/option.h
+++ b/include/qemu/option.h
@@ -149,6 +149,6 @@ void qemu_opts_free(QemuOptsList *list);
 QemuOptsList *qemu_opts_append(QemuOptsList *dst, QemuOptsList *list);
 
 QDict *keyval_parse(const char *params, const char *implied_key,
-Error **errp);
+bool *help, Error **errp);
 
 #endif
diff --git a/qapi/qobject-input-visitor.c b/qapi/qobject-input-visitor.c
index f918a05e5f..7b184b50a7 100644
--- a/qapi/qobject-input-visitor.c
+++ b/qapi/qobject-input-visitor.c
@@ -757,7 +757,7 @@ Visitor *qobject_input_visitor_new_str(const char *str,
 assert(args);
 v = qobject_input_visitor_new(QOBJECT(args));
 } else {
-args = keyval_parse(str, implied_key, errp);
+args = keyval_parse(str, implied_key, NULL, errp);
 if (!args) {
 return NULL;
 }
diff --git a/storage-daemon/qemu-storage-daemon.c 
b/storage-daemon/qemu-storage-daemon.c
index e6157ff518..bb9cb740f0 100644
--- a/storage-daemon/qemu-storage-daemon.c
+++ b/storage-daemon/qemu-storage-daemon.c
@@ -278,7 +278,7 @@ static void process_options(int argc, char *argv[])
 }
 qemu_opts_del(opts);
 
-args = keyval_parse(optarg, "qom-type", _fatal);
+args = keyval_parse(optarg, "qom-type", NULL, _fatal);
 user_creatable_add_dict(args, true, _fatal);
 qobject_unref(args);
 break;
diff --git a/tests/test-keyval.c b/tests/test-keyval.c
index e331a84149..1ac65c371e 100644
--- a/tests/test-keyval.c
+++ b/tests/test-keyval.c
@@ -27,27 +27,28 @@ static void test_keyval_parse(void)
 QDict *qdict, *sub_qdict;
 char long_key[129];
 char *params;
+bool help;
 
 /* Nothing */
-qdict = keyval_parse("", NULL, _abort);
+qdict = keyval_parse("", NULL, NULL, _abort);
 g_assert_cmpuint(qdict_size(qdict), ==, 0);
 qobject_unref(qdict);
 
 /* Empty key (qemu_opts_parse() accepts this) */
-qdict = keyval_parse("=val", NULL, );
+qdict = keyval_parse("=val", NULL, NULL, );
 error_free_or_abort();
 g_assert(!qdict);
 
 /* Empty key fragment */
-qdict = keyval_parse(".", NULL, );
+qdict = keyval_parse(".", NULL, NULL, );
 error_free_or_abort();
 g_assert(!qdict);
-qdict = keyval_parse("key.", NULL, );
+qdict = keyval_parse("key.", NULL, NULL, );
 error_free_or_abort();
 g_assert(!qdict);
 
 /* Invalid non-empty key (qemu_opts_parse() doesn't care) */
-qdict = keyval_parse("7up=val", NULL, );
+qdict = keyval_parse("7up=val", NULL, NULL, );
 error_free_or_abort();
 g_assert(!qdict);
 
@@ -56,25 +57,25 @@ static void test_keyval_parse(void)
 long_key[127] = 'z';
 long_key[128] = 0;
 params = g_strdup_printf("k.%s=v", long_key);
-qdict = keyval_parse(params + 2, NULL, );
+qdict = keyval_parse(params + 2, NULL, NULL, );
 error_free_or_abort();
 g_assert(!qdict);
 
 /* Overlong key fragment */
-qdict = keyval_parse(params, NULL, );
+qdict = keyval_parse(params, NULL, NULL, );
 error_free_or_abort();
 g_assert(!qdict);
 g_free(params);
 
 /* Long key (qemu_opts_parse() accepts and truncates silently) */
 params = g_strdup_printf("k.%s=v", long_key + 1);
-qdict = keyval_parse(params + 2, NULL, _abort);
+qdict = keyval_parse(params + 2, NULL, NULL, _abort);
 g_assert_cmpuint(qdict_size(qdict), ==, 1);
 g_assert_cmpstr(qdict_get_try_str(qdict, long_key + 1), ==, "v");
 qobject_unref(qdict);
 
 /* Long key fragment */
-qdict = keyval_parse(params, NULL, _abort);
+qdict = keyval_parse(params, NULL, NULL, _abort);
 g_assert_cmpuint(qdict_size(qdict), ==, 1);
 sub_qdict = qdict_get_qdict(qdict, "k");
 g_assert(sub_qdict);
@@ -84,25 +85,25 @@ static void test_keyval_parse(void)
 g_free(params);
 
 /* Crap after valid key */
-qdict = keyval_parse("key[0]=val", NULL, );
+qdict = keyval_parse("key[0]=val", NULL, NULL, );
 error_free_or_abort();
 g_assert(!qdict);
 
 /* Multiple keys, last one wins */
-qdict = keyval_parse("a=1,b=2,,x,a=3",

Re: [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for persistence

2020-09-29 Thread Klaus Jensen

On Sep 29 15:43, Dmitry Fomichev wrote:
> 
> 
> > -Original Message-
> > From: Klaus Jensen 
> > Sent: Monday, September 28, 2020 3:52 AM
> > To: Dmitry Fomichev 
> > Cc: Keith Busch ; Klaus Jensen
> > ; Kevin Wolf ; Philippe
> > Mathieu-Daudé ; Maxim Levitsky
> > ; Fam Zheng ; Niklas Cassel
> > ; Damien Le Moal ;
> > qemu-block@nongnu.org; qemu-de...@nongnu.org; Alistair Francis
> > ; Matias Bjorling 
> > Subject: Re: [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for
> > persistence
> > 
> > On Sep 28 11:35, Dmitry Fomichev wrote:
> > > A ZNS drive that is emulated by this module is currently initialized
> > > with all zones Empty upon startup. However, actual ZNS SSDs save the
> > > state and condition of all zones in their internal NVRAM in the event
> > > of power loss. When such a drive is powered up again, it closes or
> > > finishes all zones that were open at the moment of shutdown. Besides
> > > that, the write pointer position as well as the state and condition
> > > of all zones is preserved across power-downs.
> > >
> > > This commit adds the capability to have a persistent zone metadata
> > > to the device. The new optional module property, "zone_file",
> > > is introduced. If added to the command line, this property specifies
> > > the name of the file that stores the zone metadata. If "zone_file" is
> > > omitted, the device will be initialized with all zones empty, the same
> > > as before.
> > >
> > > If zone metadata is configured to be persistent, then zone descriptor
> > > extensions also persist across controller shutdowns.
> > >
> > > Signed-off-by: Dmitry Fomichev 
> > > ---
> > >  hw/block/nvme-ns.c| 341
> > --
> > >  hw/block/nvme-ns.h|  33 
> > >  hw/block/nvme.c   |   2 +
> > >  hw/block/trace-events |   1 +
> > >  4 files changed, 362 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > > index 47751f2d54..a94021da81 100644
> > > --- a/hw/block/nvme-ns.c
> > > +++ b/hw/block/nvme-ns.c
> > > @@ -293,12 +421,180 @@ static void
> > nvme_init_zone_meta(NvmeNamespace *ns)
> > >  i--;
> > >  }
> > >  }
> > > +
> > > +if (ns->params.zone_file) {
> > > +nvme_set_zone_meta_dirty(ns);
> > > +}
> > > +}
> > > +
> > > +static int nvme_open_zone_file(NvmeNamespace *ns, bool *init_meta,
> > > +   Error **errp)
> > > +{
> > > +Object *file_be;
> > > +HostMemoryBackend *fb;
> > > +struct stat statbuf;
> > > +int ret;
> > > +
> > > +ret = stat(ns->params.zone_file, );
> > > +if (ret && errno == ENOENT) {
> > > +*init_meta = true;
> > > +} else if (!S_ISREG(statbuf.st_mode)) {
> > > +error_setg(errp, "\"%s\" is not a regular file",
> > > +   ns->params.zone_file);
> > > +return -1;
> > > +}
> > > +
> > > +file_be = object_new(TYPE_MEMORY_BACKEND_FILE);
> > > +object_property_set_str(file_be, "mem-path", ns->params.zone_file,
> > > +_abort);
> > > +object_property_set_int(file_be, "size", ns->meta_size, 
> > > _abort);
> > > +object_property_set_bool(file_be, "share", true, _abort);
> > > +object_property_set_bool(file_be, "discard-data", false, 
> > > _abort);
> > > +if (!user_creatable_complete(USER_CREATABLE(file_be), errp)) {
> > > +object_unref(file_be);
> > > +return -1;
> > > +}
> > > +object_property_add_child(OBJECT(ns), "_fb", file_be);
> > > +object_unref(file_be);
> > > +
> > > +fb = MEMORY_BACKEND(file_be);
> > > +ns->zone_mr = host_memory_backend_get_memory(fb);
> > > +
> > > +return 0;
> > > +}
> > > +
> > > +static int nvme_map_zone_file(NvmeNamespace *ns, bool *init_meta)
> > > +{
> > > +ns->zone_meta = (void *)memory_region_get_ram_ptr(ns->zone_mr);
> > 
> > I forgot that the HostMemoryBackend doesn't magically make the memory
> > available to the device, so of course this is still needed.
> > 
> > Anyway.
> > 
> > No reason for me to keep complaining about this. I do not like it, I
> > will not ACK it and I think I made my reasons pretty clear.
> 
> So, memory_region_msync() is ok, but memory_region_get_ram_ptr() is not??
> This is the same API! You are really splitting hairs here to suit your agenda.
> Moving goal posts again
> 
> The "I do not like it" part is priceless. It is great that we have mail 
> archives available.
> 

If you read my review again, its pretty clear that I am calling out the
abstraction. I was clear that if it *really* had to be mmap based, then
it should use hostmem. Sorry for moving your patchset forward by
suggesting an improvement.

But again, I also made it pretty clear that I did not agree with the
abstraction. And that I very much disliked that it was non-portable. And
had endiannes issues. I made it SUPER clear that that was why I "did not
like it".


signature.asc
Description: PGP signature

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Klaus Jensen

On Sep 29 15:43, Dmitry Fomichev wrote:
> > -Original Message-
> > From: Qemu-block  > bounces+dmitry.fomichev=wdc@nongnu.org> On Behalf Of Klaus
> > Jensen
> > Sent: Tuesday, September 29, 2020 6:47 AM
> > To: Damien Le Moal 
> > Cc: Fam Zheng ; Kevin Wolf ; qemu-
> > bl...@nongnu.org; Niklas Cassel ; Klaus Jensen
> > ; qemu-de...@nongnu.org; Alistair Francis
> > ; Keith Busch ; Philippe
> > Mathieu-Daudé ; Matias Bjorling
> > 
> > Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> > and Zoned Namespace Command Set
> > 
> > On Sep 28 22:54, Damien Le Moal wrote:
> > > On 2020/09/29 6:25, Keith Busch wrote:
> > > > On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
> > > >> On Sep 28 02:33, Dmitry Fomichev wrote:
> > > >>> You are making it sound like the entire WDC series relies on this
> > approach.
> > > >>> Actually, the persistency is introduced in the second to last patch 
> > > >>> in the
> > > >>> series and it only adds a couple of lines of code in the i/o path to 
> > > >>> mark
> > > >>> zones dirty. This is possible because of using mmap() and I find the 
> > > >>> way
> > > >>> it is done to be quite elegant, not ugly :)
> > > >>>
> > > >>
> > > >> No, I understand that your implementation works fine without
> > > >> persistance, but persistance is key. That is why my series adds it in
> > > >> the first patch. Without persistence it is just a toy. And the QEMU
> > > >> device is not just an "NVMe-version" of null_blk.
> > > >
> > > > I really think we should be a bit more cautious of commiting to an
> > > > on-disk format for the persistent state. Both this and Klaus' persistent
> > > > state feels a bit ad-hoc, and with all the other knobs provided, it
> > > > looks too easy to have out-of-sync states, or just not being able to
> > > > boot at all if a qemu versions have different on-disk formats.
> > > >
> > > > Is anyone really considering zone emulation for production level stuff
> > > > anyway? I can't imagine a real scenario where you'd want put yourself
> > > > through that: you are just giving yourself all the downsides of a zoned
> > > > block device and none of the benefits. AFAIK, this is provided as a
> > > > development vehicle, closer to a "toy".
> > > >
> > > > I think we should consider trimming this down to a more minimal set that
> > > > we *do* agree on and commit for inclusion ASAP. We can iterate all the
> > > > bells & whistles and flush out the meta data's data marshalling scheme
> > > > for persistence later.
> > >
> > > +1 on this. Removing the persistence also removes the debate on
> > endianess. With
> > > that out of the way, it should be straightforward to get agreement on a
> > series
> > > that can be merged quickly to get developers started with testing ZNS
> > software
> > > with QEMU. That is the most important goal here. 5.9 is around the corner,
> > we
> > > need something for people to get started with ZNS quickly.
> > >
> > 
> > Wait. What. No. Stop!
> > 
> > It is unmistakably clear that you are invalidating my arguments about
> > portability and endianness issues by suggesting that we just remove
> > persistent state and deal with it later, but persistence is the killer
> > feature that sets the QEMU emulated device apart from other emulation
> > options. It is not about using emulation in production (because yeah,
> > why would you?), but persistence is what makes it possible to develop
> > and test "zoned FTLs" or something that requires recovery at power up.
> > This is what allows testing of how your host software deals with opened
> > zones being transitioned to FULL on power up and the persistent tracking
> > of LBA allocation (in my series) can be used to properly test error
> > recovery if you lost state in the app.
> > 
> > Please, work with me on this instead of just removing such an essential
> > feature. Since persistence seems to be the only thing we are really
> > discussing, we should have plenty of time until the soft-freeze to come
> > up with a proper solution on that.
> > 
> > I agree that my version had a format that was pretty ad-hoc and that
> > won't fly - it needs magic and version capabilities like in Dmitry's
> > series, which incidentially looks a lot like what we did in the
> > OpenChannel implementation, so I agree with the strategy.
> 
> Are you insinuating that I somehow took stuff from OCSSD code and try
> to claim priority this way? I am not at all that familiar with that code.
> And I've already sent you the link to tcmu-runner code that served me
> as an inspiration for implementing persistence in WDC patchset.
> That code has been around for years, uses mmap, works great and has
> nothing to do with you.
> 

No. I am not insinuating anything. The OpenChannel device also used a
blockdev, but, yes, incidentially (and sorry, I should not have used
that word), it looked like how we did it there and I noted that I agreed
with the strategy.

> > 
> > ZNS-wise, the only thing my

Re: [PATCH] qcow2: Use L1E_SIZE in qcow2_write_l1_entry()

2020-09-29 Thread Eric Blake

On 9/28/20 11:23 AM, Alberto Garcia wrote:
> We overlooked these in 02b1ecfa100e7ecc2306560cd27a4a2622bfeb04
> 
> Signed-off-by: Alberto Garcia 
> ---
>  block/qcow2-cluster.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 

Reviewed-by: Eric Blake 

> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 9acc6ce4ae..aa87d3e99b 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -240,14 +240,14 @@ int qcow2_write_l1_entry(BlockDriverState *bs, int 
> l1_index)
>  }
>  
>  ret = qcow2_pre_write_overlap_check(bs, QCOW2_OL_ACTIVE_L1,
> -s->l1_table_offset + 8 * l1_start_index, bufsize, false);
> +s->l1_table_offset + L1E_SIZE * l1_start_index, bufsize, false);
>  if (ret < 0) {
>  return ret;
>  }
>  
>  BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE);
>  ret = bdrv_pwrite_sync(bs->file,
> -   s->l1_table_offset + 8 * l1_start_index,
> +   s->l1_table_offset + L1E_SIZE * l1_start_index,
> buf, bufsize);
>  if (ret < 0) {
>  return ret;
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v2 4/4] block/export: add iothread and fixed-iothread options

2020-09-29 Thread Stefan Hajnoczi

On Tue, Sep 29, 2020 at 08:07:38AM -0500, Eric Blake wrote:
> On 9/29/20 7:55 AM, Stefan Hajnoczi wrote:
> > Make it possible to specify the iothread where the export will run. By
> > default the block node can be moved to other AioContexts later and the
> > export will follow. The fixed-iothread option forces strict behavior
> > that prevents changing AioContext while the export is active. See the
> > QAPI docs for details.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> > Note the x-blockdev-set-iothread QMP command can be used to do the same,
> > but not from the command-line. And it requires sending an additional
> > command.
> > 
> > In the long run vhost-user-blk will support per-virtqueue iothread
> > mappings. But for now a single iothread makes sense and most other
> > transports will just use one iothread anyway.
> > ---
> >   qapi/block-export.json   | 11 ++
> >   block/export/export.c| 31 +++-
> >   block/export/vhost-user-blk-server.c |  5 -
> >   nbd/server.c |  2 --
> >   4 files changed, 45 insertions(+), 4 deletions(-)
> > 
> > diff --git a/qapi/block-export.json b/qapi/block-export.json
> > index 87ac5117cd..e2cb21f5f1 100644
> > --- a/qapi/block-export.json
> > +++ b/qapi/block-export.json
> > @@ -219,11 +219,22 @@
> >   #export before completion is signalled. (since: 5.2;
> >   #default: false)
> >   #
> > +# @iothread: The name of the iothread object where the export will run. The
> > +#default is to use the thread currently associated with the #
> 
> Stray #
> 
> > +#block node. (since: 5.2)
> > +#
> > +# @fixed-iothread: True prevents the block node from being moved to another
> > +#  thread while the export is active. If true and 
> > @iothread is
> > +#  given, export creation fails if the block node cannot be
> > +#  moved to the iothread. The default is false.
> > +#
> 
> Missing a '(since 5.2)' tag.  (Hmm, we're inconsistent on whether it is
> 'since 5.2' or 'since: 5.2' inside () parentheticals; Markus, is that
> something we should be cleaning up as part of the conversion to rST?)
> 
> > @@ -63,10 +64,11 @@ static const BlockExportDriver 
> > *blk_exp_find_driver(BlockExportType type)
> >   BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
> >   {
> > +bool fixed_iothread = export->has_fixed_iothread && 
> > export->fixed_iothread;
> 
> Technically, our QAPI code guarantees that export->fixed_iothread is false
> if export->has_fixed_iothread is false.  And someday I'd love to let QAPI
> express default values for bools so that we don't need a has_FOO field when
> a default has been expressed.  But neither of those points affect this
> patch; what you have is correct even if it is verbose.
> 
> Otherwise looks reasonable.

Great, thanks for pointing this out.

I'll wait for comments from Kevin. These things could be fixed when
merging.

Stefan


signature.asc
Description: PGP signature

Re: [PULL 5/5] crypto/tls-cipher-suites: Produce fw_cfg consumable blob

2020-09-29 Thread Kevin Wolf

Am 04.07.2020 um 18:39 hat Philippe Mathieu-DaudÃ© geschrieben:
> Since our format is consumable by the fw_cfg device,
> we can implement the FW_CFG_DATA_GENERATOR interface.
> 
> Example of use to dump the cipher suites (if tracing enabled):
> 
>   $ qemu-system-x86_64 -S \
> -object tls-cipher-suites,id=mysuite1,priority=@SYSTEM \
> -fw_cfg name=etc/path/to/ciphers,gen_id=mysuite1 \
> -trace qcrypto\*
>   159066.197123:qcrypto_tls_cipher_suite_priority priority: @SYSTEM
>   159066.197219:qcrypto_tls_cipher_suite_info data=[0x13,0x02] 
> version=TLS1.3 name=TLS_AES_256_GCM_SHA384
>   159066.197228:qcrypto_tls_cipher_suite_info data=[0x13,0x03] 
> version=TLS1.3 name=TLS_CHACHA20_POLY1305_SHA256
>   159066.197233:qcrypto_tls_cipher_suite_info data=[0x13,0x01] 
> version=TLS1.3 name=TLS_AES_128_GCM_SHA256
>   159066.197236:qcrypto_tls_cipher_suite_info data=[0x13,0x04] 
> version=TLS1.3 name=TLS_AES_128_CCM_SHA256
>   159066.197240:qcrypto_tls_cipher_suite_info data=[0xc0,0x30] 
> version=TLS1.2 name=TLS_ECDHE_RSA_AES_256_GCM_SHA384
>   159066.197245:qcrypto_tls_cipher_suite_info data=[0xcc,0xa8] 
> version=TLS1.2 name=TLS_ECDHE_RSA_CHACHA20_POLY1305
>   159066.197250:qcrypto_tls_cipher_suite_info data=[0xc0,0x14] 
> version=TLS1.0 name=TLS_ECDHE_RSA_AES_256_CBC_SHA1
>   159066.197254:qcrypto_tls_cipher_suite_info data=[0xc0,0x2f] 
> version=TLS1.2 name=TLS_ECDHE_RSA_AES_128_GCM_SHA256
>   159066.197258:qcrypto_tls_cipher_suite_info data=[0xc0,0x13] 
> version=TLS1.0 name=TLS_ECDHE_RSA_AES_128_CBC_SHA1
>   159066.197261:qcrypto_tls_cipher_suite_info data=[0xc0,0x2c] 
> version=TLS1.2 name=TLS_ECDHE_ECDSA_AES_256_GCM_SHA384
>   159066.197266:qcrypto_tls_cipher_suite_info data=[0xcc,0xa9] 
> version=TLS1.2 name=TLS_ECDHE_ECDSA_CHACHA20_POLY1305
>   159066.197270:qcrypto_tls_cipher_suite_info data=[0xc0,0xad] 
> version=TLS1.2 name=TLS_ECDHE_ECDSA_AES_256_CCM
>   159066.197274:qcrypto_tls_cipher_suite_info data=[0xc0,0x0a] 
> version=TLS1.0 name=TLS_ECDHE_ECDSA_AES_256_CBC_SHA1
>   159066.197278:qcrypto_tls_cipher_suite_info data=[0xc0,0x2b] 
> version=TLS1.2 name=TLS_ECDHE_ECDSA_AES_128_GCM_SHA256
>   159066.197283:qcrypto_tls_cipher_suite_info data=[0xc0,0xac] 
> version=TLS1.2 name=TLS_ECDHE_ECDSA_AES_128_CCM
>   159066.197287:qcrypto_tls_cipher_suite_info data=[0xc0,0x09] 
> version=TLS1.0 name=TLS_ECDHE_ECDSA_AES_128_CBC_SHA1
>   159066.197291:qcrypto_tls_cipher_suite_info data=[0x00,0x9d] 
> version=TLS1.2 name=TLS_RSA_AES_256_GCM_SHA384
>   159066.197296:qcrypto_tls_cipher_suite_info data=[0xc0,0x9d] 
> version=TLS1.2 name=TLS_RSA_AES_256_CCM
>   159066.197300:qcrypto_tls_cipher_suite_info data=[0x00,0x35] 
> version=TLS1.0 name=TLS_RSA_AES_256_CBC_SHA1
>   159066.197304:qcrypto_tls_cipher_suite_info data=[0x00,0x9c] 
> version=TLS1.2 name=TLS_RSA_AES_128_GCM_SHA256
>   159066.197308:qcrypto_tls_cipher_suite_info data=[0xc0,0x9c] 
> version=TLS1.2 name=TLS_RSA_AES_128_CCM
>   159066.197312:qcrypto_tls_cipher_suite_info data=[0x00,0x2f] 
> version=TLS1.0 name=TLS_RSA_AES_128_CBC_SHA1
>   159066.197316:qcrypto_tls_cipher_suite_info data=[0x00,0x9f] 
> version=TLS1.2 name=TLS_DHE_RSA_AES_256_GCM_SHA384
>   159066.197320:qcrypto_tls_cipher_suite_info data=[0xcc,0xaa] 
> version=TLS1.2 name=TLS_DHE_RSA_CHACHA20_POLY1305
>   159066.197325:qcrypto_tls_cipher_suite_info data=[0xc0,0x9f] 
> version=TLS1.2 name=TLS_DHE_RSA_AES_256_CCM
>   159066.197329:qcrypto_tls_cipher_suite_info data=[0x00,0x39] 
> version=TLS1.0 name=TLS_DHE_RSA_AES_256_CBC_SHA1
>   159066.197333:qcrypto_tls_cipher_suite_info data=[0x00,0x9e] 
> version=TLS1.2 name=TLS_DHE_RSA_AES_128_GCM_SHA256
>   159066.197337:qcrypto_tls_cipher_suite_info data=[0xc0,0x9e] 
> version=TLS1.2 name=TLS_DHE_RSA_AES_128_CCM
>   159066.197341:qcrypto_tls_cipher_suite_info data=[0x00,0x33] 
> version=TLS1.0 name=TLS_DHE_RSA_AES_128_CBC_SHA1
>   159066.197345:qcrypto_tls_cipher_suite_count count: 29
> 
> Signed-off-by: Philippe Mathieu-DaudÃ© 
> Reviewed-by: Daniel P. BerrangÃ© 
> Acked-by: Laszlo Ersek 
> Message-Id: <20200623172726.21040-6-phi...@redhat.com>

I noticed only now that this breaks '--object help' in
qemu-storage-daemon:

$ qemu-storage-daemon --object help
List of user creatable objects:
qemu-storage-daemon: missing interface 'fw_cfg-data-generator' for object 
'tls-creds'
Aborted (core dumped)

The reason is that we don't (and can't) link hw/nvram/fw_cfg.c into the
storage daemon because it requires other system emulator stuff.

Kevin

RE: [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for persistence

2020-09-29 Thread Dmitry Fomichev



> -Original Message-
> From: Klaus Jensen 
> Sent: Monday, September 28, 2020 3:52 AM
> To: Dmitry Fomichev 
> Cc: Keith Busch ; Klaus Jensen
> ; Kevin Wolf ; Philippe
> Mathieu-Daudé ; Maxim Levitsky
> ; Fam Zheng ; Niklas Cassel
> ; Damien Le Moal ;
> qemu-block@nongnu.org; qemu-de...@nongnu.org; Alistair Francis
> ; Matias Bjorling 
> Subject: Re: [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for
> persistence
> 
> On Sep 28 11:35, Dmitry Fomichev wrote:
> > A ZNS drive that is emulated by this module is currently initialized
> > with all zones Empty upon startup. However, actual ZNS SSDs save the
> > state and condition of all zones in their internal NVRAM in the event
> > of power loss. When such a drive is powered up again, it closes or
> > finishes all zones that were open at the moment of shutdown. Besides
> > that, the write pointer position as well as the state and condition
> > of all zones is preserved across power-downs.
> >
> > This commit adds the capability to have a persistent zone metadata
> > to the device. The new optional module property, "zone_file",
> > is introduced. If added to the command line, this property specifies
> > the name of the file that stores the zone metadata. If "zone_file" is
> > omitted, the device will be initialized with all zones empty, the same
> > as before.
> >
> > If zone metadata is configured to be persistent, then zone descriptor
> > extensions also persist across controller shutdowns.
> >
> > Signed-off-by: Dmitry Fomichev 
> > ---
> >  hw/block/nvme-ns.c| 341
> --
> >  hw/block/nvme-ns.h|  33 
> >  hw/block/nvme.c   |   2 +
> >  hw/block/trace-events |   1 +
> >  4 files changed, 362 insertions(+), 15 deletions(-)
> >
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index 47751f2d54..a94021da81 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -293,12 +421,180 @@ static void
> nvme_init_zone_meta(NvmeNamespace *ns)
> >  i--;
> >  }
> >  }
> > +
> > +if (ns->params.zone_file) {
> > +nvme_set_zone_meta_dirty(ns);
> > +}
> > +}
> > +
> > +static int nvme_open_zone_file(NvmeNamespace *ns, bool *init_meta,
> > +   Error **errp)
> > +{
> > +Object *file_be;
> > +HostMemoryBackend *fb;
> > +struct stat statbuf;
> > +int ret;
> > +
> > +ret = stat(ns->params.zone_file, );
> > +if (ret && errno == ENOENT) {
> > +*init_meta = true;
> > +} else if (!S_ISREG(statbuf.st_mode)) {
> > +error_setg(errp, "\"%s\" is not a regular file",
> > +   ns->params.zone_file);
> > +return -1;
> > +}
> > +
> > +file_be = object_new(TYPE_MEMORY_BACKEND_FILE);
> > +object_property_set_str(file_be, "mem-path", ns->params.zone_file,
> > +_abort);
> > +object_property_set_int(file_be, "size", ns->meta_size, _abort);
> > +object_property_set_bool(file_be, "share", true, _abort);
> > +object_property_set_bool(file_be, "discard-data", false, _abort);
> > +if (!user_creatable_complete(USER_CREATABLE(file_be), errp)) {
> > +object_unref(file_be);
> > +return -1;
> > +}
> > +object_property_add_child(OBJECT(ns), "_fb", file_be);
> > +object_unref(file_be);
> > +
> > +fb = MEMORY_BACKEND(file_be);
> > +ns->zone_mr = host_memory_backend_get_memory(fb);
> > +
> > +return 0;
> > +}
> > +
> > +static int nvme_map_zone_file(NvmeNamespace *ns, bool *init_meta)
> > +{
> > +ns->zone_meta = (void *)memory_region_get_ram_ptr(ns->zone_mr);
> 
> I forgot that the HostMemoryBackend doesn't magically make the memory
> available to the device, so of course this is still needed.
> 
> Anyway.
> 
> No reason for me to keep complaining about this. I do not like it, I
> will not ACK it and I think I made my reasons pretty clear.

So, memory_region_msync() is ok, but memory_region_get_ram_ptr() is not??
This is the same API! You are really splitting hairs here to suit your agenda.
Moving goal posts again

The "I do not like it" part is priceless. It is great that we have mail 
archives available.

RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Dmitry Fomichev

> -Original Message-
> From: Qemu-block  bounces+dmitry.fomichev=wdc@nongnu.org> On Behalf Of Klaus
> Jensen
> Sent: Tuesday, September 29, 2020 6:47 AM
> To: Damien Le Moal 
> Cc: Fam Zheng ; Kevin Wolf ; qemu-
> bl...@nongnu.org; Niklas Cassel ; Klaus Jensen
> ; qemu-de...@nongnu.org; Alistair Francis
> ; Keith Busch ; Philippe
> Mathieu-Daudé ; Matias Bjorling
> 
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> and Zoned Namespace Command Set
> 
> On Sep 28 22:54, Damien Le Moal wrote:
> > On 2020/09/29 6:25, Keith Busch wrote:
> > > On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
> > >> On Sep 28 02:33, Dmitry Fomichev wrote:
> > >>> You are making it sound like the entire WDC series relies on this
> approach.
> > >>> Actually, the persistency is introduced in the second to last patch in 
> > >>> the
> > >>> series and it only adds a couple of lines of code in the i/o path to 
> > >>> mark
> > >>> zones dirty. This is possible because of using mmap() and I find the way
> > >>> it is done to be quite elegant, not ugly :)
> > >>>
> > >>
> > >> No, I understand that your implementation works fine without
> > >> persistance, but persistance is key. That is why my series adds it in
> > >> the first patch. Without persistence it is just a toy. And the QEMU
> > >> device is not just an "NVMe-version" of null_blk.
> > >
> > > I really think we should be a bit more cautious of commiting to an
> > > on-disk format for the persistent state. Both this and Klaus' persistent
> > > state feels a bit ad-hoc, and with all the other knobs provided, it
> > > looks too easy to have out-of-sync states, or just not being able to
> > > boot at all if a qemu versions have different on-disk formats.
> > >
> > > Is anyone really considering zone emulation for production level stuff
> > > anyway? I can't imagine a real scenario where you'd want put yourself
> > > through that: you are just giving yourself all the downsides of a zoned
> > > block device and none of the benefits. AFAIK, this is provided as a
> > > development vehicle, closer to a "toy".
> > >
> > > I think we should consider trimming this down to a more minimal set that
> > > we *do* agree on and commit for inclusion ASAP. We can iterate all the
> > > bells & whistles and flush out the meta data's data marshalling scheme
> > > for persistence later.
> >
> > +1 on this. Removing the persistence also removes the debate on
> endianess. With
> > that out of the way, it should be straightforward to get agreement on a
> series
> > that can be merged quickly to get developers started with testing ZNS
> software
> > with QEMU. That is the most important goal here. 5.9 is around the corner,
> we
> > need something for people to get started with ZNS quickly.
> >
> 
> Wait. What. No. Stop!
> 
> It is unmistakably clear that you are invalidating my arguments about
> portability and endianness issues by suggesting that we just remove
> persistent state and deal with it later, but persistence is the killer
> feature that sets the QEMU emulated device apart from other emulation
> options. It is not about using emulation in production (because yeah,
> why would you?), but persistence is what makes it possible to develop
> and test "zoned FTLs" or something that requires recovery at power up.
> This is what allows testing of how your host software deals with opened
> zones being transitioned to FULL on power up and the persistent tracking
> of LBA allocation (in my series) can be used to properly test error
> recovery if you lost state in the app.
> 
> Please, work with me on this instead of just removing such an essential
> feature. Since persistence seems to be the only thing we are really
> discussing, we should have plenty of time until the soft-freeze to come
> up with a proper solution on that.
> 
> I agree that my version had a format that was pretty ad-hoc and that
> won't fly - it needs magic and version capabilities like in Dmitry's
> series, which incidentially looks a lot like what we did in the
> OpenChannel implementation, so I agree with the strategy.

Are you insinuating that I somehow took stuff from OCSSD code and try
to claim priority this way? I am not at all that familiar with that code.
And I've already sent you the link to tcmu-runner code that served me
as an inspiration for implementing persistence in WDC patchset.
That code has been around for years, uses mmap, works great and has
nothing to do with you.

> 
> ZNS-wise, the only thing my implementation stores is the zone
> descriptors (in spec-native little-endian format) and the zone
> descriptor extensions. So there are no endian issues with those. The
> allocation tracking bitmap is always stored in little endian, but
> converted to big-endian if running on a big-endian host.
> 
> Let me just conjure something up.
> 
> #define NVME_PSTATE_MAGIC ...
> #define NVME_PSTATE_V11
> 
> typedef struct NvmePstateHeader {
> uint32_t

RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Dmitry Fomichev

> -Original Message-
> From: Klaus Jensen 
> Sent: Monday, September 28, 2020 2:37 AM
> To: Dmitry Fomichev 
> Cc: Keith Busch ; Damien Le Moal
> ; Klaus Jensen ; Kevin
> Wolf ; Philippe Mathieu-Daudé ;
> Maxim Levitsky ; Fam Zheng ;
> Niklas Cassel ; qemu-block@nongnu.org; qemu-
> de...@nongnu.org; Alistair Francis ; Matias
> Bjorling 
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> and Zoned Namespace Command Set
> 
> On Sep 28 02:33, Dmitry Fomichev wrote:
> > > -Original Message-
> > > From: Klaus Jensen 
> > >
> > > If it really needs to be memory mapped, then I think a hostmem-based
> > > approach similar to what Andrzej did for PMR is needed (I think that
> > > will get rid of the CONFIG_POSIX ifdef at least, but still leave it
> > > slightly tricky to get it to work on all platforms AFAIK).
> >
> > Ok, it looks that using the HostMemoryBackendFile backend will be
> > more appropriate. This will remove the need for conditional compile.
> >
> > The mmap() portability is pretty decent across software platforms.
> > Any poor Windows user who is forced to emulate ZNS on mingw will be
> > able to do so, just without having zone state persistency. Considering
> > how specialized this stuff is in first place, I estimate the number of users
> > affected by this "limitation" to be exactly zero.
> >
> 
> QEMU is a cross platform project - we should strive for portability.
> 
> Alienating developers that use a Windows platform and calling them out
> as "poor" is not exactly good for the zoned ecosystem.
> 

Wow. By bringing up political correctness here you are basically admitting
the fact that you have no real technical argument here. The whole Windows
issue is red herring that you are using to attack the code that is absolutely
legit, but comes from a competitor. Your initial complaint was that it
doesn't compile in mingw and that it uses "wrong" API. You have even
suggested the API to use. Now, the code uses that API and builds fine, but
now it's still not good simply because you "do not like it". It's a disgrace.

> > > But really,
> > > since we do not require memory semantics for this, then I think the
> > > abstraction is fundamentally wrong.
> > >
> >
> > Seriously, what is wrong with using mmap :) ? It is used successfully for
> > similar applications, for example -
> > https://github.com/open-iscsi/tcmu-runner/blob/master/file_zbc.c
> >
> 
> There is nothing fundamentally wrong with mmap. I just think it is the
> wrong abstraction here (and it limits portability for no good reason).
> For PMR there is a good reason - it requires memory semantics.
> 

We are trying to emulate NVMEe controller NVRAM.  The best abstraction
for emulating NVRAM would be... NVRAM!

> > > I am, of course, blowing my own horn, since my implementation uses a
> > > portable blockdev for this.
> > >
> >
> > You are making it sound like the entire WDC series relies on this approach.
> > Actually, the persistency is introduced in the second to last patch in the
> > series and it only adds a couple of lines of code in the i/o path to mark
> > zones dirty. This is possible because of using mmap() and I find the way
> > it is done to be quite elegant, not ugly :)
> >
> 
> No, I understand that your implementation works fine without
> persistance, but persistance is key. That is why my series adds it in
> the first patch. Without persistence it is just a toy. And the QEMU
> device is not just an "NVMe-version" of null_blk.
> 
> And I don't think I ever called the use of mmap ugly. I called out the
> physical memory API shenanigans as a hack.
> 
> > > Another issue is the complete lack of endian conversions. Does it
> > > matter? It depends. Will anyone ever use this on a big endian host and
> > > move the meta data backing file to a little endian host? Probably not.
> > > So does it really matter? Probably not, but it is cutting corners.
> > >
> 
> After I had replied this, I considered a follow-up, because there are
> probably QEMU developers that would call me out on this.
> 
> This definitely DOES matter to QEMU.
> 
> >
> > Great point on endianness! Naturally, all file backed values are stored in
> > their native endianness. This way, there is no extra overhead on big endian
> > hardware architectures. Portability concerns can be easily addressed by
> > storing metadata endianness as a byte flag in its header. Then, during
> > initialization, the metadata validation code can detect the possible
> > discrepancy in endianness and automatically convert the metadata to the
> > endianness of the host. This part is out of scope of this series, but I 
> > would
> > be able to contribute such a solution as an enhancement in the future.
> >
> 
> It is not out of scope. I don't see why we should merge something that
> is arguably buggy.

Again, wow! Now you turned around and arbitrarily elevated this issue from
moderate ("Does it matter?, cutting corners") to severe ("buggy"). Likely
because v5 of

Re: qcow2 merge_cow() question

2020-09-29 Thread Alberto Garcia

On Fri 21 Aug 2020 03:42:29 PM CEST, Vladimir Sementsov-Ogievskiy wrote:
>>> What are these ifs for?
>>>
>>> /* The data (middle) region must be immediately after the
>>>  * start region */
>>> if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
>>> continue;
>>> }
>>> 
>>> 
>>> /* The end region must be immediately after the data (middle)
>>>  * region */
>>> if (m->offset + m->cow_end.offset != offset + bytes) {
>>> continue;
>>> }
>>>
>>> How is it possible that data doesn't immediately follow start cow
>>> region or end cow region doesn't immediately follow data region?
>> 
>> They are sanity checks. They maybe cannot happen in practice and in
>> that case I suppose they should be replaced with assertions but this
>> should be checked carefully. If I remember correctly I was wary of
>> overlooking a case where this could happen.
>> 
>> In particular, that function receives only one data region but a list
>> of QCowL2Meta objects. I think you can get more than one QCowL2Meta
>> if the same request involves a mix of copied and newly allocated
>> clusters, but that shouldn't be a problem either.
>
> OK, thanks. So, intuitively it shouldn't happen, but there should be
> some careful investigation to change them to assertions.

I was having a look at this and here's a simple example of how this can
happen:

qemu-img create -f qcow2 -o cluster_size=1k img.qcow2 1M
qemu-io -c 'write 0 3k' img.qcow2
qemu-io -c 'discard 0 1k' img.qcow2
qemu-io -c 'discard 2k 1k' img.qcow2
qemu-io -c 'write 512 2k' img.qcow2

The last write request can be performed with one single write operation
but it needs to allocate clusters #0 and #2.

This means that merge_cow() is called with offset=512, bytes=2k and two
QCowL2Meta structures:

  - The first one with cow_start={0, 512} and cow_end={1k, 0} 
  - The second one with cow_start={2k, 0} and cow_end={2560, 512}

In theory it should be possible to combine both into one that has
cow_start={0, 512} and cow_end={2560, 512}, but I don't think this
situation happens very often so I wouldn't go that way.

In any case, the checks have to remain and they cannot be turned into
assertions.

Berto

Re: [PATCH v2] hw/ide: check null block before _cancel_dma_sync

2020-09-29 Thread Li Qiang

P J P  于2020年9月29日周二 下午2:22写道：
>
>   Hello Li,
>
> +-- On Fri, 18 Sep 2020, Li Qiang wrote --+
> | P J P  于2020年9月18日周五 下午6:26写道：
> | > +-- On Fri, 18 Sep 2020, Li Qiang wrote --+
> | > | Update v2: use an assert() call
> | > |   
> ->https://lists.nongnu.org/archive/html/qemu-devel/2020-08/msg08336.html
> |
> | In 'ide_ioport_write' the guest can set 'bus->unit' to 0 or 1 by issue
> | 'ATA_IOPORT_WR_DEVICE_HEAD'. So this case the guest can set the active ifs.
> | If the guest set this to 1.
> |
> | Then in 'idebus_active_if' will return 'IDEBus.ifs[1]' and thus the 's->blk'
> | will be NULL.
>
> Right, guest does select the drive via
>
>   portio_write
>->ide_ioport_write
>   case ATA_IOPORT_WR_DEVICE_HEAD:
>   /* FIXME: HOB readback uses bit 7 */
>   bus->ifs[0].select = (val & ~0x10) | 0xa0;
>   bus->ifs[1].select = (val | 0x10) | 0xa0;
>   /* select drive */
>   bus->unit = (val >> 4) & 1; <== set bus->unit=0x1
>   break;
>
>
> | So from your (Peter's) saying, we need to check the value in
> | 'ATA_IOPORT_WR_DEVICE_HEAD' handler. To say if the guest
> | set a valid 'bus->unit'. This can also work I think.
>
> Yes, with the following fix, an assert(3) in ide_cancel_dma_sync fails.
>
> ===
> diff --git a/hw/ide/core.c b/hw/ide/core.c
> index f76f7e5234..cb55cc8b0f 100644
> --- a/hw/ide/core.c
> +++ b/hw/ide/core.c
> @@ -1300,7 +1300,11 @@ void ide_ioport_write(void *opaque, uint32_t addr,
> uint_)
>  bus->ifs[0].select = (val & ~0x10) | 0xa0;
>  bus->ifs[1].select = (val | 0x10) | 0xa0;
>  /* select drive */
> +uint8_t bu = bus->unit;
>  bus->unit = (val >> 4) & 1;
> +if (!bus->ifs[bus->unit].blk) {
> +bus->unit = bu;
> +}
>  break;
>  default:
>
> qemu-system-x86_64: ../hw/ide/core.c:724: ide_cancel_dma_sync: Assertion 
> `s->bus->dma->aiocb == NULL' failed.
> Aborted (core dumped)

This is what I am worried, in the 'ide_ioport_write' set the
'bus->unit'. It also change the 'buf->ifs[0].select'.
Also there maybe some other corner case that causes some inconsistent.
And if we choice this method we need to deep into the more ahci-spec to
know how things really going.


> ===
>
> | As we the 'ide_exec_cmd' and other functions in 'hw/ide/core.c' check the
> | 's->blk' directly. I think we just check it in 'ide_cancel_dma_sync' is
> | enough and also this is more consistent with the other functions.
> | 'ide_cancel_dma_sync' is also called by 'cmd_device_reset' which is one of
> | the 'ide_cmd_table' handler.
>
>   Yes, I'm okay with either approach. Earlier patch v1 checks 's->blk' in
> ide_cancel_dma_sync().

I prefer the 'check the s->blk in the beginning of ide_cancel_dma_sync' method.
Some little different with your earlier patch.

Anyway, let the maintainer do the choices.

Thanks,
Li Qiang

>
> | BTW, where is the Peter's email saying this, just want to learn something,
> | :).
>
>   -> https://lists.nongnu.org/archive/html/qemu-devel/2020-09/msg05820.html
>
> Thank you.
> --
> Prasad J Pandit / Red Hat Product Security Team
> 8685 545E B54C 486B C6EB 271E E285 8B5A F050 DE8D

[PATCH] job: delete job_{lock, unlock} functions and replace them with lock guard

2020-09-29 Thread Elena Afanasova

Signed-off-by: Elena Afanasova 
---
 job.c | 46 +-
 1 file changed, 17 insertions(+), 29 deletions(-)

diff --git a/job.c b/job.c
index 8fecf38960..89ceb53434 100644
--- a/job.c
+++ b/job.c
@@ -79,16 +79,6 @@ struct JobTxn {
  * job_enter. */
 static QemuMutex job_mutex;
 
-static void job_lock(void)
-{
-qemu_mutex_lock(_mutex);
-}
-
-static void job_unlock(void)
-{
-qemu_mutex_unlock(_mutex);
-}
-
 static void __attribute__((__constructor__)) job_init(void)
 {
 qemu_mutex_init(_mutex);
@@ -437,21 +427,19 @@ void job_enter_cond(Job *job, bool(*fn)(Job *job))
 return;
 }
 
-job_lock();
-if (job->busy) {
-job_unlock();
-return;
-}
+WITH_QEMU_LOCK_GUARD(_mutex) {
+if (job->busy) {
+return;
+}
 
-if (fn && !fn(job)) {
-job_unlock();
-return;
-}
+if (fn && !fn(job)) {
+return;
+}
 
-assert(!job->deferred_to_main_loop);
-timer_del(>sleep_timer);
-job->busy = true;
-job_unlock();
+assert(!job->deferred_to_main_loop);
+timer_del(>sleep_timer);
+job->busy = true;
+}
 aio_co_enter(job->aio_context, job->co);
 }
 
@@ -468,13 +456,13 @@ void job_enter(Job *job)
  * called explicitly. */
 static void coroutine_fn job_do_yield(Job *job, uint64_t ns)
 {
-job_lock();
-if (ns != -1) {
-timer_mod(>sleep_timer, ns);
+WITH_QEMU_LOCK_GUARD(_mutex) {
+if (ns != -1) {
+timer_mod(>sleep_timer, ns);
+}
+job->busy = false;
+job_event_idle(job);
 }
-job->busy = false;
-job_event_idle(job);
-job_unlock();
 qemu_coroutine_yield();
 
 /* Set by job_enter_cond() before re-entering the coroutine.  */
-- 
2.25.1

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Damien Le Moal

On 2020/09/29 19:46, Klaus Jensen wrote:
> On Sep 28 22:54, Damien Le Moal wrote:
>> On 2020/09/29 6:25, Keith Busch wrote:
>>> On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
 On Sep 28 02:33, Dmitry Fomichev wrote:
> You are making it sound like the entire WDC series relies on this 
> approach.
> Actually, the persistency is introduced in the second to last patch in the
> series and it only adds a couple of lines of code in the i/o path to mark
> zones dirty. This is possible because of using mmap() and I find the way
> it is done to be quite elegant, not ugly :)
>

 No, I understand that your implementation works fine without
 persistance, but persistance is key. That is why my series adds it in
 the first patch. Without persistence it is just a toy. And the QEMU
 device is not just an "NVMe-version" of null_blk.
>>>
>>> I really think we should be a bit more cautious of commiting to an
>>> on-disk format for the persistent state. Both this and Klaus' persistent
>>> state feels a bit ad-hoc, and with all the other knobs provided, it
>>> looks too easy to have out-of-sync states, or just not being able to
>>> boot at all if a qemu versions have different on-disk formats.
>>>
>>> Is anyone really considering zone emulation for production level stuff
>>> anyway? I can't imagine a real scenario where you'd want put yourself
>>> through that: you are just giving yourself all the downsides of a zoned
>>> block device and none of the benefits. AFAIK, this is provided as a
>>> development vehicle, closer to a "toy".
>>>
>>> I think we should consider trimming this down to a more minimal set that
>>> we *do* agree on and commit for inclusion ASAP. We can iterate all the
>>> bells & whistles and flush out the meta data's data marshalling scheme
>>> for persistence later.
>>
>> +1 on this. Removing the persistence also removes the debate on endianess. 
>> With
>> that out of the way, it should be straightforward to get agreement on a 
>> series
>> that can be merged quickly to get developers started with testing ZNS 
>> software
>> with QEMU. That is the most important goal here. 5.9 is around the corner, we
>> need something for people to get started with ZNS quickly.
>>
> 
> Wait. What. No. Stop!
> 
> It is unmistakably clear that you are invalidating my arguments about
> portability and endianness issues by suggesting that we just remove
> persistent state and deal with it later, but persistence is the killer
> feature that sets the QEMU emulated device apart from other emulation
> options. It is not about using emulation in production (because yeah,
> why would you?), but persistence is what makes it possible to develop
> and test "zoned FTLs" or something that requires recovery at power up.
> This is what allows testing of how your host software deals with opened
> zones being transitioned to FULL on power up and the persistent tracking
> of LBA allocation (in my series) can be used to properly test error
> recovery if you lost state in the app.

I am not invalidating anything. I am in violent agreement with you about the
usefulness of persistence. My point was that I agree with Keith: let's first get
the base emulation in and improve on top of it. And the base emulation does not
need to include persistence and endianess of the saved zone meta for now. The
result of this would still be super useful to have in stable.

Then let's add persistence and others bells and whistles on top (see below).

> Please, work with me on this instead of just removing such an essential
> feature. Since persistence seems to be the only thing we are really
> discussing, we should have plenty of time until the soft-freeze to come
> up with a proper solution on that.
> 
> I agree that my version had a format that was pretty ad-hoc and that
> won't fly - it needs magic and version capabilities like in Dmitry's
> series, which incidentially looks a lot like what we did in the
> OpenChannel implementation, so I agree with the strategy.
> 
> ZNS-wise, the only thing my implementation stores is the zone
> descriptors (in spec-native little-endian format) and the zone
> descriptor extensions. So there are no endian issues with those. The
> allocation tracking bitmap is always stored in little endian, but
> converted to big-endian if running on a big-endian host.
> 
> Let me just conjure something up.
> 
> #define NVME_PSTATE_MAGIC ...
> #define NVME_PSTATE_V11
> 
> typedef struct NvmePstateHeader {
> uint32_t magic;
> uint32_t version;
> 
> uint64_t blk_len;
> 
> uint8_t  lbads;
> uint8_t  iocs;
> 
> uint8_t  rsvd18[3054];
> 
> struct {
> uint64_t zsze;
> uint8_t  zdes;
> } QEMU_PACKED zns;
> 
> uint8_t  rsvd3089[1007];
> } QEMU_PACKED NvmePstateHeader;
> 
> With such a header we have all we need. We can bail out if any
> parameters do not match and similar

Re: [PATCH v3 07/18] hw/block/nvme: add support for the get log page command

2020-09-29 Thread Peter Maydell

On Mon, 6 Jul 2020 at 07:15, Klaus Jensen  wrote:
>
> From: Klaus Jensen 
>
> Add support for the Get Log Page command and basic implementations of
> the mandatory Error Information, SMART / Health Information and Firmware
> Slot Information log pages.
>
> In violation of the specification, the SMART / Health Information log
> page does not persist information over the lifetime of the controller
> because the device has no place to store such persistent state.
>
> Note that the LPA field in the Identify Controller data structure
> intentionally has bit 0 cleared because there is no namespace specific
> information in the SMART / Health information log page.
>
> Required for compliance with NVMe revision 1.3d. See NVM Express 1.3d,
> Section 5.14 ("Get Log Page command").

Hi; Coverity reports a potential issue in this code
(CID 1432413):

> +static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> +uint64_t off, NvmeRequest *req)
> +{
> +uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
> +uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
> +uint32_t nsid = le32_to_cpu(cmd->nsid);
> +
> +uint32_t trans_len;
> +time_t current_ms;
> +uint64_t units_read = 0, units_written = 0;
> +uint64_t read_commands = 0, write_commands = 0;
> +NvmeSmartLog smart;
> +BlockAcctStats *s;
> +
> +if (nsid && nsid != 0x) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +s = blk_get_stats(n->conf.blk);
> +
> +units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> +units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> +read_commands = s->nr_ops[BLOCK_ACCT_READ];
> +write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
> +
> +if (off > sizeof(smart)) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}

Here we check for off > sizeof(smart), which means that we allow
off == sizeof(smart)...

> +
> +trans_len = MIN(sizeof(smart) - off, buf_len);

> +return nvme_dma_read_prp(n, (uint8_t *)  + off, trans_len, prp1,
> + prp2);

...in which case the pointer we pass to nvme_dma_read_prp() will
be off the end of the 'smart' object.

Now we are passing 0 as the trans_len, so I *think* this function
will not actually read the buffer (Coverity is not smart
enough to see this); so I could just close the Coverity issue as
a false-positive. But maybe there is a clearer-to-humans as well
as clearer-to-Coverity way to write this. What do you think ?

> +static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> + uint64_t off, NvmeRequest *req)
> +{
> +uint32_t trans_len;
> +uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
> +uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
> +NvmeFwSlotInfoLog fw_log = {
> +.afi = 0x1,
> +};
> +
> +strpadcpy((char *)_log.frs1, sizeof(fw_log.frs1), "1.0", ' ');
> +
> +if (off > sizeof(fw_log)) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +trans_len = MIN(sizeof(fw_log) - off, buf_len);
> +
> +return nvme_dma_read_prp(n, (uint8_t *) _log + off, trans_len, prp1,
> + prp2);

Coverity warns about the same structure here (CID 1432411).

thanks
-- PMM

Re: [PATCH v2 4/4] block/export: add iothread and fixed-iothread options

2020-09-29 Thread Eric Blake


On 9/29/20 7:55 AM, Stefan Hajnoczi wrote:

Make it possible to specify the iothread where the export will run. By
default the block node can be moved to other AioContexts later and the
export will follow. The fixed-iothread option forces strict behavior
that prevents changing AioContext while the export is active. See the
QAPI docs for details.

Signed-off-by: Stefan Hajnoczi 
---
Note the x-blockdev-set-iothread QMP command can be used to do the same,
but not from the command-line. And it requires sending an additional
command.

In the long run vhost-user-blk will support per-virtqueue iothread
mappings. But for now a single iothread makes sense and most other
transports will just use one iothread anyway.
---
  qapi/block-export.json   | 11 ++
  block/export/export.c| 31 +++-
  block/export/vhost-user-blk-server.c |  5 -
  nbd/server.c |  2 --
  4 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index 87ac5117cd..e2cb21f5f1 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -219,11 +219,22 @@
  #export before completion is signalled. (since: 5.2;
  #default: false)
  #
+# @iothread: The name of the iothread object where the export will run. The
+#default is to use the thread currently associated with the #


Stray #


+#block node. (since: 5.2)
+#
+# @fixed-iothread: True prevents the block node from being moved to another
+#  thread while the export is active. If true and @iothread is
+#  given, export creation fails if the block node cannot be
+#  moved to the iothread. The default is false.
+#


Missing a '(since 5.2)' tag.  (Hmm, we're inconsistent on whether it is 
'since 5.2' or 'since: 5.2' inside () parentheticals; Markus, is that 
something we should be cleaning up as part of the conversion to rST?)



@@ -63,10 +64,11 @@ static const BlockExportDriver 
*blk_exp_find_driver(BlockExportType type)
  
  BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)

  {
+bool fixed_iothread = export->has_fixed_iothread && export->fixed_iothread;


Technically, our QAPI code guarantees that export->fixed_iothread is 
false if export->has_fixed_iothread is false.  And someday I'd love to 
let QAPI express default values for bools so that we don't need a 
has_FOO field when a default has been expressed.  But neither of those 
points affect this patch; what you have is correct even if it is verbose.


Otherwise looks reasonable.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 3/4] block: move block exports to libblockdev

2020-09-29 Thread Eric Blake


On 9/29/20 7:55 AM, Stefan Hajnoczi wrote:

Block exports are used by softmmu, qemu-storage-daemon, and qemu-nbd.
They are not used by other programs and are not otherwise needed in
libblock.

Undo the recent move of blockdev-nbd.c from blockdev_ss into block_ss.
Since bdrv_close_all() (libblock) calls blk_exp_close_all()
(libblockdev) a stub function is required..

Make qemu-ndb.c use signal handling utility functions instead of


nbd


duplicating the code. This helps because os-posix.c is in libblockdev
and it depends on a qemu_system_killed() symbol that qemu-nbd.c lacks.
Once we use the signal handling utility functions we also end up
providing the necessary symbol.

Signed-off-by: Stefan Hajnoczi 
---


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 3/4] block: move block exports to libblockdev

2020-09-29 Thread Paolo Bonzini

On 29/09/20 14:55, Stefan Hajnoczi wrote:
> Block exports are used by softmmu, qemu-storage-daemon, and qemu-nbd.
> They are not used by other programs and are not otherwise needed in
> libblock.
> 
> Undo the recent move of blockdev-nbd.c from blockdev_ss into block_ss.
> Since bdrv_close_all() (libblock) calls blk_exp_close_all()
> (libblockdev) a stub function is required..
> 
> Make qemu-ndb.c use signal handling utility functions instead of
> duplicating the code. This helps because os-posix.c is in libblockdev
> and it depends on a qemu_system_killed() symbol that qemu-nbd.c lacks.
> Once we use the signal handling utility functions we also end up
> providing the necessary symbol.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  qemu-nbd.c| 21 +
>  stubs/blk-exp-close-all.c |  7 +++
>  block/export/meson.build  |  4 ++--
>  meson.build   |  4 ++--
>  nbd/meson.build   |  2 ++
>  stubs/meson.build |  1 +
>  6 files changed, 23 insertions(+), 16 deletions(-)
>  create mode 100644 stubs/blk-exp-close-all.c
> 
> diff --git a/qemu-nbd.c b/qemu-nbd.c
> index 6d7ac7490f..06774ca615 100644
> --- a/qemu-nbd.c
> +++ b/qemu-nbd.c
> @@ -25,6 +25,7 @@
>  #include "qapi/error.h"
>  #include "qemu/cutils.h"
>  #include "sysemu/block-backend.h"
> +#include "sysemu/runstate.h" /* for qemu_system_killed() prototype */
>  #include "block/block_int.h"
>  #include "block/nbd.h"
>  #include "qemu/main-loop.h"
> @@ -155,7 +156,11 @@ QEMU_COPYRIGHT "\n"
>  }
>  
>  #if HAVE_NBD_DEVICE
> -static void termsig_handler(int signum)
> +/*
> + * The client thread uses SIGTERM to interrupt the server.  A signal
> + * handler ensures that "qemu-nbd -v -c" exits with a nice status code.
> + */
> +void qemu_system_killed(int signum, pid_t pid)
>  {
>  atomic_cmpxchg(, RUNNING, TERMINATE);
>  qemu_notify_event();
> @@ -581,20 +586,12 @@ int main(int argc, char **argv)
>  const char *pid_file_name = NULL;
>  BlockExportOptions *export_opts;
>  
> +os_setup_early_signal_handling();
> +
>  #if HAVE_NBD_DEVICE
> -/* The client thread uses SIGTERM to interrupt the server.  A signal
> - * handler ensures that "qemu-nbd -v -c" exits with a nice status code.
> - */
> -struct sigaction sa_sigterm;
> -memset(_sigterm, 0, sizeof(sa_sigterm));
> -sa_sigterm.sa_handler = termsig_handler;
> -sigaction(SIGTERM, _sigterm, NULL);
> +os_setup_signal_handling();
>  #endif /* HAVE_NBD_DEVICE */
>  
> -#ifdef CONFIG_POSIX
> -signal(SIGPIPE, SIG_IGN);
> -#endif
> -
>  socket_init();
>  error_init(argv[0]);
>  module_call_init(MODULE_INIT_TRACE);
> diff --git a/stubs/blk-exp-close-all.c b/stubs/blk-exp-close-all.c
> new file mode 100644
> index 00..1c71316763
> --- /dev/null
> +++ b/stubs/blk-exp-close-all.c
> @@ -0,0 +1,7 @@
> +#include "qemu/osdep.h"
> +#include "block/export.h"
> +
> +/* Only used in programs that support block exports (libblockdev.fa) */
> +void blk_exp_close_all(void)
> +{
> +}
> diff --git a/block/export/meson.build b/block/export/meson.build
> index 469a7aa0f5..a2772a0dce 100644
> --- a/block/export/meson.build
> +++ b/block/export/meson.build
> @@ -1,2 +1,2 @@
> -block_ss.add(files('export.c'))
> -block_ss.add(when: 'CONFIG_VHOST_USER', if_true: 
> files('vhost-user-blk-server.c'))
> +blockdev_ss.add(files('export.c'))
> +blockdev_ss.add(when: 'CONFIG_VHOST_USER', if_true: 
> files('vhost-user-blk-server.c'))
> diff --git a/meson.build b/meson.build
> index 18d689b423..0e9528adab 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -835,7 +835,6 @@ subdir('dump')
>  
>  block_ss.add(files(
>'block.c',
> -  'blockdev-nbd.c',
>'blockjob.c',
>'job.c',
>'qemu-io-cmds.c',
> @@ -848,6 +847,7 @@ subdir('block')
>  
>  blockdev_ss.add(files(
>'blockdev.c',
> +  'blockdev-nbd.c',
>'iothread.c',
>'job-qmp.c',
>  ))
> @@ -1171,7 +1171,7 @@ if have_tools
>qemu_io = executable('qemu-io', files('qemu-io.c'),
>   dependencies: [block, qemuutil], install: true)
>qemu_nbd = executable('qemu-nbd', files('qemu-nbd.c'),
> -   dependencies: [block, qemuutil], install: true)
> +   dependencies: [blockdev, qemuutil], install: true)
>  
>subdir('storage-daemon')
>subdir('contrib/rdmacm-mux')
> diff --git a/nbd/meson.build b/nbd/meson.build
> index 0c00a776d3..2baaa36948 100644
> --- a/nbd/meson.build
> +++ b/nbd/meson.build
> @@ -1,5 +1,7 @@
>  block_ss.add(files(
>'client.c',
>'common.c',
> +))
> +blockdev_ss.add(files(
>'server.c',
>  ))
> diff --git a/stubs/meson.build b/stubs/meson.build
> index e0b322bc28..0fdcf93c09 100644
> --- a/stubs/meson.build
> +++ b/stubs/meson.build
> @@ -1,6 +1,7 @@
>  stub_ss.add(files('arch_type.c'))
>  stub_ss.add(files('bdrv-next-monitor-owned.c'))
>  stub_ss.add(files('blk-commit-all.c'))
> +stub_ss.add(files('blk-exp-close-all.c'))
>  stub_ss.add(files('blockdev-close-all-bdrv-states.c'))
>

[PATCH v2 3/4] block: move block exports to libblockdev

2020-09-29 Thread Stefan Hajnoczi

Block exports are used by softmmu, qemu-storage-daemon, and qemu-nbd.
They are not used by other programs and are not otherwise needed in
libblock.

Undo the recent move of blockdev-nbd.c from blockdev_ss into block_ss.
Since bdrv_close_all() (libblock) calls blk_exp_close_all()
(libblockdev) a stub function is required..

Make qemu-ndb.c use signal handling utility functions instead of
duplicating the code. This helps because os-posix.c is in libblockdev
and it depends on a qemu_system_killed() symbol that qemu-nbd.c lacks.
Once we use the signal handling utility functions we also end up
providing the necessary symbol.

Signed-off-by: Stefan Hajnoczi 
---
 qemu-nbd.c| 21 +
 stubs/blk-exp-close-all.c |  7 +++
 block/export/meson.build  |  4 ++--
 meson.build   |  4 ++--
 nbd/meson.build   |  2 ++
 stubs/meson.build |  1 +
 6 files changed, 23 insertions(+), 16 deletions(-)
 create mode 100644 stubs/blk-exp-close-all.c

diff --git a/qemu-nbd.c b/qemu-nbd.c
index 6d7ac7490f..06774ca615 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -25,6 +25,7 @@
 #include "qapi/error.h"
 #include "qemu/cutils.h"
 #include "sysemu/block-backend.h"
+#include "sysemu/runstate.h" /* for qemu_system_killed() prototype */
 #include "block/block_int.h"
 #include "block/nbd.h"
 #include "qemu/main-loop.h"
@@ -155,7 +156,11 @@ QEMU_COPYRIGHT "\n"
 }
 
 #if HAVE_NBD_DEVICE
-static void termsig_handler(int signum)
+/*
+ * The client thread uses SIGTERM to interrupt the server.  A signal
+ * handler ensures that "qemu-nbd -v -c" exits with a nice status code.
+ */
+void qemu_system_killed(int signum, pid_t pid)
 {
 atomic_cmpxchg(, RUNNING, TERMINATE);
 qemu_notify_event();
@@ -581,20 +586,12 @@ int main(int argc, char **argv)
 const char *pid_file_name = NULL;
 BlockExportOptions *export_opts;
 
+os_setup_early_signal_handling();
+
 #if HAVE_NBD_DEVICE
-/* The client thread uses SIGTERM to interrupt the server.  A signal
- * handler ensures that "qemu-nbd -v -c" exits with a nice status code.
- */
-struct sigaction sa_sigterm;
-memset(_sigterm, 0, sizeof(sa_sigterm));
-sa_sigterm.sa_handler = termsig_handler;
-sigaction(SIGTERM, _sigterm, NULL);
+os_setup_signal_handling();
 #endif /* HAVE_NBD_DEVICE */
 
-#ifdef CONFIG_POSIX
-signal(SIGPIPE, SIG_IGN);
-#endif
-
 socket_init();
 error_init(argv[0]);
 module_call_init(MODULE_INIT_TRACE);
diff --git a/stubs/blk-exp-close-all.c b/stubs/blk-exp-close-all.c
new file mode 100644
index 00..1c71316763
--- /dev/null
+++ b/stubs/blk-exp-close-all.c
@@ -0,0 +1,7 @@
+#include "qemu/osdep.h"
+#include "block/export.h"
+
+/* Only used in programs that support block exports (libblockdev.fa) */
+void blk_exp_close_all(void)
+{
+}
diff --git a/block/export/meson.build b/block/export/meson.build
index 469a7aa0f5..a2772a0dce 100644
--- a/block/export/meson.build
+++ b/block/export/meson.build
@@ -1,2 +1,2 @@
-block_ss.add(files('export.c'))
-block_ss.add(when: 'CONFIG_VHOST_USER', if_true: 
files('vhost-user-blk-server.c'))
+blockdev_ss.add(files('export.c'))
+blockdev_ss.add(when: 'CONFIG_VHOST_USER', if_true: 
files('vhost-user-blk-server.c'))
diff --git a/meson.build b/meson.build
index 18d689b423..0e9528adab 100644
--- a/meson.build
+++ b/meson.build
@@ -835,7 +835,6 @@ subdir('dump')
 
 block_ss.add(files(
   'block.c',
-  'blockdev-nbd.c',
   'blockjob.c',
   'job.c',
   'qemu-io-cmds.c',
@@ -848,6 +847,7 @@ subdir('block')
 
 blockdev_ss.add(files(
   'blockdev.c',
+  'blockdev-nbd.c',
   'iothread.c',
   'job-qmp.c',
 ))
@@ -1171,7 +1171,7 @@ if have_tools
   qemu_io = executable('qemu-io', files('qemu-io.c'),
  dependencies: [block, qemuutil], install: true)
   qemu_nbd = executable('qemu-nbd', files('qemu-nbd.c'),
-   dependencies: [block, qemuutil], install: true)
+   dependencies: [blockdev, qemuutil], install: true)
 
   subdir('storage-daemon')
   subdir('contrib/rdmacm-mux')
diff --git a/nbd/meson.build b/nbd/meson.build
index 0c00a776d3..2baaa36948 100644
--- a/nbd/meson.build
+++ b/nbd/meson.build
@@ -1,5 +1,7 @@
 block_ss.add(files(
   'client.c',
   'common.c',
+))
+blockdev_ss.add(files(
   'server.c',
 ))
diff --git a/stubs/meson.build b/stubs/meson.build
index e0b322bc28..0fdcf93c09 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -1,6 +1,7 @@
 stub_ss.add(files('arch_type.c'))
 stub_ss.add(files('bdrv-next-monitor-owned.c'))
 stub_ss.add(files('blk-commit-all.c'))
+stub_ss.add(files('blk-exp-close-all.c'))
 stub_ss.add(files('blockdev-close-all-bdrv-states.c'))
 stub_ss.add(files('change-state-handler.c'))
 stub_ss.add(files('clock-warp.c'))
-- 
2.26.2

[PATCH v2 4/4] block/export: add iothread and fixed-iothread options

2020-09-29 Thread Stefan Hajnoczi

Make it possible to specify the iothread where the export will run. By
default the block node can be moved to other AioContexts later and the
export will follow. The fixed-iothread option forces strict behavior
that prevents changing AioContext while the export is active. See the
QAPI docs for details.

Signed-off-by: Stefan Hajnoczi 
---
Note the x-blockdev-set-iothread QMP command can be used to do the same,
but not from the command-line. And it requires sending an additional
command.

In the long run vhost-user-blk will support per-virtqueue iothread
mappings. But for now a single iothread makes sense and most other
transports will just use one iothread anyway.
---
 qapi/block-export.json   | 11 ++
 block/export/export.c| 31 +++-
 block/export/vhost-user-blk-server.c |  5 -
 nbd/server.c |  2 --
 4 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index 87ac5117cd..e2cb21f5f1 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -219,11 +219,22 @@
 #export before completion is signalled. (since: 5.2;
 #default: false)
 #
+# @iothread: The name of the iothread object where the export will run. The
+#default is to use the thread currently associated with the #
+#block node. (since: 5.2)
+#
+# @fixed-iothread: True prevents the block node from being moved to another
+#  thread while the export is active. If true and @iothread is
+#  given, export creation fails if the block node cannot be
+#  moved to the iothread. The default is false.
+#
 # Since: 4.2
 ##
 { 'union': 'BlockExportOptions',
   'base': { 'type': 'BlockExportType',
 'id': 'str',
+   '*fixed-iothread': 'bool',
+   '*iothread': 'str',
 'node-name': 'str',
 '*writable': 'bool',
 '*writethrough': 'bool' },
diff --git a/block/export/export.c b/block/export/export.c
index 550897e236..a5b6b02703 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -15,6 +15,7 @@
 
 #include "block/block.h"
 #include "sysemu/block-backend.h"
+#include "sysemu/iothread.h"
 #include "block/export.h"
 #include "block/nbd.h"
 #include "qapi/error.h"
@@ -63,10 +64,11 @@ static const BlockExportDriver 
*blk_exp_find_driver(BlockExportType type)
 
 BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
 {
+bool fixed_iothread = export->has_fixed_iothread && export->fixed_iothread;
 const BlockExportDriver *drv;
 BlockExport *exp = NULL;
 BlockDriverState *bs;
-BlockBackend *blk;
+BlockBackend *blk = NULL;
 AioContext *ctx;
 uint64_t perm;
 int ret;
@@ -102,6 +104,28 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error 
**errp)
 ctx = bdrv_get_aio_context(bs);
 aio_context_acquire(ctx);
 
+if (export->has_iothread) {
+IOThread *iothread;
+AioContext *new_ctx;
+
+iothread = iothread_by_id(export->iothread);
+if (!iothread) {
+error_setg(errp, "iothread \"%s\" not found", export->iothread);
+goto fail;
+}
+
+new_ctx = iothread_get_aio_context(iothread);
+
+ret = bdrv_try_set_aio_context(bs, new_ctx, errp);
+if (ret == 0) {
+aio_context_release(ctx);
+aio_context_acquire(new_ctx);
+ctx = new_ctx;
+} else if (fixed_iothread) {
+goto fail;
+}
+}
+
 /*
  * Block exports are used for non-shared storage migration. Make sure
  * that BDRV_O_INACTIVE is cleared and the image is ready for write
@@ -116,6 +140,11 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error 
**errp)
 }
 
 blk = blk_new(ctx, perm, BLK_PERM_ALL);
+
+if (!fixed_iothread) {
+blk_set_allow_aio_context_change(blk, true);
+}
+
 ret = blk_insert_bs(blk, bs, errp);
 if (ret < 0) {
 goto fail;
diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index 81072a5a46..a1c37548e1 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -323,13 +323,17 @@ static const VuDevIface vu_blk_iface = {
 static void blk_aio_attached(AioContext *ctx, void *opaque)
 {
 VuBlkExport *vexp = opaque;
+
+vexp->export.ctx = ctx;
 vhost_user_server_attach_aio_context(>vu_server, ctx);
 }
 
 static void blk_aio_detach(void *opaque)
 {
 VuBlkExport *vexp = opaque;
+
 vhost_user_server_detach_aio_context(>vu_server);
+vexp->export.ctx = NULL;
 }
 
 static void
@@ -384,7 +388,6 @@ static int vu_blk_exp_create(BlockExport *exp, 
BlockExportOptions *opts,
 vu_blk_initialize_config(blk_bs(exp->blk), >blkcfg,
logical_block_size);
 
-blk_set_allow_aio_context_change(exp->blk, true);

[PATCH v2 1/4] util/vhost-user-server: use static library in meson.build

2020-09-29 Thread Stefan Hajnoczi

Don't compile contrib/libvhost-user/libvhost-user.c again. Instead build
the static library once and then reuse it throughout QEMU.

Also switch from CONFIG_LINUX to CONFIG_VHOST_USER, which is what the
vhost-user tools (vhost-user-gpu, etc) do.

Signed-off-by: Stefan Hajnoczi 
---
 block/export/export.c | 8 
 block/export/meson.build  | 2 +-
 contrib/libvhost-user/meson.build | 1 +
 meson.build   | 6 +-
 tests/qtest/meson.build   | 2 +-
 util/meson.build  | 4 +++-
 6 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/block/export/export.c b/block/export/export.c
index bd7cac241f..550897e236 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -17,17 +17,17 @@
 #include "sysemu/block-backend.h"
 #include "block/export.h"
 #include "block/nbd.h"
-#if CONFIG_LINUX
-#include "block/export/vhost-user-blk-server.h"
-#endif
 #include "qapi/error.h"
 #include "qapi/qapi-commands-block-export.h"
 #include "qapi/qapi-events-block-export.h"
 #include "qemu/id.h"
+#ifdef CONFIG_VHOST_USER
+#include "vhost-user-blk-server.h"
+#endif
 
 static const BlockExportDriver *blk_exp_drivers[] = {
 _exp_nbd,
-#if CONFIG_LINUX
+#ifdef CONFIG_VHOST_USER
 _exp_vhost_user_blk,
 #endif
 };
diff --git a/block/export/meson.build b/block/export/meson.build
index ef3a9576f7..469a7aa0f5 100644
--- a/block/export/meson.build
+++ b/block/export/meson.build
@@ -1,2 +1,2 @@
 block_ss.add(files('export.c'))
-block_ss.add(when: 'CONFIG_LINUX', if_true: files('vhost-user-blk-server.c', 
'../../contrib/libvhost-user/libvhost-user.c'))
+block_ss.add(when: 'CONFIG_VHOST_USER', if_true: 
files('vhost-user-blk-server.c'))
diff --git a/contrib/libvhost-user/meson.build 
b/contrib/libvhost-user/meson.build
index e68dd1a581..a261e7665f 100644
--- a/contrib/libvhost-user/meson.build
+++ b/contrib/libvhost-user/meson.build
@@ -1,3 +1,4 @@
 libvhost_user = static_library('vhost-user',
files('libvhost-user.c', 
'libvhost-user-glib.c'),
build_by_default: false)
+vhost_user = declare_dependency(link_with: libvhost_user)
diff --git a/meson.build b/meson.build
index 4c6c7310fa..eb84b97ebb 100644
--- a/meson.build
+++ b/meson.build
@@ -788,6 +788,11 @@ trace_events_subdirs += [
   'util',
 ]
 
+vhost_user = not_found
+if 'CONFIG_VHOST_USER' in config_host
+  subdir('contrib/libvhost-user')
+endif
+
 subdir('qapi')
 subdir('qobject')
 subdir('stubs')
@@ -1169,7 +1174,6 @@ if have_tools
  install: true)
 
   if 'CONFIG_VHOST_USER' in config_host
-subdir('contrib/libvhost-user')
 subdir('contrib/vhost-user-blk')
 subdir('contrib/vhost-user-gpu')
 subdir('contrib/vhost-user-input')
diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
index c72821b09a..aa8d0985e1 100644
--- a/tests/qtest/meson.build
+++ b/tests/qtest/meson.build
@@ -191,7 +191,7 @@ qos_test_ss.add(
 )
 qos_test_ss.add(when: 'CONFIG_VIRTFS', if_true: files('virtio-9p-test.c'))
 qos_test_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user-test.c'))
-qos_test_ss.add(when: 'CONFIG_LINUX', if_true: files('vhost-user-blk-test.c'))
+qos_test_ss.add(when: 'CONFIG_VHOST_USER', if_true: 
files('vhost-user-blk-test.c'))
 
 extra_qtest_deps = {
   'bios-tables-test': [io],
diff --git a/util/meson.build b/util/meson.build
index 2296e81b34..9b2a7a5de9 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -66,7 +66,9 @@ if have_block
   util_ss.add(files('main-loop.c'))
   util_ss.add(files('nvdimm-utils.c'))
   util_ss.add(files('qemu-coroutine.c', 'qemu-coroutine-lock.c', 
'qemu-coroutine-io.c'))
-  util_ss.add(when: 'CONFIG_LINUX', if_true: files('vhost-user-server.c'))
+  util_ss.add(when: 'CONFIG_VHOST_USER', if_true: [
+files('vhost-user-server.c'), vhost_user
+  ])
   util_ss.add(files('block-helpers.c'))
   util_ss.add(files('qemu-coroutine-sleep.c'))
   util_ss.add(files('qemu-co-shared-resource.c'))
-- 
2.26.2

[PATCH v2 2/4] qemu-storage-daemon: avoid compiling blockdev_ss twice

2020-09-29 Thread Stefan Hajnoczi

Introduce libblkdev.fa to avoid recompiling blockdev_ss twice.

Suggested-by: Paolo Bonzini 
Reviewed-by: Paolo Bonzini 
Signed-off-by: Stefan Hajnoczi 
---
 meson.build| 12 ++--
 storage-daemon/meson.build |  3 +--
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/meson.build b/meson.build
index eb84b97ebb..18d689b423 100644
--- a/meson.build
+++ b/meson.build
@@ -857,7 +857,6 @@ blockdev_ss.add(files(
 blockdev_ss.add(when: 'CONFIG_POSIX', if_true: files('os-posix.c'))
 softmmu_ss.add(when: 'CONFIG_WIN32', if_true: [files('os-win32.c')])
 
-softmmu_ss.add_all(blockdev_ss)
 softmmu_ss.add(files(
   'bootdevice.c',
   'dma-helpers.c',
@@ -952,6 +951,15 @@ block = declare_dependency(link_whole: [libblock],
link_args: '@block.syms',
dependencies: [crypto, io])
 
+blockdev_ss = blockdev_ss.apply(config_host, strict: false)
+libblockdev = static_library('blockdev', blockdev_ss.sources() + genh,
+ dependencies: blockdev_ss.dependencies(),
+ name_suffix: 'fa',
+ build_by_default: false)
+
+blockdev = declare_dependency(link_whole: [libblockdev],
+  dependencies: [block])
+
 qmp_ss = qmp_ss.apply(config_host, strict: false)
 libqmp = static_library('qmp', qmp_ss.sources() + genh,
 dependencies: qmp_ss.dependencies(),
@@ -968,7 +976,7 @@ foreach m : block_mods + softmmu_mods
 install_dir: config_host['qemu_moddir'])
 endforeach
 
-softmmu_ss.add(authz, block, chardev, crypto, io, qmp)
+softmmu_ss.add(authz, blockdev, chardev, crypto, io, qmp)
 common_ss.add(qom, qemuutil)
 
 common_ss.add_all(when: 'CONFIG_SOFTMMU', if_true: [softmmu_ss])
diff --git a/storage-daemon/meson.build b/storage-daemon/meson.build
index 0409acc3f5..c5adce81c3 100644
--- a/storage-daemon/meson.build
+++ b/storage-daemon/meson.build
@@ -1,7 +1,6 @@
 qsd_ss = ss.source_set()
 qsd_ss.add(files('qemu-storage-daemon.c'))
-qsd_ss.add(block, chardev, qmp, qom, qemuutil)
-qsd_ss.add_all(blockdev_ss)
+qsd_ss.add(blockdev, chardev, qmp, qom, qemuutil)
 
 subdir('qapi')
 
-- 
2.26.2

[PATCH v2 0/4] block/export: add BlockExportOptions->iothread member

2020-09-29 Thread Stefan Hajnoczi

v2:
 * Add fixed-iothread option to set AioContext change policy [Kevin]
 * Use os-posix.c signal handling utilities in qemu-nbd.c [Paolo]

This series adjusts the build system and then adds a
BlockExportOptions->iothread member so that it is possible to set the iothread
for an export.

Based-on: 20200924151549.913737-1-stefa...@redhat.com ("[PATCH v2 00/13] 
block/export: convert vhost-user-blk-server to block exports API")

Stefan Hajnoczi (4):
  util/vhost-user-server: use static library in meson.build
  qemu-storage-daemon: avoid compiling blockdev_ss twice
  block: move block exports to libblockdev
  block/export: add iothread and fixed-iothread options

 qapi/block-export.json   | 11 
 block/export/export.c| 39 
 block/export/vhost-user-blk-server.c |  5 +++-
 nbd/server.c |  2 --
 qemu-nbd.c   | 21 +++
 stubs/blk-exp-close-all.c|  7 +
 block/export/meson.build |  4 +--
 contrib/libvhost-user/meson.build|  1 +
 meson.build  | 22 
 nbd/meson.build  |  2 ++
 storage-daemon/meson.build   |  3 +--
 stubs/meson.build|  1 +
 tests/qtest/meson.build  |  2 +-
 util/meson.build |  4 ++-
 14 files changed, 93 insertions(+), 31 deletions(-)
 create mode 100644 stubs/blk-exp-close-all.c

-- 
2.26.2

[PATCH v10 7/9] stream: skip filters when writing backing file name to QCOW2 header

2020-09-29 Thread Andrey Shinkevich via

Avoid writing a filter JSON-name to QCOW2 image when the backing file
is changed after the block stream job.

Signed-off-by: Andrey Shinkevich 
---
 block/stream.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/stream.c b/block/stream.c
index e0540ee..b0719e9 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -65,6 +65,7 @@ static int stream_prepare(Job *job)
 BlockDriverState *bs = blk_bs(bjob->blk);
 BlockDriverState *unfiltered_bs = bdrv_skip_filters(bs);
 BlockDriverState *base = bdrv_filter_or_cow_bs(s->above_base);
+BlockDriverState *base_metadata = bdrv_skip_filters(base);
 Error *local_err = NULL;
 int ret = 0;
 
@@ -73,10 +74,10 @@ static int stream_prepare(Job *job)
 
 if (bdrv_cow_child(unfiltered_bs)) {
 const char *base_id = NULL, *base_fmt = NULL;
-if (base) {
-base_id = s->backing_file_str;
-if (base->drv) {
-base_fmt = base->drv->format_name;
+if (base_metadata) {
+base_id = base_metadata->filename;
+if (base_metadata->drv) {
+base_fmt = base_metadata->drv->format_name;
 }
 }
 bdrv_set_backing_hd(unfiltered_bs, base, _err);
-- 
1.8.3.1

[PATCH v10 9/9] block: apply COR-filter to block-stream jobs

2020-09-29 Thread Andrey Shinkevich via

This patch completes the series with the COR-filter insertion for
block-stream operations. Adding the filter makes it possible for copied
regions to be discarded in backing files during the block-stream job,
what will reduce the disk overuse.
The COR-filter insertion incurs changes in the iotests case
245:test_block_stream_4 that reopens the backing chain during a
block-stream job. There are changes in the iotests #030 as well.
The iotests case 030:test_stream_parallel was deleted due to multiple
conflicts between the concurrent job operations over the same backing
chain. The base backing node for one job is the top node for another
job. It may change due to the filter node inserted into the backing
chain while both jobs are running. Another issue is that the parts of
the backing chain are being frozen by the running job and may not be
changed by the concurrent job when needed. The concept of the parallel
jobs with common nodes is considered vital no more.

Signed-off-by: Andrey Shinkevich 
---
 block/stream.c | 93 ++
 tests/qemu-iotests/030 | 51 +++--
 tests/qemu-iotests/030.out |  4 +-
 tests/qemu-iotests/141.out |  2 +-
 tests/qemu-iotests/245 | 19 +++---
 5 files changed, 83 insertions(+), 86 deletions(-)

diff --git a/block/stream.c b/block/stream.c
index fe2663f..240b3dc 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -17,8 +17,10 @@
 #include "block/blockjob_int.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qerror.h"
+#include "qapi/qmp/qdict.h"
 #include "qemu/ratelimit.h"
 #include "sysemu/block-backend.h"
+#include "block/copy-on-read.h"
 
 enum {
 /*
@@ -33,6 +35,8 @@ typedef struct StreamBlockJob {
 BlockJob common;
 BlockDriverState *base_overlay; /* COW overlay (stream from this) */
 BlockDriverState *above_base;   /* Node directly above the base */
+BlockDriverState *cor_filter_bs;
+BlockDriverState *target_bs;
 BlockdevOnError on_error;
 bool bs_read_only;
 bool chain_frozen;
@@ -52,23 +56,20 @@ static void stream_abort(Job *job)
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
 
 if (s->chain_frozen) {
-BlockJob *bjob = >common;
-bdrv_unfreeze_backing_chain(blk_bs(bjob->blk), s->above_base);
+bdrv_unfreeze_backing_chain(s->cor_filter_bs, s->above_base);
 }
 }
 
 static int stream_prepare(Job *job)
 {
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
-BlockJob *bjob = >common;
-BlockDriverState *bs = blk_bs(bjob->blk);
-BlockDriverState *unfiltered_bs = bdrv_skip_filters(bs);
+BlockDriverState *unfiltered_bs = bdrv_skip_filters(s->target_bs);
 BlockDriverState *base = bdrv_filter_or_cow_bs(s->above_base);
 BlockDriverState *base_metadata = bdrv_skip_filters(base);
 Error *local_err = NULL;
 int ret = 0;
 
-bdrv_unfreeze_backing_chain(bs, s->above_base);
+bdrv_unfreeze_backing_chain(s->cor_filter_bs, s->above_base);
 s->chain_frozen = false;
 
 if (bdrv_cow_child(unfiltered_bs)) {
@@ -94,13 +95,14 @@ static void stream_clean(Job *job)
 {
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
 BlockJob *bjob = >common;
-BlockDriverState *bs = blk_bs(bjob->blk);
+
+bdrv_cor_filter_drop(s->cor_filter_bs);
 
 /* Reopen the image back in read-only mode if necessary */
 if (s->bs_read_only) {
 /* Give up write permissions before making it read-only */
 blk_set_perm(bjob->blk, 0, BLK_PERM_ALL, _abort);
-bdrv_reopen_set_read_only(bs, true, NULL);
+bdrv_reopen_set_read_only(s->target_bs, true, NULL);
 }
 }
 
@@ -108,9 +110,7 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
 {
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
 BlockBackend *blk = s->common.blk;
-BlockDriverState *bs = blk_bs(blk);
-BlockDriverState *unfiltered_bs = bdrv_skip_filters(bs);
-bool enable_cor = !bdrv_cow_child(s->base_overlay);
+BlockDriverState *unfiltered_bs = bdrv_skip_filters(s->target_bs);
 int64_t len;
 int64_t offset = 0;
 uint64_t delay_ns = 0;
@@ -122,21 +122,12 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
 return 0;
 }
 
-len = bdrv_getlength(bs);
+len = bdrv_getlength(s->target_bs);
 if (len < 0) {
 return len;
 }
 job_progress_set_remaining(>common.job, len);
 
-/* Turn on copy-on-read for the whole block device so that guest read
- * requests help us make progress.  Only do this when copying the entire
- * backing chain since the copy-on-read operation does not take base into
- * account.
- */
-if (enable_cor) {
-bdrv_enable_copy_on_read(bs);
-}
-
 for ( ; offset < len; offset += n) {
 bool copy;
 int ret;
@@ -195,10 +186,6 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
 }
 }
 
-

[PATCH v10 6/9] copy-on-read: skip non-guest reads if no copy needed

2020-09-29 Thread Andrey Shinkevich via

If the flag BDRV_REQ_PREFETCH was set, pass it further to the
COR-driver to skip unneeded reading. It can be taken into account for
the COR-algorithms optimization. That check is being made during the
block stream job by the moment.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 14 ++
 block/io.c   |  2 +-
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index f53f7e0..5389dca 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -145,10 +145,16 @@ static int coroutine_fn 
cor_co_preadv_part(BlockDriverState *bs,
 }
 }
 
-ret = bdrv_co_preadv_part(bs->file, offset, n, qiov, qiov_offset,
-  local_flags);
-if (ret < 0) {
-return ret;
+if ((flags & BDRV_REQ_PREFETCH) &
+!(local_flags & BDRV_REQ_COPY_ON_READ)) {
+/* Skip non-guest reads if no copy needed */
+} else {
+
+ret = bdrv_co_preadv_part(bs->file, offset, n, qiov, qiov_offset,
+  local_flags);
+if (ret < 0) {
+return ret;
+}
 }
 
 offset += n;
diff --git a/block/io.c b/block/io.c
index 11df188..62b75a5 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1388,7 +1388,7 @@ static int coroutine_fn 
bdrv_co_do_copy_on_readv(BdrvChild *child,
 qemu_iovec_init_buf(_qiov, bounce_buffer, pnum);
 
 ret = bdrv_driver_preadv(bs, cluster_offset, pnum,
- _qiov, 0, 0);
+ _qiov, 0, flags & 
BDRV_REQ_PREFETCH);
 if (ret < 0) {
 goto err;
 }
-- 
1.8.3.1

[PATCH v10 4/9] copy-on-read: pass base node name to COR driver

2020-09-29 Thread Andrey Shinkevich via

To limit the guest's COR operations by the base node in the backing
chain during stream job, pass the base node name to the copy-on-read
driver. The rest of the functionality will be implemented in the patch
that follows.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 3c8231f..e04092f 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -24,19 +24,23 @@
 #include "block/block_int.h"
 #include "qemu/module.h"
 #include "qapi/error.h"
+#include "qapi/qmp/qerror.h"
 #include "qapi/qmp/qdict.h"
 #include "block/copy-on-read.h"
 
 
 typedef struct BDRVStateCOR {
 bool active;
+BlockDriverState *base_bs;
 } BDRVStateCOR;
 
 
 static int cor_open(BlockDriverState *bs, QDict *options, int flags,
 Error **errp)
 {
+BlockDriverState *base_bs = NULL;
 BDRVStateCOR *state = bs->opaque;
+const char *base_node = qdict_get_try_str(options, "base");
 
 bs->file = bdrv_open_child(NULL, options, "file", bs, _of_bds,
BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
@@ -52,7 +56,16 @@ static int cor_open(BlockDriverState *bs, QDict *options, 
int flags,
 ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
 bs->file->bs->supported_zero_flags);
 
+if (base_node) {
+qdict_del(options, "base");
+base_bs = bdrv_lookup_bs(NULL, base_node, errp);
+if (!base_bs) {
+error_setg(errp, QERR_BASE_NOT_FOUND, base_node);
+return -EINVAL;
+}
+}
 state->active = true;
+state->base_bs = base_bs;
 
 /*
  * We don't need to call bdrv_child_refresh_perms() now as the permissions
-- 
1.8.3.1

[PATCH v10 8/9] block: remove unused backing-file name parameter

2020-09-29 Thread Andrey Shinkevich via

The block stream QMP parameter backing-file is in use no more. It
designates a backing file name to set in QCOW2 image header after the
block stream job finished. The base file name is used instead.

Signed-off-by: Andrey Shinkevich 
---
 block/monitor/block-hmp-cmds.c |  2 +-
 block/stream.c |  6 +-
 blockdev.c | 17 +
 include/block/block_int.h  |  2 +-
 qapi/block-core.json   | 17 +
 5 files changed, 5 insertions(+), 39 deletions(-)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index 4e66775..5f19499 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -506,7 +506,7 @@ void hmp_block_stream(Monitor *mon, const QDict *qdict)
 int64_t speed = qdict_get_try_int(qdict, "speed", 0);
 
 qmp_block_stream(true, device, device, base != NULL, base, false, NULL,
- false, NULL, qdict_haskey(qdict, "speed"), speed, true,
+ qdict_haskey(qdict, "speed"), speed, true,
  BLOCKDEV_ON_ERROR_REPORT, false, NULL, false, false, 
false,
  false, );
 
diff --git a/block/stream.c b/block/stream.c
index b0719e9..fe2663f 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -34,7 +34,6 @@ typedef struct StreamBlockJob {
 BlockDriverState *base_overlay; /* COW overlay (stream from this) */
 BlockDriverState *above_base;   /* Node directly above the base */
 BlockdevOnError on_error;
-char *backing_file_str;
 bool bs_read_only;
 bool chain_frozen;
 } StreamBlockJob;
@@ -103,8 +102,6 @@ static void stream_clean(Job *job)
 blk_set_perm(bjob->blk, 0, BLK_PERM_ALL, _abort);
 bdrv_reopen_set_read_only(bs, true, NULL);
 }
-
-g_free(s->backing_file_str);
 }
 
 static int coroutine_fn stream_run(Job *job, Error **errp)
@@ -220,7 +217,7 @@ static const BlockJobDriver stream_job_driver = {
 };
 
 void stream_start(const char *job_id, BlockDriverState *bs,
-  BlockDriverState *base, const char *backing_file_str,
+  BlockDriverState *base,
   int creation_flags, int64_t speed,
   BlockdevOnError on_error,
   const char *filter_node_name,
@@ -295,7 +292,6 @@ void stream_start(const char *job_id, BlockDriverState *bs,
 
 s->base_overlay = base_overlay;
 s->above_base = above_base;
-s->backing_file_str = g_strdup(backing_file_str);
 s->bs_read_only = bs_read_only;
 s->chain_frozen = true;
 
diff --git a/blockdev.c b/blockdev.c
index d719c47..b223601 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2486,7 +2486,6 @@ out:
 void qmp_block_stream(bool has_job_id, const char *job_id, const char *device,
   bool has_base, const char *base,
   bool has_base_node, const char *base_node,
-  bool has_backing_file, const char *backing_file,
   bool has_speed, int64_t speed,
   bool has_on_error, BlockdevOnError on_error,
   bool has_filter_node_name, const char *filter_node_name,
@@ -2498,7 +2497,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 BlockDriverState *base_bs = NULL;
 AioContext *aio_context;
 Error *local_err = NULL;
-const char *base_name = NULL;
 int job_flags = JOB_DEFAULT;
 
 if (!has_on_error) {
@@ -2526,7 +2524,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 goto out;
 }
 assert(bdrv_get_aio_context(base_bs) == aio_context);
-base_name = base;
 }
 
 if (has_base_node) {
@@ -2541,7 +2538,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 }
 assert(bdrv_get_aio_context(base_bs) == aio_context);
 bdrv_refresh_filename(base_bs);
-base_name = base_bs->filename;
 }
 
 /* Check for op blockers in the whole chain between bs and base */
@@ -2553,17 +2549,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 }
 }
 
-/* if we are streaming the entire chain, the result will have no backing
- * file, and specifying one is therefore an error */
-if (base_bs == NULL && has_backing_file) {
-error_setg(errp, "backing file specified, but streaming the "
- "entire chain");
-goto out;
-}
-
-/* backing_file string overrides base bs filename */
-base_name = has_backing_file ? backing_file : base_name;
-
 if (has_auto_finalize && !auto_finalize) {
 job_flags |= JOB_MANUAL_FINALIZE;
 }
@@ -2571,7 +2556,7 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 job_flags |= JOB_MANUAL_DISMISS;
 }
 
-stream_start(has_job_id ? job_id : NULL, bs, base_bs, base_name,
+stream_start(has_job_id ?

[PATCH v10 3/9] qapi: add filter-node-name to block-stream

2020-09-29 Thread Andrey Shinkevich via

Provide the possibility to pass the 'filter-node-name' parameter to the
block-stream job as it is done for the commit block job.

Signed-off-by: Andrey Shinkevich 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/monitor/block-hmp-cmds.c | 4 ++--
 block/stream.c | 4 +++-
 blockdev.c | 4 +++-
 include/block/block_int.h  | 7 ++-
 qapi/block-core.json   | 6 ++
 5 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index 4d3db5e..4e66775 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -507,8 +507,8 @@ void hmp_block_stream(Monitor *mon, const QDict *qdict)
 
 qmp_block_stream(true, device, device, base != NULL, base, false, NULL,
  false, NULL, qdict_haskey(qdict, "speed"), speed, true,
- BLOCKDEV_ON_ERROR_REPORT, false, false, false, false,
- );
+ BLOCKDEV_ON_ERROR_REPORT, false, NULL, false, false, 
false,
+ false, );
 
 hmp_handle_error(mon, error);
 }
diff --git a/block/stream.c b/block/stream.c
index 8ce6729..e0540ee 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -221,7 +221,9 @@ static const BlockJobDriver stream_job_driver = {
 void stream_start(const char *job_id, BlockDriverState *bs,
   BlockDriverState *base, const char *backing_file_str,
   int creation_flags, int64_t speed,
-  BlockdevOnError on_error, Error **errp)
+  BlockdevOnError on_error,
+  const char *filter_node_name,
+  Error **errp)
 {
 StreamBlockJob *s;
 BlockDriverState *iter;
diff --git a/blockdev.c b/blockdev.c
index bebd3ba..d719c47 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2489,6 +2489,7 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
   bool has_backing_file, const char *backing_file,
   bool has_speed, int64_t speed,
   bool has_on_error, BlockdevOnError on_error,
+  bool has_filter_node_name, const char *filter_node_name,
   bool has_auto_finalize, bool auto_finalize,
   bool has_auto_dismiss, bool auto_dismiss,
   Error **errp)
@@ -2571,7 +2572,8 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 }
 
 stream_start(has_job_id ? job_id : NULL, bs, base_bs, base_name,
- job_flags, has_speed ? speed : 0, on_error, _err);
+ job_flags, has_speed ? speed : 0, on_error,
+ filter_node_name, _err);
 if (local_err) {
 error_propagate(errp, local_err);
 goto out;
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 38cad9d..f782737 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1134,6 +1134,9 @@ int is_windows_drive(const char *filename);
  *  See @BlockJobCreateFlags
  * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
  * @on_error: The action to take upon error.
+ * @filter_node_name: The node name that should be assigned to the filter
+ * driver that the commit job inserts into the graph above @bs. NULL means
+ * that a node name should be autogenerated.
  * @errp: Error object.
  *
  * Start a streaming operation on @bs.  Clusters that are unallocated
@@ -1146,7 +1149,9 @@ int is_windows_drive(const char *filename);
 void stream_start(const char *job_id, BlockDriverState *bs,
   BlockDriverState *base, const char *backing_file_str,
   int creation_flags, int64_t speed,
-  BlockdevOnError on_error, Error **errp);
+  BlockdevOnError on_error,
+  const char *filter_node_name,
+  Error **errp);
 
 /**
  * commit_start:
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 3c16f1e..32fb097 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2533,6 +2533,11 @@
 #'stop' and 'enospc' can only be used if the block device
 #supports io-status (see BlockInfo).  Since 1.3.
 #
+# @filter-node-name: the node name that should be assigned to the
+#filter driver that the stream job inserts into the graph
+#above @device. If this option is not given, a node name is
+#autogenerated. (Since: 5.2)
+#
 # @auto-finalize: When false, this job will wait in a PENDING state after it 
has
 # finished its work, waiting for @block-job-finalize before
 # making any block graph changes.
@@ -2563,6 +2568,7 @@
   'data': { '*job-id': 'str', 'device': 'str', '*base': 'str',
 '*base-node': 'str', '*backing-file': 'str', '*speed': 'int',
 '*on-error': 'BlockdevOnError',
+

[PATCH v10 2/9] copy-on-read: add filter append/drop functions

2020-09-29 Thread Andrey Shinkevich via

Provide API for the COR-filter insertion/removal.
Also, drop the filter child permissions for an inactive state when the
filter node is being removed.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 84 
 block/copy-on-read.h | 35 ++
 2 files changed, 119 insertions(+)
 create mode 100644 block/copy-on-read.h

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index cb03e0f..3c8231f 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -23,11 +23,21 @@
 #include "qemu/osdep.h"
 #include "block/block_int.h"
 #include "qemu/module.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qdict.h"
+#include "block/copy-on-read.h"
+
+
+typedef struct BDRVStateCOR {
+bool active;
+} BDRVStateCOR;
 
 
 static int cor_open(BlockDriverState *bs, QDict *options, int flags,
 Error **errp)
 {
+BDRVStateCOR *state = bs->opaque;
+
 bs->file = bdrv_open_child(NULL, options, "file", bs, _of_bds,
BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
false, errp);
@@ -42,6 +52,13 @@ static int cor_open(BlockDriverState *bs, QDict *options, 
int flags,
 ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
 bs->file->bs->supported_zero_flags);
 
+state->active = true;
+
+/*
+ * We don't need to call bdrv_child_refresh_perms() now as the permissions
+ * will be updated later when the filter node gets its parent.
+ */
+
 return 0;
 }
 
@@ -57,6 +74,17 @@ static void cor_child_perm(BlockDriverState *bs, BdrvChild 
*c,
uint64_t perm, uint64_t shared,
uint64_t *nperm, uint64_t *nshared)
 {
+BDRVStateCOR *s = bs->opaque;
+
+if (!s->active) {
+/*
+ * While the filter is being removed
+ */
+*nperm = 0;
+*nshared = BLK_PERM_ALL;
+return;
+}
+
 *nperm = perm & PERM_PASSTHROUGH;
 *nshared = (shared & PERM_PASSTHROUGH) | PERM_UNCHANGED;
 
@@ -135,6 +163,7 @@ static void cor_lock_medium(BlockDriverState *bs, bool 
locked)
 
 static BlockDriver bdrv_copy_on_read = {
 .format_name= "copy-on-read",
+.instance_size  = sizeof(BDRVStateCOR),
 
 .bdrv_open  = cor_open,
 .bdrv_child_perm= cor_child_perm,
@@ -159,4 +188,59 @@ static void bdrv_copy_on_read_init(void)
 bdrv_register(_copy_on_read);
 }
 
+
+BlockDriverState *bdrv_cor_filter_append(BlockDriverState *bs,
+ QDict *node_options,
+ int flags, Error **errp)
+{
+BlockDriverState *cor_filter_bs;
+Error *local_err = NULL;
+
+cor_filter_bs = bdrv_open(NULL, NULL, node_options, flags, errp);
+if (cor_filter_bs == NULL) {
+error_prepend(errp, "Could not create COR-filter node: ");
+return NULL;
+}
+
+bdrv_drained_begin(bs);
+bdrv_replace_node(bs, cor_filter_bs, _err);
+bdrv_drained_end(bs);
+
+if (local_err) {
+bdrv_unref(cor_filter_bs);
+error_propagate(errp, local_err);
+return NULL;
+}
+
+return cor_filter_bs;
+}
+
+
+void bdrv_cor_filter_drop(BlockDriverState *cor_filter_bs)
+{
+BdrvChild *child;
+BlockDriverState *bs;
+BDRVStateCOR *s = cor_filter_bs->opaque;
+
+child = bdrv_filter_child(cor_filter_bs);
+if (!child) {
+return;
+}
+bs = child->bs;
+
+/* Retain the BDS until we complete the graph change. */
+bdrv_ref(bs);
+/* Hold a guest back from writing while permissions are being reset. */
+bdrv_drained_begin(bs);
+/* Drop permissions before the graph change. */
+s->active = false;
+bdrv_child_refresh_perms(cor_filter_bs, child, _abort);
+bdrv_replace_node(cor_filter_bs, bs, _abort);
+
+bdrv_drained_end(bs);
+bdrv_unref(bs);
+bdrv_unref(cor_filter_bs);
+}
+
+
 block_init(bdrv_copy_on_read_init);
diff --git a/block/copy-on-read.h b/block/copy-on-read.h
new file mode 100644
index 000..d6f2422
--- /dev/null
+++ b/block/copy-on-read.h
@@ -0,0 +1,35 @@
+/*
+ * Copy-on-read filter block driver
+ *
+ * The filter driver performs Copy-On-Read (COR) operations
+ *
+ * Copyright (c) 2018-2020 Virtuozzo International GmbH.
+ *
+ * Author:
+ *   Andrey Shinkevich 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a

[PATCH v10 1/9] copy-on-read: Support preadv/pwritev_part functions

2020-09-29 Thread Andrey Shinkevich via

Add support for the recently introduced functions
bdrv_co_preadv_part()
and
bdrv_co_pwritev_part()
to the COR-filter driver.

Signed-off-by: Andrey Shinkevich 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/copy-on-read.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 2816e61..cb03e0f 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -74,21 +74,25 @@ static int64_t cor_getlength(BlockDriverState *bs)
 }
 
 
-static int coroutine_fn cor_co_preadv(BlockDriverState *bs,
-  uint64_t offset, uint64_t bytes,
-  QEMUIOVector *qiov, int flags)
+static int coroutine_fn cor_co_preadv_part(BlockDriverState *bs,
+   uint64_t offset, uint64_t bytes,
+   QEMUIOVector *qiov,
+   size_t qiov_offset,
+   int flags)
 {
-return bdrv_co_preadv(bs->file, offset, bytes, qiov,
-  flags | BDRV_REQ_COPY_ON_READ);
+return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
+   flags | BDRV_REQ_COPY_ON_READ);
 }
 
 
-static int coroutine_fn cor_co_pwritev(BlockDriverState *bs,
-   uint64_t offset, uint64_t bytes,
-   QEMUIOVector *qiov, int flags)
+static int coroutine_fn cor_co_pwritev_part(BlockDriverState *bs,
+uint64_t offset,
+uint64_t bytes,
+QEMUIOVector *qiov,
+size_t qiov_offset, int flags)
 {
-
-return bdrv_co_pwritev(bs->file, offset, bytes, qiov, flags);
+return bdrv_co_pwritev_part(bs->file, offset, bytes, qiov, qiov_offset,
+flags);
 }
 
 
@@ -137,8 +141,8 @@ static BlockDriver bdrv_copy_on_read = {
 
 .bdrv_getlength = cor_getlength,
 
-.bdrv_co_preadv = cor_co_preadv,
-.bdrv_co_pwritev= cor_co_pwritev,
+.bdrv_co_preadv_part= cor_co_preadv_part,
+.bdrv_co_pwritev_part   = cor_co_pwritev_part,
 .bdrv_co_pwrite_zeroes  = cor_co_pwrite_zeroes,
 .bdrv_co_pdiscard   = cor_co_pdiscard,
 .bdrv_co_pwritev_compressed = cor_co_pwritev_compressed,
-- 
1.8.3.1

[PATCH v10 0/9] Apply COR-filter to the block-stream permanently

2020-09-29 Thread Andrey Shinkevich via

Despite the patch "freeze link to base node..." has been removed from the
series in the current version 9, the iotest case test_stream_parallel does
not pass after the COR-filter is inserted into the backing chain. As the
test case may not be initialized, it does not make a sense and was removed
again.
The check with bdrv_is_allocated_above() takes place in the COR-filter and
in the block-stream job both. An optimization of the block-stream job based
on the filter functionality may be made in a separate series.

v10:
  02: The missed new file block/copy-on-read.h added
v9:
  02: Refactored.
  04: Base node name is used instead of the file name.
  05: New implementation based on Max' review.
  06: New.
  07: New. The patch "freeze link to base node..." was deleted.
  08: New.
  09: The filter node options are initialized.

The v8 Message-Id:
<1598633579-221780-1-git-send-email-andrey.shinkev...@virtuozzo.com>

Andrey Shinkevich (9):
  copy-on-read: Support preadv/pwritev_part functions
  copy-on-read: add filter append/drop functions
  qapi: add filter-node-name to block-stream
  copy-on-read: pass base node name to COR driver
  copy-on-read: limit guest COR activity to base in COR driver
  copy-on-read: skip non-guest reads if no copy needed
  stream: skip filters when writing backing file name to QCOW2 header
  block: remove unused backing-file name parameter
  block: apply COR-filter to block-stream jobs

 block/copy-on-read.c   | 165 ++---
 block/copy-on-read.h   |  35 +
 block/io.c |   2 +-
 block/monitor/block-hmp-cmds.c |   6 +-
 block/stream.c | 112 +---
 blockdev.c |  21 +-
 include/block/block_int.h  |   9 ++-
 qapi/block-core.json   |  23 ++
 tests/qemu-iotests/030 |  51 ++---
 tests/qemu-iotests/030.out |   4 +-
 tests/qemu-iotests/141.out |   2 +-
 tests/qemu-iotests/245 |  19 +++--
 12 files changed, 302 insertions(+), 147 deletions(-)
 create mode 100644 block/copy-on-read.h

-- 
1.8.3.1

[PATCH v10 5/9] copy-on-read: limit guest COR activity to base in COR driver

2020-09-29 Thread Andrey Shinkevich via

Limit the guest's COR operations by the base node in the backing chain
when the base node name is given. It will be useful for a block stream
job when the COR-filter is applied.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 38 --
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index e04092f..f53f7e0 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -121,8 +121,42 @@ static int coroutine_fn 
cor_co_preadv_part(BlockDriverState *bs,
size_t qiov_offset,
int flags)
 {
-return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
-   flags | BDRV_REQ_COPY_ON_READ);
+int64_t n = 0;
+int64_t size = offset + bytes;
+int local_flags;
+int ret;
+BDRVStateCOR *state = bs->opaque;
+
+if (!state->base_bs) {
+return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
+   flags | BDRV_REQ_COPY_ON_READ);
+}
+
+while (offset < size) {
+local_flags = flags;
+
+/* In case of failure, try to copy-on-read anyway */
+ret = bdrv_is_allocated(bs->file->bs, offset, bytes, );
+if (!ret) {
+ret = bdrv_is_allocated_above(bdrv_cow_bs(bs->file->bs),
+  state->base_bs, false, offset, n, 
);
+if (ret > 0) {
+local_flags |= BDRV_REQ_COPY_ON_READ;
+}
+}
+
+ret = bdrv_co_preadv_part(bs->file, offset, n, qiov, qiov_offset,
+  local_flags);
+if (ret < 0) {
+return ret;
+}
+
+offset += n;
+qiov_offset += n;
+bytes -= n;
+}
+
+return 0;
 }
 
 
-- 
1.8.3.1

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Klaus Jensen

On Sep 28 22:54, Damien Le Moal wrote:
> On 2020/09/29 6:25, Keith Busch wrote:
> > On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
> >> On Sep 28 02:33, Dmitry Fomichev wrote:
> >>> You are making it sound like the entire WDC series relies on this 
> >>> approach.
> >>> Actually, the persistency is introduced in the second to last patch in the
> >>> series and it only adds a couple of lines of code in the i/o path to mark
> >>> zones dirty. This is possible because of using mmap() and I find the way
> >>> it is done to be quite elegant, not ugly :)
> >>>
> >>
> >> No, I understand that your implementation works fine without
> >> persistance, but persistance is key. That is why my series adds it in
> >> the first patch. Without persistence it is just a toy. And the QEMU
> >> device is not just an "NVMe-version" of null_blk.
> > 
> > I really think we should be a bit more cautious of commiting to an
> > on-disk format for the persistent state. Both this and Klaus' persistent
> > state feels a bit ad-hoc, and with all the other knobs provided, it
> > looks too easy to have out-of-sync states, or just not being able to
> > boot at all if a qemu versions have different on-disk formats.
> > 
> > Is anyone really considering zone emulation for production level stuff
> > anyway? I can't imagine a real scenario where you'd want put yourself
> > through that: you are just giving yourself all the downsides of a zoned
> > block device and none of the benefits. AFAIK, this is provided as a
> > development vehicle, closer to a "toy".
> > 
> > I think we should consider trimming this down to a more minimal set that
> > we *do* agree on and commit for inclusion ASAP. We can iterate all the
> > bells & whistles and flush out the meta data's data marshalling scheme
> > for persistence later.
> 
> +1 on this. Removing the persistence also removes the debate on endianess. 
> With
> that out of the way, it should be straightforward to get agreement on a series
> that can be merged quickly to get developers started with testing ZNS software
> with QEMU. That is the most important goal here. 5.9 is around the corner, we
> need something for people to get started with ZNS quickly.
> 

Wait. What. No. Stop!

It is unmistakably clear that you are invalidating my arguments about
portability and endianness issues by suggesting that we just remove
persistent state and deal with it later, but persistence is the killer
feature that sets the QEMU emulated device apart from other emulation
options. It is not about using emulation in production (because yeah,
why would you?), but persistence is what makes it possible to develop
and test "zoned FTLs" or something that requires recovery at power up.
This is what allows testing of how your host software deals with opened
zones being transitioned to FULL on power up and the persistent tracking
of LBA allocation (in my series) can be used to properly test error
recovery if you lost state in the app.

Please, work with me on this instead of just removing such an essential
feature. Since persistence seems to be the only thing we are really
discussing, we should have plenty of time until the soft-freeze to come
up with a proper solution on that.

I agree that my version had a format that was pretty ad-hoc and that
won't fly - it needs magic and version capabilities like in Dmitry's
series, which incidentially looks a lot like what we did in the
OpenChannel implementation, so I agree with the strategy.

ZNS-wise, the only thing my implementation stores is the zone
descriptors (in spec-native little-endian format) and the zone
descriptor extensions. So there are no endian issues with those. The
allocation tracking bitmap is always stored in little endian, but
converted to big-endian if running on a big-endian host.

Let me just conjure something up.

#define NVME_PSTATE_MAGIC ...
#define NVME_PSTATE_V11

typedef struct NvmePstateHeader {
uint32_t magic;
uint32_t version;

uint64_t blk_len;

uint8_t  lbads;
uint8_t  iocs;

uint8_t  rsvd18[3054];

struct {
uint64_t zsze;
uint8_t  zdes;
} QEMU_PACKED zns;

uint8_t  rsvd3089[1007];
} QEMU_PACKED NvmePstateHeader;

With such a header we have all we need. We can bail out if any
parameters do not match and similar to nvme data structures it contains
reserved areas for future use. I'll be posting a v2 with this. If this
still feels too ad-hoc, we can be inspired by QCOW2 and the "extension"
feature.

I can agree that we drop other optional features like zone excursions
and the reset/finish recommended limit simulation, but PLEASE DO NOT
remove persistence and upstream a half-baked version when we are so
close and have time to get it right.

signature.asc
Description: PGP signature

Re: [PATCH v5 0/7] vhost-user-blk: fix the migration issue and enhance qtests

2020-09-29 Thread Dima Stepanov

On Tue, Sep 29, 2020 at 03:13:09AM -0400, Michael S. Tsirkin wrote:
> On Sun, Sep 27, 2020 at 09:48:28AM +0300, Dima Stepanov wrote:
> > On Thu, Sep 24, 2020 at 07:26:14AM -0400, Michael S. Tsirkin wrote:
> > > On Fri, Sep 11, 2020 at 11:39:42AM +0300, Dima Stepanov wrote:
> > > > v4 -> v5:
> > > >   - vhost: check queue state in the vhost_dev_set_log routine
> > > > tests/qtest/vhost-user-test: prepare the tests for adding new
> > > > dev class
> > > > tests/qtest/vhost-user-test: add support for the
> > > > vhost-user-blk device
> > > > tests/qtest/vhost-user-test: add migrate_reconnect test
> > > > Reviewed-by: Raphael Norwitz
> > > >   - Update qtest, by merging vhost-user-blk "if" case with the
> > > > virtio-blk case.
> > > 
> > > I dropped patches 3-7 since they were stalling on some systems.
> > > Pls work with Peter Maydell (cc'd) to figure it out.
> > Thanks!
> > 
> > Peter, can you share any details for the stalling errors with me?
> 
> I can say for sure that even on x86/linux the affected tests take
> much longer to run with these applied.
> I'd suggest making sure there are no timeouts involved in the good case 
Could you help me to reproduce it? Because on my system i see only 10+ seconds
increase for the qos-test set to pass (both x86_64 and i386). So on the
current master i'm running it like:
  - ./configure  --target-list="x86_64-softmmu i386-softmmu"
  - with no patch set:
time QTEST_QEMU_BINARY=./build/x86_64-softmmu/qemu-system-x86_64 
./build/tests/qtest/qos-test
real0m6.394s
user0m3.643s
sys 0m3.477s
  - without patch 7 (where i include vhost-user-net tests also):
real0m9.955s
user0m4.133s
sys 0m4.397s
  - full patch set:
real0m17.263s
user0m4.530s
sys 0m4.802s
For i386 target i see pretty the same numbers:
  time QTEST_QEMU_BINARY=./build/i386-softmmu/qemu-system-i386 
./build/tests/qtest/qos-test
  real0m17.386s
  user0m4.503s
  sys 0m4.911s
So it looks like that i'm missing some step to reproduce an issue.

And if i run the exact test it takes ~2-3s to pass:
  $ time QTEST_QEMU_BINARY=./build/x86_64-softmmu/qemu-system-x86_64 
./build/tests/qtest/qos-test -p 
/x86_64/pc/i440FX-pcihost/pci-bus-pc/pci-bus/vhost-user-blk-pci/vhost-user-blk/vhost-user-blk-tests/reconnect
  
/x86_64/pc/i440FX-pcihost/pci-bus-pc/pci-bus/vhost-user-blk-pci/vhost-user-blk/vhost-user-blk-tests/reconnect:
 OK
  real0m2.253s
  user0m0.118s
  sys 0m0.104s
And same numbers for i386.

> 
> > > 
> > > 
> > > > v3 -> v4:
> > > >   - vhost: recheck dev state in the vhost_migration_log routine
> > > > Reviewed-by: Raphael Norwitz
> > > >   - vhost: check queue state in the vhost_dev_set_log routine
> > > > Use "continue" instead of "break" to handle non-initialized
> > > > virtqueue case.
> > > > 
> > > > v2 -> v3:
> > > >   - update commit message for the 
> > > > "vhost: recheck dev state in the vhost_migration_log routine" commit
> > > >   - rename "started" field of the VhostUserBlk structure to
> > > > "started_vu", so there will be no confustion with the VHOST started
> > > > field
> > > >   - update vhost-user-test.c to always initialize nq local variable
> > > > (spotted by patchew)
> > > > 
> > > > v1 -> v2:
> > > >   - add comments to connected/started fields in the header file
> > > >   - move the "s->started" logic from the vhost_user_blk_disconnect
> > > > routine to the vhost_user_blk_stop routine
> > > > 
> > > > Reference e-mail threads:
> > > >   - https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg01509.html
> > > >   - https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg05241.html
> > > > 
> > > > If vhost-user daemon is used as a backend for the vhost device, then we
> > > > should consider a possibility of disconnect at any moment. There was a 
> > > > general
> > > > question here: should we consider it as an error or okay state for the 
> > > > vhost-user
> > > > devices during migration process?
> > > > I think the disconnect event for the vhost-user devices should not 
> > > > break the
> > > > migration process, because:
> > > >   - the device will be in the stopped state, so it will not be changed
> > > > during migration
> > > >   - if reconnect will be made the migration log will be reinitialized as
> > > > part of reconnect/init process:
> > > > #0  vhost_log_global_start (listener=0x563989cf7be0)
> > > > at hw/virtio/vhost.c:920
> > > > #1  0x56398603d8bc in listener_add_address_space 
> > > > (listener=0x563989cf7be0,
> > > > as=0x563986ea4340 )
> > > > at softmmu/memory.c:2664
> > > > #2  0x56398603dd30 in memory_listener_register 
> > > > (listener=0x563989cf7be0,
> > > > as=0x563986ea4340 )
> > > > at softmmu/memory.c:2740
> > > > #3  0x563985fd6956 in vhost_dev_init (hdev=0x563989cf7bd8,
> > > > opaque=0x563989cf7e30,

[PULL v4 06/48] vhost: recheck dev state in the vhost_migration_log routine

2020-09-29 Thread Michael S. Tsirkin

From: Dima Stepanov 

vhost-user devices can get a disconnect in the middle of the VHOST-USER
handshake on the migration start. If disconnect event happened right
before sending next VHOST-USER command, then the vhost_dev_set_log()
call in the vhost_migration_log() function will return error. This error
will lead to the assert() and close the QEMU migration source process.
For the vhost-user devices the disconnect event should not break the
migration process, because:
  - the device will be in the stopped state, so it will not be changed
during migration
  - if reconnect will be made the migration log will be reinitialized as
part of reconnect/init process:
#0  vhost_log_global_start (listener=0x563989cf7be0)
at hw/virtio/vhost.c:920
#1  0x56398603d8bc in listener_add_address_space 
(listener=0x563989cf7be0,
as=0x563986ea4340 )
at softmmu/memory.c:2664
#2  0x56398603dd30 in memory_listener_register (listener=0x563989cf7be0,
as=0x563986ea4340 )
at softmmu/memory.c:2740
#3  0x563985fd6956 in vhost_dev_init (hdev=0x563989cf7bd8,
opaque=0x563989cf7e30, backend_type=VHOST_BACKEND_TYPE_USER,
busyloop_timeout=0)
at hw/virtio/vhost.c:1385
#4  0x563985f7d0b8 in vhost_user_blk_connect (dev=0x563989cf7990)
at hw/block/vhost-user-blk.c:315
#5  0x563985f7d3f6 in vhost_user_blk_event (opaque=0x563989cf7990,
event=CHR_EVENT_OPENED)
at hw/block/vhost-user-blk.c:379
Update the vhost-user-blk device with the internal started_vu field which
will be used for initialization (vhost_user_blk_start) and clean up
(vhost_user_blk_stop). This additional flag in the VhostUserBlk structure
will be used to track whether the device really needs to be stopped and
cleaned up on a vhost-user level.
The disconnect event will set the overall VHOST device (not vhost-user) to
the stopped state, so it can be used by the general vhost_migration_log
routine.
Such approach could be propogated to the other vhost-user devices, but
better idea is just to make the same connect/disconnect code for all the
vhost-user devices.

This migration issue was slightly discussed earlier:
  - https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg01509.html
  - https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg05241.html

Signed-off-by: Dima Stepanov 
Reviewed-by: Raphael Norwitz 
Message-Id: 
<9fbfba06791a87813fcee3e2315f0b904cc6789a.1599813294.git.dimas...@yandex-team.ru>
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 include/hw/virtio/vhost-user-blk.h | 10 ++
 hw/block/vhost-user-blk.c  | 19 ---
 hw/virtio/vhost.c  | 27 ---
 3 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/include/hw/virtio/vhost-user-blk.h 
b/include/hw/virtio/vhost-user-blk.h
index f536576d20..7c91f15040 100644
--- a/include/hw/virtio/vhost-user-blk.h
+++ b/include/hw/virtio/vhost-user-blk.h
@@ -40,7 +40,17 @@ struct VHostUserBlk {
 VhostUserState vhost_user;
 struct vhost_virtqueue *vhost_vqs;
 VirtQueue **virtqs;
+
+/*
+ * There are at least two steps of initialization of the
+ * vhost-user device. The first is a "connect" step and
+ * second is a "start" step. Make a separation between
+ * those initialization phases by using two fields.
+ */
+/* vhost_user_blk_connect/vhost_user_blk_disconnect */
 bool connected;
+/* vhost_user_blk_start/vhost_user_blk_stop */
+bool started_vu;
 };
 
 #endif
diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 39aec42dae..a076b1e54d 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -150,6 +150,7 @@ static int vhost_user_blk_start(VirtIODevice *vdev)
 error_report("Error starting vhost: %d", -ret);
 goto err_guest_notifiers;
 }
+s->started_vu = true;
 
 /* guest_notifier_mask/pending not used yet, so just unmask
  * everything here. virtio-pci will do the right thing by
@@ -175,6 +176,11 @@ static void vhost_user_blk_stop(VirtIODevice *vdev)
 VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
 int ret;
 
+if (!s->started_vu) {
+return;
+}
+s->started_vu = false;
+
 if (!k->set_guest_notifiers) {
 return;
 }
@@ -341,9 +347,7 @@ static void vhost_user_blk_disconnect(DeviceState *dev)
 }
 s->connected = false;
 
-if (s->dev.started) {
-vhost_user_blk_stop(vdev);
-}
+vhost_user_blk_stop(vdev);
 
 vhost_dev_cleanup(>dev);
 }
@@ -399,6 +403,15 @@ static void vhost_user_blk_event(void *opaque, 
QEMUChrEvent event)
 NULL, NULL, false);
 aio_bh_schedule_oneshot(ctx, vhost_user_blk_chr_closed_bh, opaque);
 }
+
+/*
+ * Move vhost device to the stopped state. The vhost-user device
+ * will be clean up and disconnected in BH. This can be useful in
+ * the vhost

Re: [PATCH v5 0/7] vhost-user-blk: fix the migration issue and enhance qtests

2020-09-29 Thread Michael S. Tsirkin

On Sun, Sep 27, 2020 at 09:48:28AM +0300, Dima Stepanov wrote:
> On Thu, Sep 24, 2020 at 07:26:14AM -0400, Michael S. Tsirkin wrote:
> > On Fri, Sep 11, 2020 at 11:39:42AM +0300, Dima Stepanov wrote:
> > > v4 -> v5:
> > >   - vhost: check queue state in the vhost_dev_set_log routine
> > > tests/qtest/vhost-user-test: prepare the tests for adding new
> > > dev class
> > > tests/qtest/vhost-user-test: add support for the
> > > vhost-user-blk device
> > > tests/qtest/vhost-user-test: add migrate_reconnect test
> > > Reviewed-by: Raphael Norwitz
> > >   - Update qtest, by merging vhost-user-blk "if" case with the
> > > virtio-blk case.
> > 
> > I dropped patches 3-7 since they were stalling on some systems.
> > Pls work with Peter Maydell (cc'd) to figure it out.
> Thanks!
> 
> Peter, can you share any details for the stalling errors with me?

I can say for sure that even on x86/linux the affected tests take
much longer to run with these applied.
I'd suggest making sure there are no timeouts involved in the good case 

> > 
> > 
> > > v3 -> v4:
> > >   - vhost: recheck dev state in the vhost_migration_log routine
> > > Reviewed-by: Raphael Norwitz
> > >   - vhost: check queue state in the vhost_dev_set_log routine
> > > Use "continue" instead of "break" to handle non-initialized
> > > virtqueue case.
> > > 
> > > v2 -> v3:
> > >   - update commit message for the 
> > > "vhost: recheck dev state in the vhost_migration_log routine" commit
> > >   - rename "started" field of the VhostUserBlk structure to
> > > "started_vu", so there will be no confustion with the VHOST started
> > > field
> > >   - update vhost-user-test.c to always initialize nq local variable
> > > (spotted by patchew)
> > > 
> > > v1 -> v2:
> > >   - add comments to connected/started fields in the header file
> > >   - move the "s->started" logic from the vhost_user_blk_disconnect
> > > routine to the vhost_user_blk_stop routine
> > > 
> > > Reference e-mail threads:
> > >   - https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg01509.html
> > >   - https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg05241.html
> > > 
> > > If vhost-user daemon is used as a backend for the vhost device, then we
> > > should consider a possibility of disconnect at any moment. There was a 
> > > general
> > > question here: should we consider it as an error or okay state for the 
> > > vhost-user
> > > devices during migration process?
> > > I think the disconnect event for the vhost-user devices should not break 
> > > the
> > > migration process, because:
> > >   - the device will be in the stopped state, so it will not be changed
> > > during migration
> > >   - if reconnect will be made the migration log will be reinitialized as
> > > part of reconnect/init process:
> > > #0  vhost_log_global_start (listener=0x563989cf7be0)
> > > at hw/virtio/vhost.c:920
> > > #1  0x56398603d8bc in listener_add_address_space 
> > > (listener=0x563989cf7be0,
> > > as=0x563986ea4340 )
> > > at softmmu/memory.c:2664
> > > #2  0x56398603dd30 in memory_listener_register 
> > > (listener=0x563989cf7be0,
> > > as=0x563986ea4340 )
> > > at softmmu/memory.c:2740
> > > #3  0x563985fd6956 in vhost_dev_init (hdev=0x563989cf7bd8,
> > > opaque=0x563989cf7e30, backend_type=VHOST_BACKEND_TYPE_USER,
> > > busyloop_timeout=0)
> > > at hw/virtio/vhost.c:1385
> > > #4  0x563985f7d0b8 in vhost_user_blk_connect (dev=0x563989cf7990)
> > > at hw/block/vhost-user-blk.c:315
> > > #5  0x563985f7d3f6 in vhost_user_blk_event (opaque=0x563989cf7990,
> > > event=CHR_EVENT_OPENED)
> > > at hw/block/vhost-user-blk.c:379
> > > The first patch in the patchset fixes this issue by setting vhost device 
> > > to the
> > > stopped state in the disconnect handler and check it the 
> > > vhost_migration_log()
> > > routine before returning from the function.
> > > qtest framework was updated to test vhost-user-blk functionality. The
> > > vhost-user-blk/vhost-user-blk-tests/migrate_reconnect test was added to 
> > > reproduce
> > > the original issue found.
> > > 
> > > Dima Stepanov (7):
> > >   vhost: recheck dev state in the vhost_migration_log routine
> > >   vhost: check queue state in the vhost_dev_set_log routine
> > >   tests/qtest/vhost-user-test: prepare the tests for adding new dev
> > > class
> > >   tests/qtest/libqos/virtio-blk: add support for vhost-user-blk
> > >   tests/qtest/vhost-user-test: add support for the vhost-user-blk device
> > >   tests/qtest/vhost-user-test: add migrate_reconnect test
> > >   tests/qtest/vhost-user-test: enable the reconnect tests
> > > 
> > >  hw/block/vhost-user-blk.c  |  19 ++-
> > >  hw/virtio/vhost.c  |  39 -
> > >  include/hw/virtio/vhost-user-blk.h |  10 ++
> > >  tests/qtest/libqos/virtio-blk.c|  14 +-
> > >  tests/qtest/vhost-user-test.c

Re: [PATCH v2] hw/ide: check null block before _cancel_dma_sync

2020-09-29 Thread P J P

  Hello Li,

+-- On Fri, 18 Sep 2020, Li Qiang wrote --+
| P J P  
篋\x8E2020綛\xB49\xE6\x9C\x8818\xE6\x97ュ\x91\xA8篋\x94 
筝\x8B\xE5\x8D\x886:26\xE5\x86\x99\xE9\x81\x93鐚\x9A
| > +-- On Fri, 18 Sep 2020, Li Qiang wrote --+
| > | Update v2: use an assert() call
| > |   ->https://lists.nongnu.org/archive/html/qemu-devel/2020-08/msg08336.html
| 
| In 'ide_ioport_write' the guest can set 'bus->unit' to 0 or 1 by issue 
| 'ATA_IOPORT_WR_DEVICE_HEAD'. So this case the guest can set the active ifs. 
| If the guest set this to 1.
| 
| Then in 'idebus_active_if' will return 'IDEBus.ifs[1]' and thus the 's->blk' 
| will be NULL.

Right, guest does select the drive via

  portio_write
   ->ide_ioport_write
  case ATA_IOPORT_WR_DEVICE_HEAD:
  /* FIXME: HOB readback uses bit 7 */
  bus->ifs[0].select = (val & ~0x10) | 0xa0;
  bus->ifs[1].select = (val | 0x10) | 0xa0;
  /* select drive */
  bus->unit = (val >> 4) & 1; <== set bus->unit=0x1
  break;


| So from your (Peter's) saying, we need to check the value in
| 'ATA_IOPORT_WR_DEVICE_HEAD' handler. To say if the guest
| set a valid 'bus->unit'. This can also work I think.

Yes, with the following fix, an assert(3) in ide_cancel_dma_sync fails.

===
diff --git a/hw/ide/core.c b/hw/ide/core.c
index f76f7e5234..cb55cc8b0f 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -1300,7 +1300,11 @@ void ide_ioport_write(void *opaque, uint32_t addr, 
uint_)
 bus->ifs[0].select = (val & ~0x10) | 0xa0;
 bus->ifs[1].select = (val | 0x10) | 0xa0;
 /* select drive */
+uint8_t bu = bus->unit;
 bus->unit = (val >> 4) & 1;
+if (!bus->ifs[bus->unit].blk) {
+bus->unit = bu;
+}
 break;
 default:

qemu-system-x86_64: ../hw/ide/core.c:724: ide_cancel_dma_sync: Assertion 
`s->bus->dma->aiocb == NULL' failed.
Aborted (core dumped)
===
 
| As we the 'ide_exec_cmd' and other functions in 'hw/ide/core.c' check the 
| 's->blk' directly. I think we just check it in 'ide_cancel_dma_sync' is 
| enough and also this is more consistent with the other functions. 
| 'ide_cancel_dma_sync' is also called by 'cmd_device_reset' which is one of 
| the 'ide_cmd_table' handler.

  Yes, I'm okay with either approach. Earlier patch v1 checks 's->blk' in 
ide_cancel_dma_sync().
 
| BTW, where is the Peter's email saying this, just want to learn something, 
| :).

  -> https://lists.nongnu.org/archive/html/qemu-devel/2020-09/msg05820.html

Thank you.
--
Prasad J Pandit / Red Hat Product Security Team
8685 545E B54C 486B C6EB 271E E285 8B5A F050 DE8D

83 matches

Mail list logo