Re: [PATCH v2 19/20] nvme: make lba data size configurable

2019-11-12 Thread Klaus Birkelund
On Tue, Nov 12, 2019 at 03:24:00PM +, Beata Michalska wrote:
> Hi Klaus,
> 
> On Tue, 15 Oct 2019 at 11:50, Klaus Jensen  wrote:
> >  #define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
> > -DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
> > +DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0), \
> > +DEFINE_PROP_UINT8("lbads", _state, _props.lbads, 9)
> >
> Could we actually use BDRV_SECTOR_BITS instead of magic numbers?
> 
 
Yes, better. Fixed in two places.



Re: [PATCH v2 04/20] nvme: populate the mandatory subnqn and ver fields

2019-11-12 Thread Klaus Birkelund
On Tue, Nov 12, 2019 at 03:04:45PM +, Beata Michalska wrote:
> Hi Klaus
> 
> On Tue, 15 Oct 2019 at 11:42, Klaus Jensen  wrote:
> > +n->bar.vs = 0x00010201;
> 
> Very minor:
> 
> The version number is being set twice in the patch series already.
> And it is being set in two places.
> It might be worth to make a #define out of it so that only one
> needs to be changed.
> 

I think you are right. I'll do that.



Re: [PATCH v2 06/20] nvme: add support for the abort command

2019-11-12 Thread Klaus Birkelund
On Tue, Nov 12, 2019 at 03:04:38PM +, Beata Michalska wrote:
> Hi Klaus
> 

Hi Beata,

Thank you very much for your thorough reviews! I'll start going through
them one by one :) You might have seen that I've posted a v3, but I will
make sure to consolidate between v2 and v3!

> On Tue, 15 Oct 2019 at 11:41, Klaus Jensen  wrote:
> >
> > Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> > Section 5.1 ("Abort command").
> >
> > The Abort command is a best effort command; for now, the device always
> > fails to abort the given command.
> >
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 16 
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index daa2367b0863..84e4f2ea7a15 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -741,6 +741,18 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd 
> > *cmd)
> >  }
> >  }
> >
> > +static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > +{
> > +uint16_t sqid = le32_to_cpu(cmd->cdw10) & 0x;
> > +
> > +req->cqe.result = 1;
> > +if (nvme_check_sqid(n, sqid)) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> Shouldn't we validate the CID as well ?
> 

According to the specification it is "implementation specific if/when a
controller chooses to complete the command when the command to abort is
not found".

I'm interpreting this to mean that, yes, an invalid command identifier
could be given in the command, but this implementation does not care
about that.

I still think the controller should check the validity of the submission
queue identifier though. It is a general invariant that the sqid should
be valid.

> > +return NVME_SUCCESS;
> > +}
> > +
> >  static inline void nvme_set_timestamp(NvmeCtrl *n, uint64_t ts)
> >  {
> >  trace_nvme_setfeat_timestamp(ts);
> > @@ -859,6 +871,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  trace_nvme_err_invalid_setfeat(dw10);
> >  return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> > +
> >  return NVME_SUCCESS;
> >  }
> >
> > @@ -875,6 +888,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  return nvme_create_cq(n, cmd);
> >  case NVME_ADM_CMD_IDENTIFY:
> >  return nvme_identify(n, cmd);
> > +case NVME_ADM_CMD_ABORT:
> > +return nvme_abort(n, cmd, req);
> >  case NVME_ADM_CMD_SET_FEATURES:
> >  return nvme_set_feature(n, cmd, req);
> >  case NVME_ADM_CMD_GET_FEATURES:
> > @@ -1388,6 +1403,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >  id->ieee[2] = 0xb3;
> >  id->ver = cpu_to_le32(0x00010201);
> >  id->oacs = cpu_to_le16(0);
> > +id->acl = 3;
> So we are setting the max number of concurrent commands
> but there is no logic to enforce that and wrap up with the
> status suggested by specification.
> 

That is true, but because the controller always completes the Abort
command immediately this cannot happen. If the controller did try to
abort executing commands, the Abort command would need to linger in the
controller state until a completion queue entry is posted for the
command to be aborted before the completion queue entry can be posted
for the Abort command. This takes up resources in the controller and is
the reason for the Abort Command Limit.

You could argue that we should set ACL to 0 then, but the specification
recommends a value of 3 and I do not see any harm in conveying a
"reasonable", though inconsequential, value.



Re: [PATCH v3 0/8] blockdev: avoid acquiring AioContext lock twice at do_drive_backup and do_blockdev_backup

2019-11-12 Thread no-reply
Patchew URL: https://patchew.org/QEMU/20191112113012.71136-1-...@redhat.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  TESTiotest-qcow2: 268
Failures: 141
Failed 1 of 108 iotests
make: *** [check-tests/check-block.sh] Error 1
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 662, in 
sys.exit(main())
---
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=5e0a4e7f97154a93b182d709969b9417', '-u', 
'1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', 
'-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-6a9_8q0n/src/docker-src.2019-11-12-17.38.46.26027:/var/tmp/qemu:z,ro',
 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=5e0a4e7f97154a93b182d709969b9417
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-6a9_8q0n/src'
make: *** [docker-run-test-quick@centos7] Error 2

real10m57.839s
user0m8.062s


The full log is available at
http://patchew.org/logs/20191112113012.71136-1-...@redhat.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [PATCH 2/2] iotests: Test multiple blockdev-snapshot calls

2019-11-12 Thread Peter Krempa
On Fri, Nov 08, 2019 at 09:53:12 +0100, Kevin Wolf wrote:
> Test that doing a second blockdev-snapshot doesn't make the first
> overlay's backing file go away.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  tests/qemu-iotests/273 |  76 +
>  tests/qemu-iotests/273.out | 337 +
>  tests/qemu-iotests/group   |   1 +
>  3 files changed, 414 insertions(+)
>  create mode 100755 tests/qemu-iotests/273
>  create mode 100644 tests/qemu-iotests/273.out

Didn't apply cleanly for me.

> 
> diff --git a/tests/qemu-iotests/273 b/tests/qemu-iotests/273
> new file mode 100755
> index 00..60076de7ce
> --- /dev/null
> +++ b/tests/qemu-iotests/273
> @@ -0,0 +1,76 @@
> +#!/usr/bin/env bash
> +#
> +# Test large write to a qcow2 image

Cut?


Rest looks good

Reviewed-by: Peter Krempa 

> +#
> +# Copyright (C) 2019 Red Hat, Inc.
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +#
> +
> +seq=$(basename "$0")
> +echo "QA output created by $seq"
> +
> +status=1 # failure is the default!
> +
> +_cleanup()
> +{
> +_cleanup_test_img
> +}
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +# get standard environment, filters and checks
> +. ./common.rc
> +. ./common.filter
> +
> +# This is a qcow2 regression test
> +_supported_fmt qcow2
> +_supported_proto file
> +_supported_os Linux
> +
> +do_run_qemu()
> +{
> +echo Testing: "$@"
> +$QEMU -nographic -qmp-pretty stdio -nodefaults "$@"

-qmp-pretty, that's useful

> +echo
> +}
> +
> +run_qemu()
> +{
> +do_run_qemu "$@" 2>&1 | _filter_testdir | _filter_qemu | _filter_qmp |
> +_filter_generated_node_ids | _filter_imgfmt | 
> _filter_actual_image_size
> +}
> +
> +TEST_IMG="$TEST_IMG.base" _make_test_img 64M
> +TEST_IMG="$TEST_IMG.mid" _make_test_img -b "$TEST_IMG.base"
> +_make_test_img -b "$TEST_IMG.mid"
> +
> +run_qemu \
> +-blockdev file,node-name=base,filename="$TEST_IMG.base" \
> + -blockdev file,node-name=midf,filename="$TEST_IMG.mid" \
> + -blockdev 
> '{"driver":"qcow2","node-name":"mid","file":"midf","backing":null}' \
> + -blockdev file,node-name=topf,filename="$TEST_IMG" \
> + -blockdev 
> '{"driver":"qcow2","file":"topf","node-name":"top","backing":null}' \
> +< +{"execute":"qmp_capabilities"}
> +{"execute":"blockdev-snapshot","arguments":{"node":"base","overlay":"mid"}}
> +{"execute":"blockdev-snapshot","arguments":{"node":"mid","overlay":"top"}}
> +{"execute":"query-named-block-nodes"}
> +{"execute":"x-debug-query-block-graph"}

Oh, this too!

> +{"execute":"quit"}
> +EOF
> +
> +# success, all done
> +echo "*** done"
> +rm -f $seq.full
> +status=0




Re: [PATCH 1/2] block: Remove 'backing': null from bs->{explicit_,}options

2019-11-12 Thread Peter Krempa
On Fri, Nov 08, 2019 at 09:53:11 +0100, Kevin Wolf wrote:
> bs->options and bs->explicit_options shouldn't contain any options for
> child nodes. bdrv_open_inherited() takes care to remove any options that
> match a child name after opening the image and the same is done when
> reopening.
> 
> However, we miss the case of 'backing': null, which is a child option,
> but results in no child being created. This means that a 'backing': null
> remains in bs->options and bs->explicit_options.
> 
> A typical use for 'backing': null is in live snapshots: blockdev-add for
> the qcow2 overlay makes sure not to open the backing file (because it is

Note that we also use '"backing": null' as a terminator for the last
image in the chain if the user configures the chain manually.

This is kind-of a protection from opening the backing file from the
header if it was misconfigured somehow. I think this functionality
should be kept despite probably not making practical sense.

In my testing this scenario worked properly.

> already opened and blockdev-snapshot will attach it). After doing a
> blockdev-snapshot, bs->options and bs->explicit_options become
> inconsistent with the actual state (bs has a backing file now, but the
> options still say null). On the next occasion that the image is
> reopened, e.g. switching it from read-write to read-only when another
> snapshot is taken, the option will take effect again and the node
> incorrectly loses its backing file.
> 
> Fix bdrv_open_inherited() to remove the 'backing' option from
> bs->options and bs->explicit_options even for the case where it
> specifies that no backing file is wanted.
> 
> Reported-by: Peter Krempa 
> Signed-off-by: Kevin Wolf 

The fix looks sane-enough to me and works as expected, but since I'm not
familiar enough with this code I'm comfortable only with a:

Tested-by: Peter Krempa 

> ---
>  block.c | 2 ++
>  1 file changed, 2 insertions(+)




Re: [RFC v5 000/126] error: auto propagated local_err

2019-11-12 Thread Vladimir Sementsov-Ogievskiy
12.11.2019 16:46, Cornelia Huck wrote:
> On Fri, 8 Nov 2019 22:57:25 +0400
> Marc-André Lureau  wrote:
> 
>> Hi
>>
>> On Fri, Nov 8, 2019 at 7:31 PM Vladimir Sementsov-Ogievskiy
>>  wrote:
>>>
>>> Finally, what is the plan?
>>>
>>> Markus what do you think?
>>>
>>> Now a lot of patches are reviewed, but a lot of are not.
>>>
>>> Is there any hope that all patches will be reviewed? Should I resend the
>>> whole series, or may be reduce it to reviewed subsystems only?
>>
>> I don't think we have well established rules for whole-tree cleanups
>> like this. In the past, several cleanup series got lost.
> 
> Yes, it is always problematic if a series touches a lot of different
> subsystems.
> 
>>
>> It will take ages to get every subsystem maintainer to review the
>> patches. Most likely, since they are quite systematic, there isn't
>> much to say and it is easy to miss something that has some hidden
>> ramifications. Perhaps whole-tree cleanups should require at least 2
>> reviewers to bypass the subsytem maintainer review? But my past
>> experience with this kind of exercice doesn't encourage me, and
>> probably I am not the only one.
> 
> It's not just the reviews; it's easy to miss compile problems on less
> mainstream architectures (and even easier to miss functional problems
> there, although they are probably less likely with automated rework.)
> CI can probably help, but that's something for the future.
> 
> Anyway, I've now gotten around to that series; spotted one problem in
> s390x code, I think.
> 
> One thing that's helpful for such a large series is a git branch that
> makes it easy to give the series a quick go. (You can use patchew, but
> it takes time before it gets all mails, so just pushing it somewhere
> and letting people know is a good idea anyway.)
> 

Thanks for review!

The series are posted here:

https://src.openvz.org/users/vsementsov/repos/qemu/browse

https://src.openvz.org/scm/~vsementsov/qemu.git #tag up-auto-local-err-v5


-- 
Best regards,
Vladimir


Re: [PATCH v2 15/20] nvme: add support for scatter gather lists

2019-11-12 Thread Beata Michalska
Hi Klaus,

On Tue, 15 Oct 2019 at 11:57, Klaus Jensen  wrote:
>
> For now, support the Data Block, Segment and Last Segment descriptor
> types.
>
> See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").
>
> Signed-off-by: Klaus Jensen 
> ---
>  block/nvme.c  |  18 +-
>  hw/block/nvme.c   | 380 --
>  hw/block/trace-events |   3 +
>  include/block/nvme.h  |  62 ++-
>  4 files changed, 398 insertions(+), 65 deletions(-)
>
> diff --git a/block/nvme.c b/block/nvme.c
> index 5be3a39b632e..8825c19c72c2 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -440,7 +440,7 @@ static void nvme_identify(BlockDriverState *bs, int 
> namespace, Error **errp)
>  error_setg(errp, "Cannot map buffer for DMA");
>  goto out;
>  }
> -cmd.prp1 = cpu_to_le64(iova);
> +cmd.dptr.prp.prp1 = cpu_to_le64(iova);
>
>  if (nvme_cmd_sync(bs, s->queues[0], )) {
>  error_setg(errp, "Failed to identify controller");
> @@ -529,7 +529,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
> **errp)
>  }
>  cmd = (NvmeCmd) {
>  .opcode = NVME_ADM_CMD_CREATE_CQ,
> -.prp1 = cpu_to_le64(q->cq.iova),
> +.dptr.prp.prp1 = cpu_to_le64(q->cq.iova),
>  .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
>  .cdw11 = cpu_to_le32(0x3),
>  };
> @@ -540,7 +540,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
> **errp)
>  }
>  cmd = (NvmeCmd) {
>  .opcode = NVME_ADM_CMD_CREATE_SQ,
> -.prp1 = cpu_to_le64(q->sq.iova),
> +.dptr.prp.prp1 = cpu_to_le64(q->sq.iova),
>  .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
>  .cdw11 = cpu_to_le32(0x1 | (n << 16)),
>  };
> @@ -889,16 +889,16 @@ try_map:
>  case 0:
>  abort();
>  case 1:
> -cmd->prp1 = pagelist[0];
> -cmd->prp2 = 0;
> +cmd->dptr.prp.prp1 = pagelist[0];
> +cmd->dptr.prp.prp2 = 0;
>  break;
>  case 2:
> -cmd->prp1 = pagelist[0];
> -cmd->prp2 = pagelist[1];
> +cmd->dptr.prp.prp1 = pagelist[0];
> +cmd->dptr.prp.prp2 = pagelist[1];
>  break;
>  default:
> -cmd->prp1 = pagelist[0];
> -cmd->prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
> +cmd->dptr.prp.prp1 = pagelist[0];
> +cmd->dptr.prp.prp2 = cpu_to_le64(req->prp_list_iova + 
> sizeof(uint64_t));
>  break;
>  }
>  trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries);
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index f4b9bd36a04e..0a5cd079df9a 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -296,6 +296,198 @@ unmap:
>  return status;
>  }
>
> +static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
> +NvmeSglDescriptor *segment, uint64_t nsgld, uint32_t *len,
> +NvmeRequest *req)
> +{
> +dma_addr_t addr, trans_len;
> +
> +for (int i = 0; i < nsgld; i++) {
> +if (NVME_SGL_TYPE(segment[i].type) != SGL_DESCR_TYPE_DATA_BLOCK) {
> +trace_nvme_err_invalid_sgl_descriptor(req->cid,
> +NVME_SGL_TYPE(segment[i].type));
> +return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
> +}
> +
> +if (*len == 0) {
> +if (!NVME_CTRL_SGLS_EXCESS_LENGTH(n->id_ctrl.sgls)) {
> +trace_nvme_err_invalid_sgl_excess_length(req->cid);
> +return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> +}
> +
> +break;
> +}
> +
> +addr = le64_to_cpu(segment[i].addr);
> +trans_len = MIN(*len, le64_to_cpu(segment[i].len));
> +
> +if (nvme_addr_is_cmb(n, addr)) {
> +/*
> + * All data and metadata, if any, associated with a particular
> + * command shall be located in either the CMB or host memory. 
> Thus,
> + * if an address if found to be in the CMB and we have already
> + * mapped data that is in host memory, the use is invalid.
> + */
> +if (!nvme_req_is_cmb(req) && qsg->size) {
> +return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> +}
> +
> +nvme_req_set_cmb(req);
> +} else {
> +/*
> + * Similarly, if the address does not reference the CMB, but we
> + * have already established that the request has data or metadata
> + * in the CMB, the use is invalid.
> + */
> +if (nvme_req_is_cmb(req)) {
> +return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> +}
> +}
> +
> +qemu_sglist_add(qsg, addr, trans_len);
> +
> +*len -= trans_len;
> +}
> +
> +return NVME_SUCCESS;
> +}
> +
> +static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg,
> +NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
> +{
> +const int MAX_NSGLD = 256;
> +
> +

Re: [PATCH v2 14/20] nvme: allow multiple aios per command

2019-11-12 Thread Beata Michalska
Hi Klaus,

On Tue, 15 Oct 2019 at 11:55, Klaus Jensen  wrote:
>
> This refactors how the device issues asynchronous block backend
> requests. The NvmeRequest now holds a queue of NvmeAIOs that are
> associated with the command. This allows multiple aios to be issued for
> a command. Only when all requests have been completed will the device
> post a completion queue entry.
>
> Because the device is currently guaranteed to only issue a single aio
> request per command, the benefit is not immediately obvious. But this
> functionality is required to support metadata.
>
> Signed-off-by: Klaus Jensen 
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme.c   | 455 +-
>  hw/block/nvme.h   | 165 ---
>  hw/block/trace-events |   8 +
>  3 files changed, 511 insertions(+), 117 deletions(-)
>
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index cbc0b6a660b6..f4b9bd36a04e 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -25,6 +25,8 @@
>   *  Default: 64
>   *   cmb_size_mb= : Size of Controller Memory Buffer in MBs.
>   *  Default: 0 (disabled)
> + *   mdts= : Maximum Data Transfer Size (power of two)
> + *  Default: 7
>   */
>
>  #include "qemu/osdep.h"
> @@ -56,6 +58,7 @@
>  } while (0)
>
>  static void nvme_process_sq(void *opaque);
> +static void nvme_aio_cb(void *opaque, int ret);
>
>  static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
>  {
> @@ -197,7 +200,7 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> *qsg, uint64_t prp1,
>  }
>
>  if (nvme_addr_is_cmb(n, prp1)) {
> -req->is_cmb = true;
> +nvme_req_set_cmb(req);
>  }
>
>  pci_dma_sglist_init(qsg, >parent_obj, num_prps);
> @@ -255,8 +258,8 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> *qsg, uint64_t prp1,
>  }
>
>  addr_is_cmb = nvme_addr_is_cmb(n, prp_ent);
> -if ((req->is_cmb && !addr_is_cmb) ||
> -(!req->is_cmb && addr_is_cmb)) {
> +if ((nvme_req_is_cmb(req) && !addr_is_cmb) ||
> +(!nvme_req_is_cmb(req) && addr_is_cmb)) {
>  status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
>  goto unmap;
>  }
> @@ -269,8 +272,8 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> *qsg, uint64_t prp1,
>  }
>  } else {
>  bool addr_is_cmb = nvme_addr_is_cmb(n, prp2);
> -if ((req->is_cmb && !addr_is_cmb) ||
> -(!req->is_cmb && addr_is_cmb)) {
> +if ((nvme_req_is_cmb(req) && !addr_is_cmb) ||
> +(!nvme_req_is_cmb(req) && addr_is_cmb)) {
>  status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
>  goto unmap;
>  }
> @@ -312,7 +315,7 @@ static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t 
> *ptr, uint32_t len,
>  return status;
>  }
>
> -if (req->is_cmb) {
> +if (nvme_req_is_cmb(req)) {
>  QEMUIOVector iov;
>
>  qemu_iovec_init(, qsg.nsg);
> @@ -341,19 +344,18 @@ static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t 
> *ptr, uint32_t len,
Any reason why the nvme_dma_write_prp is missing the changes applied
to nvme_dma_read_prp ?

>  static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
>  uint64_t prp1, uint64_t prp2, NvmeRequest *req)
>  {
> -QEMUSGList qsg;
>  uint16_t status = NVME_SUCCESS;
>
> -status = nvme_map_prp(n, , prp1, prp2, len, req);
> +status = nvme_map_prp(n, >qsg, prp1, prp2, len, req);
>  if (status) {
>  return status;
>  }
>
> -if (req->is_cmb) {
> +if (nvme_req_is_cmb(req)) {
>  QEMUIOVector iov;
>
> -qemu_iovec_init(, qsg.nsg);
> -dma_to_cmb(n, , );
> +qemu_iovec_init(, req->qsg.nsg);
> +dma_to_cmb(n, >qsg, );
>
>  if (unlikely(qemu_iovec_from_buf(, 0, ptr, len) != len)) {
>  trace_nvme_err_invalid_dma();
> @@ -365,17 +367,137 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
> *ptr, uint32_t len,
>  goto out;
>  }
>
> -if (unlikely(dma_buf_read(ptr, len, ))) {
> +if (unlikely(dma_buf_read(ptr, len, >qsg))) {
>  trace_nvme_err_invalid_dma();
>  status = NVME_INVALID_FIELD | NVME_DNR;
>  }
>
>  out:
> -qemu_sglist_destroy();
> +qemu_sglist_destroy(>qsg);
>
>  return status;
>  }
>
> +static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +{
> +NvmeNamespace *ns = req->ns;
> +
> +uint32_t len = req->nlb << nvme_ns_lbads(ns);
> +uint64_t prp1 = le64_to_cpu(cmd->prp1);
> +uint64_t prp2 = le64_to_cpu(cmd->prp2);
> +
> +return nvme_map_prp(n, >qsg, prp1, prp2, len, req);
> +}
> +
> +static void nvme_aio_destroy(NvmeAIO *aio)
> +{
> +if (aio->iov.nalloc) {
> +

Re: [PATCH v2 19/20] nvme: make lba data size configurable

2019-11-12 Thread Beata Michalska
Hi Klaus,

On Tue, 15 Oct 2019 at 11:50, Klaus Jensen  wrote:
>
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme-ns.c | 2 +-
>  hw/block/nvme-ns.h | 4 +++-
>  hw/block/nvme.c| 1 +
>  3 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index aa76bb63ef45..70ff622a5729 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -18,7 +18,7 @@ static int nvme_ns_init(NvmeNamespace *ns)
>  {
>  NvmeIdNs *id_ns = >id_ns;
>
> -id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> +id_ns->lbaf[0].ds = ns->params.lbads;
>  id_ns->nuse = id_ns->ncap = id_ns->nsze =
>  cpu_to_le64(nvme_ns_nlbas(ns));
>
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index 64dd054cf6a9..aa1c81d85cde 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -6,10 +6,12 @@
>  OBJECT_CHECK(NvmeNamespace, (obj), TYPE_NVME_NS)
>
>  #define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
> -DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
> +DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0), \
> +DEFINE_PROP_UINT8("lbads", _state, _props.lbads, 9)
>
Could we actually use BDRV_SECTOR_BITS instead of magic numbers?

BR

Beata


>  typedef struct NvmeNamespaceParams {
>  uint32_t nsid;
> +uint8_t  lbads;
>  } NvmeNamespaceParams;
>
>  typedef struct NvmeNamespace {
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 67f92bf5a3ac..d0103c16cfe9 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -2602,6 +2602,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> **errp)
>  if (n->namespace.conf.blk) {
>  ns = >namespace;
>  ns->params.nsid = 1;
> +ns->params.lbads = 9;
>
>  if (nvme_ns_setup(n, ns, _err)) {
>  error_propagate_prepend(errp, local_err, "nvme_ns_setup: ");
> --
> 2.23.0
>
>



Re: [PATCH v2 13/20] nvme: refactor prp mapping

2019-11-12 Thread Beata Michalska
Hi Klaus,

On Tue, 15 Oct 2019 at 11:57, Klaus Jensen  wrote:
>
> Instead of handling both QSGs and IOVs in multiple places, simply use
> QSGs everywhere by assuming that the request does not involve the
> controller memory buffer (CMB). If the request is found to involve the
> CMB, convert the QSG to an IOV and issue the I/O. The QSG is converted
> to an IOV by the dma helpers anyway, so the CMB path is not unfairly
> affected by this simplifying change.
>

Out of curiosity, in how many cases the SG list will have to
be converted to IOV ? Does that justify creating the SG list in vain ?

> As a side-effect, this patch also allows PRPs to be located in the CMB.
> The logic ensures that if some of the PRP is in the CMB, all of it must
> be located there, as per the specification.
>
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme.c   | 255 --
>  hw/block/nvme.h   |   4 +-
>  hw/block/trace-events |   1 +
>  include/block/nvme.h  |   1 +
>  4 files changed, 174 insertions(+), 87 deletions(-)
>
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 1e2320b38b14..cbc0b6a660b6 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -179,138 +179,200 @@ static void nvme_set_error_page(NvmeCtrl *n, uint16_t 
> sqid, uint16_t cid,
>  n->elp_index = (n->elp_index + 1) % n->params.elpe;
>  }
>
> -static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t 
> prp1,
> - uint64_t prp2, uint32_t len, NvmeCtrl *n)
> +static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, uint64_t prp1,
> +uint64_t prp2, uint32_t len, NvmeRequest *req)
>  {
>  hwaddr trans_len = n->page_size - (prp1 % n->page_size);
>  trans_len = MIN(len, trans_len);
>  int num_prps = (len >> n->page_bits) + 1;
> +uint16_t status = NVME_SUCCESS;
> +bool prp_list_in_cmb = false;
> +
> +trace_nvme_map_prp(req->cid, req->cmd.opcode, trans_len, len, prp1, prp2,
> +num_prps);
>
>  if (unlikely(!prp1)) {
>  trace_nvme_err_invalid_prp();
>  return NVME_INVALID_FIELD | NVME_DNR;
> -} else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
> -   prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
> -qsg->nsg = 0;
> -qemu_iovec_init(iov, num_prps);
> -qemu_iovec_add(iov, (void *)>cmbuf[prp1 - n->ctrl_mem.addr], 
> trans_len);
> -} else {
> -pci_dma_sglist_init(qsg, >parent_obj, num_prps);
> -qemu_sglist_add(qsg, prp1, trans_len);
>  }
> +
> +if (nvme_addr_is_cmb(n, prp1)) {
> +req->is_cmb = true;
> +}
> +
This seems to be used here and within read/write functions which are calling
this one. Maybe there is a nicer way to track that instead of passing
the request
from multiple places ?

> +pci_dma_sglist_init(qsg, >parent_obj, num_prps);
> +qemu_sglist_add(qsg, prp1, trans_len);
> +
>  len -= trans_len;
>  if (len) {
>  if (unlikely(!prp2)) {
>  trace_nvme_err_invalid_prp2_missing();
> +status = NVME_INVALID_FIELD | NVME_DNR;
>  goto unmap;
>  }
> +
>  if (len > n->page_size) {
>  uint64_t prp_list[n->max_prp_ents];
>  uint32_t nents, prp_trans;
>  int i = 0;
>
> +if (nvme_addr_is_cmb(n, prp2)) {
> +prp_list_in_cmb = true;
> +}
> +
>  nents = (len + n->page_size - 1) >> n->page_bits;
>  prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> -nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
> +nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
>  while (len != 0) {
> +bool addr_is_cmb;
>  uint64_t prp_ent = le64_to_cpu(prp_list[i]);
>
>  if (i == n->max_prp_ents - 1 && len > n->page_size) {
>  if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
>  trace_nvme_err_invalid_prplist_ent(prp_ent);
> +status = NVME_INVALID_FIELD | NVME_DNR;
> +goto unmap;
> +}
> +
> +addr_is_cmb = nvme_addr_is_cmb(n, prp_ent);
> +if ((prp_list_in_cmb && !addr_is_cmb) ||
> +(!prp_list_in_cmb && addr_is_cmb)) {

Minor: Same condition (based on different vars) is being used in
multiple places. Might be worth to move it outside and just pass in
the needed values.

> +status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
>  goto unmap;
>  }
>
>  i = 0;
>  nents = (len + n->page_size - 1) >> n->page_bits;
>  prp_trans = MIN(n->max_prp_ents, nents) * 
> sizeof(uint64_t);
> -nvme_addr_read(n, prp_ent, (void *)prp_list,
> -prp_trans);
> +   

Re: [PATCH v2 12/20] nvme: bump supported specification version to 1.3

2019-11-12 Thread Beata Michalska
Hi Klaus,

On Tue, 15 Oct 2019 at 11:52, Klaus Jensen  wrote:
>
> Add the new Namespace Identification Descriptor List (CNS 03h) and track
> creation of queues to enable the controller to return Command Sequence
> Error if Set Features is called for Number of Queues after any queues
> have been created.
>
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme.c   | 82 +++
>  hw/block/nvme.h   |  1 +
>  hw/block/trace-events |  8 +++--
>  include/block/nvme.h  | 30 +---
>  4 files changed, 100 insertions(+), 21 deletions(-)
>
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index e7d46dcc6afe..1e2320b38b14 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -9,20 +9,22 @@
>   */
>
>  /**
> - * Reference Specification: NVM Express 1.2.1
> + * Reference Specification: NVM Express 1.3d
>   *
>   *   https://nvmexpress.org/resources/specifications/
>   */
>
>  /**
>   * Usage: add options:
> - *  -drive file=,if=none,id=
> - *  -device nvme,drive=,serial=,id=, \
> - *  cmb_size_mb=, \
> - *  num_queues=
> + * -drive file=,if=none,id=
> + * -device nvme,drive=,serial=,id=
>   *
> - * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
> - * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
> + * Advanced optional options:
> + *
> + *   num_queues=  : Maximum number of IO Queues.
> + *  Default: 64
> + *   cmb_size_mb= : Size of Controller Memory Buffer in MBs.
> + *  Default: 0 (disabled)
>   */
>
>  #include "qemu/osdep.h"
> @@ -345,6 +347,8 @@ static void nvme_post_cqes(void *opaque)
>  static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
>  {
>  assert(cq->cqid == req->sq->cqid);
> +
> +trace_nvme_enqueue_req_completion(req->cid, cq->cqid, req->status);
>  QTAILQ_REMOVE(>sq->out_req_list, req, entry);
>  QTAILQ_INSERT_TAIL(>req_list, req, entry);
>  timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> @@ -530,6 +534,7 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
>  if (sq->sqid) {
>  g_free(sq);
>  }
> +n->qs_created--;
>  }
>
>  static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
> @@ -596,6 +601,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, 
> uint64_t dma_addr,
>  cq = n->cq[cqid];
>  QTAILQ_INSERT_TAIL(&(cq->sq_list), sq, entry);
>  n->sq[sqid] = sq;
> +n->qs_created++;
>  }
>
>  static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
> @@ -742,7 +748,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, 
> NvmeRequest *req)
>  uint32_t dw11 = le32_to_cpu(cmd->cdw11);
>  uint32_t dw12 = le32_to_cpu(cmd->cdw12);
>  uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> -uint16_t lid = dw10 & 0xff;
> +uint8_t  lid = dw10 & 0xff;
> +uint8_t  lsp = (dw10 >> 8) & 0xf;
>  uint8_t  rae = (dw10 >> 15) & 0x1;
>  uint32_t numdl, numdu;
>  uint64_t off, lpol, lpou;
> @@ -760,7 +767,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, 
> NvmeRequest *req)
>  return NVME_INVALID_FIELD | NVME_DNR;
>  }
>
> -trace_nvme_get_log(req->cid, lid, rae, len, off);
> +trace_nvme_get_log(req->cid, lid, lsp, rae, len, off);
>
>  switch (lid) {
>  case NVME_LOG_ERROR_INFO:
> @@ -784,6 +791,7 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
>  if (cq->cqid) {
>  g_free(cq);
>  }
> +n->qs_created--;
>  }
>
>  static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeCmd *cmd)
> @@ -824,6 +832,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, 
> uint64_t dma_addr,
>  msix_vector_use(>parent_obj, cq->vector);
>  n->cq[cqid] = cq;
>  cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
> +n->qs_created++;
>  }
>
>  static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
> @@ -897,7 +906,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
> NvmeIdentify *c)
>  prp1, prp2);
>  }
>
> -static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
> +static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
>  {
>  static const int data_len = 4 * KiB;
>  uint32_t min_nsid = le32_to_cpu(c->nsid);
> @@ -907,7 +916,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
> NvmeIdentify *c)
>  uint16_t ret;
>  int i, j = 0;
>
> -trace_nvme_identify_nslist(min_nsid);
> +trace_nvme_identify_ns_list(min_nsid);
>
>  list = g_malloc0(data_len);
>  for (i = 0; i < n->num_namespaces; i++) {
> @@ -924,6 +933,41 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
> NvmeIdentify *c)
>  return ret;
>  }
>
> +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
> +{
> +static const int len = 4096;
> +
> +struct ns_descr {
> +uint8_t nidt;
> +uint8_t nidl;
> +uint8_t rsvd2[2];
> +uint8_t nid[16];
> +};

Re: [PATCH v2 08/20] nvme: add support for the get log page command

2019-11-12 Thread Beata Michalska
Hi Klaus,


On Tue, 15 Oct 2019 at 11:45, Klaus Jensen  wrote:
>
> Add support for the Get Log Page command and basic implementations
> of the mandatory Error Information, SMART/Health Information and
> Firmware Slot Information log pages.
>
> In violation of the specification, the SMART/Health Information log page
> does not persist information over the lifetime of the controller because
> the device has no place to store such persistent state.
>
> Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> Section 5.10 ("Get Log Page command").
>
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme.c   | 150 +-
>  hw/block/nvme.h   |   9 ++-
>  hw/block/trace-events |   2 +
>  include/block/nvme.h  |   2 +-
>  4 files changed, 160 insertions(+), 3 deletions(-)
>
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 1fdb3b8655ed..4412a3bea3bc 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -44,6 +44,7 @@
>  #include "nvme.h"
>
>  #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> +#define NVME_TEMPERATURE 0x143
>
>  #define NVME_GUEST_ERR(trace, fmt, ...) \
>  do { \
> @@ -577,6 +578,137 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd 
> *cmd)
>  return NVME_SUCCESS;
>  }
>
> +static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd *cmd,
> +uint32_t buf_len, uint64_t off, NvmeRequest *req)
> +{
> +uint32_t trans_len;
> +uint64_t prp1 = le64_to_cpu(cmd->prp1);
> +uint64_t prp2 = le64_to_cpu(cmd->prp2);
> +
> +if (off > sizeof(*n->elpes) * (n->params.elpe + 1)) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +trans_len = MIN(sizeof(*n->elpes) * (n->params.elpe + 1) - off, buf_len);
> +
> +return nvme_dma_read_prp(n, (uint8_t *) n->elpes + off, trans_len, prp1,
> +prp2);
> +}
> +
> +static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> +uint64_t off, NvmeRequest *req)
> +{
> +uint64_t prp1 = le64_to_cpu(cmd->prp1);
> +uint64_t prp2 = le64_to_cpu(cmd->prp2);
> +uint32_t nsid = le32_to_cpu(cmd->nsid);
> +
> +uint32_t trans_len;
> +time_t current_ms;
> +uint64_t units_read = 0, units_written = 0, read_commands = 0,
> +write_commands = 0;
> +NvmeSmartLog smart;
> +BlockAcctStats *s;
> +
> +if (!nsid || (nsid != 0x && nsid > n->num_namespaces)) {
> +trace_nvme_err_invalid_ns(nsid, n->num_namespaces);
> +return NVME_INVALID_NSID | NVME_DNR;
> +}
> +
The LAP '0' bit is cleared now - which means there is no support
for per-namespace data. So in theory, if that was the aim, this condition
should check for the values different than 0x0 and 0x and either
abort the command or treat that as request for controller specific data.

> +s = blk_get_stats(n->conf.blk);
> +
> +units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> +units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> +read_commands = s->nr_ops[BLOCK_ACCT_READ];
> +write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
> +
> +if (off > sizeof(smart)) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +trans_len = MIN(sizeof(smart) - off, buf_len);
> +
> +memset(, 0x0, sizeof(smart));
> +
> +smart.data_units_read[0] = cpu_to_le64(units_read / 1000);
> +smart.data_units_written[0] = cpu_to_le64(units_written / 1000);
> +smart.host_read_commands[0] = cpu_to_le64(read_commands);
> +smart.host_write_commands[0] = cpu_to_le64(write_commands);
> +
> +smart.number_of_error_log_entries[0] = cpu_to_le64(0);
> +smart.temperature[0] = n->temperature & 0xff;
> +smart.temperature[1] = (n->temperature >> 8) & 0xff;
> +
> +if (n->features.temp_thresh <= n->temperature) {
> +smart.critical_warning |= NVME_SMART_TEMPERATURE;
> +}
> +
> +current_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> +smart.power_on_hours[0] = cpu_to_le64(
> +(((current_ms - n->starttime_ms) / 1000) / 60) / 60);
> +
> +return nvme_dma_read_prp(n, (uint8_t *)  + off, trans_len, prp1,
> +prp2);
> +}
> +
> +static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> +uint64_t off, NvmeRequest *req)
> +{
> +uint32_t trans_len;
> +uint64_t prp1 = le64_to_cpu(cmd->prp1);
> +uint64_t prp2 = le64_to_cpu(cmd->prp2);
> +NvmeFwSlotInfoLog fw_log;
> +
> +if (off > sizeof(fw_log)) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +memset(_log, 0, sizeof(NvmeFwSlotInfoLog));
> +
> +trans_len = MIN(sizeof(fw_log) - off, buf_len);
> +
> +return nvme_dma_read_prp(n, (uint8_t *) _log + off, trans_len, prp1,
> +prp2);
> +}
> +
> +static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +{
> +uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> +uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> +uint32_t dw12 = le32_to_cpu(cmd->cdw12);
> +uint32_t dw13 = 

Re: [PATCH v2 04/20] nvme: populate the mandatory subnqn and ver fields

2019-11-12 Thread Beata Michalska
Hi Klaus

On Tue, 15 Oct 2019 at 11:42, Klaus Jensen  wrote:
>
> Required for compliance with NVMe revision 1.2.1 or later. See NVM
> Express 1.2.1, Section 5.11 ("Identify command"), Figure 90 and Section
> 7.9 ("NVMe Qualified Names").
>
> This also bumps the supported version to 1.2.1.
>
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 277700fdcc58..16f0fba10b08 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -9,9 +9,9 @@
>   */
>
>  /**
> - * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
> + * Reference Specification: NVM Express 1.2.1
>   *
> - *  http://www.nvmexpress.org/resources/
> + *   https://nvmexpress.org/resources/specifications/
>   */
>
>  /**
> @@ -1366,6 +1366,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> **errp)
>  id->ieee[0] = 0x00;
>  id->ieee[1] = 0x02;
>  id->ieee[2] = 0xb3;
> +id->ver = cpu_to_le32(0x00010201);
>  id->oacs = cpu_to_le16(0);
>  id->frmw = 7 << 1;
>  id->lpa = 1 << 0;
> @@ -1373,6 +1374,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> **errp)
>  id->cqes = (0x4 << 4) | 0x4;
>  id->nn = cpu_to_le32(n->num_namespaces);
>  id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
> +
> +strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
> +pstrcat((char *) id->subnqn, sizeof(id->subnqn), n->params.serial);
> +
>  id->psd[0].mp = cpu_to_le16(0x9c4);
>  id->psd[0].enlat = cpu_to_le32(0x10);
>  id->psd[0].exlat = cpu_to_le32(0x4);
> @@ -1387,7 +1392,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> **errp)
>  NVME_CAP_SET_CSS(n->bar.cap, 1);
>  NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
>
> -n->bar.vs = 0x00010200;
> +n->bar.vs = 0x00010201;

Very minor:

The version number is being set twice in the patch series already.
And it is being set in two places.
It might be worth to make a #define out of it so that only one
needs to be changed.

BR


Beata
>  n->bar.intmc = n->bar.intms = 0;
>
>  if (n->params.cmb_size_mb) {
> --
> 2.23.0
>
>



Re: [PATCH v2 09/20] nvme: add support for the asynchronous event request command

2019-11-12 Thread Beata Michalska
Hi Klaus,

On Tue, 15 Oct 2019 at 11:49, Klaus Jensen  wrote:
>
> Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> Section 5.2 ("Asynchronous Event Request command").
>
> Mostly imported from Keith's qemu-nvme tree. Modified to not enqueue
> events if something of the same type is already queued (but not cleared
> by the host).
>
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme.c   | 180 --
>  hw/block/nvme.h   |  13 ++-
>  hw/block/trace-events |   8 ++
>  include/block/nvme.h  |   4 +-
>  4 files changed, 196 insertions(+), 9 deletions(-)
>
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 4412a3bea3bc..5cdee37582f9 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -334,6 +334,46 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
> NvmeRequest *req)
>  timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>  }
>
> +static void nvme_enqueue_event(NvmeCtrl *n, uint8_t event_type,
> +uint8_t event_info, uint8_t log_page)
> +{
> +NvmeAsyncEvent *event;
> +
> +trace_nvme_enqueue_event(event_type, event_info, log_page);
> +
> +/*
> + * Do not enqueue the event if something of this type is already queued.
> + * This bounds the size of the event queue and makes sure it does not 
> grow
> + * indefinitely when events are not processed by the host (i.e. does not
> + * issue any AERs).
> + */
> +if (n->aer_mask_queued & (1 << event_type)) {
> +trace_nvme_enqueue_event_masked(event_type);
> +return;
> +}
> +n->aer_mask_queued |= (1 << event_type);
> +
> +event = g_new(NvmeAsyncEvent, 1);
> +event->result = (NvmeAerResult) {
> +.event_type = event_type,
> +.event_info = event_info,
> +.log_page   = log_page,
> +};
> +
> +QTAILQ_INSERT_TAIL(>aer_queue, event, entry);
> +
> +timer_mod(n->aer_timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> +}
> +
> +static void nvme_clear_events(NvmeCtrl *n, uint8_t event_type)
> +{
> +n->aer_mask &= ~(1 << event_type);
> +if (!QTAILQ_EMPTY(>aer_queue)) {
> +timer_mod(n->aer_timer,
> +qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> +}
> +}
> +
>  static void nvme_rw_cb(void *opaque, int ret)
>  {
>  NvmeRequest *req = opaque;
> @@ -578,7 +618,7 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
>  return NVME_SUCCESS;
>  }
>
> -static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd *cmd,
> +static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
>  uint32_t buf_len, uint64_t off, NvmeRequest *req)
>  {
>  uint32_t trans_len;
> @@ -591,12 +631,16 @@ static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd 
> *cmd,
>
>  trans_len = MIN(sizeof(*n->elpes) * (n->params.elpe + 1) - off, buf_len);
>
> +if (!rae) {
> +nvme_clear_events(n, NVME_AER_TYPE_ERROR);
> +}
> +
>  return nvme_dma_read_prp(n, (uint8_t *) n->elpes + off, trans_len, prp1,
>  prp2);
>  }
>
> -static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> -uint64_t off, NvmeRequest *req)
> +static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
> +uint32_t buf_len, uint64_t off, NvmeRequest *req)
>  {
>  uint64_t prp1 = le64_to_cpu(cmd->prp1);
>  uint64_t prp2 = le64_to_cpu(cmd->prp2);
> @@ -646,6 +690,10 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd 
> *cmd, uint32_t buf_len,
>  smart.power_on_hours[0] = cpu_to_le64(
>  (((current_ms - n->starttime_ms) / 1000) / 60) / 60);
>
> +if (!rae) {
> +nvme_clear_events(n, NVME_AER_TYPE_SMART);
> +}
> +
>  return nvme_dma_read_prp(n, (uint8_t *)  + off, trans_len, prp1,
>  prp2);
>  }
> @@ -698,9 +746,9 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, 
> NvmeRequest *req)
>
>  switch (lid) {
>  case NVME_LOG_ERROR_INFO:
> -return nvme_error_info(n, cmd, len, off, req);
> +return nvme_error_info(n, cmd, rae, len, off, req);
>  case NVME_LOG_SMART_INFO:
> -return nvme_smart_info(n, cmd, len, off, req);
> +return nvme_smart_info(n, cmd, rae, len, off, req);
>  case NVME_LOG_FW_SLOT_INFO:
>  return nvme_fw_log_info(n, cmd, len, off, req);
>  default:
> @@ -958,6 +1006,9 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
> *cmd, NvmeRequest *req)
>  break;
>  case NVME_TIMESTAMP:
>  return nvme_get_feature_timestamp(n, cmd);
> +case NVME_ASYNCHRONOUS_EVENT_CONF:
> +result = cpu_to_le32(n->features.async_config);
> +break;
>  default:
>  trace_nvme_err_invalid_getfeat(dw10);
>  return NVME_INVALID_FIELD | NVME_DNR;
> @@ -993,6 +1044,12 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
> *cmd, NvmeRequest *req)
>  switch (dw10) {
>  case NVME_TEMPERATURE_THRESHOLD:
>  n->features.temp_thresh = dw11;
> +

Re: [PATCH v2 06/20] nvme: add support for the abort command

2019-11-12 Thread Beata Michalska
Hi Klaus

On Tue, 15 Oct 2019 at 11:41, Klaus Jensen  wrote:
>
> Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> Section 5.1 ("Abort command").
>
> The Abort command is a best effort command; for now, the device always
> fails to abort the given command.
>
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme.c | 16 
>  1 file changed, 16 insertions(+)
>
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index daa2367b0863..84e4f2ea7a15 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -741,6 +741,18 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
>  }
>  }
>
> +static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +{
> +uint16_t sqid = le32_to_cpu(cmd->cdw10) & 0x;
> +
> +req->cqe.result = 1;
> +if (nvme_check_sqid(n, sqid)) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
Shouldn't we validate the CID as well ?

> +return NVME_SUCCESS;
> +}
> +
>  static inline void nvme_set_timestamp(NvmeCtrl *n, uint64_t ts)
>  {
>  trace_nvme_setfeat_timestamp(ts);
> @@ -859,6 +871,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
> *cmd, NvmeRequest *req)
>  trace_nvme_err_invalid_setfeat(dw10);
>  return NVME_INVALID_FIELD | NVME_DNR;
>  }
> +
>  return NVME_SUCCESS;
>  }
>
> @@ -875,6 +888,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
> NvmeRequest *req)
>  return nvme_create_cq(n, cmd);
>  case NVME_ADM_CMD_IDENTIFY:
>  return nvme_identify(n, cmd);
> +case NVME_ADM_CMD_ABORT:
> +return nvme_abort(n, cmd, req);
>  case NVME_ADM_CMD_SET_FEATURES:
>  return nvme_set_feature(n, cmd, req);
>  case NVME_ADM_CMD_GET_FEATURES:
> @@ -1388,6 +1403,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> **errp)
>  id->ieee[2] = 0xb3;
>  id->ver = cpu_to_le32(0x00010201);
>  id->oacs = cpu_to_le16(0);
> +id->acl = 3;
So we are setting the max number of concurrent commands
but there is no logic to enforce that and wrap up with the
status suggested by specification.

BR


Beata
>  id->frmw = 7 << 1;
>  id->lpa = 1 << 0;
>  id->sqes = (0x6 << 4) | 0x6;
> --
> 2.23.0
>
>



Re: [RFC PATCH 18/18] qemu-storage-daemon: Add --monitor option

2019-11-12 Thread Markus Armbruster
Kevin Wolf  writes:

> This adds and parses the --monitor option, so that a QMP monitor can be
> used in the storage daemon. The monitor offers commands defined in the
> QAPI schema at storage-daemon/qapi/qapi-schema.json.

I feel we should explain the module sharing between the two QAPI
schemata here.  It's not exactly trivial, as we'll see below.

> Signed-off-by: Kevin Wolf 
> ---
>  storage-daemon/qapi/qapi-schema.json | 15 
>  qemu-storage-daemon.c| 34 
>  Makefile | 30 
>  Makefile.objs|  4 ++--
>  monitor/Makefile.objs|  2 ++
>  qapi/Makefile.objs   |  5 
>  qom/Makefile.objs|  1 +
>  scripts/qapi/gen.py  |  5 
>  storage-daemon/Makefile.objs |  1 +
>  storage-daemon/qapi/Makefile.objs|  1 +
>  10 files changed, 96 insertions(+), 2 deletions(-)
>  create mode 100644 storage-daemon/qapi/qapi-schema.json
>  create mode 100644 storage-daemon/Makefile.objs
>  create mode 100644 storage-daemon/qapi/Makefile.objs
>
> diff --git a/storage-daemon/qapi/qapi-schema.json 
> b/storage-daemon/qapi/qapi-schema.json
> new file mode 100644
> index 00..58c561ebea
> --- /dev/null
> +++ b/storage-daemon/qapi/qapi-schema.json
> @@ -0,0 +1,15 @@
> +# -*- Mode: Python -*-
> +
> +{ 'include': '../../qapi/pragma.json' }
> +
> +{ 'include': '../../qapi/block.json' }
> +{ 'include': '../../qapi/block-core.json' }
> +{ 'include': '../../qapi/char.json' }
> +{ 'include': '../../qapi/common.json' }
> +{ 'include': '../../qapi/crypto.json' }
> +{ 'include': '../../qapi/introspect.json' }
> +{ 'include': '../../qapi/job.json' }
> +{ 'include': '../../qapi/monitor.json' }
> +{ 'include': '../../qapi/qom.json' }
> +{ 'include': '../../qapi/sockets.json' }
> +{ 'include': '../../qapi/transaction.json' }
> diff --git a/qemu-storage-daemon.c b/qemu-storage-daemon.c
> index 46e0a6ea56..4939e6b41f 100644
> --- a/qemu-storage-daemon.c
> +++ b/qemu-storage-daemon.c
> @@ -28,12 +28,16 @@
>  #include "block/nbd.h"
>  #include "chardev/char.h"
>  #include "crypto/init.h"
> +#include "monitor/monitor.h"
> +#include "monitor/monitor-internal.h"
>  
>  #include "qapi/error.h"
>  #include "qapi/qapi-commands-block.h"
>  #include "qapi/qapi-commands-block-core.h"
> +#include "qapi/qapi-commands-monitor.h"

Aren't these three redundant with ...

>  #include "qapi/qapi-visit-block.h"
>  #include "qapi/qapi-visit-block-core.h"
> +#include "qapi/qmp/qstring.h"
>  #include "qapi/qobject-input-visitor.h"
>  
>  #include "qemu-common.h"
> @@ -46,6 +50,8 @@
>  #include "qemu/option.h"
>  #include "qom/object_interfaces.h"
>  
> +#include "storage-daemon/qapi/qapi-commands.h"
> +

... this one now?

>  #include "sysemu/runstate.h"
>  #include "trace/control.h"
>  
> @@ -58,6 +64,11 @@ void qemu_system_killed(int signal, pid_t pid)
>  exit_requested = true;
>  }
>  
> +void qmp_quit(Error **errp)
> +{
> +exit_requested = true;
> +}
> +
>  static void help(void)
>  {
>  printf(
> @@ -101,6 +112,7 @@ enum {
>  OPTION_OBJECT = 256,
>  OPTION_BLOCKDEV,
>  OPTION_CHARDEV,
> +OPTION_MONITOR,
>  OPTION_NBD_SERVER,
>  OPTION_EXPORT,
>  };

Recommend to keep these sorted alphabetically.

> @@ -116,6 +128,17 @@ static QemuOptsList qemu_object_opts = {
>  },
>  };
>  
> +static void init_qmp_commands(void)
> +{
> +qmp_init_marshal(_commands);
> +qmp_register_command(_commands, "query-qmp-schema",
> + qmp_query_qmp_schema, QCO_ALLOW_PRECONFIG);
> +
> +QTAILQ_INIT(_cap_negotiation_commands);
> +qmp_register_command(_cap_negotiation_commands, "qmp_capabilities",
> + qmp_marshal_qmp_capabilities, QCO_ALLOW_PRECONFIG);
> +}

Duplicates monitor_init_qmp_commands() less two 'gen': false commands
that don't exist in the storage daemon.

Hmm, could we let commands.py generate command registration even for
'gen': false commands?

> +
>  static void init_export(BlockExport *export, Error **errp)
>  {
>  switch (export->type) {
> @@ -138,6 +161,7 @@ static int process_options(int argc, char *argv[], Error 
> **errp)
>  {"object", required_argument, 0, OPTION_OBJECT},
>  {"blockdev", required_argument, 0, OPTION_BLOCKDEV},
>  {"chardev", required_argument, 0, OPTION_CHARDEV},
> +{"monitor", required_argument, 0, OPTION_MONITOR},
>  {"nbd-server", required_argument, 0, OPTION_NBD_SERVER},
>  {"export", required_argument, 0, OPTION_EXPORT},
>  {"version", no_argument, 0, 'V'},
> @@ -208,6 +232,14 @@ static int process_options(int argc, char *argv[], Error 
> **errp)
>  qemu_opts_del(opts);
>  break;
>  }
> +case OPTION_MONITOR:
> +{
> +QemuOpts *opts = qemu_opts_parse(_mon_opts,
> + 

Re: [RFC v5 000/126] error: auto propagated local_err

2019-11-12 Thread Cornelia Huck
On Fri, 8 Nov 2019 22:57:25 +0400
Marc-André Lureau  wrote:

> Hi
> 
> On Fri, Nov 8, 2019 at 7:31 PM Vladimir Sementsov-Ogievskiy
>  wrote:
> >
> > Finally, what is the plan?
> >
> > Markus what do you think?
> >
> > Now a lot of patches are reviewed, but a lot of are not.
> >
> > Is there any hope that all patches will be reviewed? Should I resend the
> > whole series, or may be reduce it to reviewed subsystems only?  
> 
> I don't think we have well established rules for whole-tree cleanups
> like this. In the past, several cleanup series got lost.

Yes, it is always problematic if a series touches a lot of different
subsystems.

> 
> It will take ages to get every subsystem maintainer to review the
> patches. Most likely, since they are quite systematic, there isn't
> much to say and it is easy to miss something that has some hidden
> ramifications. Perhaps whole-tree cleanups should require at least 2
> reviewers to bypass the subsytem maintainer review? But my past
> experience with this kind of exercice doesn't encourage me, and
> probably I am not the only one.

It's not just the reviews; it's easy to miss compile problems on less
mainstream architectures (and even easier to miss functional problems
there, although they are probably less likely with automated rework.)
CI can probably help, but that's something for the future.

Anyway, I've now gotten around to that series; spotted one problem in
s390x code, I think.

One thing that's helpful for such a large series is a git branch that
makes it easy to give the series a quick go. (You can use patchew, but
it takes time before it gets all mails, so just pushing it somewhere
and letting people know is a good idea anyway.)




Re: [RFC v5 026/126] python: add commit-per-subsystem.py

2019-11-12 Thread Cornelia Huck
On Fri, 11 Oct 2019 19:04:12 +0300
Vladimir Sementsov-Ogievskiy  wrote:

> Add script to automatically commit tree-wide changes per-subsystem.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---

I think this still needs some notes as to the supposed usage.




Re: [PATCH v2 00/11] RFC crypto/luks: encryption key managment using amend interface

2019-11-12 Thread Maxim Levitsky
On Tue, 2019-11-12 at 12:58 +0100, Max Reitz wrote:
> On 08.11.19 16:07, Maxim Levitsky wrote:
> > On Fri, 2019-10-04 at 21:10 +0200, Max Reitz wrote:
> > > On 13.09.19 00:30, Maxim Levitsky wrote:
> > > > This patch series is continuation of my work to add encryption
> > > > key managment to luks/qcow2 with luks.
> > > > 
> > > > This is second version of this patch set.
> > > > The changes are mostly addressing the review feedback,
> > > > plus I tested (and fixed sadly) the somewhat ugly code
> > > > that allows to still write share a raw luks device,
> > > > while preveting the key managment from happening in this case,
> > > > as it is unsafe.
> > > > I added a new iotest dedicated to that as well.
> > > > 
> > > > Best regards,
> > > > Maxim Levitsky
> > > 
> > > At least for an RFC looks good from my perspective.  I didn’t look at
> > > the crypto things very closely (assuming Dan would do so), and I didn’t
> > > check the iotests in detail.  (But it definitely doesn’t look like they
> > > lack in breadth.  Maybe I’d like to see a test that you cannot have
> > > other useful nodes attached to the LUKS or qcow2 node while the
> > > amendment process is ongoing (because CONSISTENT_READ is unshared).  But
> > > that’s the only thing I can think of.)
> > 
> > Could you elaborate on this? 
> > 
> > Inside the same process several users can access that luks node at the same
> > time while one of them changes encryption keys, since this doesn't affect 
> > IO of the data.
> > 
> > Two users in same process I was *I think* told that can't do the amend in 
> > the same time
> > since qmp is protected with a lock. However since I use a block job (to be 
> > consistent with blockdev-create)
> > I wonder if several qmp amend commands couldn't race one with another. 
> > These jobs is running
> > on the block device AIO context (I changed this recently after a review), 
> > but stil I am not sure
> > there can't be a race.
> > 
> > And when there is access to the same image from multiple processes, I do 
> > have a test that
> > checks that as long as more that one process has the image open, noone can 
> > change the encryption keys
> > (this is only relevant for raw luks format, since for qcow2 this is 
> > forbidden anyway).
> 
> Yes, sorry, I don’t remember/know where I got the qcow2 part from.  (I
> probably just forgot during after reviewing that only LUKS’s permissions
> are changed by this series.)
> 
> But for LUKS, those changed permissions do apply.  If you can’t do
> something between two different qemu instances, you can’t do it in a
> single one: The file locks are equivalent to the internal permission mask.
> 
> So if you can’t change the encryption keys while another process has the
> image open, you can’t change the encryption keys while another node uses
> the file node in the same process.  For example, you can’t attach two
> LUKS nodes to a single file node and then change the keys on one of the
> nodes.
> 
> Max
> 
Ah, I understand now. I'll add a test for that!

Best regards,
Maxim Levitsky




Re: [PATCH v2 00/11] RFC crypto/luks: encryption key managment using amend interface

2019-11-12 Thread Max Reitz
On 08.11.19 16:07, Maxim Levitsky wrote:
> On Fri, 2019-10-04 at 21:10 +0200, Max Reitz wrote:
>> On 13.09.19 00:30, Maxim Levitsky wrote:
>>> This patch series is continuation of my work to add encryption
>>> key managment to luks/qcow2 with luks.
>>>
>>> This is second version of this patch set.
>>> The changes are mostly addressing the review feedback,
>>> plus I tested (and fixed sadly) the somewhat ugly code
>>> that allows to still write share a raw luks device,
>>> while preveting the key managment from happening in this case,
>>> as it is unsafe.
>>> I added a new iotest dedicated to that as well.
>>>
>>> Best regards,
>>> Maxim Levitsky
>>
>> At least for an RFC looks good from my perspective.  I didn’t look at
>> the crypto things very closely (assuming Dan would do so), and I didn’t
>> check the iotests in detail.  (But it definitely doesn’t look like they
>> lack in breadth.  Maybe I’d like to see a test that you cannot have
>> other useful nodes attached to the LUKS or qcow2 node while the
>> amendment process is ongoing (because CONSISTENT_READ is unshared).  But
>> that’s the only thing I can think of.)
> Could you elaborate on this? 
> 
> Inside the same process several users can access that luks node at the same
> time while one of them changes encryption keys, since this doesn't affect IO 
> of the data.
> 
> Two users in same process I was *I think* told that can't do the amend in the 
> same time
> since qmp is protected with a lock. However since I use a block job (to be 
> consistent with blockdev-create)
> I wonder if several qmp amend commands couldn't race one with another. These 
> jobs is running
> on the block device AIO context (I changed this recently after a review), but 
> stil I am not sure
> there can't be a race.
> 
> And when there is access to the same image from multiple processes, I do have 
> a test that
> checks that as long as more that one process has the image open, noone can 
> change the encryption keys
> (this is only relevant for raw luks format, since for qcow2 this is forbidden 
> anyway).

Yes, sorry, I don’t remember/know where I got the qcow2 part from.  (I
probably just forgot during after reviewing that only LUKS’s permissions
are changed by this series.)

But for LUKS, those changed permissions do apply.  If you can’t do
something between two different qemu instances, you can’t do it in a
single one: The file locks are equivalent to the internal permission mask.

So if you can’t change the encryption keys while another process has the
image open, you can’t change the encryption keys while another node uses
the file node in the same process.  For example, you can’t attach two
LUKS nodes to a single file node and then change the keys on one of the
nodes.

Max



signature.asc
Description: OpenPGP digital signature


[PATCH v3 3/8] blockdev: place drive_backup_prepare with the other related transaction functions

2019-11-12 Thread Sergio Lopez
Move drive_backup_prepare() to be side by side with the other
related transaction helper functions.

Signed-off-by: Sergio Lopez 
---
 blockdev.c | 247 +++--
 1 file changed, 125 insertions(+), 122 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index e8b673c5f3..b32855f702 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1761,8 +1761,131 @@ typedef struct DriveBackupState {
 BlockJob *job;
 } DriveBackupState;
 
-static BlockJob *do_drive_backup(DriveBackup *backup, JobTxn *txn,
-Error **errp);
+static BlockJob *do_backup_common(BackupCommon *backup,
+  BlockDriverState *bs,
+  BlockDriverState *target_bs,
+  AioContext *aio_context,
+  JobTxn *txn, Error **errp);
+
+static void drive_backup_prepare(BlkActionState *common, Error **errp)
+{
+DriveBackupState *state = DO_UPCAST(DriveBackupState, common, common);
+DriveBackup *backup;
+BlockDriverState *bs;
+BlockDriverState *target_bs;
+BlockDriverState *source = NULL;
+AioContext *aio_context;
+QDict *options;
+Error *local_err = NULL;
+int flags;
+int64_t size;
+bool set_backing_hd = false;
+
+assert(common->action->type == TRANSACTION_ACTION_KIND_DRIVE_BACKUP);
+backup = common->action->u.drive_backup.data;
+
+if (!backup->has_mode) {
+backup->mode = NEW_IMAGE_MODE_ABSOLUTE_PATHS;
+}
+
+bs = bdrv_lookup_bs(backup->device, backup->device, errp);
+if (!bs) {
+return;
+}
+
+if (!bs->drv) {
+error_setg(errp, "Device has no medium");
+return;
+}
+
+aio_context = bdrv_get_aio_context(bs);
+aio_context_acquire(aio_context);
+
+/* Paired with .clean() */
+bdrv_drained_begin(bs);
+
+if (!backup->has_format) {
+backup->format = backup->mode == NEW_IMAGE_MODE_EXISTING ?
+ NULL : (char *) bs->drv->format_name;
+}
+
+/* Early check to avoid creating target */
+if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_BACKUP_SOURCE, errp)) {
+goto out;
+}
+
+flags = bs->open_flags | BDRV_O_RDWR;
+
+/*
+ * See if we have a backing HD we can use to create our new image
+ * on top of.
+ */
+if (backup->sync == MIRROR_SYNC_MODE_TOP) {
+source = backing_bs(bs);
+if (!source) {
+backup->sync = MIRROR_SYNC_MODE_FULL;
+}
+}
+if (backup->sync == MIRROR_SYNC_MODE_NONE) {
+source = bs;
+flags |= BDRV_O_NO_BACKING;
+set_backing_hd = true;
+}
+
+size = bdrv_getlength(bs);
+if (size < 0) {
+error_setg_errno(errp, -size, "bdrv_getlength failed");
+goto out;
+}
+
+if (backup->mode != NEW_IMAGE_MODE_EXISTING) {
+assert(backup->format);
+if (source) {
+bdrv_refresh_filename(source);
+bdrv_img_create(backup->target, backup->format, source->filename,
+source->drv->format_name, NULL,
+size, flags, false, _err);
+} else {
+bdrv_img_create(backup->target, backup->format, NULL, NULL, NULL,
+size, flags, false, _err);
+}
+}
+
+if (local_err) {
+error_propagate(errp, local_err);
+goto out;
+}
+
+options = qdict_new();
+qdict_put_str(options, "discard", "unmap");
+qdict_put_str(options, "detect-zeroes", "unmap");
+if (backup->format) {
+qdict_put_str(options, "driver", backup->format);
+}
+
+target_bs = bdrv_open(backup->target, NULL, options, flags, errp);
+if (!target_bs) {
+goto out;
+}
+
+if (set_backing_hd) {
+bdrv_set_backing_hd(target_bs, source, _err);
+if (local_err) {
+goto unref;
+}
+}
+
+state->bs = bs;
+
+state->job = do_backup_common(qapi_DriveBackup_base(backup),
+  bs, target_bs, aio_context,
+  common->block_job_txn, errp);
+
+unref:
+bdrv_unref(target_bs);
+out:
+aio_context_release(aio_context);
+}
 
 static void drive_backup_commit(BlkActionState *common)
 {
@@ -3553,126 +3676,6 @@ static BlockJob *do_backup_common(BackupCommon *backup,
 return job;
 }
 
-static void drive_backup_prepare(BlkActionState *common, Error **errp)
-{
-DriveBackupState *state = DO_UPCAST(DriveBackupState, common, common);
-DriveBackup *backup;
-BlockDriverState *bs;
-BlockDriverState *target_bs;
-BlockDriverState *source = NULL;
-AioContext *aio_context;
-QDict *options;
-Error *local_err = NULL;
-int flags;
-int64_t size;
-bool set_backing_hd = false;
-
-assert(common->action->type == TRANSACTION_ACTION_KIND_DRIVE_BACKUP);
-backup = common->action->u.drive_backup.data;
-
-if (!backup->has_mode) {

[PATCH v3 5/8] blockdev: merge blockdev_backup_prepare with do_blockdev_backup

2019-11-12 Thread Sergio Lopez
Consolidate blockdev_backup_prepare() with do_blockdev_backup() as a
first step towards streamlining all functionality through
transactions.

Signed-off-by: Sergio Lopez 
---
 blockdev.c | 64 +-
 1 file changed, 15 insertions(+), 49 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 5e85fc042e..128707cdc6 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1940,47 +1940,6 @@ typedef struct BlockdevBackupState {
 BlockJob *job;
 } BlockdevBackupState;
 
-static BlockJob *do_blockdev_backup(BlockdevBackup *backup, JobTxn *txn,
-Error **errp);
-
-static void blockdev_backup_prepare(BlkActionState *common, Error **errp)
-{
-BlockdevBackupState *state = DO_UPCAST(BlockdevBackupState, common, 
common);
-BlockdevBackup *backup;
-BlockDriverState *bs, *target;
-AioContext *aio_context;
-Error *local_err = NULL;
-
-assert(common->action->type == TRANSACTION_ACTION_KIND_BLOCKDEV_BACKUP);
-backup = common->action->u.blockdev_backup.data;
-
-bs = bdrv_lookup_bs(backup->device, backup->device, errp);
-if (!bs) {
-return;
-}
-
-target = bdrv_lookup_bs(backup->target, backup->target, errp);
-if (!target) {
-return;
-}
-
-aio_context = bdrv_get_aio_context(bs);
-aio_context_acquire(aio_context);
-state->bs = bs;
-
-/* Paired with .clean() */
-bdrv_drained_begin(state->bs);
-
-state->job = do_blockdev_backup(backup, common->block_job_txn, _err);
-if (local_err) {
-error_propagate(errp, local_err);
-goto out;
-}
-
-out:
-aio_context_release(aio_context);
-}
-
 static void blockdev_backup_commit(BlkActionState *common)
 {
 BlockdevBackupState *state = DO_UPCAST(BlockdevBackupState, common, 
common);
@@ -3695,32 +3654,39 @@ XDbgBlockGraph *qmp_x_debug_query_block_graph(Error 
**errp)
 return bdrv_get_xdbg_block_graph(errp);
 }
 
-BlockJob *do_blockdev_backup(BlockdevBackup *backup, JobTxn *txn,
- Error **errp)
+static void blockdev_backup_prepare(BlkActionState *common, Error **errp)
 {
+BlockdevBackupState *state = DO_UPCAST(BlockdevBackupState, common, 
common);
+BlockdevBackup *backup;
 BlockDriverState *bs;
 BlockDriverState *target_bs;
 AioContext *aio_context;
-BlockJob *job;
+
+assert(common->action->type == TRANSACTION_ACTION_KIND_BLOCKDEV_BACKUP);
+backup = common->action->u.blockdev_backup.data;
 
 bs = bdrv_lookup_bs(backup->device, backup->device, errp);
 if (!bs) {
-return NULL;
+return;
 }
 
 target_bs = bdrv_lookup_bs(backup->target, backup->target, errp);
 if (!target_bs) {
-return NULL;
+return;
 }
 
 aio_context = bdrv_get_aio_context(bs);
 aio_context_acquire(aio_context);
+state->bs = bs;
 
-job = do_backup_common(qapi_BlockdevBackup_base(backup),
-   bs, target_bs, aio_context, txn, errp);
+/* Paired with .clean() */
+bdrv_drained_begin(state->bs);
+
+state->job = do_backup_common(qapi_BlockdevBackup_base(backup),
+  bs, target_bs, aio_context,
+  common->block_job_txn, errp);
 
 aio_context_release(aio_context);
-return job;
 }
 
 void qmp_blockdev_backup(BlockdevBackup *arg, Error **errp)
-- 
2.23.0




[PATCH v3 6/8] blockdev: place blockdev_backup_prepare with the other related transaction helpers

2019-11-12 Thread Sergio Lopez
Move blockdev_backup_prepare() to be side by side with the other
related transaction helper functions.

Signed-off-by: Sergio Lopez 
---
 blockdev.c | 70 +++---
 1 file changed, 35 insertions(+), 35 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 128707cdc6..f94aaa98f0 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1940,6 +1940,41 @@ typedef struct BlockdevBackupState {
 BlockJob *job;
 } BlockdevBackupState;
 
+static void blockdev_backup_prepare(BlkActionState *common, Error **errp)
+{
+BlockdevBackupState *state = DO_UPCAST(BlockdevBackupState, common, 
common);
+BlockdevBackup *backup;
+BlockDriverState *bs;
+BlockDriverState *target_bs;
+AioContext *aio_context;
+
+assert(common->action->type == TRANSACTION_ACTION_KIND_BLOCKDEV_BACKUP);
+backup = common->action->u.blockdev_backup.data;
+
+bs = bdrv_lookup_bs(backup->device, backup->device, errp);
+if (!bs) {
+return;
+}
+
+target_bs = bdrv_lookup_bs(backup->target, backup->target, errp);
+if (!target_bs) {
+return;
+}
+
+aio_context = bdrv_get_aio_context(bs);
+aio_context_acquire(aio_context);
+state->bs = bs;
+
+/* Paired with .clean() */
+bdrv_drained_begin(state->bs);
+
+state->job = do_backup_common(qapi_BlockdevBackup_base(backup),
+  bs, target_bs, aio_context,
+  common->block_job_txn, errp);
+
+aio_context_release(aio_context);
+}
+
 static void blockdev_backup_commit(BlkActionState *common)
 {
 BlockdevBackupState *state = DO_UPCAST(BlockdevBackupState, common, 
common);
@@ -3654,41 +3689,6 @@ XDbgBlockGraph *qmp_x_debug_query_block_graph(Error 
**errp)
 return bdrv_get_xdbg_block_graph(errp);
 }
 
-static void blockdev_backup_prepare(BlkActionState *common, Error **errp)
-{
-BlockdevBackupState *state = DO_UPCAST(BlockdevBackupState, common, 
common);
-BlockdevBackup *backup;
-BlockDriverState *bs;
-BlockDriverState *target_bs;
-AioContext *aio_context;
-
-assert(common->action->type == TRANSACTION_ACTION_KIND_BLOCKDEV_BACKUP);
-backup = common->action->u.blockdev_backup.data;
-
-bs = bdrv_lookup_bs(backup->device, backup->device, errp);
-if (!bs) {
-return;
-}
-
-target_bs = bdrv_lookup_bs(backup->target, backup->target, errp);
-if (!target_bs) {
-return;
-}
-
-aio_context = bdrv_get_aio_context(bs);
-aio_context_acquire(aio_context);
-state->bs = bs;
-
-/* Paired with .clean() */
-bdrv_drained_begin(state->bs);
-
-state->job = do_backup_common(qapi_BlockdevBackup_base(backup),
-  bs, target_bs, aio_context,
-  common->block_job_txn, errp);
-
-aio_context_release(aio_context);
-}
-
 void qmp_blockdev_backup(BlockdevBackup *arg, Error **errp)
 {
 BlockJob *job;
-- 
2.23.0




[PATCH v3 8/8] blockdev: honor bdrv_try_set_aio_context() context requirements

2019-11-12 Thread Sergio Lopez
bdrv_try_set_aio_context() requires that the old context is held, and
the new context is not held. Fix all the occurrences where it's not
done this way.

Suggested-by: Max Reitz 
Signed-off-by: Sergio Lopez 
---
 blockdev.c | 67 ++
 1 file changed, 58 insertions(+), 9 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 152a0f7454..b0647d8d33 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1535,6 +1535,7 @@ static void external_snapshot_prepare(BlkActionState 
*common,
  DO_UPCAST(ExternalSnapshotState, common, common);
 TransactionAction *action = common->action;
 AioContext *aio_context;
+AioContext *old_context;
 int ret;
 
 /* 'blockdev-snapshot' and 'blockdev-snapshot-sync' have similar
@@ -1675,11 +1676,20 @@ static void external_snapshot_prepare(BlkActionState 
*common,
 goto out;
 }
 
+/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
+old_context = bdrv_get_aio_context(state->new_bs);
+aio_context_release(aio_context);
+aio_context_acquire(old_context);
+
 ret = bdrv_try_set_aio_context(state->new_bs, aio_context, errp);
 if (ret < 0) {
-goto out;
+aio_context_release(old_context);
+return;
 }
 
+aio_context_release(old_context);
+aio_context_acquire(aio_context);
+
 /* This removes our old bs and adds the new bs. This is an operation that
  * can fail, so we need to do it in .prepare; undoing it for abort is
  * always possible. */
@@ -1775,11 +1785,13 @@ static void drive_backup_prepare(BlkActionState 
*common, Error **errp)
 BlockDriverState *target_bs;
 BlockDriverState *source = NULL;
 AioContext *aio_context;
+AioContext *old_context;
 QDict *options;
 Error *local_err = NULL;
 int flags;
 int64_t size;
 bool set_backing_hd = false;
+int ret;
 
 assert(common->action->type == TRANSACTION_ACTION_KIND_DRIVE_BACKUP);
 backup = common->action->u.drive_backup.data;
@@ -1868,6 +1880,20 @@ static void drive_backup_prepare(BlkActionState *common, 
Error **errp)
 goto out;
 }
 
+/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
+old_context = bdrv_get_aio_context(target_bs);
+aio_context_release(aio_context);
+aio_context_acquire(old_context);
+
+ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
+if (ret < 0) {
+aio_context_release(old_context);
+return;
+ }
+
+aio_context_release(old_context);
+aio_context_acquire(aio_context);
+
 if (set_backing_hd) {
 bdrv_set_backing_hd(target_bs, source, _err);
 if (local_err) {
@@ -1947,6 +1973,8 @@ static void blockdev_backup_prepare(BlkActionState 
*common, Error **errp)
 BlockDriverState *bs;
 BlockDriverState *target_bs;
 AioContext *aio_context;
+AioContext *old_context;
+int ret;
 
 assert(common->action->type == TRANSACTION_ACTION_KIND_BLOCKDEV_BACKUP);
 backup = common->action->u.blockdev_backup.data;
@@ -1961,7 +1989,18 @@ static void blockdev_backup_prepare(BlkActionState 
*common, Error **errp)
 return;
 }
 
+/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
 aio_context = bdrv_get_aio_context(bs);
+old_context = bdrv_get_aio_context(target_bs);
+aio_context_acquire(old_context);
+
+ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
+if (ret < 0) {
+aio_context_release(old_context);
+return;
+}
+
+aio_context_release(old_context);
 aio_context_acquire(aio_context);
 state->bs = bs;
 
@@ -3562,7 +3601,6 @@ static BlockJob *do_backup_common(BackupCommon *backup,
 BlockJob *job = NULL;
 BdrvDirtyBitmap *bmap = NULL;
 int job_flags = JOB_DEFAULT;
-int ret;
 
 if (!backup->has_speed) {
 backup->speed = 0;
@@ -3586,11 +3624,6 @@ static BlockJob *do_backup_common(BackupCommon *backup,
 backup->compress = false;
 }
 
-ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
-if (ret < 0) {
-return NULL;
-}
-
 if ((backup->sync == MIRROR_SYNC_MODE_BITMAP) ||
 (backup->sync == MIRROR_SYNC_MODE_INCREMENTAL)) {
 /* done before desugaring 'incremental' to print the right message */
@@ -3825,6 +3858,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
 BlockDriverState *bs;
 BlockDriverState *source, *target_bs;
 AioContext *aio_context;
+AioContext *old_context;
 BlockMirrorBackingMode backing_mode;
 Error *local_err = NULL;
 QDict *options = NULL;
@@ -3937,12 +3971,22 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
(arg->mode == NEW_IMAGE_MODE_EXISTING ||
 !bdrv_has_zero_init(target_bs)));
 
+
+/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
+old_context = 

[PATCH v3 7/8] blockdev: change qmp_blockdev_backup to make use of transactions

2019-11-12 Thread Sergio Lopez
Change qmp_blockdev_backup() to create and start a transaction instead
of calling do_blockdev_backup() directly.

Signed-off-by: Sergio Lopez 
---
 blockdev.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index f94aaa98f0..152a0f7454 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3689,13 +3689,13 @@ XDbgBlockGraph *qmp_x_debug_query_block_graph(Error 
**errp)
 return bdrv_get_xdbg_block_graph(errp);
 }
 
-void qmp_blockdev_backup(BlockdevBackup *arg, Error **errp)
+void qmp_blockdev_backup(BlockdevBackup *backup, Error **errp)
 {
-BlockJob *job;
-job = do_blockdev_backup(arg, NULL, errp);
-if (job) {
-job_start(>job);
-}
+TransactionAction action = {
+.type = TRANSACTION_ACTION_KIND_BLOCKDEV_BACKUP,
+.u.blockdev_backup.data = backup,
+};
+blockdev_do_action(, errp);
 }
 
 /* Parameter check and block job starting for drive mirroring.
-- 
2.23.0




[PATCH v3 2/8] blockdev: fix coding style issues in drive_backup_prepare

2019-11-12 Thread Sergio Lopez
Fix a couple of minor coding style issues in drive_backup_prepare.

Signed-off-by: Sergio Lopez 
---
 blockdev.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 5d30aff1e5..e8b673c5f3 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3592,7 +3592,7 @@ static void drive_backup_prepare(BlkActionState *common, 
Error **errp)
 
 if (!backup->has_format) {
 backup->format = backup->mode == NEW_IMAGE_MODE_EXISTING ?
- NULL : (char*) bs->drv->format_name;
+ NULL : (char *) bs->drv->format_name;
 }
 
 /* Early check to avoid creating target */
@@ -3602,8 +3602,10 @@ static void drive_backup_prepare(BlkActionState *common, 
Error **errp)
 
 flags = bs->open_flags | BDRV_O_RDWR;
 
-/* See if we have a backing HD we can use to create our new image
- * on top of. */
+/*
+ * See if we have a backing HD we can use to create our new image
+ * on top of.
+ */
 if (backup->sync == MIRROR_SYNC_MODE_TOP) {
 source = backing_bs(bs);
 if (!source) {
-- 
2.23.0




[PATCH v3 4/8] blockdev: change qmp_drive_backup to make use of transactions

2019-11-12 Thread Sergio Lopez
Change qmp_drive_backup() to create and start a transaction instead of
calling do_drive_backup directly.

Signed-off-by: Sergio Lopez 
---
 blockdev.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index b32855f702..5e85fc042e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3676,14 +3676,13 @@ static BlockJob *do_backup_common(BackupCommon *backup,
 return job;
 }
 
-void qmp_drive_backup(DriveBackup *arg, Error **errp)
+void qmp_drive_backup(DriveBackup *backup, Error **errp)
 {
-
-BlockJob *job;
-job = do_drive_backup(arg, NULL, errp);
-if (job) {
-job_start(>job);
-}
+TransactionAction action = {
+.type = TRANSACTION_ACTION_KIND_DRIVE_BACKUP,
+.u.drive_backup.data = backup,
+};
+blockdev_do_action(, errp);
 }
 
 BlockDeviceInfoList *qmp_query_named_block_nodes(Error **errp)
-- 
2.23.0




[PATCH v3 0/8] blockdev: avoid acquiring AioContext lock twice at do_drive_backup and do_blockdev_backup

2019-11-12 Thread Sergio Lopez
do_drive_backup() acquires the AioContext lock of the corresponding
BlockDriverState. This is not a problem when it's called from
qmp_drive_backup(), but drive_backup_prepare() also acquires the lock
before calling it. The same things happens with do_blockdev_backup()
and blockdev_backup_prepare().

This patch series merges do_drive_backup() with drive_backup_prepare()
and do_blockdev_backup() with blockdev_backup_prepare(), and ensures
they're only getting called from a transaction context. This way,
there's a single code path for both transaction requests and qmp
commands, as suggested by Kevin Wolf.

We also take this opportunity to ensure we're honoring the context
acquisition semantics required by bdrv_try_set_aio_context, as
suggested by Max Reitz.

---
Changelog:

v3:
 - Rework the whole patch series to fix the issue by consolidating all
   operations in the transaction model. (thanks Kevin Wolf)

v2:
 - Honor bdrv_try_set_aio_context() context acquisition requirements
   (thanks Max Reitz).
 - Release the context at drive_backup_prepare() instead of avoiding
   re-acquiring it at do_drive_baclup(). (thanks Max Reitz)
 - Convert a single patch into a two-patch series.
---

Sergio Lopez (8):
  blockdev: merge drive_backup_prepare with do_drive_backup
  blockdev: fix coding style issues in drive_backup_prepare
  blockdev: place drive_backup_prepare with the other related
transaction functions
  blockdev: change qmp_drive_backup to make use of transactions
  blockdev: merge blockdev_backup_prepare with do_blockdev_backup
  blockdev: place blockdev_backup_prepare with the other related
transaction helpers
  blockdev: change qmp_blockdev_backup to make use of transactions
  blockdev: honor bdrv_try_set_aio_context() context requirements

 blockdev.c | 349 ++---
 1 file changed, 171 insertions(+), 178 deletions(-)

-- 
2.23.0




[PATCH v3 1/8] blockdev: merge drive_backup_prepare with do_drive_backup

2019-11-12 Thread Sergio Lopez
Consolidate drive_backup_prepare() with do_drive_backup() as a first
step towards streamlining all functionality through transactions.

Signed-off-by: Sergio Lopez 
---
 blockdev.c | 58 +++---
 1 file changed, 16 insertions(+), 42 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 8e029e9c01..5d30aff1e5 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1764,40 +1764,6 @@ typedef struct DriveBackupState {
 static BlockJob *do_drive_backup(DriveBackup *backup, JobTxn *txn,
 Error **errp);
 
-static void drive_backup_prepare(BlkActionState *common, Error **errp)
-{
-DriveBackupState *state = DO_UPCAST(DriveBackupState, common, common);
-BlockDriverState *bs;
-DriveBackup *backup;
-AioContext *aio_context;
-Error *local_err = NULL;
-
-assert(common->action->type == TRANSACTION_ACTION_KIND_DRIVE_BACKUP);
-backup = common->action->u.drive_backup.data;
-
-bs = bdrv_lookup_bs(backup->device, backup->device, errp);
-if (!bs) {
-return;
-}
-
-aio_context = bdrv_get_aio_context(bs);
-aio_context_acquire(aio_context);
-
-/* Paired with .clean() */
-bdrv_drained_begin(bs);
-
-state->bs = bs;
-
-state->job = do_drive_backup(backup, common->block_job_txn, _err);
-if (local_err) {
-error_propagate(errp, local_err);
-goto out;
-}
-
-out:
-aio_context_release(aio_context);
-}
-
 static void drive_backup_commit(BlkActionState *common)
 {
 DriveBackupState *state = DO_UPCAST(DriveBackupState, common, common);
@@ -3587,13 +3553,13 @@ static BlockJob *do_backup_common(BackupCommon *backup,
 return job;
 }
 
-static BlockJob *do_drive_backup(DriveBackup *backup, JobTxn *txn,
- Error **errp)
+static void drive_backup_prepare(BlkActionState *common, Error **errp)
 {
+DriveBackupState *state = DO_UPCAST(DriveBackupState, common, common);
+DriveBackup *backup;
 BlockDriverState *bs;
 BlockDriverState *target_bs;
 BlockDriverState *source = NULL;
-BlockJob *job = NULL;
 AioContext *aio_context;
 QDict *options;
 Error *local_err = NULL;
@@ -3601,23 +3567,29 @@ static BlockJob *do_drive_backup(DriveBackup *backup, 
JobTxn *txn,
 int64_t size;
 bool set_backing_hd = false;
 
+assert(common->action->type == TRANSACTION_ACTION_KIND_DRIVE_BACKUP);
+backup = common->action->u.drive_backup.data;
+
 if (!backup->has_mode) {
 backup->mode = NEW_IMAGE_MODE_ABSOLUTE_PATHS;
 }
 
 bs = bdrv_lookup_bs(backup->device, backup->device, errp);
 if (!bs) {
-return NULL;
+return;
 }
 
 if (!bs->drv) {
 error_setg(errp, "Device has no medium");
-return NULL;
+return;
 }
 
 aio_context = bdrv_get_aio_context(bs);
 aio_context_acquire(aio_context);
 
+/* Paired with .clean() */
+bdrv_drained_begin(bs);
+
 if (!backup->has_format) {
 backup->format = backup->mode == NEW_IMAGE_MODE_EXISTING ?
  NULL : (char*) bs->drv->format_name;
@@ -3687,14 +3659,16 @@ static BlockJob *do_drive_backup(DriveBackup *backup, 
JobTxn *txn,
 }
 }
 
-job = do_backup_common(qapi_DriveBackup_base(backup),
-   bs, target_bs, aio_context, txn, errp);
+state->bs = bs;
+
+state->job = do_backup_common(qapi_DriveBackup_base(backup),
+  bs, target_bs, aio_context,
+  common->block_job_txn, errp);
 
 unref:
 bdrv_unref(target_bs);
 out:
 aio_context_release(aio_context);
-return job;
 }
 
 void qmp_drive_backup(DriveBackup *arg, Error **errp)
-- 
2.23.0




Re: API definition for LUKS key management

2019-11-12 Thread Daniel P . Berrangé
On Tue, Nov 12, 2019 at 10:12:45AM +0100, Kevin Wolf wrote:
> Am 11.11.2019 um 19:34 hat Daniel P. Berrangé geschrieben:
> > On Mon, Nov 11, 2019 at 05:58:20PM +0200, Maxim Levitsky wrote:
> > > One of the concerns that was raised during the review was that amend 
> > > interface for luks that I propose is
> > > different from the amend inteface used currently for qcow2.
> > > 
> > > qcow2 amend interface specifies all the format options, thus overwrites 
> > > the existing options.
> > > Thus it seems natural to make the luks amend interface work the same way, 
> > > that it receive an array
> > > of 8 slots, and for each slot specify if it is active, and if true what 
> > > password to put in it.
> > > This does allow to add and erase the keyslots, but it doesn't allow:
> > > 
> > >* add a password without knowing all other passwords that exist in 
> > > existing keyslots
> > >  this can be mitigated by specifying which keyslots to modify for 
> > > example by omitting the
> > >  keyslots that shouldn't be touched from the array (passing null 
> > > placeholder instead)
> > >  but then it already doesn't follow the 'specify all the options each 
> > > time' principle.
> > 
> > I think this is highly undesirable, as we must not assume that the
> > mgmt app has access to all the passwords currently set.
> 
> And I think this shows the problem that we realy have with the crypto
> driver and amend: For every other driver, if you must, you can query the
> current settings and just write them back.
> 
> The difference here is that crypto doesn't allow to directly query or
> specify the content of some options (the keyslots), but provides only a
> way to derives that content from a secret, and obviously there is no way
> back from the stored data to the secret (that's what it's for).
> 
> I think we have two options here:
> 
> 1. Add a special "don't touch this" value for keyslots. Normally, just
>leaving out the value would be suitable syntax for this. Here,
>however, we have a list of keyslots, so we can't leave anything out.
> 
>We could use something like an alternate between str (new secret ID),
>null (erase keyslot) and empty dict (leave it alone) - the latter
>feels a bit hackish, but maybe it's not too bad. If the list is
>shorter than 8 entries, the rest is assumed to mean "leave it alone",
>too.

I'd be very wary of having a "null" vs "empty dict" distinction to
mean "erase" vs "don't touch".

It feels like that is designed to maximise the chances of someone
shooting themselves in the foot by accidentally using "null" instead
of an "empty dict".

The reason for the use of "active=yes" / "active=no" is because that
was reasonable explicit about wanting to erase a keyslot, and it does
does actually map to the key slot on disk which has an "active" field
taking two magic values.

> 2. Allow to query and set the raw key, which doesn't require a password

I don't think I understand what you mean here. If you don't have a
password the only change you can make is to erase key slots.

> > >* erase all keyslots matching a password - this is really hard to do 
> > > using this approach,
> > >  unless we give user some kind of api to try each keyslot with given 
> > > password,
> > >  which is kind of ugly and might be racy as well.
> > 
> > > So what do you think?
> > 
> > The point of using "amend" is that we already have some of the boilerplate
> > supporting framework around that, so it saves effort for both QEMU and
> > our users. If the semantics of "amend" don't fit nicely though, then the
> > benefit of re-using "amend" is cancelled out and we should go back to
> > considering a separate "key-manage" command.
> 
> This wouldn't solve the fundamental problem that the crypto block
> driver, as it currently is, isn't able to provide a blockdev-amend
> callback. It's worse for qcow2 because qcow2 already implements amend.
> 
> I think we need to find a solution for the amend API.


BTW, looking forward to the future, if we ever implement LUKS version 2
support there are a bunch more things can be tweaked at runtime. Most
notable is that it is possible to change the master key, and change the
encryption algorithm choices. Both of these then need to trigger a bulk
re-encrypt of the entire disk contents, which takes a long time.

I doubt we'll do this in the near term, but we should consider how this
might fit into whatever scheme we pick for updates.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: API definition for LUKS key management

2019-11-12 Thread Max Reitz
On 12.11.19 10:12, Kevin Wolf wrote:
> Am 11.11.2019 um 19:34 hat Daniel P. Berrangé geschrieben:
>> On Mon, Nov 11, 2019 at 05:58:20PM +0200, Maxim Levitsky wrote:
>>> One of the concerns that was raised during the review was that amend 
>>> interface for luks that I propose is
>>> different from the amend inteface used currently for qcow2.
>>>
>>> qcow2 amend interface specifies all the format options, thus overwrites the 
>>> existing options.
>>> Thus it seems natural to make the luks amend interface work the same way, 
>>> that it receive an array
>>> of 8 slots, and for each slot specify if it is active, and if true what 
>>> password to put in it.
>>> This does allow to add and erase the keyslots, but it doesn't allow:
>>>
>>>* add a password without knowing all other passwords that exist in 
>>> existing keyslots
>>>  this can be mitigated by specifying which keyslots to modify for 
>>> example by omitting the
>>>  keyslots that shouldn't be touched from the array (passing null 
>>> placeholder instead)
>>>  but then it already doesn't follow the 'specify all the options each 
>>> time' principle.
>>
>> I think this is highly undesirable, as we must not assume that the
>> mgmt app has access to all the passwords currently set.
> 
> And I think this shows the problem that we realy have with the crypto
> driver and amend: For every other driver, if you must, you can query the
> current settings and just write them back.
> 
> The difference here is that crypto doesn't allow to directly query or
> specify the content of some options (the keyslots), but provides only a
> way to derives that content from a secret, and obviously there is no way
> back from the stored data to the secret (that's what it's for).
> 
> I think we have two options here:
> 
> 1. Add a special "don't touch this" value for keyslots. Normally, just
>leaving out the value would be suitable syntax for this. Here,
>however, we have a list of keyslots, so we can't leave anything out.
> 
>We could use something like an alternate between str (new secret ID),
>null (erase keyslot) and empty dict (leave it alone) - the latter
>feels a bit hackish, but maybe it's not too bad.

I thought of something similar, but how would that look on the command line?

Though I suppose if the worst thing were how it looks on the command
line, we could introduce a new qemu-img subcommand that then internally
translates into the right amend syntax.

>If the list is
>shorter than 8 entries, the rest is assumed to mean "leave it alone",
>too.
> 
> 2. Allow to query and set the raw key, which doesn't require a password
> 
>> The two key use cases for having multiple key slots are
>>
>>   - To enable a two-phase change of passwords to ensure new password
>> is safely written out before erasing the old password
>> 
>>   - To allow for multiple access passwords with different controls
>> or access to when each password is made available.
>>
>> eg each VM may have a separate "backup password" securely
>> stored off host that is only made available for use when
>> doing disaster recovery.
>>
>> the second use case is doomed if you need to always provide all
>> current passwords when changing any key slots.
> 
> That providing all current passwords doesn't work is obvious.
> 
>>>* erase all keyslots matching a password - this is really hard to do 
>>> using this approach,
>>>  unless we give user some kind of api to try each keyslot with given 
>>> password,
>>>  which is kind of ugly and might be racy as well.
>>
>>> So what do you think?
>>
>> The point of using "amend" is that we already have some of the boilerplate
>> supporting framework around that, so it saves effort for both QEMU and
>> our users. If the semantics of "amend" don't fit nicely though, then the
>> benefit of re-using "amend" is cancelled out and we should go back to
>> considering a separate "key-manage" command.
> 
> This wouldn't solve the fundamental problem that the crypto block
> driver, as it currently is, isn't able to provide a blockdev-amend
> callback. It's worse for qcow2 because qcow2 already implements amend.

Hm, well, I would have assumed this is only bad on the premise that we
want to have amend be complete at some point.  Do we?

While I do think it might be nice to be able to change e.g. cluster_size
especially for the upcoming subcluster extension (in addition to
enabling subclusters on an existing image), I seriously doubt anyone’s
going to implement it.  (Maybe enabling subclusters, but not changing
cluster_size.)

> I think we need to find a solution for the amend API.

I do think it’s weird to look for non-amend solutions when it clearly
looks like an amend problem, but OTOH I don’t think it would be that bad
to disregard amend.  (Provided there are good reasons for disregarding it.)

Max



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v6 1/3] block: introduce compress filter driver

2019-11-12 Thread Vladimir Sementsov-Ogievskiy
11.11.2019 19:04, Andrey Shinkevich wrote:
> Allow writing all the data compressed through the filter driver.
> The written data will be aligned by the cluster size.
> Based on the QEMU current implementation, that data can be written to
> unallocated clusters only. May be used for a backup job.
> 
> Suggested-by: Max Reitz 
> Signed-off-by: Andrey Shinkevich 
> ---
>   block/Makefile.objs |   1 +
>   block/filter-compress.c | 212 
> 
>   qapi/block-core.json|  10 ++-
>   3 files changed, 219 insertions(+), 4 deletions(-)
>   create mode 100644 block/filter-compress.c
> 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index e394fe0..330529b 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -43,6 +43,7 @@ block-obj-y += crypto.o
>   
>   block-obj-y += aio_task.o
>   block-obj-y += backup-top.o
> +block-obj-y += filter-compress.o
>   
>   common-obj-y += stream.o
>   
> diff --git a/block/filter-compress.c b/block/filter-compress.c
> new file mode 100644
> index 000..a7b0337
> --- /dev/null
> +++ b/block/filter-compress.c
> @@ -0,0 +1,212 @@
> +/*
> + * Compress filter block driver
> + *
> + * Copyright (c) 2019 Virtuozzo International GmbH
> + *
> + * Author:
> + *   Andrey Shinkevich 
> + *   (based on block/copy-on-read.c by Max Reitz)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation; either version 2 or
> + * (at your option) any later version of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, see .
> + */
> +
> +#include "qemu/osdep.h"
> +#include "block/block_int.h"
> +#include "qemu/module.h"
> +
> +
> +static int zip_open(BlockDriverState *bs, QDict *options, int flags,
> + Error **errp)
> +{
> +bs->backing = bdrv_open_child(NULL, options, "file", bs, _file, 
> false,
> +  errp);
> +if (!bs->backing) {
> +return -EINVAL;
> +}
> +
> +bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
> +BDRV_REQ_WRITE_COMPRESSED |
> +(BDRV_REQ_FUA & bs->backing->bs->supported_write_flags);
> +
> +bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
> +((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
> +bs->backing->bs->supported_zero_flags);
> +
> +return 0;
> +}
> +
> +
> +#define PERM_PASSTHROUGH (BLK_PERM_CONSISTENT_READ \
> +  | BLK_PERM_WRITE \
> +  | BLK_PERM_RESIZE)
> +#define PERM_UNCHANGED (BLK_PERM_ALL & ~PERM_PASSTHROUGH)
> +
> +static void zip_child_perm(BlockDriverState *bs, BdrvChild *c,
> +const BdrvChildRole *role,
> +BlockReopenQueue *reopen_queue,
> +uint64_t perm, uint64_t shared,
> +uint64_t *nperm, uint64_t *nshared)
> +{
> +*nperm = perm & PERM_PASSTHROUGH;
> +*nshared = (shared & PERM_PASSTHROUGH) | PERM_UNCHANGED;
> +
> +/*
> + * We must not request write permissions for an inactive node, the child
> + * cannot provide it.
> + */
> +if (!(bs->open_flags & BDRV_O_INACTIVE)) {
> +*nperm |= BLK_PERM_WRITE_UNCHANGED;
> +}
> +}
> +
> +
> +static int64_t zip_getlength(BlockDriverState *bs)
> +{
> +return bdrv_getlength(bs->backing->bs);
> +}
> +
> +
> +static int coroutine_fn zip_co_truncate(BlockDriverState *bs, int64_t offset,
> + bool exact, PreallocMode prealloc,
> + Error **errp)
> +{
> +return bdrv_co_truncate(bs->backing, offset, exact, prealloc, errp);
> +}
> +
> +
> +static int coroutine_fn zip_co_preadv(BlockDriverState *bs,
> +   uint64_t offset, uint64_t bytes,
> +   QEMUIOVector *qiov, int flags)
> +{
> +return bdrv_co_preadv(bs->backing, offset, bytes, qiov, flags);
> +}
> +
> +
> +static int coroutine_fn zip_co_preadv_part(BlockDriverState *bs,
> +uint64_t offset, uint64_t bytes,
> +QEMUIOVector *qiov,
> +size_t qiov_offset,
> +int flags)
> +{
> +return bdrv_co_preadv_part(bs->backing, offset, bytes, qiov, qiov_offset,
> +   flags);
> +}
> +
> +
> +static int coroutine_fn 

Re: [PATCH v6 1/3] block: introduce compress filter driver

2019-11-12 Thread Vladimir Sementsov-Ogievskiy
12.11.2019 13:07, Andrey Shinkevich wrote:
> On 12/11/2019 12:39, Kevin Wolf wrote:
>> Am 11.11.2019 um 17:04 hat Andrey Shinkevich geschrieben:
>>> Allow writing all the data compressed through the filter driver.
>>> The written data will be aligned by the cluster size.
>>> Based on the QEMU current implementation, that data can be written to
>>> unallocated clusters only. May be used for a backup job.
>>>
>>> Suggested-by: Max Reitz 
>>> Signed-off-by: Andrey Shinkevich 
>>
>>> +static BlockDriver bdrv_compress = {
>>> +.format_name= "compress",
>>> +
>>> +.bdrv_open  = zip_open,
>>> +.bdrv_child_perm= zip_child_perm,
>>
>> Why do you call the functions zip_* when the driver is called compress?
>> I think zip would be a driver for zip archives, which we don't use here.
>>
> 
> Kevin,
> Thanks for your response.
> I was trying to make my mind up with a short form for 'compress'.
> I will change the 'zip' for something like 'compr'.

I'd keep it compress, it sounds better.

> 
>>> +.bdrv_getlength = zip_getlength,
>>> +.bdrv_co_truncate   = zip_co_truncate,
>>> +
>>> +.bdrv_co_preadv = zip_co_preadv,
>>> +.bdrv_co_preadv_part= zip_co_preadv_part,
>>> +.bdrv_co_pwritev= zip_co_pwritev,
>>> +.bdrv_co_pwritev_part   = zip_co_pwritev_part,
>>
>> If you implement .bdrv_co_preadv/pwritev_part, isn't the implementation
>> of .bdrv_co_preadv/pwritev (without _part) dead code?
>>
> 
> Understood and will remove the dead path.
> 
>>> +.bdrv_co_pwrite_zeroes  = zip_co_pwrite_zeroes,
>>> +.bdrv_co_pdiscard   = zip_co_pdiscard,
>>> +.bdrv_refresh_limits= zip_refresh_limits,
>>> +
>>> +.bdrv_eject = zip_eject,
>>> +.bdrv_lock_medium   = zip_lock_medium,
>>> +
>>> +.bdrv_co_block_status   = 
>>> bdrv_co_block_status_from_backing,
>>
>> Why not use bs->file? (Well, apart from the still not merged filter
>> series by Max...)
>>
> 
> We need to keep a backing chain unbroken with the filter inserted. So,
> the backing child should not be zero. It is necessary for the backup
> job, for instance. When I initialized both children pointing to the same
> node, the code didn't work properly. I have to reproduce it again to
> tell you exactly what happened then and will appreciate your advice
> about a proper design.
> 
> Andrey


file-child based filters are pain, which needs 42-patches (seems postponed now 
:(
Max's series to work correct (or at least more correct than now). file-child
based filters break backing-chains, and backing-child-based works well. So, I 
don't
know any benefit of file-child based filters, and I think there is no reason to
create new ones. If in future for some reason we need file-child support in the 
filter
it's simple to add it (so filter will have only one child, but it may be 
backing or
file, as requested by user).

So, I propose to start with backing-child which works better, and add 
file-child support
in future if needed.

Also, note that backup-top filter uses backing child, and there is a reason to 
make it
public filter (to realize image fleecing without backup job), so we'll have 
public
backing-child based filter anyway.

Also, we have pending series about using COR filter in block-stream (it breaks
backing-chain, as it is file-child-based), and now I think that the simplest
way to fix it is support backing child in block-stream (so, block-stream job
will create COR filter with backing child instead of file child).

> 
>>> +.bdrv_recurse_is_first_non_filter   = zip_recurse_is_first_non_filter,
>>> +
>>> +.has_variable_length= true,
>>> +.is_filter  = true,
>>> +};
>>
>> Kevin
>>
> 


-- 
Best regards,
Vladimir



Re: [PATCH v6 1/3] block: introduce compress filter driver

2019-11-12 Thread Andrey Shinkevich
On 12/11/2019 12:39, Kevin Wolf wrote:
> Am 11.11.2019 um 17:04 hat Andrey Shinkevich geschrieben:
>> Allow writing all the data compressed through the filter driver.
>> The written data will be aligned by the cluster size.
>> Based on the QEMU current implementation, that data can be written to
>> unallocated clusters only. May be used for a backup job.
>>
>> Suggested-by: Max Reitz 
>> Signed-off-by: Andrey Shinkevich 
> 
>> +static BlockDriver bdrv_compress = {
>> +.format_name= "compress",
>> +
>> +.bdrv_open  = zip_open,
>> +.bdrv_child_perm= zip_child_perm,
> 
> Why do you call the functions zip_* when the driver is called compress?
> I think zip would be a driver for zip archives, which we don't use here.
> 

Kevin,
Thanks for your response.
I was trying to make my mind up with a short form for 'compress'.
I will change the 'zip' for something like 'compr'.

>> +.bdrv_getlength = zip_getlength,
>> +.bdrv_co_truncate   = zip_co_truncate,
>> +
>> +.bdrv_co_preadv = zip_co_preadv,
>> +.bdrv_co_preadv_part= zip_co_preadv_part,
>> +.bdrv_co_pwritev= zip_co_pwritev,
>> +.bdrv_co_pwritev_part   = zip_co_pwritev_part,
> 
> If you implement .bdrv_co_preadv/pwritev_part, isn't the implementation
> of .bdrv_co_preadv/pwritev (without _part) dead code?
> 

Understood and will remove the dead path.

>> +.bdrv_co_pwrite_zeroes  = zip_co_pwrite_zeroes,
>> +.bdrv_co_pdiscard   = zip_co_pdiscard,
>> +.bdrv_refresh_limits= zip_refresh_limits,
>> +
>> +.bdrv_eject = zip_eject,
>> +.bdrv_lock_medium   = zip_lock_medium,
>> +
>> +.bdrv_co_block_status   = bdrv_co_block_status_from_backing,
> 
> Why not use bs->file? (Well, apart from the still not merged filter
> series by Max...)
> 

We need to keep a backing chain unbroken with the filter inserted. So, 
the backing child should not be zero. It is necessary for the backup 
job, for instance. When I initialized both children pointing to the same 
node, the code didn't work properly. I have to reproduce it again to 
tell you exactly what happened then and will appreciate your advice 
about a proper design.

Andrey

>> +.bdrv_recurse_is_first_non_filter   = zip_recurse_is_first_non_filter,
>> +
>> +.has_variable_length= true,
>> +.is_filter  = true,
>> +};
> 
> Kevin
> 

-- 
With the best regards,
Andrey Shinkevich



Re: API definition for LUKS key management

2019-11-12 Thread Daniel P . Berrangé
On Mon, Nov 11, 2019 at 05:58:20PM +0200, Maxim Levitsky wrote:
> I will try to explain the interface with bunch of examples:

I want to fill in equiv examples from cryptsetup for sake
of comparison

> # adds a new password, defined by qemu secret 'sec0' to first unused slot
> # give user a error if all keyslots are occupied
> qemu-img amend --secret ... -o key-secret=sec1 image.luks

  cryptsetup luksAddKey --key-file currentkey.txt \
--new-key-file newkey.txt \
/dev/mapper/foo

> # erases all keyslots that can be opened by password that is contained in a 
> qemu secret 'sec0'
> # active=off means that the given password/keyslot won't be active after the 
> operation
> qemu-img amend --secret ... -o key-secret=sec0,active=off image.luks


  cryptsetup luksRemoveKey --key-file currentkey.txt \
   /dev/mapper/foo

> # erase the slot 5 (this is more low level command, less expected to be used)
> qemu-img amend --secret ... -o slot=5,active=off image.luks

  cryptsetup luksKillSlot /dev/mapper/foo 5

> # add new secret to slot 5 (will fail if the slot is already marked as active)
> qemu-img amend --secret ... -o slot=5,key-secret=sec1 image.luks

  cryptsetup luksAddKey --key-file currentkey.txt \
--new-key-file newkey.txt \
--key-slot 5 \
/dev/mapper/foo

They look very different in syntax because they are taking two differnet
approaches.

The cryptsetup approach is functional, where the user states what action
they want performed.

The qemu-img amend approach is declarative, where the user expresses what
end state they want the image to be in.

The challenge is that the qemu-img proposal isn't fully declarative as
we are only partially expressing the end state, attempting to leave other
slots unspecified. This is good as it means we don't have to specify a
huge amount of data on the CLI. This is bad as makes it less obvious
what actual effects of the commands will be and I feel users will end
up needing to read the manual every time to re-learn it.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [RFC PATCH 16/18] qapi: Create 'pragma' module

2019-11-12 Thread Markus Armbruster
Kevin Wolf  writes:

> We want to share the whitelists between the system emulator schema and
> the storage daemon schema, so move all the pragmas from the main schema
> file into a separate file that can be included from both.

Confusing because the storage daemon schema doesn't exist at this point.
PATCH 13's commit message has the same issue.

I'll revisit this when I review PATCH 18.

>
> Signed-off-by: Kevin Wolf 
> ---
>  qapi/pragma.json  | 24 
>  qapi/qapi-schema.json | 25 +
>  qapi/Makefile.objs|  2 +-
>  3 files changed, 26 insertions(+), 25 deletions(-)
>  create mode 100644 qapi/pragma.json
>
> diff --git a/qapi/pragma.json b/qapi/pragma.json
> new file mode 100644
> index 00..cffae27666
> --- /dev/null
> +++ b/qapi/pragma.json
> @@ -0,0 +1,24 @@
> +{ 'pragma': { 'doc-required': true } }
> +
> +# Whitelists to permit QAPI rule violations; think twice before you
> +# add to them!
> +{ 'pragma': {
> +# Commands allowed to return a non-dictionary:
> +'returns-whitelist': [
> +'human-monitor-command',
> +'qom-get',
> +'query-migrate-cache-size',
> +'query-tpm-models',
> +'query-tpm-types',
> +'ringbuf-read' ],
> +'name-case-whitelist': [
> +'ACPISlotType', # DIMM, visible through 
> query-acpi-ospm-status
> +'CpuInfoMIPS',  # PC, visible through query-cpu
> +'CpuInfoTricore',   # PC, visible through query-cpu
> +'BlockdevVmdkSubformat',# all members, to match VMDK spec 
> spellings
> +'BlockdevVmdkAdapterType',  # legacyESX, to match VMDK spec spellings
> +'QapiErrorClass',   # all members, visible through errors
> +'UuidInfo', # UUID, visible through query-uuid
> +'X86CPURegister32', # all members, visible indirectly 
> through qom-get
> +'CpuInfo'   # CPU, visible through query-cpu
> +] } }
> diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
> index be90422ffe..85b4048535 100644
> --- a/qapi/qapi-schema.json
> +++ b/qapi/qapi-schema.json
> @@ -49,30 +49,7 @@
>  #
>  ##
>  
> -{ 'pragma': { 'doc-required': true } }
> -
> -# Whitelists to permit QAPI rule violations; think twice before you
> -# add to them!
> -{ 'pragma': {
> -# Commands allowed to return a non-dictionary:
> -'returns-whitelist': [
> -'human-monitor-command',
> -'qom-get',
> -'query-migrate-cache-size',
> -'query-tpm-models',
> -'query-tpm-types',
> -'ringbuf-read' ],
> -'name-case-whitelist': [
> -'ACPISlotType', # DIMM, visible through 
> query-acpi-ospm-status
> -'CpuInfoMIPS',  # PC, visible through query-cpu
> -'CpuInfoTricore',   # PC, visible through query-cpu
> -'BlockdevVmdkSubformat',# all members, to match VMDK spec 
> spellings
> -'BlockdevVmdkAdapterType',  # legacyESX, to match VMDK spec spellings
> -'QapiErrorClass',   # all members, visible through errors
> -'UuidInfo', # UUID, visible through query-uuid
> -'X86CPURegister32', # all members, visible indirectly 
> through qom-get
> -'CpuInfo'   # CPU, visible through query-cpu
> -] } }
> +{ 'include': 'pragma.json' }
>  
>  # Documentation generated with qapi-gen.py is in source order, with
>  # included sub-schemas inserted at the first include directive
> diff --git a/qapi/Makefile.objs b/qapi/Makefile.objs
> index 519b6f1a8e..3e04e299ed 100644
> --- a/qapi/Makefile.objs
> +++ b/qapi/Makefile.objs
> @@ -7,7 +7,7 @@ util-obj-y += qapi-util.o
>  
>  QAPI_COMMON_MODULES = audio authz block-core block char common crypto
>  QAPI_COMMON_MODULES += dump error introspect job machine migration misc 
> monitor
> -QAPI_COMMON_MODULES += net qdev qom rdma rocker run-state sockets tpm
> +QAPI_COMMON_MODULES += net pragma qdev qom rdma rocker run-state sockets tpm
>  QAPI_COMMON_MODULES += trace transaction ui
>  QAPI_TARGET_MODULES = machine-target misc-target
>  QAPI_MODULES = $(QAPI_COMMON_MODULES) $(QAPI_TARGET_MODULES)

Only works because we accept and ignore names in these whitelists that
don't exist in the schema.

Which parts of the whitelists are actually needed in the storage daemon?




Re: [PATCH v6 1/3] block: introduce compress filter driver

2019-11-12 Thread Kevin Wolf
Am 11.11.2019 um 17:04 hat Andrey Shinkevich geschrieben:
> Allow writing all the data compressed through the filter driver.
> The written data will be aligned by the cluster size.
> Based on the QEMU current implementation, that data can be written to
> unallocated clusters only. May be used for a backup job.
> 
> Suggested-by: Max Reitz 
> Signed-off-by: Andrey Shinkevich 

> +static BlockDriver bdrv_compress = {
> +.format_name= "compress",
> +
> +.bdrv_open  = zip_open,
> +.bdrv_child_perm= zip_child_perm,

Why do you call the functions zip_* when the driver is called compress?
I think zip would be a driver for zip archives, which we don't use here.

> +.bdrv_getlength = zip_getlength,
> +.bdrv_co_truncate   = zip_co_truncate,
> +
> +.bdrv_co_preadv = zip_co_preadv,
> +.bdrv_co_preadv_part= zip_co_preadv_part,
> +.bdrv_co_pwritev= zip_co_pwritev,
> +.bdrv_co_pwritev_part   = zip_co_pwritev_part,

If you implement .bdrv_co_preadv/pwritev_part, isn't the implementation
of .bdrv_co_preadv/pwritev (without _part) dead code?

> +.bdrv_co_pwrite_zeroes  = zip_co_pwrite_zeroes,
> +.bdrv_co_pdiscard   = zip_co_pdiscard,
> +.bdrv_refresh_limits= zip_refresh_limits,
> +
> +.bdrv_eject = zip_eject,
> +.bdrv_lock_medium   = zip_lock_medium,
> +
> +.bdrv_co_block_status   = bdrv_co_block_status_from_backing,

Why not use bs->file? (Well, apart from the still not merged filter
series by Max...)

> +.bdrv_recurse_is_first_non_filter   = zip_recurse_is_first_non_filter,
> +
> +.has_variable_length= true,
> +.is_filter  = true,
> +};

Kevin




Re: API definition for LUKS key management

2019-11-12 Thread Kevin Wolf
Am 11.11.2019 um 19:34 hat Daniel P. Berrangé geschrieben:
> On Mon, Nov 11, 2019 at 05:58:20PM +0200, Maxim Levitsky wrote:
> > One of the concerns that was raised during the review was that amend 
> > interface for luks that I propose is
> > different from the amend inteface used currently for qcow2.
> > 
> > qcow2 amend interface specifies all the format options, thus overwrites the 
> > existing options.
> > Thus it seems natural to make the luks amend interface work the same way, 
> > that it receive an array
> > of 8 slots, and for each slot specify if it is active, and if true what 
> > password to put in it.
> > This does allow to add and erase the keyslots, but it doesn't allow:
> > 
> >* add a password without knowing all other passwords that exist in 
> > existing keyslots
> >  this can be mitigated by specifying which keyslots to modify for 
> > example by omitting the
> >  keyslots that shouldn't be touched from the array (passing null 
> > placeholder instead)
> >  but then it already doesn't follow the 'specify all the options each 
> > time' principle.
> 
> I think this is highly undesirable, as we must not assume that the
> mgmt app has access to all the passwords currently set.

And I think this shows the problem that we realy have with the crypto
driver and amend: For every other driver, if you must, you can query the
current settings and just write them back.

The difference here is that crypto doesn't allow to directly query or
specify the content of some options (the keyslots), but provides only a
way to derives that content from a secret, and obviously there is no way
back from the stored data to the secret (that's what it's for).

I think we have two options here:

1. Add a special "don't touch this" value for keyslots. Normally, just
   leaving out the value would be suitable syntax for this. Here,
   however, we have a list of keyslots, so we can't leave anything out.

   We could use something like an alternate between str (new secret ID),
   null (erase keyslot) and empty dict (leave it alone) - the latter
   feels a bit hackish, but maybe it's not too bad. If the list is
   shorter than 8 entries, the rest is assumed to mean "leave it alone",
   too.

2. Allow to query and set the raw key, which doesn't require a password

> The two key use cases for having multiple key slots are
> 
>   - To enable a two-phase change of passwords to ensure new password
> is safely written out before erasing the old password
> 
>   - To allow for multiple access passwords with different controls
> or access to when each password is made available.
> 
> eg each VM may have a separate "backup password" securely
> stored off host that is only made available for use when
> doing disaster recovery.
> 
> the second use case is doomed if you need to always provide all
> current passwords when changing any key slots.

That providing all current passwords doesn't work is obvious.

> >* erase all keyslots matching a password - this is really hard to do 
> > using this approach,
> >  unless we give user some kind of api to try each keyslot with given 
> > password,
> >  which is kind of ugly and might be racy as well.
> 
> > So what do you think?
> 
> The point of using "amend" is that we already have some of the boilerplate
> supporting framework around that, so it saves effort for both QEMU and
> our users. If the semantics of "amend" don't fit nicely though, then the
> benefit of re-using "amend" is cancelled out and we should go back to
> considering a separate "key-manage" command.

This wouldn't solve the fundamental problem that the crypto block
driver, as it currently is, isn't able to provide a blockdev-amend
callback. It's worse for qcow2 because qcow2 already implements amend.

I think we need to find a solution for the amend API.

Kevin




Re: [PATCH v6 1/3] block: introduce compress filter driver

2019-11-12 Thread Vladimir Sementsov-Ogievskiy
11.11.2019 23:47, Eric Blake wrote:
> On 11/11/19 10:04 AM, Andrey Shinkevich wrote:
>> Allow writing all the data compressed through the filter driver.
>> The written data will be aligned by the cluster size.
>> Based on the QEMU current implementation, that data can be written to
>> unallocated clusters only. May be used for a backup job.
>>
>> Suggested-by: Max Reitz 
>> Signed-off-by: Andrey Shinkevich 
>> ---
>>   block/Makefile.objs |   1 +
>>   block/filter-compress.c | 212 
>> 
>>   qapi/block-core.json    |  10 ++-
>>   3 files changed, 219 insertions(+), 4 deletions(-)
>>   create mode 100644 block/filter-compress.c
> 
>> +++ b/qapi/block-core.json
>> @@ -2884,15 +2884,16 @@
>>   # @copy-on-read: Since 3.0
>>   # @blklogwrites: Since 3.0
>>   # @blkreplay: Since 4.2
>> +# @compress: Since 4.2
> 
> Are we still trying to get this in 4.2, even though soft freeze is past?  Or 
> are we going to have to defer it to 5.0 as a new feature?
> 

5.0 of course

-- 
Best regards,
Vladimir



Re: Fwd: how to create a block devie?

2019-11-12 Thread Kevin Wolf
Am 12.11.2019 um 01:33 hat Liu Jaloo geschrieben:
> thanks Kevin for reply.
> I want to create a slackware usb-installer disk, with usbimg2disk.sh
> I create a disk with command:
> $ qemu-img create slackware64-current-udisk.img 8G
> then in slackware source directory:
> $ sudo ./usbimg2disk.sh -i usbboot.img -o slackware64-current-udisk.img
> report error like this:
> *** Not a block device:
> '/home/jaloo/qemu/slackware64-current/slackware64-current-udisk.img' !
> 
> in usbimg2disk.sh
>   # Sanity checks:
>   if [ ! -b $TOWIPE ]; then
> echo "*** Not a block device: '$TOWIPE' !"
> exit 1
>   fi
> 
> so my question is how to create a block device for Bash Conditional
> Expression:
>if -b
> 
> -b file
> 
> True if file exists and is a block special file.

You don't need to use usbimg2disk.sh. The purpose of this tool is if you
want to use a (physical) USB disk with an existing file system that you
don't want to overwrite. It adds the installer to your existing file
system.

As we're talking about virtual disks here, we have as many fresh USB
disks as we like, and usbboot.img is already one. Just attach
usbboot.img directly to the VM.

Kevin

> On Mon, Nov 11, 2019 at 5:19 PM Kevin Wolf  wrote:
> 
> > Am 09.11.2019 um 10:05 hat Liu Jaloo geschrieben:
> > > block device for bash Bash Conditional Expressions
> > > if -b
> > >
> > > -- Forwarded message -
> > > From: Liu Jaloo 
> > > Date: Sat, Nov 9, 2019 at 4:46 PM
> > > Subject: how to create a block devie?
> > > To: 
> > >
> > >
> > > i want to emulate a USB disk for install operation system, how to create
> > a
> > > block device using qemu-img to work with dd command.
> >
> > You don't need to write an image to a block device on the host for
> > emulating a USB disk in QEMU. Just use the image file directly.
> >
> > Kevin
> >
> >




Re: [RFC PATCH 15/18] qapi: Support empty modules

2019-11-12 Thread Markus Armbruster
Kevin Wolf  writes:

> If you added an include file that doesn't contain any definitions, no
> source files would be generated for it. However, in other source files,
> you would still get an #include for the header files of the empty
> module.

Bug.

Cause: we generate #include module.h always, and the module.h when
visiting its first definition.  If there are no definitions, we don't.

> The intended behaviour is that empty source files are created for empty
> modules.

Yes.

>  This patch makes QAPISchema keep a list of all modules
> (including empty ones) and modifies visit() to first visit all modules
> in that list.

Minimally invasive fix.  Backends still initialize module output on
first visit_module(), but now all modules are visited upfront.

Separating "initialize module" from "switch to module" might be easier
to understand.  Idea, not demand.

> Some test reference outputs need to be updated due to the additional
> visitor calls.
>
> Signed-off-by: Kevin Wolf 
> ---
>  scripts/qapi/schema.py   | 9 +
>  tests/qapi-schema/comments.out   | 2 ++
>  tests/qapi-schema/doc-bad-section.out| 2 ++
>  tests/qapi-schema/doc-good.out   | 2 ++
>  tests/qapi-schema/empty.out  | 2 ++
>  tests/qapi-schema/event-case.out | 2 ++
>  tests/qapi-schema/include-repetition.out | 4 
>  tests/qapi-schema/include-simple.out | 3 +++
>  tests/qapi-schema/indented-expr.out  | 2 ++
>  tests/qapi-schema/qapi-schema-test.out   | 4 
>  10 files changed, 32 insertions(+)
>
> diff --git a/scripts/qapi/schema.py b/scripts/qapi/schema.py
> index 38041098bd..e1b034d67d 100644
> --- a/scripts/qapi/schema.py
> +++ b/scripts/qapi/schema.py
> @@ -749,6 +749,7 @@ class QAPISchema(object):
>  self.docs = parser.docs
>  self._entity_list = []
>  self._entity_dict = {}
> +self._modules = [os.path.basename(fname)]
>  self._predefining = True
>  self._def_predefineds()
>  self._predefining = False
> @@ -800,6 +801,8 @@ class QAPISchema(object):
>  main_info = main_info.parent
>  fname = os.path.relpath(include, os.path.dirname(main_info.fname))
>  self._def_entity(QAPISchemaInclude(fname, info))
> +if fname not in self._modules:
> +self._modules.append(fname)
>  
>  def _def_builtin_type(self, name, json_type, c_type):
>  self._def_entity(QAPISchemaBuiltinType(name, json_type, c_type))
> @@ -1033,6 +1036,12 @@ class QAPISchema(object):
>  visitor.visit_begin(self)
>  module = None
>  visitor.visit_module(module)
> +
> +# Make sure that all modules are visited, even if they contain no
> +# entities
> +for module in self._modules:
> +visitor.visit_module(module)
> +

Slightly neater, I think:

   visitor.visit_begin(self)
  +
  +# Visit all modules, to ensure @visitor sees them
  +for module in self._modules:
  +visitor.visit_module(module)
  +
   module = None
   visitor.visit_module(module)

This way, we keep starting with module None rather than whatever user
module comes last.  The .out diffs below then don't add a nother "module
None" line.

>  for entity in self._entity_list:
>  if visitor.visit_needed(entity):
>  if entity.module != module:
> diff --git a/tests/qapi-schema/comments.out b/tests/qapi-schema/comments.out
> index 273f0f54e1..fa7e95d1cc 100644
> --- a/tests/qapi-schema/comments.out
> +++ b/tests/qapi-schema/comments.out
> @@ -1,4 +1,6 @@
>  module None
> +module comments.json
> +module None
>  object q_empty
>  enum QType
>  prefix QTYPE
[...]