Re: [Qemu-block] [Qemu-devel] [PATCH 0/5] Few fixes for userspace NVME driver

2019-06-02 Thread Maxim Levitsky
On Mon, 2019-04-15 at 16:57 +0300, Maxim Levitsky wrote:
> CC: Fam Zheng 
> CC: Kevin Wolf 
> CC: Max Reitz 
> CC: qemu-de...@nongnu.org
> 
> 
> Hi!
> These are few assorted fixes and features for the userspace
> nvme driver.
> 
> Tested that on my laptop with my Samsung X5 thunderbolt drive, which
> happens to have 4K sectors, support for discard and write zeros.
> 
> Also bunch of fixes sitting in my queue from the period when I developed
> the nvme-mdev driver.
> 
> Best regards,
>   Maxim Levitsky
> 
> Maxim Levitsky (5):
>   block/nvme: don't flip CQ phase bits
>   block/nvme: fix doorbell stride
>   block/nvme: support larger that 512 bytes sector devices
>   block/nvme: add support for write zeros
>   block/nvme: add support for discard
> 
>  block/nvme.c | 194 +--
>  block/trace-events   |   3 +
>  include/block/nvme.h |  17 +++-
>  3 files changed, 204 insertions(+), 10 deletions(-)
> 

Ping.

Best regards,
Maxim Levitsky





Re: [Qemu-block] [PATCH v2 0/5] Few fixes for userspace NVME driver

2019-06-03 Thread Maxim Levitsky
On Wed, 2019-04-17 at 22:53 +0300, Maxim Levitsky wrote:
> Hi!
> These are few assorted fixes and features for the userspace
> nvme driver.
> 
> Tested that on my laptop with my Samsung X5 thunderbolt drive, which
> happens to have 4K sectors, support for discard and write zeros.
> 
> Also bunch of fixes sitting in my queue from the period when I developed
> the nvme-mdev driver.
> 
> Best regards,
>     Maxim Levitsky
> 
> Maxim Levitsky (5):
>   block/nvme: don't flip CQ phase bits
>   block/nvme: fix doorbell stride
>   block/nvme: support larger that 512 bytes sector devices
>   block/nvme: add support for write zeros
>   block/nvme: add support for discard
> 
>  block/nvme.c | 193 +--
>  block/trace-events   |   3 +
>  include/block/nvme.h |  19 -
>  3 files changed, 205 insertions(+), 10 deletions(-)
> 

Ping.

Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v2 1/5] block/nvme: don't flip CQ phase bits

2019-06-05 Thread Maxim Levitsky
On Mon, 2019-06-03 at 18:25 -0400, John Snow wrote:
> 
> On 4/17/19 3:53 PM, Maxim Levitsky wrote:
> > Phase bits are only set by the hardware to indicate new completions
> > and not by the device driver.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c | 2 --
> >  1 file changed, 2 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 0684bbd077..2d208000df 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -340,8 +340,6 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
> > NVMeQueuePair *q)
> >  qemu_mutex_lock(&q->lock);
> >  c->cid = cpu_to_le16(0);
> >  q->inflight--;
> > -/* Flip Phase Tag bit. */
> > -c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
> >  progress = true;
> >  }
> >  if (progress) {
> > 
> 
> Since you've not got much traction on this and you've pinged a v2, can
> you point me to a spec or a reproducer that illustrates the problem?
> 
> (Or wait for more NVME knowledgeable people to give you a review...!)

"A Completion Queue entry is posted to the Completion Queue when the controller 
write of that Completion
Queue entry to the next free Completion Queue slot inverts the Phase Tag (P) 
bit from its previous value
in memory. The controller may generate an interrupt to the host to indicate 
that one or more Completion
Queue entries have been posted."



Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v2 5/5] block/nvme: add support for discard

2019-06-06 Thread Maxim Levitsky
On Thu, 2019-06-06 at 11:19 +0800, Fam Zheng wrote:
> On Wed, 04/17 22:53, Maxim Levitsky wrote:
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c   | 80 ++
> >  block/trace-events |  2 ++
> >  2 files changed, 82 insertions(+)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 35b925899f..b83912c627 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -110,6 +110,7 @@ typedef struct {
> >  bool plugged;
> >  
> >  bool supports_write_zeros;
> > +bool supports_discard;
> >  
> >  CoMutex dma_map_lock;
> >  CoQueue dma_flush_queue;
> > @@ -462,6 +463,7 @@ static void nvme_identify(BlockDriverState *bs, int 
> > namespace, Error **errp)
> >  
> >  
> >  s->supports_write_zeros = (idctrl->oncs & NVME_ONCS_WRITE_ZEROS) != 0;
> > +s->supports_discard = (idctrl->oncs & NVME_ONCS_DSM) != 0;
> >  
> >  memset(resp, 0, 4096);
> >  
> > @@ -1144,6 +1146,83 @@ static coroutine_fn int 
> > nvme_co_pwrite_zeroes(BlockDriverState *bs,
> >  }
> >  
> >  
> > +static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
> > +int64_t offset, int bytes)
> 
> While you respin, you can align the parameters.

Hi Fam!!


I didn't knew that this is also required by qemu codeing style (it kind of 
suggested in the kernel)
I'll be more that glad to do so!


> 
> > +{
> > +BDRVNVMeState *s = bs->opaque;
> > +NVMeQueuePair *ioq = s->queues[1];
> > +NVMeRequest *req;
> > +NvmeDsmRange *buf;
> > +QEMUIOVector local_qiov;
> > +int r;
> > +
> > +NvmeCmd cmd = {
> > +.opcode = NVME_CMD_DSM,
> > +.nsid = cpu_to_le32(s->nsid),
> > +.cdw10 = 0, /*number of ranges - 0 based*/
> > +.cdw11 = cpu_to_le32(1 << 2), /*deallocate bit*/
> > +};
> > +
> > +NVMeCoData data = {
> > +.ctx = bdrv_get_aio_context(bs),
> > +.ret = -EINPROGRESS,
> > +};
> > +
> > +if (!s->supports_discard) {
> > +return -ENOTSUP;
> > +}
> > +
> > +assert(s->nr_queues > 1);
> > +
> > +buf = qemu_try_blockalign0(bs, 4096);
> > +if (!buf) {
> > +return -ENOMEM;
> > +}
> > +
> > +buf->nlb = bytes >> s->blkshift;
> > +buf->slba = offset >> s->blkshift;
> 
> This buffer is for the device, do we need to do anything about the endianness?

Thank you very very much, this is indeed an endianess bug.


Thanks a lot for the review,
Best regards,
Maxim Levitsky

> 
> > +buf->cattr = 0;
> > +
> > +qemu_iovec_init(&local_qiov, 1);
> > +qemu_iovec_add(&local_qiov, buf, 4096);
> > +
> > +req = nvme_get_free_req(ioq);
> > +assert(req);
> > +
> > +qemu_co_mutex_lock(&s->dma_map_lock);
> > +r = nvme_cmd_map_qiov(bs, &cmd, req, &local_qiov);
> > +qemu_co_mutex_unlock(&s->dma_map_lock);
> > +
> > +if (r) {
> > +req->busy = false;
> > +return r;
> > +}
> > +
> > +trace_nvme_dsm(s, offset, bytes);
> > +
> > +nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
> > +
> > +data.co = qemu_coroutine_self();
> > +while (data.ret == -EINPROGRESS) {
> > +qemu_coroutine_yield();
> > +}
> > +
> > +qemu_co_mutex_lock(&s->dma_map_lock);
> > +r = nvme_cmd_unmap_qiov(bs, &local_qiov);
> > +qemu_co_mutex_unlock(&s->dma_map_lock);
> > +if (r) {
> > +return r;
> > +}
> > +
> > +trace_nvme_dsm_done(s, offset, bytes, data.ret);
> > +
> > +qemu_iovec_destroy(&local_qiov);
> > +qemu_vfree(buf);
> > +return data.ret;
> > +
> > +}
> > +
> > +
> >  static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
> > BlockReopenQueue *queue, Error **errp)
> >  {
> > @@ -1250,6 +1329,7 @@ static BlockDriver bdrv_nvme = {
> >  .bdrv_co_pwritev  = nvme_co_pwritev,
> >  
> >  .bdrv_co_pwrite_zeroes= nvme_co_pwrite_zeroes,
> > +.bdrv_co_pdiscard = nvme_co_pdiscard,
> >  
> >  .bdrv_co_flush_to_disk= nvme_co_flush,
> >  .bdrv_reopen_prepare  = nvme_reopen_prepare,
> 

Re: [Qemu-block] [PATCH v4 1/9] configure: permit use of io_uring

2019-06-06 Thread Maxim Levitsky
On Mon, 2019-06-03 at 18:08 +0530, Aarushi Mehta wrote:
> Reviewed-by: Stefan Hajnoczi 
> Signed-off-by: Aarushi Mehta 
> ---
>  configure | 27 +++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/configure b/configure
> index 6cdcfb2dc3..cb49ef0fcc 100755
> --- a/configure
> +++ b/configure
> @@ -365,6 +365,7 @@ xen=""
>  xen_ctrl_version=""
>  xen_pci_passthrough=""
>  linux_aio=""
> +linux_io_uring=""
>  cap_ng=""
>  attr=""
>  libattr=""
> @@ -1266,6 +1267,10 @@ for opt do
>;;
>--enable-linux-aio) linux_aio="yes"
>;;
> +  --disable-linux-io-uring) linux_io_uring="no"
> +  ;;
> +  --enable-linux-io-uring) linux_io_uring="yes"
> +  ;;
>--disable-attr) attr="no"
>;;
>--enable-attr) attr="yes"
> @@ -1784,6 +1789,7 @@ disabled with --disable-FEATURE, default is enabled if 
> available:
>vde support for vde network
>netmap  support for netmap network
>linux-aio   Linux AIO support
> +  linux-io-uring  Linux io_uring support
>cap-ng  libcap-ng support
>attrattr and xattr support
>vhost-net   vhost-net kernel acceleration support
> @@ -3973,6 +3979,21 @@ EOF
>  linux_aio=no
>fi
>  fi
> +##
> +# linux-io-uring probe
> +
> +if test "$linux_io_uring" != "no" ; then
> +  if $pkg_config liburing; then
> +linux_io_uring_cflags=$($pkg_config --cflags liburing)
> +linux_io_uring_libs=$($pkg_config --libs liburing)
> +linux_io_uring=yes
> +  else
> +if test "$linux_io_uring" = "yes" ; then
> +  feature_not_found "linux io_uring" "Install liburing devel"
> +fi
> +linux_io_uring=no
> +  fi
> +fi
>  
>  ##
>  # TPM emulation is only on POSIX
> @@ -6396,6 +6417,7 @@ echo "PIE   $pie"
>  echo "vde support   $vde"
>  echo "netmap support$netmap"
>  echo "Linux AIO support $linux_aio"
> +echo "Linux io_uring support $linux_io_uring"
>  echo "ATTR/XATTR support $attr"
>  echo "Install blobs $blobs"
>  echo "KVM support   $kvm"
> @@ -6876,6 +6898,11 @@ fi
>  if test "$linux_aio" = "yes" ; then
>echo "CONFIG_LINUX_AIO=y" >> $config_host_mak
>  fi
> +if test "$linux_io_uring" = "yes" ; then
> +  echo "CONFIG_LINUX_IO_URING=y" >> $config_host_mak
> +  echo "LINUX_IO_URING_CFLAGS=$linux_io_uring_cflags" >> $config_host_mak
> +  echo "LINUX_IO_URING_LIBS=$linux_io_uring_libs" >> $config_host_mak
> +fi
>  if test "$attr" = "yes" ; then
>echo "CONFIG_ATTR=y" >> $config_host_mak
>  fi



Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky




Re: [Qemu-block] [Qemu-devel] [PATCH v4 2/9] qapi/block-core: add option for io_uring

2019-06-06 Thread Maxim Levitsky
On Wed, 2019-06-05 at 07:58 +0200, Markus Armbruster wrote:
> Aarushi Mehta  writes:
> 
> > Option only enumerates for hosts that support it.
> 
> Blank line here, please.  Same in other patches.
> 
> > Signed-off-by: Aarushi Mehta 
> > ---
> >  qapi/block-core.json | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/qapi/block-core.json b/qapi/block-core.json
> > index 1defcde048..db7eedd058 100644
> > --- a/qapi/block-core.json
> > +++ b/qapi/block-core.json
> > @@ -2792,11 +2792,13 @@
> >  #
> >  # @threads: Use qemu's thread pool
> >  # @native:  Use native AIO backend (only Linux and Windows)
> > +# @io_uring:Use linux io_uring (since 4.1)
> >  #
> >  # Since: 2.9
> >  ##
> >  { 'enum': 'BlockdevAioOptions',
> > -  'data': [ 'threads', 'native' ] }
> > +  'data': [ 'threads', 'native',
> > +{ 'name': 'io_uring', 'if': 'defined(CONFIG_LINUX_IO_URING)' } 
> > ] }
> 
> We prefer '-' over '_' in the QAPI schema: 'io-uring' instead of
> 'io_uring'.  Exceptions can be made when existing siblings use '_' (not
> the case here), or to match how the thing is commonly spelled outside
> QEMU.  Up to the subject matter experts; I just want to make sure it's
> not accidental

I agree with that.
Other than that,

Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v4 3/9] block/block: add BDRV flag for io_uring

2019-06-06 Thread Maxim Levitsky
On Mon, 2019-06-03 at 18:08 +0530, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta 
> Reviewed-by: Stefan Hajnoczi 
> ---
>  include/block/block.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/block/block.h b/include/block/block.h
> index 9b083e2bca..60f7c6c01c 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -121,6 +121,7 @@ typedef struct HDGeometry {
>ignoring the format layer */
>  #define BDRV_O_NO_IO   0x1 /* don't initialize for I/O */
>  #define BDRV_O_AUTO_RDONLY 0x2 /* degrade to read-only if opening 
> read-write fails */
> +#define BDRV_O_IO_URING0x4 /* use io_uring instead of the thread 
> pool */
>  
>  #define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_NO_FLUSH)
>  

I had some fun learning now why do we need that flag.
Lot of code could be removed when someday we will remove the -drive interface.


Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky




Re: [Qemu-block] [Qemu-devel] [PATCH v2 1/5] block/nvme: don't flip CQ phase bits

2019-06-11 Thread Maxim Levitsky
On Fri, 2019-06-07 at 15:28 -0400, John Snow wrote:
> 
> On 6/7/19 7:08 AM, Paolo Bonzini wrote:
> > On 06/06/19 23:23, John Snow wrote:
> > > So: This looks right; does this fix a bug that can be observed? Do we
> > > have any regression tests for block/NVMe?
> > 
> > I don't think it fixes a bug; by the time the CQ entry is picked up by
> > QEMU, the device is not supposed to touch it anymore.
> > 
> > However, the idea behind the phase bits is that you can decide whether
> > the driver has placed a completion in the queue.  When we get here, we have
> > 
> > le16_to_cpu(c->status) & 0x1) == !q->cq_phase
> > 
> > On the next pass through the ring buffer q->cq_phase will be flipped,
> > and thus when we see this element we'll get
> > 
> > le16_to_cpu(c->status) & 0x1) == q->cq_phase
> > 
> > and not process it.  Since block/nvme.c flips the bit, this mechanism
> > does not work and the loop termination relies on the other part of the
> > condition, "if (!c->cid) break;".
> > 
> > So the patch is correct, but it would also be nice to also either remove
> > phase handling altogether, or check that the phase handling works
> > properly and drop the !c->cid test.
> > 
> > Paolo


I agree with that and I'll send an updated patch soon.

The driver should not touch the completion entries at all, but rather just scan 
for the entries whose
phase bit was flipped by the hardware.

in fact I don't even think that the 'c->cid' became the exit condition, but 
rather since the device is not allowed 
to fully fill the compleiton queue (it must alway keep at least one free entry 
there), the end condition would still
be the check on the flipped phase bit.


I'll fix that to be up to the spec,

Best regards,
Maxim Levitskky




Re: [Qemu-block] [PATCH] nvme: do not advertise support for unsupported arbitration mechanism

2019-06-15 Thread Maxim Levitsky
On Fri, 2019-06-14 at 22:39 +0200, Max Reitz wrote:
> On 06.06.19 11:25, Klaus Birkelund Jensen wrote:
> > The device mistakenly reports that the Weighted Round Robin with Urgent
> > Priority Class arbitration mechanism is supported.
> > 
> > It is not.
> 
> I believe you based on the fact that there is no “weight” or “priority”
> anywhere in nvme.c, and that it does not evaluate the Arbitration
> Mechanism Selected field.
> 
> > Signed-off-by: Klaus Birkelund Jensen 
> > ---
> >  hw/block/nvme.c | 1 -
> >  1 file changed, 1 deletion(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 30e50f7a3853..415b4641d6b4 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1383,7 +1383,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >  n->bar.cap = 0;
> >  NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
> >  NVME_CAP_SET_CQR(n->bar.cap, 1);
> > -NVME_CAP_SET_AMS(n->bar.cap, 1);
> 
> I suppose the better way would be to pass 0, so it is more explicit, I
> think.
> 
> (Just removing it looks like it may have just been forgotten.)
> 
> Max
> 
> >  NVME_CAP_SET_TO(n->bar.cap, 0xf);
> >  NVME_CAP_SET_CSS(n->bar.cap, 1);
> >  NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
> > 
> 
> 

Yea. no way that this driver supports WRRU and I haven't noticed it.
Just checked again to be sure.

To be honest, after some quick look, 
that driver doesn' even really support the regular Round-Robin as it uses a 
timer to process a specific queue, 
kicked up by a doorbell write :-)


Acked-by: Maxim Levitsky 


Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring

2019-06-17 Thread Maxim Levitsky
default:
> +fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
> +__func__, type);

Nitpick: Don't we use some king of error printing functions like 'error_setg' 
rather that fprintf?


> +abort();
> +}
> +io_uring_sqe_set_data(sqes, luringcb);
> +s->io_q.in_queue++;
> +
> +if (!s->io_q.blocked &&
> +(!s->io_q.plugged ||
> + s->io_q.in_flight + s->io_q.in_queue >= MAX_EVENTS)) {
> +return ioq_submit(s);
> +}
> +return 0;
> +}
> +
> +int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int 
> fd,
> +uint64_t offset, QEMUIOVector *qiov, int 
> type)
> +{
> +int ret;
> +LuringAIOCB luringcb = {
> +.co = qemu_coroutine_self(),
> +.ret= -EINPROGRESS,
> +.qiov   = qiov,
> +.is_read= (type == QEMU_AIO_READ),
> +};
> +
> +ret = luring_do_submit(fd, &luringcb, s, offset, type);
> +if (ret < 0) {
> +return ret;
> +}
> +
> +if (luringcb.ret == -EINPROGRESS) {
> +qemu_coroutine_yield();
> +}
> +return luringcb.ret;
> +}
> +
> +void luring_detach_aio_context(LuringState *s, AioContext *old_context)
> +{
> +aio_set_fd_handler(old_context, s->ring.ring_fd, false, NULL, NULL, NULL,
> +   s);
> +qemu_bh_delete(s->completion_bh);
> +s->aio_context = NULL;
> +}
> +
> +void luring_attach_aio_context(LuringState *s, AioContext *new_context)
> +{
> +s->aio_context = new_context;
> +s->completion_bh = aio_bh_new(new_context, qemu_luring_completion_bh, s);
> +aio_set_fd_handler(s->aio_context, s->ring.ring_fd, false,
> +   qemu_luring_completion_cb, NULL, NULL, s);
> +}
> +
> +LuringState *luring_init(Error **errp)
> +{
> +int rc;
> +LuringState *s;
> +s = g_malloc0(sizeof(*s));
> +struct io_uring *ring = &s->ring;
> +rc =  io_uring_queue_init(MAX_EVENTS, ring, 0);
> +if (rc < 0) {
> +error_setg_errno(errp, errno, "failed to init linux io_uring ring");
> +g_free(s);
> +return NULL;
> +}
> +



> +ioq_init(&s->io_q);

Another nitpick, maybe inline that function as it is used just here?
(that will save the static declaration upfront as well)
Feel free to leave this as is, if you think this way it is clearer.

> +return s;
> +
> +}
> +
> +void luring_cleanup(LuringState *s)
> +{
> +io_uring_queue_exit(&s->ring);
> +g_free(s);
> +}
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 0ca25dfec6..9da3fd9793 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -50,6 +50,7 @@ typedef void IOHandler(void *opaque);
>  struct Coroutine;
>  struct ThreadPool;
>  struct LinuxAioState;
> +struct LuringState;
>  
>  struct AioContext {
>  GSource source;
> @@ -118,11 +119,19 @@ struct AioContext {
>  struct ThreadPool *thread_pool;
>  
>  #ifdef CONFIG_LINUX_AIO
> -/* State for native Linux AIO.  Uses aio_context_acquire/release for
> +/*
> + * State for native Linux AIO.  Uses aio_context_acquire/release for
>   * locking.
>   */
>  struct LinuxAioState *linux_aio;
>  #endif
> +#ifdef CONFIG_LINUX_IO_URING
> +/*
> + * State for Linux io_uring.  Uses aio_context_acquire/release for
> + * locking.
> + */
> +struct LuringState *linux_io_uring;
> +#endif
>  
>  /* TimerLists for calling timers - one per clock type.  Has its own
>   * locking.
> @@ -387,6 +396,11 @@ struct LinuxAioState *aio_setup_linux_aio(AioContext 
> *ctx, Error **errp);
>  /* Return the LinuxAioState bound to this AioContext */
>  struct LinuxAioState *aio_get_linux_aio(AioContext *ctx);
>  
> +/* Setup the LuringState bound to this AioContext */
> +struct LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp);
> +
> +/* Return the LuringState bound to this AioContext */
> +struct LuringState *aio_get_linux_io_uring(AioContext *ctx);
>  /**
>   * aio_timer_new_with_attrs:
>   * @ctx: the aio context
> diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
> index 0cb7cc74a2..71d7d1395f 100644
> --- a/include/block/raw-aio.h
> +++ b/include/block/raw-aio.h
> @@ -55,6 +55,18 @@ void laio_attach_aio_context(LinuxAioState *s, AioContext 
> *new_context);
>  void laio_io_plug(BlockDriverState *bs, LinuxAioState *s);
>  void laio_io_unplug(BlockDriverState *bs, LinuxAioState *s);
>  #endif
> +/* io_uring.c - Linux io_uring implementation */
> +#ifdef CONFIG_LINUX_IO_URING
> +typedef struct LuringState LuringState;
> +LuringState *luring_init(Error **errp);
> +void luring_cleanup(LuringState *s);
> +int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int 
> fd,
> +uint64_t offset, QEMUIOVector *qiov, int 
> type);
> +void luring_detach_aio_context(LuringState *s, AioContext *old_context);
> +void luring_attach_aio_context(LuringState *s, AioContext *new_context);
> +void luring_io_plug(BlockDriverState *bs, LuringState *s);
> +void luring_io_unplug(BlockDriverState *bs, LuringState *s);
> +#endif
>  
>  #ifdef _WIN32
>  typedef struct QEMUWin32AIOState QEMUWin32AIOState;


I plan on this or next week to do some benchmarks of the code and I will share 
the results as soon
as I do them.

Please pardon me if I made some mistakes in the review because most of the qemu 
is new for me,
so I don't yet know well most of the stuff here.

Best regards,
Maxim Levitsky






Re: [Qemu-block] [PATCH v5 05/12] stubs: add stubs for io_uring interface

2019-06-17 Thread Maxim Levitsky
On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta 
> Reviewed-by: Stefan Hajnoczi 
> ---
>  MAINTAINERS |  1 +
>  stubs/Makefile.objs |  1 +
>  stubs/io_uring.c| 32 
>  3 files changed, 34 insertions(+)
>  create mode 100644 stubs/io_uring.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 49f896796e..bc38175124 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2522,6 +2522,7 @@ R: Stefan Hajnoczi 
>  L: qemu-block@nongnu.org
>  S: Maintained
>  F: block/io_uring.c
> +F: stubs/io_uring.c
>  
>  qcow2
>  M: Kevin Wolf 
> diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
> index 9c7393b08c..5cf160a9c8 100644
> --- a/stubs/Makefile.objs
> +++ b/stubs/Makefile.objs
> @@ -13,6 +13,7 @@ stub-obj-y += iothread.o
>  stub-obj-y += iothread-lock.o
>  stub-obj-y += is-daemonized.o
>  stub-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
> +stub-obj-$(CONFIG_LINUX_IO_URING) += io_uring.o
>  stub-obj-y += machine-init-done.o
>  stub-obj-y += migr-blocker.o
>  stub-obj-y += change-state-handler.o
> diff --git a/stubs/io_uring.c b/stubs/io_uring.c
> new file mode 100644
> index 00..622d1e4648
> --- /dev/null
> +++ b/stubs/io_uring.c
> @@ -0,0 +1,32 @@
> +/*
> + * Linux io_uring support.
> + *
> + * Copyright (C) 2009 IBM, Corp.
> + * Copyright (C) 2009 Red Hat, Inc.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +#include "qemu/osdep.h"
> +#include "block/aio.h"
> +#include "block/raw-aio.h"
> +
> +void luring_detach_aio_context(LuringState *s, AioContext *old_context)
> +{
> +abort();
> +}
> +
> +void luring_attach_aio_context(LuringState *s, AioContext *new_context)
> +{
> +abort();
> +}
> +
> +LuringState *luring_init(Error **errp)
> +{
> +abort();
> +}
> +
> +void luring_cleanup(LuringState *s)
> +{
> +abort();
> +}

I do wonder if there is any value in these stubs (and linux aio stubs as well) 
since any attempt to use this
will abort the test, even basic aio initialization.

I am not yet familiar with qemu unit tests, so I won't yet put ack on this 
patch.

Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v5 06/12] util/async: add aio interfaces for io_uring

2019-06-17 Thread Maxim Levitsky
On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta 
> Reviewed-by: Stefan Hajnoczi 
> ---
>  util/async.c | 36 
>  1 file changed, 36 insertions(+)
> 
> diff --git a/util/async.c b/util/async.c
> index c10642a385..2709f0edc3 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -277,6 +277,14 @@ aio_ctx_finalize(GSource *source)
>  }
>  #endif
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +if (ctx->linux_io_uring) {
> +luring_detach_aio_context(ctx->linux_io_uring, ctx);
> +luring_cleanup(ctx->linux_io_uring);
> +ctx->linux_io_uring = NULL;
> +}
> +#endif
> +
>  assert(QSLIST_EMPTY(&ctx->scheduled_coroutines));
>  qemu_bh_delete(ctx->co_schedule_bh);
>  
> @@ -341,6 +349,29 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
>  }
>  #endif
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp)
> +{
> +if (ctx->linux_io_uring) {
> +return ctx->linux_io_uring;
> +}
> +
> +ctx->linux_io_uring = luring_init(errp);
> +if (!ctx->linux_io_uring) {
> +return NULL;
> +}
> +
> +luring_attach_aio_context(ctx->linux_io_uring, ctx);
> +return ctx->linux_io_uring;
> +}
> +
> +LuringState *aio_get_linux_io_uring(AioContext *ctx)
> +{
> +assert(ctx->linux_io_uring);
> +return ctx->linux_io_uring;
> +}
> +#endif
> +
>  void aio_notify(AioContext *ctx)
>  {

Minor nitpick. Maybe we can memset all the private area of the AioContext to 0, 
and then setup the stuff that is not zero? That would remove most of the code 
below.
This is an old habit from the kernel code.

(I assume that g_source_new doesn't do this)


>  /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> @@ -432,6 +463,11 @@ AioContext *aio_context_new(Error **errp)
>  #ifdef CONFIG_LINUX_AIO
>  ctx->linux_aio = NULL;
>  #endif
> +
> +#ifdef CONFIG_LINUX_IO_URING
> +ctx->linux_io_uring = NULL;
> +#endif
> +
>  ctx->thread_pool = NULL;
>  qemu_rec_mutex_init(&ctx->lock);
>  timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);


Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v5 07/12] blockdev: accept io_uring as option

2019-06-17 Thread Maxim Levitsky
On Mon, 2019-06-10 at 19:19 +0530, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta 
> Reviewed-by: Stefan Hajnoczi 
> ---
>  blockdev.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/blockdev.c b/blockdev.c
> index 3f44b891eb..a2a5b32604 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -386,6 +386,8 @@ static void extract_common_blockdev_options(QemuOpts 
> *opts, int *bdrv_flags,
>  if ((aio = qemu_opt_get(opts, "aio")) != NULL) {
>  if (!strcmp(aio, "native")) {
>  *bdrv_flags |= BDRV_O_NATIVE_AIO;
> +} else if (!strcmp(aio, "io_uring")) {
> +*bdrv_flags |= BDRV_O_IO_URING;
>  } else if (!strcmp(aio, "threads")) {
>  /* this is the default */
>  } else {
> @@ -4579,7 +4581,7 @@ QemuOptsList qemu_common_drive_opts = {
>  },{
>  .name = "aio",
>  .type = QEMU_OPT_STRING,
> -.help = "host AIO implementation (threads, native)",
> +.help = "host AIO implementation (threads, native, io_uring)",
>  },{
>  .name = BDRV_OPT_CACHE_WB,
>      .type = QEMU_OPT_BOOL,

Nitpick: Maybe we should rename the native to libaio (accept both but give an 
deprication warning)?


Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v5 10/12] block/io_uring: adds userspace completion polling

2019-06-17 Thread Maxim Levitsky
On Tue, 2019-06-11 at 10:51 +0100, Stefan Hajnoczi wrote:
> On Mon, Jun 10, 2019 at 07:19:03PM +0530, Aarushi Mehta wrote:
> > +static bool qemu_luring_poll_cb(void *opaque)
> > +{
> > +LuringState *s = opaque;
> > +struct io_uring_cqe *cqes;
> > +
> > +if (io_uring_peek_cqe(&s->ring, &cqes) == 0) {
> > +if (!cqes) {
> > +qemu_luring_process_completions_and_submit(s);
> > +return true;
> > +}
> 
> Is this logic inverted?  We have a completion when cqes != NULL.

This indeed looks inverted to me.

Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v5 08/12] block/file-posix.c: extend to use io_uring

2019-06-17 Thread Maxim Levitsky
ize == bytes);
> -return laio_co_submit(bs, aio, s->fd, offset, qiov, type);
> +} else if (s->use_linux_aio && s->needs_alignment) {
You can probably drop the s->needs_alignment here as it is set iff the file if 
opened with O_DIRECT which is a requirement
for the linux aio.

Maybe while at it, maybe update that comment prior to the 's->needs_alignment 
&& !bdrv_qiov_is_aligned(bs, qiov)'
since it is somewhat outdated.

I think it should state:

"When using O_DIRECT, the request must be aligned to be able to use either 
libaio or io_uring interface.
If not fail back to regular thread pool read/write code which emulates this for 
us if we set QEMU_AIO_MISALIGNED"

Or something like that.



> +LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
> +assert(qiov->size == bytes);
> +return laio_co_submit(bs, aio, s->fd, offset, qiov, type);
>  #endif
> -}
>  }
>  
>  acb = (RawPosixAIOData) {
> @@ -1920,24 +1951,36 @@ static int coroutine_fn 
> raw_co_pwritev(BlockDriverState *bs, uint64_t offset,
>  
>  static void raw_aio_plug(BlockDriverState *bs)
>  {
> +BDRVRawState __attribute__((unused)) *s = bs->opaque;
>  #ifdef CONFIG_LINUX_AIO
> -BDRVRawState *s = bs->opaque;
>  if (s->use_linux_aio) {
>  LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
>  laio_io_plug(bs, aio);
>  }
>  #endif
> +#ifdef CONFIG_LINUX_IO_URING
> +if (s->use_linux_io_uring) {
> +LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
> +luring_io_plug(bs, aio);
> +}
> +#endif
>  }
>  
>  static void raw_aio_unplug(BlockDriverState *bs)
>  {
> +BDRVRawState __attribute__((unused)) *s = bs->opaque;
>  #ifdef CONFIG_LINUX_AIO
> -BDRVRawState *s = bs->opaque;
>  if (s->use_linux_aio) {
>  LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
>  laio_io_unplug(bs, aio);
>  }
>  #endif
> +#ifdef CONFIG_LINUX_IO_URING
> +if (s->use_linux_io_uring) {
> +LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
> +luring_io_unplug(bs, aio);
> +}
> +#endif
>  }
>  
>  static int raw_co_flush_to_disk(BlockDriverState *bs)
> @@ -1963,8 +2006,8 @@ static int raw_co_flush_to_disk(BlockDriverState *bs)
>  static void raw_aio_attach_aio_context(BlockDriverState *bs,
> AioContext *new_context)
>  {
> +BDRVRawState __attribute__((unused)) *s = bs->opaque;
>  #ifdef CONFIG_LINUX_AIO
> -BDRVRawState *s = bs->opaque;
>  if (s->use_linux_aio) {
>  Error *local_err;
>  if (!aio_setup_linux_aio(new_context, &local_err)) {
> @@ -1974,6 +2017,16 @@ static void 
> raw_aio_attach_aio_context(BlockDriverState *bs,
>  }
>  }
>  #endif
> +#ifdef CONFIG_LINUX_IO_URING
> +if (s->use_linux_io_uring) {
> +Error *local_err;
> +if (!aio_setup_linux_io_uring(new_context, &local_err)) {
> +error_reportf_err(local_err, "Unable to use linux io_uring, "
> + "falling back to thread pool: ");
> +s->use_linux_io_uring = false;
> +}
> +}
> +#endif
>  }
>  
>  static void raw_close(BlockDriverState *bs)


Other that minor notes,
Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v5 12/12] qemu-iotests/087: checks for io_uring

2019-06-17 Thread Maxim Levitsky
On Mon, 2019-06-10 at 19:19 +0530, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta 
> ---
>  tests/qemu-iotests/087 | 26 ++
>  tests/qemu-iotests/087.out | 10 ++
>  2 files changed, 36 insertions(+)
> 
> diff --git a/tests/qemu-iotests/087 b/tests/qemu-iotests/087
> index d6c8613419..0cc7283ad8 100755
> --- a/tests/qemu-iotests/087
> +++ b/tests/qemu-iotests/087
> @@ -124,6 +124,32 @@ run_qemu_filter_aio <  { "execute": "quit" }
>  EOF
>  
> +echo
> +echo === aio=io_uring without O_DIRECT ===
> +echo
> +
> +# Skip this test if io_uring is not enabled in this build
> +run_qemu_filter_io_uring()
> +{
> +run_qemu "$@"
> +}
> +
> +run_qemu_filter_io_uring < +{ "execute": "qmp_capabilities" }
> +{ "execute": "blockdev-add",
> +  "arguments": {
> +  "driver": "$IMGFMT",
> +  "node-name": "disk",
> +  "file": {
> +  "driver": "file",
> +  "filename": "$TEST_IMG",
> +  "aio": "io_uring"
> +  }
> +}
> +  }
> +{ "execute": "quit" }
> +EOF
> +
>  echo
>  echo === Encrypted image QCow ===
>  echo
> diff --git a/tests/qemu-iotests/087.out b/tests/qemu-iotests/087.out
> index 2d92ea847b..f0557d425f 100644
> --- a/tests/qemu-iotests/087.out
> +++ b/tests/qemu-iotests/087.out
> @@ -32,6 +32,16 @@ QMP_VERSION
>  {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
> "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
>  
>  
> +=== aio=io_uring without O_DIRECT ===
> +
> +Testing:
> +QMP_VERSION
> +{"return": {}}
> +{"return": {}}
> +{"return": {}}
> +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
> "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
> +
> +

This is kind of wrong copy&paste happened here: 

The "aio=native without O_DIRECT" test is a negative test trying to enable 
libaio support without O_DIRECT,
which qemu check and fails on.

The io_uring can be enabled without O_DIRECT, thus negative test doesn't make 
sense, but rather a positive test which not only enables
this but actually does some IO in this mode to see that it works is needed IMHO.



>  === Encrypted image QCow ===
>  
>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 encryption=on 
> encrypt.key-secret=sec0


Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring

2019-06-19 Thread Maxim Levitsky
On Wed, 2019-06-19 at 11:14 +0100, Stefan Hajnoczi wrote:
> On Mon, Jun 17, 2019 at 03:26:50PM +0300, Maxim Levitsky wrote:
> > On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote:
> > > +if (!cqes) {
> > > +break;
> > > +}
> > > +LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
> > > +ret = cqes->res;
> > > +
> > > +if (ret == luringcb->qiov->size) {
> > > +ret = 0;
> > > +} else if (ret >= 0) {
> > 
> > 
> > You should very carefully check the allowed return values here.
> > 
> > It looks like you can get '-EINTR' here, which would ask you to rerun the 
> > read operation, and otherwise
> > you will get the number of bytes read, which might be less that what was 
> > asked for, which implies that you
> > need to retry the read operation with the remainder of the buffer rather 
> > that zero the end of the buffer IMHO 
> > 
> > (0 is returned on EOF according to 'read' semantics, which I think are used 
> > here, thus a short read might not be an EOF)
> > 
> > 
> > Looking at linux-aio.c though I do see that it just passes through the 
> > returned value with no special treatments. 
> > including lack of check for -EINTR.
> > 
> > I assume that since aio is linux specific, and it only supports direct IO, 
> > it happens
> > to have assumption of no short reads/-EINTR (but since libaio has very 
> > sparse documentation I can't verify this)
> > 
> > On the other hand the aio=threads implementation actually does everything 
> > as specified on the 'write' manpage,
> > retrying the reads on -EINTR, and doing additional reads if less that 
> > required number of bytes were read.
> > 
> > Looking at io_uring implementation in the kernel I see that it does support 
> > synchronous (non O_DIRECT mode), 
> > and in this case, it goes through the same ->read_iter which is pretty much 
> > the same path that 
> > regular read() takes and so it might return short reads and or -EINTR.
> 
> Interesting point.  Investigating EINTR should at least be a TODO
> comment and needs to be resolved before io_uring lands in a QEMU
> release.
> 
> > > +static int ioq_submit(LuringState *s)
> > > +{
> > > +int ret = 0;
> > > +LuringAIOCB *luringcb, *luringcb_next;
> > > +
> > > +while (s->io_q.in_queue > 0) {
> > > +QSIMPLEQ_FOREACH_SAFE(luringcb, &s->io_q.sq_overflow, next,
> > > +  luringcb_next) {
> > 
> > I am torn about the 'sq_overflow' name. it seems to me that its not 
> > immediately clear that these
> > are the requests that are waiting because the io uring got full, but I 
> > can't now think of a better name.
> > 
> > Maybe add a comment here to explain what is going on here?
> 
> Hmm...I suggested this name because I thought it was clear.  But the
> fact that it puzzled you proves it wasn't clear :-).
> 
> Can anyone think of a better name?  It's the queue we keep in QEMU to
> hold requests while the io_uring sq ring is full.
> 
> > Also maybe we could somehow utilize the plug/unplug facility to avoid 
> > reaching that state in first place?
> > Maybe the block layer has some kind of 'max outstanding requests' limit 
> > that could be used?
> > 
> > In my nvme-mdev I opted to not process the input queues when such a 
> > condition is detected, but here you can't as the block layer
> > pretty much calls you to process the requests.
> 
> Block layer callers are allowed to submit as many I/O requests as they
> like and there is no feedback mechanism.  It's up to linux-aio.c and
> io_uring.c to handle the case where host kernel I/O submission resources
> are exhausted.
> 
> Plug/unplug is a batching performance optimization to reduce the number
> of io_uring_enter() calls but it does not stop the callers from
> submitting more I/O requests.  So plug/unplug isn't directly applicable
> here.

Thanks for the explanation! I guess we can leave that name as is, but add some 
comment or so
in the place where the queue is accessed.



> 
> > > +static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState 
> > > *s,
> > > +uint64_t offset, int type)
> > > +{
> > > +struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
> > > +if (!sqes) {
> > > +sqes = &am

Re: [Qemu-block] [PATCH v5 07/12] blockdev: accept io_uring as option

2019-06-19 Thread Maxim Levitsky
On Wed, 2019-06-19 at 11:24 +0100, Stefan Hajnoczi wrote:
> On Mon, Jun 17, 2019 at 04:01:45PM +0300, Maxim Levitsky wrote:
> > On Mon, 2019-06-10 at 19:19 +0530, Aarushi Mehta wrote:
> > > Signed-off-by: Aarushi Mehta 
> > > Reviewed-by: Stefan Hajnoczi 
> > > ---
> > >  blockdev.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/blockdev.c b/blockdev.c
> > > index 3f44b891eb..a2a5b32604 100644
> > > --- a/blockdev.c
> > > +++ b/blockdev.c
> > > @@ -386,6 +386,8 @@ static void extract_common_blockdev_options(QemuOpts 
> > > *opts, int *bdrv_flags,
> > >  if ((aio = qemu_opt_get(opts, "aio")) != NULL) {
> > >  if (!strcmp(aio, "native")) {
> > >  *bdrv_flags |= BDRV_O_NATIVE_AIO;
> > > +} else if (!strcmp(aio, "io_uring")) {
> > > +*bdrv_flags |= BDRV_O_IO_URING;
> > >  } else if (!strcmp(aio, "threads")) {
> > >  /* this is the default */
> > >  } else {
> > > @@ -4579,7 +4581,7 @@ QemuOptsList qemu_common_drive_opts = {
> > >  },{
> > >  .name = "aio",
> > >  .type = QEMU_OPT_STRING,
> > > -.help = "host AIO implementation (threads, native)",
> > > +.help = "host AIO implementation (threads, native, 
> > > io_uring)",
> > >  },{
> > >  .name = BDRV_OPT_CACHE_WB,
> > >  .type = QEMU_OPT_BOOL,
> > 
> > Nitpick: Maybe we should rename the native to libaio (accept both but give 
> > an deprication warning)?
> 
> "libaio" is a clearer name but I'm afraid changing it or introducing a
> new name is not worth it with so many users, command-lines, scripts, and
> management tools that know about "native".  Having two names that mean
> the same thing might cause confusion.
> 
> Let's leave it as is.
> 
I won't argue about this, also this can also be done later.

Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH] block/qcow: Improve error when opening qcow2 files as qcow

2019-06-27 Thread Maxim Levitsky
On Wed, 2019-06-26 at 17:53 -0400, John Snow wrote:
> Reported-by: radmehrsae...@gmail.com
> Fixes: https://bugs.launchpad.net/bugs/1832914
> Signed-off-by: John Snow 
> ---
>  block/qcow.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/block/qcow.c b/block/qcow.c
> index 6dee5bb792..a9cb6ae0bd 100644
> --- a/block/qcow.c
> +++ b/block/qcow.c
> @@ -156,7 +156,12 @@ static int qcow_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  goto fail;
>  }
>  if (header.version != QCOW_VERSION) {
> -error_setg(errp, "Unsupported qcow version %" PRIu32, 
> header.version);
> +error_setg(errp, "qcow (v%d) does not support qcow version %" PRIu32,
> +   QCOW_VERSION, header.version);
> +if (header.version == 2 || header.version == 3) {
> +error_append_hint(errp, "Try the 'qcow2' driver instead.");
> +}
> +
>  ret = -ENOTSUP;
>  goto fail;
>  }

Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky




[Qemu-block] [PATCH 0/1] RFC: don't obey the block device max transfer len / max segments for block devices

2019-06-30 Thread Maxim Levitsky
It looks like Linux block devices, even in O_DIRECT mode don't have any user 
visible
limit on transfer size / number of segments, which underlying block device can 
have.
The block layer takes care of enforcing these limits by splitting the bios.

By limiting the transfer sizes, we  force qemu to do the splitting itself which
introduces various overheads.
It is especially visible in nbd server, where the low max transfer size of the
underlying device forces us to advertise this over NBD, thus increasing the 
traffic overhead in case of
image conversion which benefits from large blocks.

More information can be found here:
https://bugzilla.redhat.com/show_bug.cgi?id=1647104

Tested this with qemu-img convert over nbd and natively and to my surprise, 
even native IO performance improved a bit.
(The device on which it was tested is Intel Optane DC P4800X, which has 128k 
max transfer size)

The benchmark:

Images were created using:

Sparse image:  qemu-img create -f qcow2 /dev/nvme0n1p3 1G / 10G / 100G
Allocated image: qemu-img create -f qcow2 /dev/nvme0n1p3 -o 
preallocation=metadata  1G / 10G / 100G

The test was:

 echo "convert native:"
 rm -rf /dev/shm/disk.img
 time qemu-img convert -p -f qcow2 -O raw -T none $FILE /dev/shm/disk.img > 
/dev/zero

 echo "convert via nbd:"
 qemu-nbd -k /tmp/nbd.sock -v  -f qcow2 $FILE -x export --cache=none 
--aio=native --fork
 rm -rf /dev/shm/disk.img
 time qemu-img convert -p -f raw -O raw 
nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img > /dev/zero

The results:

=
1G sparse image:
 native:
before: 0.027s
after: 0.027s
 nbd:
before: 0.287s
after: 0.035s

=
100G sparse image:
 native:
before: 0.028s
after: 0.028s
 nbd:
before: 23.796s
after: 0.109s

=
1G preallocated image:
 native:
   before: 0.454s
   after: 0.427s
 nbd:
   before: 0.649s
   after: 0.546s

The block limits of max transfer size/max segment size are retained
for the SCSI passthrough because in this case the kernel passes the userspace 
request
directly to the kernel scsi driver, bypassing the block layer, and thus there 
is no code to split
such requests.

What do you think?

Fam, since you was the original author of the code that added
these limits, could you share your opinion on that?
What was the reason besides SCSI passthrough?

Best regards,
    Maxim Levitsky

Maxim Levitsky (1):
  raw-posix.c - use max transfer length / max segemnt count only for
SCSI passthrough

 block/file-posix.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

-- 
2.17.2




[Qemu-block] [PATCH 1/1] raw-posix.c - use max transfer length / max segemnt count only for SCSI passthrough

2019-06-30 Thread Maxim Levitsky
Regular block devices (/dev/sda*, /dev/nvme*, etc) interface is not limited
by the underlying storage limits, but rather the kernel block layer
takes care to split the requests that are too large/fragmented.

Doing so allows us to have less overhead in qemu.

Signed-off-by: Maxim Levitsky 
---
 block/file-posix.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index ab05b51a66..66dad34f8a 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1038,15 +1038,13 @@ static void raw_reopen_abort(BDRVReopenState *state)
 s->reopen_state = NULL;
 }
 
-static int hdev_get_max_transfer_length(BlockDriverState *bs, int fd)
+static int sg_get_max_transfer_length(BlockDriverState *bs, int fd)
 {
 #ifdef BLKSECTGET
 int max_bytes = 0;
-short max_sectors = 0;
-if (bs->sg && ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
+
+if (ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
 return max_bytes;
-} else if (!bs->sg && ioctl(fd, BLKSECTGET, &max_sectors) == 0) {
-return max_sectors << BDRV_SECTOR_BITS;
 } else {
 return -errno;
 }
@@ -1055,7 +1053,7 @@ static int hdev_get_max_transfer_length(BlockDriverState 
*bs, int fd)
 #endif
 }
 
-static int hdev_get_max_segments(const struct stat *st)
+static int sg_get_max_segments(const struct stat *st)
 {
 #ifdef CONFIG_LINUX
 char buf[32];
@@ -1106,12 +1104,12 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 struct stat st;
 
 if (!fstat(s->fd, &st)) {
-if (S_ISBLK(st.st_mode) || S_ISCHR(st.st_mode)) {
-int ret = hdev_get_max_transfer_length(bs, s->fd);
+if (bs->sg) {
+int ret = sg_get_max_transfer_length(bs, s->fd);
 if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
 bs->bl.max_transfer = pow2floor(ret);
 }
-ret = hdev_get_max_segments(&st);
+ret = sg_get_max_segments(&st);
 if (ret > 0) {
 bs->bl.max_transfer = MIN(bs->bl.max_transfer,
   ret * getpagesize());
-- 
2.17.2




Re: [Qemu-block] [Qemu-devel] [PATCH 0/1] RFC: don't obey the block device max transfer len / max segments for block devices

2019-07-02 Thread Maxim Levitsky
On Sun, 2019-06-30 at 18:08 +0300, Maxim Levitsky wrote:
> It looks like Linux block devices, even in O_DIRECT mode don't have any user 
> visible
> limit on transfer size / number of segments, which underlying block device 
> can have.
> The block layer takes care of enforcing these limits by splitting the bios.
> 
> By limiting the transfer sizes, we  force qemu to do the splitting itself 
> which
> introduces various overheads.
> It is especially visible in nbd server, where the low max transfer size of the
> underlying device forces us to advertise this over NBD, thus increasing the 
> traffic overhead in case of
> image conversion which benefits from large blocks.
> 
> More information can be found here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1647104
> 
> Tested this with qemu-img convert over nbd and natively and to my surprise, 
> even native IO performance improved a bit.
> (The device on which it was tested is Intel Optane DC P4800X, which has 128k 
> max transfer size)
> 
> The benchmark:
> 
> Images were created using:
> 
> Sparse image:  qemu-img create -f qcow2 /dev/nvme0n1p3 1G / 10G / 100G
> Allocated image: qemu-img create -f qcow2 /dev/nvme0n1p3 -o 
> preallocation=metadata  1G / 10G / 100G
> 
> The test was:
> 
>  echo "convert native:"
>  rm -rf /dev/shm/disk.img
>  time qemu-img convert -p -f qcow2 -O raw -T none $FILE /dev/shm/disk.img > 
> /dev/zero
> 
>  echo "convert via nbd:"
>  qemu-nbd -k /tmp/nbd.sock -v  -f qcow2 $FILE -x export --cache=none 
> --aio=native --fork
>  rm -rf /dev/shm/disk.img
>  time qemu-img convert -p -f raw -O raw 
> nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img > /dev/zero
> 
> The results:
> 
> =
> 1G sparse image:
>  native:
>   before: 0.027s
>   after: 0.027s
>  nbd:
>   before: 0.287s
>   after: 0.035s
> 
> =
> 100G sparse image:
>  native:
>   before: 0.028s
>   after: 0.028s
>  nbd:
>   before: 23.796s
>   after: 0.109s
> 
> =
> 1G preallocated image:
>  native:
>before: 0.454s
>after: 0.427s
>  nbd:
>before: 0.649s
>after: 0.546s
> 
> The block limits of max transfer size/max segment size are retained
> for the SCSI passthrough because in this case the kernel passes the userspace 
> request
> directly to the kernel scsi driver, bypassing the block layer, and thus there 
> is no code to split
> such requests.
> 
> What do you think?
> 
> Fam, since you was the original author of the code that added
> these limits, could you share your opinion on that?
> What was the reason besides SCSI passthrough?
> 
> Best regards,
>   Maxim Levitsky
> 
> Maxim Levitsky (1):
>   raw-posix.c - use max transfer length / max segemnt count only for
> SCSI passthrough
> 
>  block/file-posix.c | 16 +++-
>  1 file changed, 7 insertions(+), 9 deletions(-)
> 


Ping

Best regards,
Maxim Levitsky




Re: [Qemu-block] [Qemu-devel] [PATCH 1/1] raw-posix.c - use max transfer length / max segemnt count only for SCSI passthrough

2019-07-03 Thread Maxim Levitsky
On Wed, 2019-07-03 at 09:50 -0500, Eric Blake wrote:
> On 6/30/19 10:08 AM, Maxim Levitsky wrote:
> > Regular block devices (/dev/sda*, /dev/nvme*, etc) interface is not limited
> 
> The regular block device interface is
> 
> or
> 
> Regular block devices interfaces are
> 
> > by the underlying storage limits, but rather the kernel block layer
> > takes care to split the requests that are too large/fragmented.
> > 
> > Doing so allows us to have less overhead in qemu.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/file-posix.c | 16 +++-
> >  1 file changed, 7 insertions(+), 9 deletions(-)
> > 
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index ab05b51a66..66dad34f8a 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1038,15 +1038,13 @@ static void raw_reopen_abort(BDRVReopenState *state)
> >  s->reopen_state = NULL;
> >  }
> >  
> > -static int hdev_get_max_transfer_length(BlockDriverState *bs, int fd)
> > +static int sg_get_max_transfer_length(BlockDriverState *bs, int fd)
> >  {
> >  #ifdef BLKSECTGET
> >  int max_bytes = 0;
> > -short max_sectors = 0;
> > -if (bs->sg && ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
> > +
> > +if (ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
> >  return max_bytes;
> > -} else if (!bs->sg && ioctl(fd, BLKSECTGET, &max_sectors) == 0) {
> > -return max_sectors << BDRV_SECTOR_BITS;
> >  } else {
> >  return -errno;
> >  }
> > @@ -1055,7 +1053,7 @@ static int 
> > hdev_get_max_transfer_length(BlockDriverState *bs, int fd)
> >  #endif
> >  }
> >  
> > -static int hdev_get_max_segments(const struct stat *st)
> > +static int sg_get_max_segments(const struct stat *st)
> >  {
> >  #ifdef CONFIG_LINUX
> >  char buf[32];
> > @@ -1106,12 +1104,12 @@ static void raw_refresh_limits(BlockDriverState 
> > *bs, Error **errp)
> >  struct stat st;
> >  
> >  if (!fstat(s->fd, &st)) {
> > -if (S_ISBLK(st.st_mode) || S_ISCHR(st.st_mode)) {
> > -int ret = hdev_get_max_transfer_length(bs, s->fd);
> 
> Is it worth delaying the fstat()...
> 
> > +if (bs->sg) {
> > +int ret = sg_get_max_transfer_length(bs, s->fd);
> >  if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
> >  bs->bl.max_transfer = pow2floor(ret);
> >  }
> > -ret = hdev_get_max_segments(&st);
> > +ret = sg_get_max_segments(&st);
> 
> ...until inside the if (bs->sg) condition, to avoid wasted work for
> other scenarios?
> 
> >  if (ret > 0) {
> >  bs->bl.max_transfer = MIN(bs->bl.max_transfer,
> >ret * getpagesize());
> > 
> 
> Reviewed-by: Eric Blake 
> 

Thank you very much for the review. I'll send a V2 soon.

Best regards,
Maxim Levitsky






[Qemu-block] [PATCH v3 1/6] block/nvme: don't touch the completion entries

2019-07-03 Thread Maxim Levitsky
Completion entries are meant to be only read by the host and written by the 
device.
The driver is supposed to scan the completions from the last point where it 
left,
and until it sees a completion with non flipped phase bit.


Signed-off-by: Maxim Levitsky 
---
 block/nvme.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 73ed5fa75f..6d4e7f3d83 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -315,7 +315,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
NVMeQueuePair *q)
 while (q->inflight) {
 int16_t cid;
 c = (NvmeCqe *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
-if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
+if ((le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
 break;
 }
 q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
@@ -339,10 +339,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
NVMeQueuePair *q)
 qemu_mutex_unlock(&q->lock);
 req.cb(req.opaque, nvme_translate_error(c));
 qemu_mutex_lock(&q->lock);
-c->cid = cpu_to_le16(0);
 q->inflight--;
-/* Flip Phase Tag bit. */
-c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
 progress = true;
 }
 if (progress) {
-- 
2.17.2




[Qemu-block] [PATCH v3 2/6] block/nvme: fix doorbell stride

2019-07-03 Thread Maxim Levitsky
Fix the math involving non standard doorbell stride

Signed-off-by: Maxim Levitsky 
---
 block/nvme.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index 6d4e7f3d83..52798081b2 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -217,7 +217,7 @@ static NVMeQueuePair 
*nvme_create_queue_pair(BlockDriverState *bs,
 error_propagate(errp, local_err);
 goto fail;
 }
-q->cq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale + 1];
+q->cq.doorbell = &s->regs->doorbells[(idx * 2 + 1) * s->doorbell_scale];
 
 return q;
 fail:
-- 
2.17.2




[Qemu-block] [PATCH v3 5/6] block/nvme: add support for write zeros

2019-07-03 Thread Maxim Levitsky
Signed-off-by: Maxim Levitsky 
---
 block/nvme.c | 69 +++-
 block/trace-events   |  1 +
 include/block/nvme.h | 19 +++-
 3 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 152d27b07f..02e0846643 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -110,6 +110,8 @@ typedef struct {
 uint64_t max_transfer;
 bool plugged;
 
+bool supports_write_zeros;
+
 CoMutex dma_map_lock;
 CoQueue dma_flush_queue;
 
@@ -457,6 +459,8 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 s->max_transfer = MIN_NON_ZERO(s->max_transfer,
   s->page_size / sizeof(uint64_t) * s->page_size);
 
+s->supports_write_zeros = (idctrl->oncs & NVME_ONCS_WRITE_ZEROS) != 0;
+
 memset(resp, 0, 4096);
 
 cmd.cdw10 = 0;
@@ -469,6 +473,11 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 s->nsze = le64_to_cpu(idns->nsze);
 lbaf = &idns->lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
 
+if (NVME_ID_NS_DLFEAT_WRITE_ZEROS(idns->dlfeat) &&
+NVME_ID_NS_DLFEAT_READ_BEHAVIOR(idns->dlfeat) ==
+NVME_ID_NS_DLFEAT_READ_BEHAVIOR_ZEROS)
+bs->supported_write_flags |= BDRV_REQ_MAY_UNMAP;
+
 if (lbaf->ms) {
 error_setg(errp, "Namespaces with metadata are not yet supported");
 goto out;
@@ -763,6 +772,8 @@ static int nvme_file_open(BlockDriverState *bs, QDict 
*options, int flags,
 int ret;
 BDRVNVMeState *s = bs->opaque;
 
+bs->supported_write_flags = BDRV_REQ_FUA;
+
 opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
 qemu_opts_absorb_qdict(opts, options, &error_abort);
 device = qemu_opt_get(opts, NVME_BLOCK_OPT_DEVICE);
@@ -791,7 +802,6 @@ static int nvme_file_open(BlockDriverState *bs, QDict 
*options, int flags,
 goto fail;
 }
 }
-bs->supported_write_flags = BDRV_REQ_FUA;
 return 0;
 fail:
 nvme_close(bs);
@@ -1085,6 +1095,60 @@ static coroutine_fn int nvme_co_flush(BlockDriverState 
*bs)
 }
 
 
+static coroutine_fn int nvme_co_pwrite_zeroes(BlockDriverState *bs,
+  int64_t offset,
+  int bytes,
+  BdrvRequestFlags flags)
+{
+BDRVNVMeState *s = bs->opaque;
+NVMeQueuePair *ioq = s->queues[1];
+NVMeRequest *req;
+
+if (!s->supports_write_zeros) {
+return -ENOTSUP;
+}
+
+uint32_t cdw12 = ((bytes >> s->blkshift) - 1) & 0x;
+
+NvmeCmd cmd = {
+.opcode = NVME_CMD_WRITE_ZEROS,
+.nsid = cpu_to_le32(s->nsid),
+.cdw10 = cpu_to_le32((offset >> s->blkshift) & 0x),
+.cdw11 = cpu_to_le32(((offset >> s->blkshift) >> 32) & 0x),
+};
+
+NVMeCoData data = {
+.ctx = bdrv_get_aio_context(bs),
+.ret = -EINPROGRESS,
+};
+
+if (flags & BDRV_REQ_MAY_UNMAP) {
+cdw12 |= (1 << 25);
+}
+
+if (flags & BDRV_REQ_FUA) {
+cdw12 |= (1 << 30);
+}
+
+cmd.cdw12 = cpu_to_le32(cdw12);
+
+trace_nvme_write_zeros(s, offset, bytes, flags);
+assert(s->nr_queues > 1);
+req = nvme_get_free_req(ioq);
+assert(req);
+
+nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+
+data.co = qemu_coroutine_self();
+while (data.ret == -EINPROGRESS) {
+qemu_coroutine_yield();
+}
+
+trace_nvme_rw_done(s, true, offset, bytes, data.ret);
+return data.ret;
+}
+
+
 static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
BlockReopenQueue *queue, Error **errp)
 {
@@ -1297,6 +1361,9 @@ static BlockDriver bdrv_nvme = {
 
 .bdrv_co_preadv   = nvme_co_preadv,
 .bdrv_co_pwritev  = nvme_co_pwritev,
+
+.bdrv_co_pwrite_zeroes= nvme_co_pwrite_zeroes,
+
 .bdrv_co_flush_to_disk= nvme_co_flush,
 .bdrv_reopen_prepare  = nvme_reopen_prepare,
 
diff --git a/block/trace-events b/block/trace-events
index 9ccea755da..12f363bb44 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -148,6 +148,7 @@ nvme_submit_command_raw(int c0, int c1, int c2, int c3, int 
c4, int c5, int c6,
 nvme_handle_event(void *s) "s %p"
 nvme_poll_cb(void *s) "s %p"
 nvme_prw_aligned(void *s, int is_write, uint64_t offset, uint64_t bytes, int 
flags, int niov) "s %p is_write %d offset %"PRId64" bytes %"PRId64" flags %d 
niov %d"
+nvme_write_zeros(void *s, uint64_t offset, uint64_t bytes, int flags) "s %p 
offset %"PRId64" bytes %"PRId64" flags %d"
 nvme_qiov_unaligned(const void *qiov, int n, void *base, size_t size, int 
align) &quo

[Qemu-block] [PATCH v3 4/6] block/nvme: add support for image creation

2019-07-03 Thread Maxim Levitsky
Tesed on a nvme device like that:

# create preallocated qcow2 image
$ qemu-img create -f qcow2 nvme://:06:00.0/1 10G -o preallocation=metadata
Formatting 'nvme://:06:00.0/1', fmt=qcow2 size=10737418240 
cluster_size=65536 preallocation=metadata lazy_refcounts=off refcount_bits=16

# create an empty qcow2 image
$ qemu-img create -f qcow2 nvme://:06:00.0/1 10G -o preallocation=off
Formatting 'nvme://:06:00.0/1', fmt=qcow2 size=10737418240 
cluster_size=65536 preallocation=off lazy_refcounts=off refcount_bits=16

Signed-off-by: Maxim Levitsky 
---
 block/nvme.c | 108 +++
 1 file changed, 108 insertions(+)

diff --git a/block/nvme.c b/block/nvme.c
index 1f0d09349f..152d27b07f 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -1148,6 +1148,90 @@ static void nvme_aio_unplug(BlockDriverState *bs)
 }
 }
 
+static int coroutine_fn nvme_co_create_opts(const char *filename,
+QemuOpts *opts, Error **errp)
+{
+
+int64_t total_size = 0;
+char *buf = NULL;
+BlockDriverState *bs;
+QEMUIOVector local_qiov;
+int ret = 0;
+int64_t blocksize;
+QDict *options;
+Error *local_err = NULL;
+PreallocMode prealloc;
+
+total_size = ROUND_UP(qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0),
+  BDRV_SECTOR_SIZE);
+
+buf = qemu_opt_get_del(opts, BLOCK_OPT_PREALLOC);
+prealloc = qapi_enum_parse(&PreallocMode_lookup, buf,
+  PREALLOC_MODE_OFF, &local_err);
+g_free(buf);
+
+if (prealloc != PREALLOC_MODE_OFF) {
+error_setg(errp, "Only prealloc=off is supported");
+return -EINVAL;
+}
+
+options = qdict_new();
+qdict_put_str(options, "driver", "nvme");
+nvme_parse_filename(filename, options, &local_err);
+
+if (local_err) {
+error_propagate(errp, local_err);
+qobject_unref(options);
+return -EINVAL;
+}
+
+bs = bdrv_open(NULL, NULL, options,
+   BDRV_O_RDWR | BDRV_O_RESIZE | BDRV_O_PROTOCOL, errp);
+if (bs == NULL) {
+return -EIO;
+}
+
+if (nvme_getlength(bs) < total_size) {
+error_setg(errp, "Device is too small");
+bdrv_unref(bs);
+qobject_unref(options);
+return -ENOSPC;
+}
+
+blocksize = nvme_get_blocksize(bs);
+buf = qemu_try_blockalign0(bs, blocksize);
+qemu_iovec_init(&local_qiov, 1);
+qemu_iovec_add(&local_qiov, buf, blocksize);
+
+ret = nvme_co_prw_aligned(bs, 0, blocksize,
+&local_qiov, true, BDRV_REQ_FUA);
+if (ret) {
+error_setg(errp, "Write error to sector 0");
+}
+
+qemu_vfree(buf);
+bdrv_unref(bs);
+return ret;
+}
+
+
+static int coroutine_fn nvme_co_truncate(BlockDriverState *bs, int64_t offset,
+PreallocMode prealloc, Error **errp)
+{
+if (prealloc != PREALLOC_MODE_OFF) {
+error_setg(errp, "Preallocation mode '%s' unsupported nvme devices",
+PreallocMode_str(prealloc));
+return -ENOTSUP;
+}
+
+if (offset > nvme_getlength(bs)) {
+error_setg(errp, "Cannot grow nvme devices");
+return -EINVAL;
+}
+
+return 0;
+}
+
 static void nvme_register_buf(BlockDriverState *bs, void *host, size_t size)
 {
 int ret;
@@ -1169,6 +1253,7 @@ static void nvme_unregister_buf(BlockDriverState *bs, 
void *host)
 qemu_vfio_dma_unmap(s->vfio, host);
 }
 
+
 static const char *const nvme_strong_runtime_opts[] = {
 NVME_BLOCK_OPT_DEVICE,
 NVME_BLOCK_OPT_NAMESPACE,
@@ -1176,6 +1261,25 @@ static const char *const nvme_strong_runtime_opts[] = {
 NULL
 };
 
+
+static QemuOptsList nvme_create_opts = {
+.name = "nvme-create-opts",
+.head = QTAILQ_HEAD_INITIALIZER(nvme_create_opts.head),
+.desc = {
+{
+.name = BLOCK_OPT_SIZE,
+.type = QEMU_OPT_SIZE,
+.help = "Virtual disk size"
+},
+{
+.name = BLOCK_OPT_PREALLOC,
+.type = QEMU_OPT_STRING,
+.help = "Preallocation mode (allowed values: off)",
+},
+{ /* end of list */ }
+}
+};
+
 static BlockDriver bdrv_nvme = {
 .format_name  = "nvme",
 .protocol_name= "nvme",
@@ -1187,6 +1291,10 @@ static BlockDriver bdrv_nvme = {
 .bdrv_getlength   = nvme_getlength,
 .bdrv_probe_blocksizes= nvme_probe_blocksizes,
 
+.bdrv_co_create_opts  = nvme_co_create_opts,
+.bdrv_co_truncate = nvme_co_truncate,
+.create_opts  = &nvme_create_opts,
+
 .bdrv_co_preadv   = nvme_co_preadv,
 .bdrv_co_pwritev  = nvme_co_pwritev,
 .bdrv_co_flush_to_disk= nvme_co_flush,
-- 
2.17.2




[Qemu-block] [PATCH v3 6/6] block/nvme: add support for discard

2019-07-03 Thread Maxim Levitsky
Signed-off-by: Maxim Levitsky 
---
 block/nvme.c   | 81 ++
 block/trace-events |  2 ++
 2 files changed, 83 insertions(+)

diff --git a/block/nvme.c b/block/nvme.c
index 02e0846643..f8bcf1ffb6 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -111,6 +111,7 @@ typedef struct {
 bool plugged;
 
 bool supports_write_zeros;
+bool supports_discard;
 
 CoMutex dma_map_lock;
 CoQueue dma_flush_queue;
@@ -460,6 +461,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
   s->page_size / sizeof(uint64_t) * s->page_size);
 
 s->supports_write_zeros = (idctrl->oncs & NVME_ONCS_WRITE_ZEROS) != 0;
+s->supports_discard = (idctrl->oncs & NVME_ONCS_DSM) != 0;
 
 memset(resp, 0, 4096);
 
@@ -1149,6 +1151,84 @@ static coroutine_fn int 
nvme_co_pwrite_zeroes(BlockDriverState *bs,
 }
 
 
+static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
+ int64_t offset,
+ int bytes)
+{
+BDRVNVMeState *s = bs->opaque;
+NVMeQueuePair *ioq = s->queues[1];
+NVMeRequest *req;
+NvmeDsmRange *buf;
+QEMUIOVector local_qiov;
+int r;
+
+NvmeCmd cmd = {
+.opcode = NVME_CMD_DSM,
+.nsid = cpu_to_le32(s->nsid),
+.cdw10 = 0, /*number of ranges - 0 based*/
+.cdw11 = cpu_to_le32(1 << 2), /*deallocate bit*/
+};
+
+NVMeCoData data = {
+.ctx = bdrv_get_aio_context(bs),
+.ret = -EINPROGRESS,
+};
+
+if (!s->supports_discard) {
+return -ENOTSUP;
+}
+
+assert(s->nr_queues > 1);
+
+buf = qemu_try_blockalign0(bs, 4096);
+if (!buf) {
+return -ENOMEM;
+}
+
+buf->nlb = bytes >> s->blkshift;
+buf->slba = offset >> s->blkshift;
+buf->cattr = 0;
+
+qemu_iovec_init(&local_qiov, 1);
+qemu_iovec_add(&local_qiov, buf, 4096);
+
+req = nvme_get_free_req(ioq);
+assert(req);
+
+qemu_co_mutex_lock(&s->dma_map_lock);
+r = nvme_cmd_map_qiov(bs, &cmd, req, &local_qiov);
+qemu_co_mutex_unlock(&s->dma_map_lock);
+
+if (r) {
+req->busy = false;
+return r;
+}
+
+trace_nvme_dsm(s, offset, bytes);
+
+nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+
+data.co = qemu_coroutine_self();
+while (data.ret == -EINPROGRESS) {
+qemu_coroutine_yield();
+}
+
+qemu_co_mutex_lock(&s->dma_map_lock);
+r = nvme_cmd_unmap_qiov(bs, &local_qiov);
+qemu_co_mutex_unlock(&s->dma_map_lock);
+if (r) {
+return r;
+}
+
+trace_nvme_dsm_done(s, offset, bytes, data.ret);
+
+qemu_iovec_destroy(&local_qiov);
+qemu_vfree(buf);
+return data.ret;
+
+}
+
+
 static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
BlockReopenQueue *queue, Error **errp)
 {
@@ -1363,6 +1443,7 @@ static BlockDriver bdrv_nvme = {
 .bdrv_co_pwritev  = nvme_co_pwritev,
 
 .bdrv_co_pwrite_zeroes= nvme_co_pwrite_zeroes,
+.bdrv_co_pdiscard = nvme_co_pdiscard,
 
 .bdrv_co_flush_to_disk= nvme_co_flush,
 .bdrv_reopen_prepare  = nvme_reopen_prepare,
diff --git a/block/trace-events b/block/trace-events
index 12f363bb44..f763f79d99 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -152,6 +152,8 @@ nvme_write_zeros(void *s, uint64_t offset, uint64_t bytes, 
int flags) "s %p offs
 nvme_qiov_unaligned(const void *qiov, int n, void *base, size_t size, int 
align) "qiov %p n %d base %p size 0x%zx align 0x%x"
 nvme_prw_buffered(void *s, uint64_t offset, uint64_t bytes, int niov, int 
is_write) "s %p offset %"PRId64" bytes %"PRId64" niov %d is_write %d"
 nvme_rw_done(void *s, int is_write, uint64_t offset, uint64_t bytes, int ret) 
"s %p is_write %d offset %"PRId64" bytes %"PRId64" ret %d"
+nvme_dsm(void *s, uint64_t offset, uint64_t bytes) "s %p offset %"PRId64" 
bytes %"PRId64""
+nvme_dsm_done(void *s, uint64_t offset, uint64_t bytes, int ret) "s %p offset 
%"PRId64" bytes %"PRId64" ret %d"
 nvme_dma_map_flush(void *s) "s %p"
 nvme_free_req_queue_wait(void *q) "q %p"
 nvme_cmd_map_qiov(void *s, void *cmd, void *req, void *qiov, int entries) "s 
%p cmd %p req %p qiov %p entries %d"
-- 
2.17.2




[Qemu-block] [PATCH v3 0/6] Few fixes for userspace NVME driver

2019-07-03 Thread Maxim Levitsky
Compared to last submission, this series adds another patch,
which implements support for image creation over the nvme drive like that:

qemu-img create -f qcow2 nvme://:06:00.0/1 10G -o preallocation=metadata

I also addressed the review comments.

Best regards,
Maxim Levitsky

Maxim Levitsky (6):
  block/nvme: don't touch the completion entries
  block/nvme: fix doorbell stride
  block/nvme: support larger that 512 bytes sector devices
  block/nvme: add support for image creation
  block/nvme: add support for write zeros
  block/nvme: add support for discard

 block/nvme.c | 310 +--
 block/trace-events   |   3 +
 include/block/nvme.h |  19 ++-
 3 files changed, 320 insertions(+), 12 deletions(-)

-- 
2.17.2




[Qemu-block] [PATCH v3 3/6] block/nvme: support larger that 512 bytes sector devices

2019-07-03 Thread Maxim Levitsky
Currently the driver hardcodes the sector size to 512,
and doesn't check the underlying device. Fix that.

Also fail if underlying nvme device is formatted with metadata
as this needs special support.

Signed-off-by: Maxim Levitsky 
---
 block/nvme.c | 45 -
 1 file changed, 40 insertions(+), 5 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 52798081b2..1f0d09349f 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -102,8 +102,11 @@ typedef struct {
 size_t doorbell_scale;
 bool write_cache_supported;
 EventNotifier irq_notifier;
+
 uint64_t nsze; /* Namespace size reported by identify command */
 int nsid;  /* The namespace id to read/write data. */
+size_t blkshift;
+
 uint64_t max_transfer;
 bool plugged;
 
@@ -415,8 +418,9 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 BDRVNVMeState *s = bs->opaque;
 NvmeIdCtrl *idctrl;
 NvmeIdNs *idns;
+NvmeLBAF *lbaf;
 uint8_t *resp;
-int r;
+int r, hwsect_size;
 uint64_t iova;
 NvmeCmd cmd = {
 .opcode = NVME_ADM_CMD_IDENTIFY,
@@ -463,7 +467,22 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 }
 
 s->nsze = le64_to_cpu(idns->nsze);
+lbaf = &idns->lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
+
+if (lbaf->ms) {
+error_setg(errp, "Namespaces with metadata are not yet supported");
+goto out;
+}
+
+hwsect_size = 1 << lbaf->ds;
+
+if (hwsect_size < BDRV_SECTOR_BITS || hwsect_size > s->page_size) {
+error_setg(errp, "Namespace has unsupported block size (%d)",
+hwsect_size);
+goto out;
+}
 
+s->blkshift = lbaf->ds;
 out:
 qemu_vfio_dma_unmap(s->vfio, resp);
 qemu_vfree(resp);
@@ -782,8 +801,22 @@ fail:
 static int64_t nvme_getlength(BlockDriverState *bs)
 {
 BDRVNVMeState *s = bs->opaque;
+return s->nsze << s->blkshift;
+}
 
-return s->nsze << BDRV_SECTOR_BITS;
+static int64_t nvme_get_blocksize(BlockDriverState *bs)
+{
+BDRVNVMeState *s = bs->opaque;
+assert(s->blkshift >= 9);
+return 1 << s->blkshift;
+}
+
+static int nvme_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
+{
+int64_t blocksize = nvme_get_blocksize(bs);
+bsz->phys = blocksize;
+bsz->log = blocksize;
+return 0;
 }
 
 /* Called with s->dma_map_lock */
@@ -914,13 +947,14 @@ static coroutine_fn int 
nvme_co_prw_aligned(BlockDriverState *bs,
 BDRVNVMeState *s = bs->opaque;
 NVMeQueuePair *ioq = s->queues[1];
 NVMeRequest *req;
-uint32_t cdw12 = (((bytes >> BDRV_SECTOR_BITS) - 1) & 0x) |
+
+uint32_t cdw12 = (((bytes >> s->blkshift) - 1) & 0x) |
(flags & BDRV_REQ_FUA ? 1 << 30 : 0);
 NvmeCmd cmd = {
 .opcode = is_write ? NVME_CMD_WRITE : NVME_CMD_READ,
 .nsid = cpu_to_le32(s->nsid),
-.cdw10 = cpu_to_le32((offset >> BDRV_SECTOR_BITS) & 0x),
-.cdw11 = cpu_to_le32(((offset >> BDRV_SECTOR_BITS) >> 32) & 
0x),
+.cdw10 = cpu_to_le32((offset >> s->blkshift) & 0x),
+.cdw11 = cpu_to_le32(((offset >> s->blkshift) >> 32) & 0x),
 .cdw12 = cpu_to_le32(cdw12),
 };
 NVMeCoData data = {
@@ -1151,6 +1185,7 @@ static BlockDriver bdrv_nvme = {
 .bdrv_file_open   = nvme_file_open,
 .bdrv_close   = nvme_close,
 .bdrv_getlength   = nvme_getlength,
+.bdrv_probe_blocksizes= nvme_probe_blocksizes,
 
 .bdrv_co_preadv   = nvme_co_preadv,
 .bdrv_co_pwritev  = nvme_co_pwritev,
-- 
2.17.2




[Qemu-block] [PATCH v4] block/nvme: add support for discard

2019-07-03 Thread Maxim Levitsky
Signed-off-by: Maxim Levitsky 
---
 block/nvme.c   | 81 ++
 block/trace-events |  2 ++
 2 files changed, 83 insertions(+)

diff --git a/block/nvme.c b/block/nvme.c
index 02e0846643..96a715dcc1 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -111,6 +111,7 @@ typedef struct {
 bool plugged;
 
 bool supports_write_zeros;
+bool supports_discard;
 
 CoMutex dma_map_lock;
 CoQueue dma_flush_queue;
@@ -460,6 +461,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
   s->page_size / sizeof(uint64_t) * s->page_size);
 
 s->supports_write_zeros = (idctrl->oncs & NVME_ONCS_WRITE_ZEROS) != 0;
+s->supports_discard = (idctrl->oncs & NVME_ONCS_DSM) != 0;
 
 memset(resp, 0, 4096);
 
@@ -1149,6 +1151,84 @@ static coroutine_fn int 
nvme_co_pwrite_zeroes(BlockDriverState *bs,
 }
 
 
+static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
+ int64_t offset,
+ int bytes)
+{
+BDRVNVMeState *s = bs->opaque;
+NVMeQueuePair *ioq = s->queues[1];
+NVMeRequest *req;
+NvmeDsmRange *buf;
+QEMUIOVector local_qiov;
+int r;
+
+NvmeCmd cmd = {
+.opcode = NVME_CMD_DSM,
+.nsid = cpu_to_le32(s->nsid),
+.cdw10 = 0, /*number of ranges - 0 based*/
+.cdw11 = cpu_to_le32(1 << 2), /*deallocate bit*/
+};
+
+NVMeCoData data = {
+.ctx = bdrv_get_aio_context(bs),
+.ret = -EINPROGRESS,
+};
+
+if (!s->supports_discard) {
+return -ENOTSUP;
+}
+
+assert(s->nr_queues > 1);
+
+buf = qemu_try_blockalign0(bs, 4096);
+if (!buf) {
+return -ENOMEM;
+}
+
+buf->nlb = cpu_to_le32(bytes >> s->blkshift);
+buf->slba = cpu_to_le64(offset >> s->blkshift);
+buf->cattr = 0;
+
+qemu_iovec_init(&local_qiov, 1);
+qemu_iovec_add(&local_qiov, buf, 4096);
+
+req = nvme_get_free_req(ioq);
+assert(req);
+
+qemu_co_mutex_lock(&s->dma_map_lock);
+r = nvme_cmd_map_qiov(bs, &cmd, req, &local_qiov);
+qemu_co_mutex_unlock(&s->dma_map_lock);
+
+if (r) {
+req->busy = false;
+return r;
+}
+
+trace_nvme_dsm(s, offset, bytes);
+
+nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+
+data.co = qemu_coroutine_self();
+while (data.ret == -EINPROGRESS) {
+qemu_coroutine_yield();
+}
+
+qemu_co_mutex_lock(&s->dma_map_lock);
+r = nvme_cmd_unmap_qiov(bs, &local_qiov);
+qemu_co_mutex_unlock(&s->dma_map_lock);
+if (r) {
+return r;
+}
+
+trace_nvme_dsm_done(s, offset, bytes, data.ret);
+
+qemu_iovec_destroy(&local_qiov);
+qemu_vfree(buf);
+return data.ret;
+
+}
+
+
 static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
BlockReopenQueue *queue, Error **errp)
 {
@@ -1363,6 +1443,7 @@ static BlockDriver bdrv_nvme = {
 .bdrv_co_pwritev  = nvme_co_pwritev,
 
 .bdrv_co_pwrite_zeroes= nvme_co_pwrite_zeroes,
+.bdrv_co_pdiscard = nvme_co_pdiscard,
 
 .bdrv_co_flush_to_disk= nvme_co_flush,
 .bdrv_reopen_prepare  = nvme_reopen_prepare,
diff --git a/block/trace-events b/block/trace-events
index 12f363bb44..f763f79d99 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -152,6 +152,8 @@ nvme_write_zeros(void *s, uint64_t offset, uint64_t bytes, 
int flags) "s %p offs
 nvme_qiov_unaligned(const void *qiov, int n, void *base, size_t size, int 
align) "qiov %p n %d base %p size 0x%zx align 0x%x"
 nvme_prw_buffered(void *s, uint64_t offset, uint64_t bytes, int niov, int 
is_write) "s %p offset %"PRId64" bytes %"PRId64" niov %d is_write %d"
 nvme_rw_done(void *s, int is_write, uint64_t offset, uint64_t bytes, int ret) 
"s %p is_write %d offset %"PRId64" bytes %"PRId64" ret %d"
+nvme_dsm(void *s, uint64_t offset, uint64_t bytes) "s %p offset %"PRId64" 
bytes %"PRId64""
+nvme_dsm_done(void *s, uint64_t offset, uint64_t bytes, int ret) "s %p offset 
%"PRId64" bytes %"PRId64" ret %d"
 nvme_dma_map_flush(void *s) "s %p"
 nvme_free_req_queue_wait(void *q) "q %p"
 nvme_cmd_map_qiov(void *s, void *cmd, void *req, void *qiov, int entries) "s 
%p cmd %p req %p qiov %p entries %d"
-- 
2.17.2




[Qemu-block] [PATCH v2 0/1] Don't obey the kernel block device max transfer len / max segments for raw block devices

2019-07-04 Thread Maxim Levitsky
Linux block devices, even in O_DIRECT mode don't have any user visible
limit on transfer size / number of segments, which underlying kernel block 
device can have.
The kernel block layer takes care of enforcing these limits by splitting the 
bios.

By limiting the transfer sizes, we force qemu to do the splitting itself which
introduces various overheads.
It is especially visible in nbd server, where the low max transfer size of the
underlying device forces us to advertise this over NBD, thus increasing the
traffic overhead in case of image conversion which benefits from large blocks.

More information can be found here:
https://bugzilla.redhat.com/show_bug.cgi?id=1647104

Tested this with qemu-img convert over nbd and natively and to my surprise,
even native IO performance improved a bit.

(The device on which it was tested is Intel Optane DC P4800X,
which has 128k max transfer size reported by the kernel)

The benchmark:

Images were created using:

Sparse image:  qemu-img create -f qcow2 /dev/nvme0n1p3 1G / 10G / 100G
Allocated image: qemu-img create -f qcow2 /dev/nvme0n1p3 -o 
preallocation=metadata  1G / 10G / 100G

The test was:

 echo "convert native:"
 rm -rf /dev/shm/disk.img
 time qemu-img convert -p -f qcow2 -O raw -T none $FILE /dev/shm/disk.img > 
/dev/zero

 echo "convert via nbd:"
 qemu-nbd -k /tmp/nbd.sock -v  -f qcow2 $FILE -x export --cache=none 
--aio=native --fork
 rm -rf /dev/shm/disk.img
 time qemu-img convert -p -f raw -O raw 
nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img > /dev/zero

The results:

=
1G sparse image:
 native:
before: 0.027s
after: 0.027s
 nbd:
before: 0.287s
after: 0.035s

=
100G sparse image:
 native:
before: 0.028s
after: 0.028s
 nbd:
before: 23.796s
after: 0.109s

=
1G preallocated image:
 native:
   before: 0.454s
   after: 0.427s
 nbd:
   before: 0.649s
   after: 0.546s

The block limits of max transfer size/max segment size are retained
for the SCSI passthrough because in this case the kernel passes the userspace 
request
directly to the kernel scsi driver, bypassing the block layer, and thus there 
is no code to split
such requests.

Fam, since you was the original author of the code that added
these limits, could you share your opinion on that?
What was the reason besides SCSI passthrough?

V2:

*  Manually tested to not break the scsi passthrough with a nested VM
*  As Eric suggested, refactored the area around the fstat.
*  Spelling/grammar fixes

Best regards,
    Maxim Levitsky

Maxim Levitsky (1):
  raw-posix.c - use max transfer length / max segement count only for
SCSI passthrough

 block/file-posix.c | 54 --
 1 file changed, 28 insertions(+), 26 deletions(-)

-- 
2.17.2




[Qemu-block] [PATCH v2 1/1] raw-posix.c - use max transfer length / max segement count only for SCSI passthrough

2019-07-04 Thread Maxim Levitsky
Regular kernel block devices (/dev/sda*, /dev/nvme*, etc) don't have
max segment size/max segment count hardware requirements exposed
to the userspace, but rather the kernel block layer
takes care to split the incoming requests that
violate these requirements.

Allowing the kernel to do the splitting allows qemu to avoid
various overheads that arise otherwise from this.

This is especially visible in nbd server,
exposing as a raw file, a mostly empty qcow2 image over the net.
In this case most of the reads by the remote user
won't even hit the underlying kernel block device,
and therefore most of the  overhead will be in the
nbd traffic which increases significantly with lower max transfer size.

In addition to that even for local block device
access the peformance improves a bit due to less
traffic between qemu and the kernel when large
transfer sizes are used (e.g for image conversion)

More info can be found at:
https://bugzilla.redhat.com/show_bug.cgi?id=1647104

Signed-off-by: Maxim Levitsky 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
---
 block/file-posix.c | 54 --
 1 file changed, 28 insertions(+), 26 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index ab05b51a66..4479cc7ab4 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1038,15 +1038,13 @@ static void raw_reopen_abort(BDRVReopenState *state)
 s->reopen_state = NULL;
 }
 
-static int hdev_get_max_transfer_length(BlockDriverState *bs, int fd)
+static int sg_get_max_transfer_length(int fd)
 {
 #ifdef BLKSECTGET
 int max_bytes = 0;
-short max_sectors = 0;
-if (bs->sg && ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
+
+if (ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
 return max_bytes;
-} else if (!bs->sg && ioctl(fd, BLKSECTGET, &max_sectors) == 0) {
-return max_sectors << BDRV_SECTOR_BITS;
 } else {
 return -errno;
 }
@@ -1055,25 +1053,31 @@ static int 
hdev_get_max_transfer_length(BlockDriverState *bs, int fd)
 #endif
 }
 
-static int hdev_get_max_segments(const struct stat *st)
+static int sg_get_max_segments(int fd)
 {
 #ifdef CONFIG_LINUX
 char buf[32];
 const char *end;
-char *sysfspath;
+char *sysfspath = NULL;
 int ret;
-int fd = -1;
+int sysfd = -1;
 long max_segments;
+struct stat st;
+
+if (fstat(fd, &st)) {
+ret = -errno;
+goto out;
+}
 
 sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
-major(st->st_rdev), minor(st->st_rdev));
-fd = open(sysfspath, O_RDONLY);
-if (fd == -1) {
+major(st.st_rdev), minor(st.st_rdev));
+sysfd = open(sysfspath, O_RDONLY);
+if (sysfd == -1) {
 ret = -errno;
 goto out;
 }
 do {
-ret = read(fd, buf, sizeof(buf) - 1);
+ret = read(sysfd, buf, sizeof(buf) - 1);
 } while (ret == -1 && errno == EINTR);
 if (ret < 0) {
 ret = -errno;
@@ -1090,8 +1094,8 @@ static int hdev_get_max_segments(const struct stat *st)
 }
 
 out:
-if (fd != -1) {
-close(fd);
+if (sysfd != -1) {
+close(sysfd);
 }
 g_free(sysfspath);
 return ret;
@@ -1103,19 +1107,17 @@ out:
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
 BDRVRawState *s = bs->opaque;
-struct stat st;
 
-if (!fstat(s->fd, &st)) {
-if (S_ISBLK(st.st_mode) || S_ISCHR(st.st_mode)) {
-int ret = hdev_get_max_transfer_length(bs, s->fd);
-if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
-bs->bl.max_transfer = pow2floor(ret);
-}
-ret = hdev_get_max_segments(&st);
-if (ret > 0) {
-bs->bl.max_transfer = MIN(bs->bl.max_transfer,
-  ret * getpagesize());
-}
+if (bs->sg) {
+int ret = sg_get_max_transfer_length(s->fd);
+
+if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
+bs->bl.max_transfer = pow2floor(ret);
+}
+
+ret = sg_get_max_segments(s->fd);
+if (ret > 0) {
+bs->bl.max_transfer = MIN(bs->bl.max_transfer, ret * 
getpagesize());
 }
 }
 
-- 
2.17.2




Re: [Qemu-block] [PATCH v3 1/6] block/nvme: don't touch the completion entries

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 13:03 +0200, Max Reitz wrote:
> On 03.07.19 17:59, Maxim Levitsky wrote:
> > Completion entries are meant to be only read by the host and written by the 
> > device.
> > The driver is supposed to scan the completions from the last point where it 
> > left,
> > and until it sees a completion with non flipped phase bit.
> 
> (Disclaimer: This is the first time I read the nvme driver, or really
> something in the nvme spec.)
> 
> Well, no, completion entries are also meant to be initialized by the
> host.  To me it looks like this is the place where that happens:
> Everything that has been processed by the device is immediately being
> re-initialized.
> 
> Maybe we shouldn’t do that here but in nvme_submit_command().  But
> currently we don’t, and I don’t see any other place where we currently
> initialize the CQ entries.

Hi!
I couldn't find any place in the spec that says that completion entries should 
be initialized.
It is probably wise to initialize that area to 0 on driver initialization, but 
nothing beyond that.
In particular that is what the kernel nvme driver does. 
Other that allocating a zeroed memory (and even that I am not sure it does), 
it doesn't write to the completion entries.

Thanks for the very very good review btw. I will go over all patches now and 
fix things.

Best regards,
Maxim Levitsky

> 
> Max
> 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c | 5 +
> >  1 file changed, 1 insertion(+), 4 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 73ed5fa75f..6d4e7f3d83 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -315,7 +315,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
> > NVMeQueuePair *q)
> >  while (q->inflight) {
> >  int16_t cid;
> >  c = (NvmeCqe *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
> > -if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
> > +if ((le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
> >  break;
> >  }
> >  q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
> > @@ -339,10 +339,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
> > NVMeQueuePair *q)
> >  qemu_mutex_unlock(&q->lock);
> >  req.cb(req.opaque, nvme_translate_error(c));
> >  qemu_mutex_lock(&q->lock);
> > -c->cid = cpu_to_le16(0);
> >  q->inflight--;
> > -/* Flip Phase Tag bit. */
> > -c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
> >  progress = true;
> >  }
> >  if (progress) {
> > 
> 
> 





Re: [Qemu-block] [PATCH v3 2/6] block/nvme: fix doorbell stride

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 13:10 +0200, Max Reitz wrote:
> On 05.07.19 13:09, Max Reitz wrote:
> > On 03.07.19 17:59, Maxim Levitsky wrote:
> > > Fix the math involving non standard doorbell stride
> > > 
> > > Signed-off-by: Maxim Levitsky 
> > > ---
> > >  block/nvme.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/block/nvme.c b/block/nvme.c
> > > index 6d4e7f3d83..52798081b2 100644
> > > --- a/block/nvme.c
> > > +++ b/block/nvme.c
> > > @@ -217,7 +217,7 @@ static NVMeQueuePair 
> > > *nvme_create_queue_pair(BlockDriverState *bs,
> > >  error_propagate(errp, local_err);
> > >  goto fail;
> > >  }
> > > -q->cq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale + 
> > > 1];
> > > +q->cq.doorbell = &s->regs->doorbells[(idx * 2 + 1) * 
> > > s->doorbell_scale];
> > >  
> > >  return q;
> > >  fail:
> > 
> > Hm.  How has this ever worked?
> 
> (Ah, because CAP.DSTRD has probably been 0 in most devices.)
> 
Exactly, and I used cache line stride in my nvme-mdev, which broke this and I 
spend an evening figuring out
what is going on. I was sure that there is some memory ordering bug or 
something even weirder before (as usual)
finding that this is a very simple bug.
I tested nvme-mdev pretty much with everything I could get my hands on, 
including this driver.


Best regards,
Maxim Levitsky





Re: [Qemu-block] [PATCH v3 3/6] block/nvme: support larger that 512 bytes sector devices

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 13:58 +0200, Max Reitz wrote:
> On 03.07.19 17:59, Maxim Levitsky wrote:
> > Currently the driver hardcodes the sector size to 512,
> > and doesn't check the underlying device. Fix that.
> > 
> > Also fail if underlying nvme device is formatted with metadata
> > as this needs special support.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c | 45 -
> >  1 file changed, 40 insertions(+), 5 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 52798081b2..1f0d09349f 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> 
> [...]
> 
> > @@ -463,7 +467,22 @@ static void nvme_identify(BlockDriverState *bs, int 
> > namespace, Error **errp)
> >  }
> >  
> >  s->nsze = le64_to_cpu(idns->nsze);
> > +lbaf = &idns->lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
> > +
> > +if (lbaf->ms) {
> > +error_setg(errp, "Namespaces with metadata are not yet supported");
> > +goto out;
> > +}
> > +
> > +hwsect_size = 1 << lbaf->ds;
> > +
> > +if (hwsect_size < BDRV_SECTOR_BITS || hwsect_size > s->page_size) {
> 
> s/BDRV_SECTOR_BITS/BDRV_SECTOR_SIZE/
Oops.

> 
> > +error_setg(errp, "Namespace has unsupported block size (%d)",
> > +hwsect_size);
> > +goto out;
> > +}
> >  
> > +s->blkshift = lbaf->ds;
> >  out:
> >  qemu_vfio_dma_unmap(s->vfio, resp);
> >  qemu_vfree(resp);
> > @@ -782,8 +801,22 @@ fail:
> >  static int64_t nvme_getlength(BlockDriverState *bs)
> >  {
> >  BDRVNVMeState *s = bs->opaque;
> > +return s->nsze << s->blkshift;
> > +}
> >  
> > -return s->nsze << BDRV_SECTOR_BITS;
> > +static int64_t nvme_get_blocksize(BlockDriverState *bs)
> > +{
> > +BDRVNVMeState *s = bs->opaque;
> > +assert(s->blkshift >= 9);
> 
> I think BDRV_SECTOR_BITS is more correct here (this is about what the
> general block layer code expects).  Also, there’s no pain in doing so,
> as you did check against BDRV_SECTOR_SIZE in nvme_identify().
> Max
Of course, thanks!!

> 
> > +return 1 << s->blkshift;
> > +}
> > +
> > +static int nvme_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
> > +{
> > +int64_t blocksize = nvme_get_blocksize(bs);
> > +bsz->phys = blocksize;
> > +bsz->log = blocksize;
> > +return 0;
> >  }
> >  
> >  /* Called with s->dma_map_lock */
> 
> 

Thanks for the review,
Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v3 4/6] block/nvme: add support for image creation

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 14:09 +0200, Max Reitz wrote:
> On 03.07.19 17:59, Maxim Levitsky wrote:
> > Tesed on a nvme device like that:
> > 
> > # create preallocated qcow2 image
> > $ qemu-img create -f qcow2 nvme://:06:00.0/1 10G -o 
> > preallocation=metadata
> > Formatting 'nvme://:06:00.0/1', fmt=qcow2 size=10737418240 
> > cluster_size=65536 preallocation=metadata lazy_refcounts=off 
> > refcount_bits=16
> > 
> > # create an empty qcow2 image
> > $ qemu-img create -f qcow2 nvme://:06:00.0/1 10G -o preallocation=off
> > Formatting 'nvme://:06:00.0/1', fmt=qcow2 size=10737418240 
> > cluster_size=65536 preallocation=off lazy_refcounts=off refcount_bits=16
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c | 108 +++
> >  1 file changed, 108 insertions(+)
> 
> Hm.  I’m not quite sure I like this, because this is not image creation.

I fully agree with you, and the whole thing did felt kind of wrong.
I kind of think that bdrv_co_create_opts is kind of outdated for the purpose, 
especially
with the nvme driver.
I think that it would be better if the bdrv_file_open just supported something 
like 'O_CREAT'.

I done this the mostly the same was as the file-posix does this on the block 
devices,
including that 'hack' of zeroing the first sector, for which I really don't 
know if this is the right solution.



> 
> What we need is a general interface for formatting existing files.  I
> mean, we have that in QMP (blockdev-create), but the problem is that
> this doesn’t really translate to qemu-img create.
> 
> I wonder whether it’s best to hack something up that makes
> bdrv_create_file() a no-op, or whether we should expose blockdev-create
> over qemu-img.  I’ll see how difficult the latter is, it sounds fun
> (famous last words).
For existing images, the 'bdrv_create_file' is already kind of a nop, other 
that zeroing the first sector,
which kind of makes sense, but probably best done on higher level than in each 
driver.

So these are my thoughts about this, thanks for the review!

Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v3 5/6] block/nvme: add support for write zeros

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 15:33 +0200, Max Reitz wrote:
> On 03.07.19 17:59, Maxim Levitsky wrote:
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c | 69 +++-
> >  block/trace-events   |  1 +
> >  include/block/nvme.h | 19 +++-
> >  3 files changed, 87 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 152d27b07f..02e0846643 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> 
> [...]
> 
> > @@ -469,6 +473,11 @@ static void nvme_identify(BlockDriverState *bs, int 
> > namespace, Error **errp)
> >  s->nsze = le64_to_cpu(idns->nsze);
> >  lbaf = &idns->lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
> >  
> > +if (NVME_ID_NS_DLFEAT_WRITE_ZEROS(idns->dlfeat) &&
> > +NVME_ID_NS_DLFEAT_READ_BEHAVIOR(idns->dlfeat) ==
> > +NVME_ID_NS_DLFEAT_READ_BEHAVIOR_ZEROS)
> > +bs->supported_write_flags |= BDRV_REQ_MAY_UNMAP;
> > +
> 
> This violates the coding style, there should be curly brackets here.
100% agree + I need to see if we can update the checkpatch.pl to catch this.


> 
> >  if (lbaf->ms) {
> >  error_setg(errp, "Namespaces with metadata are not yet supported");
> >  goto out;
> > @@ -763,6 +772,8 @@ static int nvme_file_open(BlockDriverState *bs, QDict 
> > *options, int flags,
> >  int ret;
> >  BDRVNVMeState *s = bs->opaque;
> >  
> > +bs->supported_write_flags = BDRV_REQ_FUA;
> > +
> >  opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
> >  qemu_opts_absorb_qdict(opts, options, &error_abort);
> >  device = qemu_opt_get(opts, NVME_BLOCK_OPT_DEVICE);
> > @@ -791,7 +802,6 @@ static int nvme_file_open(BlockDriverState *bs, QDict 
> > *options, int flags,
> >  goto fail;
> >  }
> >  }
> > -bs->supported_write_flags = BDRV_REQ_FUA;
> 
> Any reason for this movement?

This is because the nvme_identify checks if the underlying namespace
supports 'discarded data reads back as zeros', and in which case it sets the
BDRV_REQ_MAY_UNMAP in bs->supported_write_flags which later allow me to set
'deallocate' bit in the write zeros command which hints the controller
to discard the area.

This was moved to avoid overwriting the value. I could have instead just ored 
the value,
but this way I think is cleaner a bit.



> 
> >  return 0;
> >  fail:
> >  nvme_close(bs);
> > @@ -1085,6 +1095,60 @@ static coroutine_fn int 
> > nvme_co_flush(BlockDriverState *bs)
> >  }
> >  
> >  
> > +static coroutine_fn int nvme_co_pwrite_zeroes(BlockDriverState *bs,
> > +  int64_t offset,
> > +  int bytes,
> > +  BdrvRequestFlags flags)
> > +{
> > +BDRVNVMeState *s = bs->opaque;
> > +NVMeQueuePair *ioq = s->queues[1];
> > +NVMeRequest *req;
> > +
> > +if (!s->supports_write_zeros) {
> > +return -ENOTSUP;
> > +}
> > +
> > +uint32_t cdw12 = ((bytes >> s->blkshift) - 1) & 0x;
> 
> Another coding style violation: Variable declarations and other code may
> not be mixed.
Another bug in checkpatch.pl :-)

> 
> > +
> > +NvmeCmd cmd = {
> > +.opcode = NVME_CMD_WRITE_ZEROS,
> > +.nsid = cpu_to_le32(s->nsid),
> > +.cdw10 = cpu_to_le32((offset >> s->blkshift) & 0x),
> > +.cdw11 = cpu_to_le32(((offset >> s->blkshift) >> 32) & 0x),
> > +};
> > +
> > +NVMeCoData data = {
> > +.ctx = bdrv_get_aio_context(bs),
> > +.ret = -EINPROGRESS,
> > +};
> 
> [...]
> 
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 3ec8efcc43..65eb65c740 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -653,12 +653,29 @@ typedef struct NvmeIdNs {
> >  uint8_t mc;
> >  uint8_t dpc;
> >  uint8_t dps;
> > -uint8_t res30[98];
> > +
> > +uint8_t nmic;
> > +uint8_t rescap;
> > +uint8_t fpi;
> > +uint8_t dlfeat;
> > +
> > +uint8_t res30[94];
> >  NvmeLBAFlbaf[16];
> >  uint8_t res192[192];
> >  uint8_t vs[3712];
>

Re: [Qemu-block] [PATCH v4] block/nvme: add support for discard

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 15:50 +0200, Max Reitz wrote:
> On 03.07.19 18:07, Maxim Levitsky wrote:
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c   | 81 ++
> >  block/trace-events |  2 ++
> >  2 files changed, 83 insertions(+)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 02e0846643..96a715dcc1 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> 
> [...]
> 
> > @@ -460,6 +461,7 @@ static void nvme_identify(BlockDriverState *bs, int 
> > namespace, Error **errp)
> >s->page_size / sizeof(uint64_t) * s->page_size);
> >  
> >  s->supports_write_zeros = (idctrl->oncs & NVME_ONCS_WRITE_ZEROS) != 0;
> > +s->supports_discard = (idctrl->oncs & NVME_ONCS_DSM) != 0;
> 
> Shouldn’t this be le16_to_cpu(idctrl->oncs)?  Same in the previous
> patch, now that I think about it.

This reminds me of how I basically scrubbed though nvme-mdev looking for 
endiannes bugs, 
manually searching for every reference to hardware controlled structure.
Thank you very much!!


> 
> >  
> >  memset(resp, 0, 4096);
> >  
> > @@ -1149,6 +1151,84 @@ static coroutine_fn int 
> > nvme_co_pwrite_zeroes(BlockDriverState *bs,
> >  }
> >  
> >  
> > +static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
> > + int64_t offset,
> > + int bytes)
> > +{
> > +BDRVNVMeState *s = bs->opaque;
> > +NVMeQueuePair *ioq = s->queues[1];
> > +NVMeRequest *req;
> > +NvmeDsmRange *buf;
> > +QEMUIOVector local_qiov;
> > +int r;
> > +
> > +NvmeCmd cmd = {
> > +.opcode = NVME_CMD_DSM,
> > +.nsid = cpu_to_le32(s->nsid),
> > +.cdw10 = 0, /*number of ranges - 0 based*/
> 
> I’d make this cpu_to_le32(0).  Sure, there is no effect for 0, but in
> theory this is a variable value, so...
Let it be.

> 
> > +.cdw11 = cpu_to_le32(1 << 2), /*deallocate bit*/
> > +};
> > +
> > +NVMeCoData data = {
> > +.ctx = bdrv_get_aio_context(bs),
> > +.ret = -EINPROGRESS,
> > +};
> > +
> > +if (!s->supports_discard) {
> > +return -ENOTSUP;
> > +}
> > +
> > +assert(s->nr_queues > 1);
> > +
> > +buf = qemu_try_blockalign0(bs, 4096);
> 
> I’m not sure whether this needs to be 4096 or whether 16 would suffice,
>  but I suppose this gets us the least trouble.
Exactly. Even better would be now that I think about it to use 's->page_size', 
the device page size.
It is at least 4K (spec minimum).

Speaking of which, there is a theoretical bug there - the device in theory can 
indicate that its minimal page size is larger that 4K.
The kernel currently rejects such devices, but here the driver just forces 4K 
page size in the CC register

> 
> > +if (!buf) {
> > +return -ENOMEM;
> 
> Indentation is off.
True!

> 
> > +}
> > +
> > +buf->nlb = cpu_to_le32(bytes >> s->blkshift);
> > +buf->slba = cpu_to_le64(offset >> s->blkshift);
> > +buf->cattr = 0;
> > +
> > +qemu_iovec_init(&local_qiov, 1);
> > +qemu_iovec_add(&local_qiov, buf, 4096);
> > +
> > +req = nvme_get_free_req(ioq);
> > +assert(req);
> > +
> > +qemu_co_mutex_lock(&s->dma_map_lock);
> > +r = nvme_cmd_map_qiov(bs, &cmd, req, &local_qiov);
> > +qemu_co_mutex_unlock(&s->dma_map_lock);
> > +
> > +if (r) {
> > +req->busy = false;
> > +return r;
> 
> Leaking buf and local_qiov here.
True, fixed.

> 
> > +}
> > +
> > +trace_nvme_dsm(s, offset, bytes);
> > +
> > +nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
> > +
> > +data.co = qemu_coroutine_self();
> > +while (data.ret == -EINPROGRESS) {
> > +qemu_coroutine_yield();
> > +}
> > +
> > +    qemu_co_mutex_lock(&s->dma_map_lock);
> > +r = nvme_cmd_unmap_qiov(bs, &local_qiov);
> > +qemu_co_mutex_unlock(&s->dma_map_lock);
> > +if (r) {
> > +return r;
> 
> Leaking buf and local_qiov here, too.
True, fixed - next time will check error paths better.

> 
> Max
> 
> > +}
> > +
> > +trace_nvme_dsm_done(s, offset, bytes, data.ret);
> > +
> > +qemu_iovec_destroy(&local_qiov);
> > +qemu_vfree(buf);
> > +return data.ret;
> > +
> > +}
> > +
> > +
> >  static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
> > BlockReopenQueue *queue, Error **errp)
> >  {
> 
> 

Thanks for the review,
Best regards,
Maxim Levitsky





Re: [Qemu-block] [PATCH v3 1/6] block/nvme: don't touch the completion entries

2019-07-08 Thread Maxim Levitsky
On Mon, 2019-07-08 at 14:23 +0200, Max Reitz wrote:
> On 07.07.19 10:43, Maxim Levitsky wrote:
> > On Fri, 2019-07-05 at 13:03 +0200, Max Reitz wrote:
> > > On 03.07.19 17:59, Maxim Levitsky wrote:
> > > > Completion entries are meant to be only read by the host and written by 
> > > > the device.
> > > > The driver is supposed to scan the completions from the last point 
> > > > where it left,
> > > > and until it sees a completion with non flipped phase bit.
> > > 
> > > (Disclaimer: This is the first time I read the nvme driver, or really
> > > something in the nvme spec.)
> > > 
> > > Well, no, completion entries are also meant to be initialized by the
> > > host.  To me it looks like this is the place where that happens:
> > > Everything that has been processed by the device is immediately being
> > > re-initialized.
> > > 
> > > Maybe we shouldn’t do that here but in nvme_submit_command().  But
> > > currently we don’t, and I don’t see any other place where we currently
> > > initialize the CQ entries.
> > 
> > Hi!
> > I couldn't find any place in the spec that says that completion entries 
> > should be initialized.
> > It is probably wise to initialize that area to 0 on driver initialization, 
> > but nothing beyond that.
> 
> Ah, you’re right, I misread.  I didn’t pay as much attention to the
> “...prior to setting CC.EN to ‘1’” as I should have.  Yep, and that is
> done in nvme_init_queue().
> 
> OK, I cease my wrongful protest:
> 
> Reviewed-by: Max Reitz 
> 
> > 

Thank you very much!
BTW, the qemu driver does allocate zeroed memory (in nvme_init_queue, 
"q->queue = qemu_try_blockalign0(bs, bytes);"

Thus I think this is all that is needed in that regard.

Note that this patch doesn't fix any real bug I know of, 
but just makes the thing right in regard to the spec.
Also racing with hardware in theory can have various memory ordering bugs,
although in this case the writes are done in 
entries which controller probably won't touch, but still.

TL;DR - no need in code which does nothing and might cause issues.

Do you want me to resend the series or shall I wait till we decide
what to do with the image creation support? I done fixing all the
review comments long ago, just didn't want to resend the series.
Or shall I drop that patch and resend?

>From the urgency standpoint the only patch that really should
be merged ASAP is the one that adds support for block sizes,
because without it, the whole thing crashes and burns on 4K
nvme drives.

Best regards,
Maxim Levitsky







Re: [Qemu-block] [PATCH v3 1/6] block/nvme: don't touch the completion entries

2019-07-08 Thread Maxim Levitsky
On Mon, 2019-07-08 at 15:00 +0200, Max Reitz wrote:
> On 08.07.19 14:51, Maxim Levitsky wrote:
> > On Mon, 2019-07-08 at 14:23 +0200, Max Reitz wrote:
> > > On 07.07.19 10:43, Maxim Levitsky wrote:
> > > > On Fri, 2019-07-05 at 13:03 +0200, Max Reitz wrote:
> > > > > On 03.07.19 17:59, Maxim Levitsky wrote:
> > > > > > Completion entries are meant to be only read by the host and 
> > > > > > written by the device.
> > > > > > The driver is supposed to scan the completions from the last point 
> > > > > > where it left,
> > > > > > and until it sees a completion with non flipped phase bit.
> > > > > 
> > > > > (Disclaimer: This is the first time I read the nvme driver, or really
> > > > > something in the nvme spec.)
> > > > > 
> > > > > Well, no, completion entries are also meant to be initialized by the
> > > > > host.  To me it looks like this is the place where that happens:
> > > > > Everything that has been processed by the device is immediately being
> > > > > re-initialized.
> > > > > 
> > > > > Maybe we shouldn’t do that here but in nvme_submit_command().  But
> > > > > currently we don’t, and I don’t see any other place where we currently
> > > > > initialize the CQ entries.
> > > > 
> > > > Hi!
> > > > I couldn't find any place in the spec that says that completion entries 
> > > > should be initialized.
> > > > It is probably wise to initialize that area to 0 on driver 
> > > > initialization, but nothing beyond that.
> > > 
> > > Ah, you’re right, I misread.  I didn’t pay as much attention to the
> > > “...prior to setting CC.EN to ‘1’” as I should have.  Yep, and that is
> > > done in nvme_init_queue().
> > > 
> > > OK, I cease my wrongful protest:
> > > 
> > > Reviewed-by: Max Reitz 
> > > 
> > > > 
> > 
> > Thank you very much!
> > BTW, the qemu driver does allocate zeroed memory (in nvme_init_queue, 
> > "q->queue = qemu_try_blockalign0(bs, bytes);"
> 
> Yes, that’s what I was referring to above. :-)
> 
> > Thus I think this is all that is needed in that regard.
> > 
> > Note that this patch doesn't fix any real bug I know of, 
> > but just makes the thing right in regard to the spec.
> > Also racing with hardware in theory can have various memory ordering bugs,
> > although in this case the writes are done in 
> > entries which controller probably won't touch, but still.
> > 
> > TL;DR - no need in code which does nothing and might cause issues.
> > 
> > Do you want me to resend the series or shall I wait till we decide
> > what to do with the image creation support? I done fixing all the
> > review comments long ago, just didn't want to resend the series.
> > Or shall I drop that patch and resend?
> 
> I think I won’t apply the image creation patch now, so it’s probably
> better to just drop it for now.
> 
> > From the urgency standpoint the only patch that really should
> > be merged ASAP is the one that adds support for block sizes,
> > because without it, the whole thing crashes and burns on 4K
> > nvme drives.
> 
> By now we’re in softfreeze anyway, so unless write-zeroes/discard
> support is important now, it’s difficult to justify taking them for 4.1.
>  So for me it would be best if you put patches 1 through 3 into a
> for-4.1 series and move the rest to 4.2.  (I’d probably also split the
> creation patch off, because I don’t think I’m going to apply it before
> having experimented a bit with blockdev-create for qemu-img.)
> 
> If you think write-zeroes/discard support is important for 4.1, feel
> free to include them in the for-4.1 series along with an explanation as
> to why it’s important.

I don't think either that these are important, so I split them as you say.

Best regards,
Maxim Levitsky






Re: [Qemu-block] question:about introduce a new feature named “I/O hang”

2019-07-08 Thread Maxim Levitsky
On Fri, 2019-07-05 at 09:50 +0200, Kevin Wolf wrote:
> Am 04.07.2019 um 17:16 hat wangjie (P) geschrieben:
> > Hi, everybody:
> > 
> > I developed a feature named "I/O hang",my intention is to solve the problem
> > like that:
> > If the backend storage media of VM disk is far-end storage like IPSAN or
> > FCSAN, storage net link will always disconnection and
> > make I/O requests return EIO to Guest, and the status of filesystem in Guest
> > will be read-only, even the link recovered
> > after a while, the status of filesystem in Guest will not recover.
> 
> The standard solution for this is configuring the guest device with
> werror=stop,rerror=stop so that the error is not delivered to the guest,
> but the VM is stopped. When you run 'cont', the request is then retried.
> 
> > So I developed a feature named "I/O hang" to solve this problem, the
> > solution like that:
> > when some I/O requests return EIO in backend, "I/O hang" will catch the
> > requests in qemu block layer and
> > insert the requests to a rehandle queue but not return EIO to Guest, the I/O
> > requests in Guest will hang but it does not lead
> > Guest filesystem to be read-only, then "I/O hang" will loop to rehandle the
> > requests for a period time(ex. 5 second) until the requests
> > not return EIO(when backend storage link recovered).
> 
> Letting requests hang without stopping the VM risks the guest running
> into timeouts and deciding that its disk is broken.
I came to say exactly this.
While developing the nvme-mdev I also had this problem and due to assumptions 
built in the block layer,
you can't just let the guest wait forever for a request.

Note that Linux's nvme driver does know how to retry failed requests, including 
these that timed out if that helps in any way.

Best regards,
Maxim Levitsky


> 
> As you say your "hang" and retry logic sits in the block layer, what do
> you do when you encounter a bdrv_drain() request?
> 
> > In addition to the function as above, "I/O hang" also can sent event to
> > libvirt after backend storage status changed.
> > 
> > configure methods:
> > 1. "I/O hang" ability can be configured for each disk as a disk attribute.
> > 2. "I/O hang" timeout value also can be configured for each disk, when
> > storage link not recover in timeout value,
> >"I/O hang" will disable rehandle I/O requests and return EIO to Guest.
> > 
> > Are you interested in the feature?  I intend to push this feature to qemu
> > org, what's your opinion?
> 
> Were you aware of werror/rerror? Before we add another mechanism, we
> need to be sure how the features compare, that the new mechanism
> provides a significant advantage and that we keep code duplication as
> low as possible.
> 
> Kevin
> 





Re: [Qemu-block] [Qemu-devel] [PATCH v2 1/1] raw-posix.c - use max transfer length / max segement count only for SCSI passthrough

2019-07-10 Thread Maxim Levitsky
On Thu, 2019-07-04 at 15:43 +0300, Maxim Levitsky wrote:
> Regular kernel block devices (/dev/sda*, /dev/nvme*, etc) don't have
> max segment size/max segment count hardware requirements exposed
> to the userspace, but rather the kernel block layer
> takes care to split the incoming requests that
> violate these requirements.
> 
> Allowing the kernel to do the splitting allows qemu to avoid
> various overheads that arise otherwise from this.
> 
> This is especially visible in nbd server,
> exposing as a raw file, a mostly empty qcow2 image over the net.
> In this case most of the reads by the remote user
> won't even hit the underlying kernel block device,
> and therefore most of the  overhead will be in the
> nbd traffic which increases significantly with lower max transfer size.
> 
> In addition to that even for local block device
> access the peformance improves a bit due to less
> traffic between qemu and the kernel when large
> transfer sizes are used (e.g for image conversion)
> 
> More info can be found at:
> https://bugzilla.redhat.com/show_bug.cgi?id=1647104
> 
> Signed-off-by: Maxim Levitsky 
> Reviewed-by: Stefan Hajnoczi 
> Reviewed-by: Eric Blake 
> ---
>  block/file-posix.c | 54 --
>  1 file changed, 28 insertions(+), 26 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index ab05b51a66..4479cc7ab4 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1038,15 +1038,13 @@ static void raw_reopen_abort(BDRVReopenState *state)
>  s->reopen_state = NULL;
>  }
>  
> -static int hdev_get_max_transfer_length(BlockDriverState *bs, int fd)
> +static int sg_get_max_transfer_length(int fd)
>  {
>  #ifdef BLKSECTGET
>  int max_bytes = 0;
> -short max_sectors = 0;
> -if (bs->sg && ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
> +
> +if (ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
>  return max_bytes;
> -} else if (!bs->sg && ioctl(fd, BLKSECTGET, &max_sectors) == 0) {
> -return max_sectors << BDRV_SECTOR_BITS;
>  } else {
>  return -errno;
>  }
> @@ -1055,25 +1053,31 @@ static int 
> hdev_get_max_transfer_length(BlockDriverState *bs, int fd)
>  #endif
>  }
>  
> -static int hdev_get_max_segments(const struct stat *st)
> +static int sg_get_max_segments(int fd)
>  {
>  #ifdef CONFIG_LINUX
>  char buf[32];
>  const char *end;
> -char *sysfspath;
> +char *sysfspath = NULL;
>  int ret;
> -int fd = -1;
> +int sysfd = -1;
>  long max_segments;
> +struct stat st;
> +
> +if (fstat(fd, &st)) {
> +ret = -errno;
> +goto out;
> +}
>  
>  sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
> -major(st->st_rdev), minor(st->st_rdev));
> -fd = open(sysfspath, O_RDONLY);
> -if (fd == -1) {
> +major(st.st_rdev), minor(st.st_rdev));
> +sysfd = open(sysfspath, O_RDONLY);
> +if (sysfd == -1) {
>  ret = -errno;
>  goto out;
>  }
>  do {
> -ret = read(fd, buf, sizeof(buf) - 1);
> +ret = read(sysfd, buf, sizeof(buf) - 1);
>  } while (ret == -1 && errno == EINTR);
>  if (ret < 0) {
>  ret = -errno;
> @@ -1090,8 +1094,8 @@ static int hdev_get_max_segments(const struct stat *st)
>  }
>  
>  out:
> -if (fd != -1) {
> -close(fd);
> +if (sysfd != -1) {
> +close(sysfd);
>  }
>  g_free(sysfspath);
>  return ret;
> @@ -1103,19 +1107,17 @@ out:
>  static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>  {
>  BDRVRawState *s = bs->opaque;
> -struct stat st;
>  
> -if (!fstat(s->fd, &st)) {
> -if (S_ISBLK(st.st_mode) || S_ISCHR(st.st_mode)) {
> -int ret = hdev_get_max_transfer_length(bs, s->fd);
> -if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
> -bs->bl.max_transfer = pow2floor(ret);
> -}
> -ret = hdev_get_max_segments(&st);
> -if (ret > 0) {
> -bs->bl.max_transfer = MIN(bs->bl.max_transfer,
> -  ret * getpagesize());
> -}
> +if (bs->sg) {
> +int ret = sg_get_max_transfer_length(s->fd);
> +
> +if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
> +bs->bl.max_transfer = pow2floor(ret);
> +}
> +
> +ret = sg_get_max_segments(s->fd);
> +if (ret > 0) {
> +bs->bl.max_transfer = MIN(bs->bl.max_transfer, ret * 
> getpagesize());
>  }
>  }
>  


Ping.

Best regards,
Maxim Levitsky




[Qemu-block] [PATCH] LUKS: support preallocation in qemu-img

2019-07-10 Thread Maxim Levitsky
preallocation=off and preallocation=metadata
both allocate luks header only, and preallocation=falloc/full
is passed to underlying file, with the given image size.

Note that the actual preallocated size is a bit smaller due
to luks header.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1534951

Signed-off-by: Maxim Levitsky 
---
 block/crypto.c | 28 ++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/block/crypto.c b/block/crypto.c
index 8237424ae6..74b789d278 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -251,6 +251,7 @@ static int block_crypto_open_generic(QCryptoBlockFormat 
format,
 static int block_crypto_co_create_generic(BlockDriverState *bs,
   int64_t size,
   QCryptoBlockCreateOptions *opts,
+  PreallocMode prealloc,
   Error **errp)
 {
 int ret;
@@ -266,6 +267,13 @@ static int block_crypto_co_create_generic(BlockDriverState 
*bs,
 goto cleanup;
 }
 
+if (prealloc != PREALLOC_MODE_OFF) {
+ret = blk_truncate(blk, size, prealloc, errp);
+if (ret < 0) {
+goto cleanup;
+}
+}
+
 data = (struct BlockCryptoCreateData) {
 .blk = blk,
 .size = size,
@@ -516,7 +524,7 @@ block_crypto_co_create_luks(BlockdevCreateOptions 
*create_options, Error **errp)
 };
 
 ret = block_crypto_co_create_generic(bs, luks_opts->size, &create_opts,
- errp);
+ PREALLOC_MODE_OFF, errp);
 if (ret < 0) {
 goto fail;
 }
@@ -534,12 +542,28 @@ static int coroutine_fn 
block_crypto_co_create_opts_luks(const char *filename,
 QCryptoBlockCreateOptions *create_opts = NULL;
 BlockDriverState *bs = NULL;
 QDict *cryptoopts;
+PreallocMode prealloc;
+char *buf = NULL;
 int64_t size;
 int ret;
+Error *local_err = NULL;
 
 /* Parse options */
 size = qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0);
 
+buf = qemu_opt_get_del(opts, BLOCK_OPT_PREALLOC);
+prealloc = qapi_enum_parse(&PreallocMode_lookup, buf,
+   PREALLOC_MODE_OFF, &local_err);
+g_free(buf);
+if (local_err) {
+error_propagate(errp, local_err);
+return -EINVAL;
+}
+
+if (prealloc == PREALLOC_MODE_METADATA) {
+prealloc  = PREALLOC_MODE_OFF;
+}
+
 cryptoopts = qemu_opts_to_qdict_filtered(opts, NULL,
  &block_crypto_create_opts_luks,
  true);
@@ -565,7 +589,7 @@ static int coroutine_fn 
block_crypto_co_create_opts_luks(const char *filename,
 }
 
 /* Create format layer */
-ret = block_crypto_co_create_generic(bs, size, create_opts, errp);
+ret = block_crypto_co_create_generic(bs, size, create_opts, prealloc, 
errp);
 if (ret < 0) {
 goto fail;
 }
-- 
2.17.2




Re: [Qemu-block] [Qemu-devel] [PATCH] nvme: Set number of queues later in nvme_init()

2019-07-10 Thread Maxim Levitsky
On Wed, 2019-07-10 at 17:34 +0200, Philippe Mathieu-Daudé wrote:
> On 7/10/19 4:57 PM, Michal Privoznik wrote:
> > When creating the admin queue in nvme_init() the variable that
> > holds the number of queues created is modified before actual
> > queue creation. This is a problem because if creating the queue
> > fails then the variable is left in inconsistent state. This was
> > actually observed when I tried to hotplug a nvme disk. The
> > control got to nvme_file_open() which called nvme_init() which
> > failed and thus nvme_close() was called which in turn called
> > nvme_free_queue_pair() with queue being NULL. This lead to an
> > instant crash:
> > 
> >   #0  0x55d9507ec211 in nvme_free_queue_pair (bs=0x55d952ddb880, q=0x0) 
> > at block/nvme.c:164
> >   #1  0x55d9507ee180 in nvme_close (bs=0x55d952ddb880) at 
> > block/nvme.c:729
> >   #2  0x55d9507ee3d5 in nvme_file_open (bs=0x55d952ddb880, 
> > options=0x55d952bb1410, flags=147456, errp=0x7ffd8e19e200) at 
> > block/nvme.c:781
> >   #3  0x55d9507629f3 in bdrv_open_driver (bs=0x55d952ddb880, 
> > drv=0x55d95109c1e0 , node_name=0x0, options=0x55d952bb1410, 
> > open_flags=147456, errp=0x7ffd8e19e310) at block.c:1291
> >   #4  0x55d9507633d6 in bdrv_open_common (bs=0x55d952ddb880, file=0x0, 
> > options=0x55d952bb1410, errp=0x7ffd8e19e310) at block.c:1551
> >   #5  0x55d950766881 in bdrv_open_inherit (filename=0x0, reference=0x0, 
> > options=0x55d952bb1410, flags=32768, parent=0x55d9538ce420, 
> > child_role=0x55d950eaade0 , errp=0x7ffd8e19e510)
> > at block.c:3063
> >   #6  0x55d950765ae4 in bdrv_open_child_bs (filename=0x0, 
> > options=0x55d9541cdff0, bdref_key=0x55d950af33aa "file", 
> > parent=0x55d9538ce420, child_role=0x55d950eaade0 ,
> > allow_none=true, errp=0x7ffd8e19e510) at block.c:2712
> >   #7  0x55d950766633 in bdrv_open_inherit (filename=0x0, reference=0x0, 
> > options=0x55d9541cdff0, flags=0, parent=0x0, child_role=0x0, 
> > errp=0x7ffd8e19e908) at block.c:3011
> >   #8  0x55d950766dba in bdrv_open (filename=0x0, reference=0x0, 
> > options=0x55d953d00390, flags=0, errp=0x7ffd8e19e908) at block.c:3156
> >   #9  0x55d9507cb635 in blk_new_open (filename=0x0, reference=0x0, 
> > options=0x55d953d00390, flags=0, errp=0x7ffd8e19e908) at 
> > block/block-backend.c:389
> >   #10 0x55d950465ec5 in blockdev_init (file=0x0, 
> > bs_opts=0x55d953d00390, errp=0x7ffd8e19e908) at blockdev.c:602
> > 
> 
> Fixes: bdd6a90a9e5
> 
> > Signed-off-by: Michal Privoznik 
> 
> Reviewed-by: Philippe Mathieu-Daudé 
> Tested-by: Philippe Mathieu-Daudé 
> 
> > ---
> >  block/nvme.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 73ed5fa75f..9896b7f7c6 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -613,12 +613,12 @@ static int nvme_init(BlockDriverState *bs, const char 
> > *device, int namespace,
> >  
> >  /* Set up admin queue. */
> >  s->queues = g_new(NVMeQueuePair *, 1);
> > -s->nr_queues = 1;
> >  s->queues[0] = nvme_create_queue_pair(bs, 0, NVME_QUEUE_SIZE, errp);
> >  if (!s->queues[0]) {
> >  ret = -EINVAL;
> >  goto out;
> >  }
> > +s->nr_queues = 1;
> >  QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
> >  s->regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
> >  s->regs->asq = cpu_to_le64(s->queues[0]->sq.iova);
> > 
> 
> 


Reviewed-by: Maxim Levitsky 
Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH] LUKS: support preallocation in qemu-img

2019-07-11 Thread Maxim Levitsky
On Wed, 2019-07-10 at 23:52 +0200, Max Reitz wrote:
> On 10.07.19 23:24, Max Reitz wrote:
> > On 10.07.19 19:03, Maxim Levitsky wrote:
> > > preallocation=off and preallocation=metadata
> > > both allocate luks header only, and preallocation=falloc/full
> > > is passed to underlying file, with the given image size.
> > > 
> > > Note that the actual preallocated size is a bit smaller due
> > > to luks header.
> > 
> > Couldn’t you just preallocate it after creating the crypto header so
> > qcrypto_block_get_payload_offset(crypto->block) + size is the actual
> > file size?

I kind of thought of the same thing after I send the patch. I'll see now it I 
can make it work.


> > 
> > > Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1534951
> > > 
> > > Signed-off-by: Maxim Levitsky 
> > > ---
> > >  block/crypto.c | 28 ++--
> > >  1 file changed, 26 insertions(+), 2 deletions(-)
> > 
> > Hm.  I would expect a preallocated image to read 0.  But if you just
> > pass this through to the protocol layer, it won’t read 0.
> > 
> > (In fact, I don’t even quite see the point of having LUKS as an own
> > format still.  It was useful when qcow2 didn’t have LUKS support, but
> > now it does, so...  I suppose everyone using the LUKS format should
> > actually be using qcow2 with LUKS?)
> 
> Kevin just pointed out to me that our LUKS format is compatible to the
> actual layout cryptsetup uses.  OK, that is an important use case.
> 
> Hm.  Unfortunately, that doesn’t really necessitate preallocation.
> 
> Well, whatever.  If it’s simple enough, that shouldn’t stop us from
> implementing preallocation anyway.
Exactly. Since I already know the area of qemu-img relatively well, and
this bug is on my backlog, I thought why not to do it.


> 
> 
> Now I found that qapi/block-core.json defines PreallocMode’s falloc and
> full values as follows:
> 
> > # @falloc: like @full preallocation but allocate disk space by
> > #  posix_fallocate() rather than writing zeros.
> > # @full: preallocate all data by writing zeros to device to ensure disk
> > #space is really available. @full preallocation also sets up
> > #metadata correctly.
> 
> So it isn’t just me who expects these to pre-initialize the image to 0.
>  Hm, although...  I suppose @falloc technically does not specify whether
> the data reads as zeroes.  I kind of find it to be implied, but, well...

I personally don't really think that zeros are important, but rather the level 
of allocation.
posix_fallocate probably won't write the data blocks but rather only the inode 
metadata / used block bitmap/etc.

On the other hand writing zeros (or anything else) will force the block layer 
to actually write to the underlying
storage which could trigger lower layer allocation if the underlying storage is 
thin-provisioned.

In fact IMHO, instead of writing zeros, it would be better to write random 
garbage instead (or have that as an even 'fuller'
preallocation mode), since underlying storage might 'compress' the zeros. 

In this version I do have a bug that I mentioned, about not preallocation some 
data at the end of the image, and I will
fix it, so that all image is zeros as expected

Best regards,
Maxim Levitsky


> 
> Max
> 





[Qemu-block] [PATCH v2] LUKS: support preallocation in qemu-img

2019-07-11 Thread Maxim Levitsky
preallocation=off and preallocation=metadata
both allocate luks header only, and preallocation=falloc/full
is passed to underlying file.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1534951

Signed-off-by: Maxim Levitsky 
---
 block/crypto.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/block/crypto.c b/block/crypto.c
index 8237424ae6..cbc291301e 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -74,6 +74,7 @@ static ssize_t block_crypto_read_func(QCryptoBlock *block,
 struct BlockCryptoCreateData {
 BlockBackend *blk;
 uint64_t size;
+PreallocMode prealloc;
 };
 
 
@@ -112,7 +113,7 @@ static ssize_t block_crypto_init_func(QCryptoBlock *block,
  * available to the guest, so we must take account of that
  * which will be used by the crypto header
  */
-return blk_truncate(data->blk, data->size + headerlen, PREALLOC_MODE_OFF,
+return blk_truncate(data->blk, data->size + headerlen, data->prealloc,
 errp);
 }
 
@@ -251,6 +252,7 @@ static int block_crypto_open_generic(QCryptoBlockFormat 
format,
 static int block_crypto_co_create_generic(BlockDriverState *bs,
   int64_t size,
   QCryptoBlockCreateOptions *opts,
+  PreallocMode prealloc,
   Error **errp)
 {
 int ret;
@@ -269,6 +271,7 @@ static int block_crypto_co_create_generic(BlockDriverState 
*bs,
 data = (struct BlockCryptoCreateData) {
 .blk = blk,
 .size = size,
+.prealloc = prealloc,
 };
 
 crypto = qcrypto_block_create(opts, NULL,
@@ -516,7 +519,7 @@ block_crypto_co_create_luks(BlockdevCreateOptions 
*create_options, Error **errp)
 };
 
 ret = block_crypto_co_create_generic(bs, luks_opts->size, &create_opts,
- errp);
+ PREALLOC_MODE_OFF, errp);
 if (ret < 0) {
 goto fail;
 }
@@ -534,12 +537,28 @@ static int coroutine_fn 
block_crypto_co_create_opts_luks(const char *filename,
 QCryptoBlockCreateOptions *create_opts = NULL;
 BlockDriverState *bs = NULL;
 QDict *cryptoopts;
+PreallocMode prealloc;
+char *buf = NULL;
 int64_t size;
 int ret;
+Error *local_err = NULL;
 
 /* Parse options */
 size = qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0);
 
+buf = qemu_opt_get_del(opts, BLOCK_OPT_PREALLOC);
+prealloc = qapi_enum_parse(&PreallocMode_lookup, buf,
+   PREALLOC_MODE_OFF, &local_err);
+g_free(buf);
+if (local_err) {
+error_propagate(errp, local_err);
+return -EINVAL;
+}
+
+if (prealloc == PREALLOC_MODE_METADATA) {
+prealloc  = PREALLOC_MODE_OFF;
+}
+
 cryptoopts = qemu_opts_to_qdict_filtered(opts, NULL,
  &block_crypto_create_opts_luks,
  true);
@@ -565,7 +584,7 @@ static int coroutine_fn 
block_crypto_co_create_opts_luks(const char *filename,
 }
 
 /* Create format layer */
-ret = block_crypto_co_create_generic(bs, size, create_opts, errp);
+ret = block_crypto_co_create_generic(bs, size, create_opts, prealloc, 
errp);
 if (ret < 0) {
 goto fail;
 }
-- 
2.17.2




Re: [Qemu-block] [PATCH] doc: Preallocation does not require writing zeroes

2019-07-11 Thread Maxim Levitsky
On Thu, 2019-07-11 at 15:29 +0200, Max Reitz wrote:
> When preallocating an encrypted qcow2 image, it just lets the protocol
> driver write data and then does not mark the clusters as zero.
> Therefore, reading this image will yield effectively random data.
> 
> As such, we have not fulfilled the promise of always writing zeroes when
> preallocating an image in a while.  It seems that nobody has really
> cared, so change the documentation to conform to qemu's actual behavior.
> 
> Signed-off-by: Max Reitz 

I did a little grep on qemu source tree with mentions of preallocation, 
and it looks like these are the only mentions that need to be fixed.
So, thank you very much, and

Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v2] LUKS: support preallocation in qemu-img

2019-07-11 Thread Maxim Levitsky
On Thu, 2019-07-11 at 15:43 +0200, Max Reitz wrote:
> On 11.07.19 11:11, Maxim Levitsky wrote:
> > preallocation=off and preallocation=metadata
> > both allocate luks header only, and preallocation=falloc/full
> > is passed to underlying file.
> > 
> > Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1534951
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/crypto.c | 25 ++---
> >  1 file changed, 22 insertions(+), 3 deletions(-)
> 
> FWIW, do you see the implementation of block_crypto_co_truncate()?
> Like, how it just passes preallocation requests through to the
> underlying layer?  How I said it shouldn’t be done?
> 
> Yes, that was me, in commit 7ea37c30660.
> 
> So, er, yeah.
> 
> > diff --git a/block/crypto.c b/block/crypto.c
> > index 8237424ae6..cbc291301e 100644
> > --- a/block/crypto.c
> > +++ b/block/crypto.c
> 
> [...]
> 
> > @@ -534,12 +537,28 @@ static int coroutine_fn 
> > block_crypto_co_create_opts_luks(const char *filename,
> >  QCryptoBlockCreateOptions *create_opts = NULL;
> >  BlockDriverState *bs = NULL;
> >  QDict *cryptoopts;
> > +PreallocMode prealloc;
> > +char *buf = NULL;
> >  int64_t size;
> >  int ret;
> > +Error *local_err = NULL;
> >  
> >  /* Parse options */
> >  size = qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0);
> >  
> > +buf = qemu_opt_get_del(opts, BLOCK_OPT_PREALLOC);
> > +prealloc = qapi_enum_parse(&PreallocMode_lookup, buf,
> > +   PREALLOC_MODE_OFF, &local_err);
> 
> Please align such lines to the opening parenthesis.
True - I really need to invest some time to update the checkpatch.pl
in qemu source tree to be up to date, or find a way to use the kernel one,
it is so useful to let it catch these things instead of wasting your time.

> 
> > +g_free(buf);
> > +if (local_err) {
> > +error_propagate(errp, local_err);
> > +return -EINVAL;
> > +}
> > +
> > +if (prealloc == PREALLOC_MODE_METADATA) {
> > +prealloc  = PREALLOC_MODE_OFF;
> 
> There is one space too many here.
Oops, same thing as above.

> 
> > +}
> > +
> 
> I think you also need to add a @preallocation parameter to
> BlockdevCreateOptionsLUKS and handle it in block_crypto_co_create_luks().

I was under impression that with new qmp based blockdev-create api, the user
should pretty much do the preallocation itself on the underlying block file,
and then create the luks on it.

However I do see that qcow2 has a preallocation mode there, so I guess I was 
wrong.
Will do it now.

Best regards,
Maxim Levitsky







[Qemu-block] [PATCH v3] LUKS: support preallocation

2019-07-11 Thread Maxim Levitsky
preallocation=off and preallocation=metadata
both allocate luks header only, and preallocation=falloc/full
is passed to underlying file.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1534951

Signed-off-by: Maxim Levitsky 

---

Note that QMP support was only compile tested, since I am still learning
on how to use it.

If there is some library/script/etc which makes it more high level,
I would more that glad to hear about it. So far I used the qmp-shell

Also can I use qmp's blockdev-create outside a vm running?

 block/crypto.c   | 29 ++---
 qapi/block-core.json |  5 -
 2 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/block/crypto.c b/block/crypto.c
index 8237424ae6..034a645652 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -74,6 +74,7 @@ static ssize_t block_crypto_read_func(QCryptoBlock *block,
 struct BlockCryptoCreateData {
 BlockBackend *blk;
 uint64_t size;
+PreallocMode prealloc;
 };
 
 
@@ -112,7 +113,7 @@ static ssize_t block_crypto_init_func(QCryptoBlock *block,
  * available to the guest, so we must take account of that
  * which will be used by the crypto header
  */
-return blk_truncate(data->blk, data->size + headerlen, PREALLOC_MODE_OFF,
+return blk_truncate(data->blk, data->size + headerlen, data->prealloc,
 errp);
 }
 
@@ -251,6 +252,7 @@ static int block_crypto_open_generic(QCryptoBlockFormat 
format,
 static int block_crypto_co_create_generic(BlockDriverState *bs,
   int64_t size,
   QCryptoBlockCreateOptions *opts,
+  PreallocMode prealloc,
   Error **errp)
 {
 int ret;
@@ -266,9 +268,14 @@ static int block_crypto_co_create_generic(BlockDriverState 
*bs,
 goto cleanup;
 }
 
+if (prealloc == PREALLOC_MODE_METADATA) {
+prealloc = PREALLOC_MODE_OFF;
+}
+
 data = (struct BlockCryptoCreateData) {
 .blk = blk,
 .size = size,
+.prealloc = prealloc,
 };
 
 crypto = qcrypto_block_create(opts, NULL,
@@ -500,6 +507,7 @@ block_crypto_co_create_luks(BlockdevCreateOptions 
*create_options, Error **errp)
 BlockdevCreateOptionsLUKS *luks_opts;
 BlockDriverState *bs = NULL;
 QCryptoBlockCreateOptions create_opts;
+PreallocMode preallocation = PREALLOC_MODE_OFF;
 int ret;
 
 assert(create_options->driver == BLOCKDEV_DRIVER_LUKS);
@@ -515,8 +523,11 @@ block_crypto_co_create_luks(BlockdevCreateOptions 
*create_options, Error **errp)
 .u.luks = *qapi_BlockdevCreateOptionsLUKS_base(luks_opts),
 };
 
+if (luks_opts->has_preallocation)
+preallocation = luks_opts->preallocation;
+
 ret = block_crypto_co_create_generic(bs, luks_opts->size, &create_opts,
- errp);
+ preallocation, errp);
 if (ret < 0) {
 goto fail;
 }
@@ -534,12 +545,24 @@ static int coroutine_fn 
block_crypto_co_create_opts_luks(const char *filename,
 QCryptoBlockCreateOptions *create_opts = NULL;
 BlockDriverState *bs = NULL;
 QDict *cryptoopts;
+PreallocMode prealloc;
+char *buf = NULL;
 int64_t size;
 int ret;
+Error *local_err = NULL;
 
 /* Parse options */
 size = qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0);
 
+buf = qemu_opt_get_del(opts, BLOCK_OPT_PREALLOC);
+prealloc = qapi_enum_parse(&PreallocMode_lookup, buf,
+   PREALLOC_MODE_OFF, &local_err);
+g_free(buf);
+if (local_err) {
+error_propagate(errp, local_err);
+return -EINVAL;
+}
+
 cryptoopts = qemu_opts_to_qdict_filtered(opts, NULL,
  &block_crypto_create_opts_luks,
  true);
@@ -565,7 +588,7 @@ static int coroutine_fn 
block_crypto_co_create_opts_luks(const char *filename,
 }
 
 /* Create format layer */
-ret = block_crypto_co_create_generic(bs, size, create_opts, errp);
+ret = block_crypto_co_create_generic(bs, size, create_opts, prealloc, 
errp);
 if (ret < 0) {
 goto fail;
 }
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 0d43d4f37c..ebcfc9f903 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4205,13 +4205,16 @@
 #
 # @file Node to create the image format on
 # @size Size of the virtual disk in bytes
+# @preallocationPreallocation mode for the new image (default: off;
+#   allowed values: off/falloc/full
 #
 # Since: 2.12
 ##
 { 'struct': 'BlockdevCreateOptionsLUKS',
   'base': 'QCryptoBlockCreateOptionsLUKS',
   'data': { 'file': 'BlockdevRef',
-  

Re: [Qemu-block] [PATCH v3] LUKS: support preallocation

2019-07-14 Thread Maxim Levitsky
On Thu, 2019-07-11 at 18:27 +0200, Stefano Garzarella wrote:
> On Thu, Jul 11, 2019 at 06:09:40PM +0300, Maxim Levitsky wrote:
> > preallocation=off and preallocation=metadata
> > both allocate luks header only, and preallocation=falloc/full
> > is passed to underlying file.
> > 
> > Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1534951
> > 
> > Signed-off-by: Maxim Levitsky 
> > 
> > ---
> > 
> > Note that QMP support was only compile tested, since I am still learning
> > on how to use it.
> > 
> > If there is some library/script/etc which makes it more high level,
> > I would more that glad to hear about it. So far I used the qmp-shell
> > 
> > Also can I use qmp's blockdev-create outside a vm running?
> > 
> >  block/crypto.c   | 29 ++---
> >  qapi/block-core.json |  5 -
> >  2 files changed, 30 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/crypto.c b/block/crypto.c
> > index 8237424ae6..034a645652 100644
> > --- a/block/crypto.c
> > +++ b/block/crypto.c
> > @@ -74,6 +74,7 @@ static ssize_t block_crypto_read_func(QCryptoBlock *block,
> >  struct BlockCryptoCreateData {
> >  BlockBackend *blk;
> >  uint64_t size;
> > +PreallocMode prealloc;
> >  };
> >  
> >  
> > @@ -112,7 +113,7 @@ static ssize_t block_crypto_init_func(QCryptoBlock 
> > *block,
> >   * available to the guest, so we must take account of that
> >   * which will be used by the crypto header
> >   */
> > -return blk_truncate(data->blk, data->size + headerlen, 
> > PREALLOC_MODE_OFF,
> > +return blk_truncate(data->blk, data->size + headerlen, data->prealloc,
> >  errp);
> >  }
> >  
> > @@ -251,6 +252,7 @@ static int block_crypto_open_generic(QCryptoBlockFormat 
> > format,
> >  static int block_crypto_co_create_generic(BlockDriverState *bs,
> >int64_t size,
> >QCryptoBlockCreateOptions *opts,
> > +  PreallocMode prealloc,
> >Error **errp)
> >  {
> >  int ret;
> > @@ -266,9 +268,14 @@ static int 
> > block_crypto_co_create_generic(BlockDriverState *bs,
> >  goto cleanup;
> >  }
> >  
> > +if (prealloc == PREALLOC_MODE_METADATA) {
> > +prealloc = PREALLOC_MODE_OFF;
> > +}
> > +
> >  data = (struct BlockCryptoCreateData) {
> >  .blk = blk,
> >  .size = size,
> > +.prealloc = prealloc,
> >  };
> >  
> >  crypto = qcrypto_block_create(opts, NULL,
> > @@ -500,6 +507,7 @@ block_crypto_co_create_luks(BlockdevCreateOptions 
> > *create_options, Error **errp)
> >  BlockdevCreateOptionsLUKS *luks_opts;
> >  BlockDriverState *bs = NULL;
> >  QCryptoBlockCreateOptions create_opts;
> > +PreallocMode preallocation = PREALLOC_MODE_OFF;
> >  int ret;
> >  
> >  assert(create_options->driver == BLOCKDEV_DRIVER_LUKS);
> > @@ -515,8 +523,11 @@ block_crypto_co_create_luks(BlockdevCreateOptions 
> > *create_options, Error **errp)
> >  .u.luks = *qapi_BlockdevCreateOptionsLUKS_base(luks_opts),
> >  };
> >  
> > +if (luks_opts->has_preallocation)
> > +preallocation = luks_opts->preallocation;
> > +
> >  ret = block_crypto_co_create_generic(bs, luks_opts->size, &create_opts,
> > - errp);
> > + preallocation, errp);
> >  if (ret < 0) {
> >  goto fail;
> >  }
> > @@ -534,12 +545,24 @@ static int coroutine_fn 
> > block_crypto_co_create_opts_luks(const char *filename,
> >  QCryptoBlockCreateOptions *create_opts = NULL;
> >  BlockDriverState *bs = NULL;
> >  QDict *cryptoopts;
> > +PreallocMode prealloc;
> > +char *buf = NULL;
> >  int64_t size;
> >  int ret;
> > +Error *local_err = NULL;
> >  
> >  /* Parse options */
> >  size = qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0);
> >  
> > +buf = qemu_opt_get_del(opts, BLOCK_OPT_PREALLOC);
> > +prealloc = qapi_enum_parse(&PreallocMode_lookup, buf,
> > +   PREALLOC_MODE_OFF, &local_err);
> > +g_free(buf);
> > +if (local_err) {
> > +  

Re: [Qemu-block] [PATCH v3] LUKS: support preallocation

2019-07-15 Thread Maxim Levitsky
On Mon, 2019-07-15 at 10:30 +0200, Stefano Garzarella wrote:
> On Sun, Jul 14, 2019 at 05:51:51PM +0300, Maxim Levitsky wrote:
> > On Thu, 2019-07-11 at 18:27 +0200, Stefano Garzarella wrote:
> > > On Thu, Jul 11, 2019 at 06:09:40PM +0300, Maxim Levitsky wrote:
> > > > preallocation=off and preallocation=metadata
> > > > both allocate luks header only, and preallocation=falloc/full
> > > > is passed to underlying file.
> > > > 
> > > > Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1534951
> > > > 
> > > > Signed-off-by: Maxim Levitsky 
> > > > 
> > > > ---
> > > > 
> > > > Note that QMP support was only compile tested, since I am still learning
> > > > on how to use it.
> > > > 
> > > > If there is some library/script/etc which makes it more high level,
> > > > I would more that glad to hear about it. So far I used the qmp-shell
> > > > 
> > > > Also can I use qmp's blockdev-create outside a vm running?
> > > > 
> > > >  block/crypto.c   | 29 ++---
> > > >  qapi/block-core.json |  5 -
> > > >  2 files changed, 30 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/block/crypto.c b/block/crypto.c
> > > > index 8237424ae6..034a645652 100644
> > > > --- a/block/crypto.c
> > > > +++ b/block/crypto.c
> > > > @@ -74,6 +74,7 @@ static ssize_t block_crypto_read_func(QCryptoBlock 
> > > > *block,
> > > >  struct BlockCryptoCreateData {
> > > >  BlockBackend *blk;
> > > >  uint64_t size;
> > > > +PreallocMode prealloc;
> > > >  };
> > > >  
> > > >  
> > > > @@ -112,7 +113,7 @@ static ssize_t block_crypto_init_func(QCryptoBlock 
> > > > *block,
> > > >   * available to the guest, so we must take account of that
> > > >   * which will be used by the crypto header
> > > >   */
> > > > -return blk_truncate(data->blk, data->size + headerlen, 
> > > > PREALLOC_MODE_OFF,
> > > > +return blk_truncate(data->blk, data->size + headerlen, 
> > > > data->prealloc,
> > > >  errp);
> > > >  }
> > > >  
> > > > @@ -251,6 +252,7 @@ static int 
> > > > block_crypto_open_generic(QCryptoBlockFormat format,
> > > >  static int block_crypto_co_create_generic(BlockDriverState *bs,
> > > >int64_t size,
> > > >QCryptoBlockCreateOptions 
> > > > *opts,
> > > > +  PreallocMode prealloc,
> > > >Error **errp)
> > > >  {
> > > >  int ret;
> > > > @@ -266,9 +268,14 @@ static int 
> > > > block_crypto_co_create_generic(BlockDriverState *bs,
> > > >  goto cleanup;
> > > >  }
> > > >  
> > > > +if (prealloc == PREALLOC_MODE_METADATA) {
> > > > +prealloc = PREALLOC_MODE_OFF;
> > > > +}
> > > > +
> > > >  data = (struct BlockCryptoCreateData) {
> > > >  .blk = blk,
> > > >  .size = size,
> > > > +.prealloc = prealloc,
> > > >  };
> > > >  
> > > >  crypto = qcrypto_block_create(opts, NULL,
> > > > @@ -500,6 +507,7 @@ block_crypto_co_create_luks(BlockdevCreateOptions 
> > > > *create_options, Error **errp)
> > > >  BlockdevCreateOptionsLUKS *luks_opts;
> > > >  BlockDriverState *bs = NULL;
> > > >  QCryptoBlockCreateOptions create_opts;
> > > > +PreallocMode preallocation = PREALLOC_MODE_OFF;
> > > >  int ret;
> > > >  
> > > >  assert(create_options->driver == BLOCKDEV_DRIVER_LUKS);
> > > > @@ -515,8 +523,11 @@ block_crypto_co_create_luks(BlockdevCreateOptions 
> > > > *create_options, Error **errp)
> > > >  .u.luks = *qapi_BlockdevCreateOptionsLUKS_base(luks_opts),
> > > >  };
> > > >  
> > > > +if (luks_opts->has_preallocation)
> > > > +preallocation = luks_opts->preallocation;
> > > > +
> > > >  ret = blo

[Qemu-block] [PATCH v4] LUKS: support preallocation

2019-07-16 Thread Maxim Levitsky
preallocation=off and preallocation=metadata
both allocate luks header only, and preallocation=falloc/full
is passed to underlying file.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1534951

Signed-off-by: Maxim Levitsky 
---
 block/crypto.c   | 29 ++---
 qapi/block-core.json |  5 -
 2 files changed, 30 insertions(+), 4 deletions(-)


Changes from V3: updated the blockdev-create description

diff --git a/block/crypto.c b/block/crypto.c
index 8237424ae6..034a645652 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -74,6 +74,7 @@ static ssize_t block_crypto_read_func(QCryptoBlock *block,
 struct BlockCryptoCreateData {
 BlockBackend *blk;
 uint64_t size;
+PreallocMode prealloc;
 };
 
 
@@ -112,7 +113,7 @@ static ssize_t block_crypto_init_func(QCryptoBlock *block,
  * available to the guest, so we must take account of that
  * which will be used by the crypto header
  */
-return blk_truncate(data->blk, data->size + headerlen, PREALLOC_MODE_OFF,
+return blk_truncate(data->blk, data->size + headerlen, data->prealloc,
 errp);
 }
 
@@ -251,6 +252,7 @@ static int block_crypto_open_generic(QCryptoBlockFormat 
format,
 static int block_crypto_co_create_generic(BlockDriverState *bs,
   int64_t size,
   QCryptoBlockCreateOptions *opts,
+  PreallocMode prealloc,
   Error **errp)
 {
 int ret;
@@ -266,9 +268,14 @@ static int block_crypto_co_create_generic(BlockDriverState 
*bs,
 goto cleanup;
 }
 
+if (prealloc == PREALLOC_MODE_METADATA) {
+prealloc = PREALLOC_MODE_OFF;
+}
+
 data = (struct BlockCryptoCreateData) {
 .blk = blk,
 .size = size,
+.prealloc = prealloc,
 };
 
 crypto = qcrypto_block_create(opts, NULL,
@@ -500,6 +507,7 @@ block_crypto_co_create_luks(BlockdevCreateOptions 
*create_options, Error **errp)
 BlockdevCreateOptionsLUKS *luks_opts;
 BlockDriverState *bs = NULL;
 QCryptoBlockCreateOptions create_opts;
+PreallocMode preallocation = PREALLOC_MODE_OFF;
 int ret;
 
 assert(create_options->driver == BLOCKDEV_DRIVER_LUKS);
@@ -515,8 +523,11 @@ block_crypto_co_create_luks(BlockdevCreateOptions 
*create_options, Error **errp)
 .u.luks = *qapi_BlockdevCreateOptionsLUKS_base(luks_opts),
 };
 
+if (luks_opts->has_preallocation)
+preallocation = luks_opts->preallocation;
+
 ret = block_crypto_co_create_generic(bs, luks_opts->size, &create_opts,
- errp);
+ preallocation, errp);
 if (ret < 0) {
 goto fail;
 }
@@ -534,12 +545,24 @@ static int coroutine_fn 
block_crypto_co_create_opts_luks(const char *filename,
 QCryptoBlockCreateOptions *create_opts = NULL;
 BlockDriverState *bs = NULL;
 QDict *cryptoopts;
+PreallocMode prealloc;
+char *buf = NULL;
 int64_t size;
 int ret;
+Error *local_err = NULL;
 
 /* Parse options */
 size = qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0);
 
+buf = qemu_opt_get_del(opts, BLOCK_OPT_PREALLOC);
+prealloc = qapi_enum_parse(&PreallocMode_lookup, buf,
+   PREALLOC_MODE_OFF, &local_err);
+g_free(buf);
+if (local_err) {
+error_propagate(errp, local_err);
+return -EINVAL;
+}
+
 cryptoopts = qemu_opts_to_qdict_filtered(opts, NULL,
  &block_crypto_create_opts_luks,
  true);
@@ -565,7 +588,7 @@ static int coroutine_fn 
block_crypto_co_create_opts_luks(const char *filename,
 }
 
 /* Create format layer */
-ret = block_crypto_co_create_generic(bs, size, create_opts, errp);
+ret = block_crypto_co_create_generic(bs, size, create_opts, prealloc, 
errp);
 if (ret < 0) {
 goto fail;
 }
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 0d43d4f37c..9c04d83fa2 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4205,13 +4205,16 @@
 #
 # @file Node to create the image format on
 # @size Size of the virtual disk in bytes
+# @preallocationPreallocation mode for the new image (default: off;
+#   allowed values: off/metadata/falloc/full (since: 4.2)
 #
 # Since: 2.12
 ##
 { 'struct': 'BlockdevCreateOptionsLUKS',
   'base': 'QCryptoBlockCreateOptionsLUKS',
   'data': { 'file': 'BlockdevRef',
-'size': 'size' } }
+'size': 'size',
+'*preallocation':   'PreallocMode' } }
 
 ##
 # @BlockdevCreateOptionsNfs:
-- 
2.17.2




Re: [Qemu-block] [PATCH v4] LUKS: support preallocation

2019-07-16 Thread Maxim Levitsky
On Tue, 2019-07-16 at 14:41 +0200, Max Reitz wrote:
> On 16.07.19 10:15, Maxim Levitsky wrote:
> > preallocation=off and preallocation=metadata
> > both allocate luks header only, and preallocation=falloc/full
> > is passed to underlying file.
> > 
> > Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1534951
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/crypto.c   | 29 ++---
> >  qapi/block-core.json |  5 -
> >  2 files changed, 30 insertions(+), 4 deletions(-)
> > 
> > 
> > Changes from V3: updated the blockdev-create description
> 
> Looks good functionally, but there is a syntax problem:
> 
> > diff --git a/block/crypto.c b/block/crypto.c
> > index 8237424ae6..034a645652 100644
> > --- a/block/crypto.c
> > +++ b/block/crypto.c
> 
> [...]
> 
> > @@ -515,8 +523,11 @@ block_crypto_co_create_luks(BlockdevCreateOptions 
> > *create_options, Error **errp)
> >  .u.luks = *qapi_BlockdevCreateOptionsLUKS_base(luks_opts),
> >  };
> >  
> > +if (luks_opts->has_preallocation)
> > +preallocation = luks_opts->preallocation;
> 
> This lacks curly brackets.
In my defense, I am too much used to this due to kernel programming.
Eventually I will stop missing this.

> 
> > +
> >  ret = block_crypto_co_create_generic(bs, luks_opts->size, &create_opts,
> > - errp);
> > + preallocation, errp);
> >  if (ret < 0) {
> >  goto fail;
> >  }
> 
> [...]
> 
> > diff --git a/qapi/block-core.json b/qapi/block-core.json
> > index 0d43d4f37c..9c04d83fa2 100644
> > --- a/qapi/block-core.json
> > +++ b/qapi/block-core.json
> > @@ -4205,13 +4205,16 @@
> >  #
> >  # @file Node to create the image format on
> >  # @size Size of the virtual disk in bytes
> > +# @preallocationPreallocation mode for the new image (default: off;
> > +#   allowed values: off/metadata/falloc/full (since: 4.2)
> 
> Also, this lacks a closing parenthesis somewhere.
True, I need more coffee.


> 
> Max
> 
> >  #
> >  # Since: 2.12
> >  ##
> >  { 'struct': 'BlockdevCreateOptionsLUKS',
> >'base': 'QCryptoBlockCreateOptionsLUKS',
> >'data': { 'file': 'BlockdevRef',
> > -'size': 'size' } }
> > +'size': 'size',
> > +'*preallocation':   'PreallocMode' } }
> >  
> >  ##
> >  # @BlockdevCreateOptionsNfs:
> > 
> 
> 


Best regards,
Maxim Levitsky





Re: [Qemu-block] [PATCH 6/7] iscsi: Drop iscsi_co_create_opts()

2019-07-16 Thread Maxim Levitsky
On Fri, 2019-07-12 at 19:35 +0200, Max Reitz wrote:
> The generic fallback implementation effectively does the same.
> 
> Signed-off-by: Max Reitz 
> ---
>  block/iscsi.c | 56 ---
>  1 file changed, 56 deletions(-)
> 
> diff --git a/block/iscsi.c b/block/iscsi.c
> index 267f160bf6..0e5729d335 100644
> --- a/block/iscsi.c
> +++ b/block/iscsi.c
> @@ -2157,58 +2157,6 @@ static int coroutine_fn 
> iscsi_co_truncate(BlockDriverState *bs, int64_t offset,
>  return 0;
>  }
>  
> -static int coroutine_fn iscsi_co_create_opts(const char *filename, QemuOpts 
> *opts,
> - Error **errp)
> -{
> -int ret = 0;
> -int64_t total_size = 0;
> -BlockDriverState *bs;
> -IscsiLun *iscsilun = NULL;
> -QDict *bs_options;
> -Error *local_err = NULL;
> -
> -bs = bdrv_new();
> -
> -/* Read out options */
> -total_size = DIV_ROUND_UP(qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0),
> -  BDRV_SECTOR_SIZE);
> -bs->opaque = g_new0(struct IscsiLun, 1);
> -iscsilun = bs->opaque;
> -
> -bs_options = qdict_new();
> -iscsi_parse_filename(filename, bs_options, &local_err);
> -if (local_err) {
> -error_propagate(errp, local_err);
> -ret = -EINVAL;
> -} else {
> -ret = iscsi_open(bs, bs_options, 0, NULL);
> -}
> -qobject_unref(bs_options);
> -
> -if (ret != 0) {
> -goto out;
> -}
> -iscsi_detach_aio_context(bs);
> -if (iscsilun->type != TYPE_DISK) {
> -ret = -ENODEV;
> -goto out;
> -}
> -if (bs->total_sectors < total_size) {
> -ret = -ENOSPC;
> -goto out;
> -}
> -
> -ret = 0;
> -out:
> -if (iscsilun->iscsi != NULL) {
> -iscsi_destroy_context(iscsilun->iscsi);
> -}
> -g_free(bs->opaque);
> -bs->opaque = NULL;
> -bdrv_unref(bs);
> -return ret;
> -}
> -
>  static int iscsi_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
>  {
>  IscsiLun *iscsilun = bs->opaque;
> @@ -2479,8 +2427,6 @@ static BlockDriver bdrv_iscsi = {
>  .bdrv_parse_filename= iscsi_parse_filename,
>  .bdrv_file_open = iscsi_open,
>  .bdrv_close = iscsi_close,
> -.bdrv_co_create_opts= iscsi_co_create_opts,
> -.create_opts= &iscsi_create_opts,
>  .bdrv_reopen_prepare= iscsi_reopen_prepare,
>  .bdrv_reopen_commit = iscsi_reopen_commit,
>  .bdrv_co_invalidate_cache = iscsi_co_invalidate_cache,
> @@ -2518,8 +2464,6 @@ static BlockDriver bdrv_iser = {
>  .bdrv_parse_filename= iscsi_parse_filename,
>  .bdrv_file_open = iscsi_open,
>  .bdrv_close = iscsi_close,
> -.bdrv_co_create_opts= iscsi_co_create_opts,
> -.create_opts= &iscsi_create_opts,
>  .bdrv_reopen_prepare= iscsi_reopen_prepare,
>  .bdrv_reopen_commit = iscsi_reopen_commit,
>  .bdrv_co_invalidate_cache  = iscsi_co_invalidate_cache,


Well, in theory the original code did not zero the first sector, like what the 
generic code will do now,
but this is OK due to the same reasons the original zeroing code was added.


Reviewed-by: Maxim Levitsky 
Best regards,
Maxim Levitsky





Re: [Qemu-block] [PATCH 5/7] file-posix: Drop hdev_co_create_opts()

2019-07-16 Thread Maxim Levitsky
On Fri, 2019-07-12 at 19:35 +0200, Max Reitz wrote:
> The generic fallback implementation effectively does the same.
> 
> Signed-off-by: Max Reitz 
> ---
>  block/file-posix.c | 67 --
>  1 file changed, 67 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 4479cc7ab4..65bd6d 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -3325,67 +3325,6 @@ static coroutine_fn int 
> hdev_co_pwrite_zeroes(BlockDriverState *bs,
>  return raw_do_pwrite_zeroes(bs, offset, bytes, flags, true);
>  }
>  
> -static int coroutine_fn hdev_co_create_opts(const char *filename, QemuOpts 
> *opts,
> -Error **errp)
> -{
> -int fd;
> -int ret = 0;
> -struct stat stat_buf;
> -int64_t total_size = 0;
> -bool has_prefix;
> -
> -/* This function is used by both protocol block drivers and therefore 
> either
> - * of these prefixes may be given.
> - * The return value has to be stored somewhere, otherwise this is an 
> error
> - * due to -Werror=unused-value. */
> -has_prefix =
> -strstart(filename, "host_device:", &filename) ||
> -strstart(filename, "host_cdrom:" , &filename);
> -
> -(void)has_prefix;
> -
> -ret = raw_normalize_devicepath(&filename, errp);
> -if (ret < 0) {
> -return ret;
> -}
> -
> -/* Read out options */
> -total_size = ROUND_UP(qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0),
> -  BDRV_SECTOR_SIZE);
> -
> -fd = qemu_open(filename, O_WRONLY | O_BINARY);
> -if (fd < 0) {
> -ret = -errno;
> -error_setg_errno(errp, -ret, "Could not open device");
> -return ret;
> -}
> -
> -if (fstat(fd, &stat_buf) < 0) {
> -ret = -errno;
> -error_setg_errno(errp, -ret, "Could not stat device");
> -} else if (!S_ISBLK(stat_buf.st_mode) && !S_ISCHR(stat_buf.st_mode)) {
> -error_setg(errp,
> -   "The given file is neither a block nor a character 
> device");
> -ret = -ENODEV;
> -} else if (lseek(fd, 0, SEEK_END) < total_size) {
> -error_setg(errp, "Device is too small");
> -ret = -ENOSPC;
> -}
> -
> -if (!ret && total_size) {
> -uint8_t buf[BDRV_SECTOR_SIZE] = { 0 };
> -int64_t zero_size = MIN(BDRV_SECTOR_SIZE, total_size);
> -if (lseek(fd, 0, SEEK_SET) == -1) {
> -ret = -errno;
> -} else {
> -ret = qemu_write_full(fd, buf, zero_size);
> -ret = ret == zero_size ? 0 : -errno;
> -}
> -}
> -qemu_close(fd);
> -return ret;
> -}
> -
>  static BlockDriver bdrv_host_device = {
>  .format_name= "host_device",
>  .protocol_name= "host_device",
> @@ -3398,8 +3337,6 @@ static BlockDriver bdrv_host_device = {
>  .bdrv_reopen_prepare = raw_reopen_prepare,
>  .bdrv_reopen_commit  = raw_reopen_commit,
>  .bdrv_reopen_abort   = raw_reopen_abort,
> -.bdrv_co_create_opts = hdev_co_create_opts,
> -.create_opts = &raw_create_opts,
>  .mutable_opts= mutable_opts,
>  .bdrv_co_invalidate_cache = raw_co_invalidate_cache,
>  .bdrv_co_pwrite_zeroes = hdev_co_pwrite_zeroes,
> @@ -3525,8 +3462,6 @@ static BlockDriver bdrv_host_cdrom = {
>  .bdrv_reopen_prepare = raw_reopen_prepare,
>  .bdrv_reopen_commit  = raw_reopen_commit,
>  .bdrv_reopen_abort   = raw_reopen_abort,
> -.bdrv_co_create_opts = hdev_co_create_opts,
> -.create_opts = &raw_create_opts,
>  .mutable_opts= mutable_opts,
>  .bdrv_co_invalidate_cache = raw_co_invalidate_cache,
>  
> @@ -3659,8 +3594,6 @@ static BlockDriver bdrv_host_cdrom = {
>  .bdrv_reopen_prepare = raw_reopen_prepare,
>  .bdrv_reopen_commit  = raw_reopen_commit,
>  .bdrv_reopen_abort   = raw_reopen_abort,
> -.bdrv_co_create_opts = hdev_co_create_opts,
> -.create_opts= &raw_create_opts,
>  .mutable_opts   = mutable_opts,
>  
>  .bdrv_co_preadv = raw_co_preadv,


Reviewed-by: Maxim Levitsky 
Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH 2/7] block: Add blk_truncate_for_formatting()

2019-07-16 Thread Maxim Levitsky
On Fri, 2019-07-12 at 19:35 +0200, Max Reitz wrote:
> Signed-off-by: Max Reitz 
> ---
>  include/sysemu/block-backend.h | 12 
>  block/block-backend.c  | 54 ++
>  2 files changed, 66 insertions(+)
> 
> diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
> index 733c4957eb..cd9ec8bf52 100644
> --- a/include/sysemu/block-backend.h
> +++ b/include/sysemu/block-backend.h
> @@ -236,6 +236,18 @@ int blk_pwrite_compressed(BlockBackend *blk, int64_t 
> offset, const void *buf,
>int bytes);
>  int blk_truncate(BlockBackend *blk, int64_t offset, PreallocMode prealloc,
>   Error **errp);
> +
> +/**
> + * Wrapper of blk_truncate() for format drivers that need to truncate
> + * their protocol node before formatting it.
> + * Invoke blk_truncate() to truncate the file to @offset; if that
> + * fails with -ENOTSUP (and the file is already big enough), try to
> + * overwrite the first sector with zeroes.  If that succeeds, return
> + * success.
> + */
> +int blk_truncate_for_formatting(BlockBackend *blk, int64_t offset,
> +Error **errp);
> +
>  int blk_pdiscard(BlockBackend *blk, int64_t offset, int bytes);
>  int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
>   int64_t pos, int size);
> diff --git a/block/block-backend.c b/block/block-backend.c
> index a8d160fd5d..c0e64b1ee1 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -2041,6 +2041,60 @@ int blk_truncate(BlockBackend *blk, int64_t offset, 
> PreallocMode prealloc,
>  return bdrv_truncate(blk->root, offset, prealloc, errp);
>  }
>  
> +int blk_truncate_for_formatting(BlockBackend *blk, int64_t offset, Error 
> **errp)
> +{
> +Error *local_err = NULL;
> +int64_t current_size;
> +int bytes_to_clear;
> +int ret;
> +
> +ret = blk_truncate(blk, offset, PREALLOC_MODE_OFF, &local_err);
> +if (ret < 0 && ret != -ENOTSUP) {
> +error_propagate(errp, local_err);
> +return ret;
> +} else if (ret >= 0) {
> +return ret;
> +}

What if the truncate does succeed? For example the current implementation of 
raw_co_truncate,
does return zero when you truncate to less that block device size 
(and this is kind of wrong since you can't really change the block device size)

Even more, I see is that in the later patch, you call this with offset == 0 
which
I think will always succeed on a raw block device, thus skipping the zeroing 
code.

How about just doing the zeroing in the bdrv_create_file_fallback?


Another idea:

blk_truncate_for_formatting would first truncate the file to 0, then
check if the size of the file became zero in addition to the successful return 
value.

If the file size became zero, truncate the file to the requested size - this 
should make sure that file is empty.
Otherwise, zero the first sector.

It might also be nice to add a check that if the size didn't became zero, that 
it remained the same
to avoid strange situations of semi broken truncate.


Also I would rename the function to something like blk_raw_format_file,
basically a function which tries its best to erase an existing file contents


Yet another idea would to drop the lying in the raw_co_truncate (on block 
devices), and fail always,
unless asked to truncate to the exact file size, and let the callers deal with 
that.
Callers where it is not critical for the truncate to work can just ignore this 
failure.
That is probably hard to implement 

Or we can add a truncate 'mode' to .bdrv_co_truncate, which would let the 
caller indicate its intention,
that is if the caller must truncate to that size or it can accept truncate 
ending up in bigger file that it asked for. 

As we once discussed on IRC, the fact that truncate on a block device 
'succeeds',
despite not really beeing able to change the block device size, causes other 
issues,
like not beeing able to use preallocation=full when creating a qcow2 image on a 
block device.

Best regards,
Maxim Levitsky

> +
> +current_size = blk_getlength(blk);
> +if (current_size < 0) {
> +error_free(local_err);
> +error_setg_errno(errp, -current_size,
> + "Failed to inquire new image file's current 
> length");
> +return current_size;
> +}
> +
> +if (current_size < offset) {
> +/* Need to grow the image, but we failed to do that */
> +error_propagate(errp, local_err);
> +return -ENOTSUP;
> +}
> +
> +error_free(local_err);
> +/*
> + * We can deal with images that are too big.  We just need to
> + * clear

Re: [Qemu-block] [PATCH 3/7] block: Use blk_truncate_for_formatting()

2019-07-16 Thread Maxim Levitsky
On Fri, 2019-07-12 at 19:35 +0200, Max Reitz wrote:
> Signed-off-by: Max Reitz 
> ---
>  block/parallels.c | 2 +-
>  block/qcow.c  | 2 +-
>  block/qcow2.c | 2 +-
>  block/qed.c   | 2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/parallels.c b/block/parallels.c
> index 00fae125d1..a17b2d92f2 100644
> --- a/block/parallels.c
> +++ b/block/parallels.c
> @@ -563,7 +563,7 @@ static int coroutine_fn 
> parallels_co_create(BlockdevCreateOptions* opts,
>  blk_set_allow_write_beyond_eof(blk, true);
>  
>  /* Create image format */
> -ret = blk_truncate(blk, 0, PREALLOC_MODE_OFF, errp);
> +ret = blk_truncate_for_formatting(blk, 0, errp);
>  if (ret < 0) {
>  goto out;
>  }
> diff --git a/block/qcow.c b/block/qcow.c
> index 5bdf72ba33..86034135f9 100644
> --- a/block/qcow.c
> +++ b/block/qcow.c
> @@ -858,7 +858,7 @@ static int coroutine_fn 
> qcow_co_create(BlockdevCreateOptions *opts,
>  blk_set_allow_write_beyond_eof(qcow_blk, true);
>  
>  /* Create image format */
> -ret = blk_truncate(qcow_blk, 0, PREALLOC_MODE_OFF, errp);
> +ret = blk_truncate_for_formatting(qcow_blk, 0, errp);
>  if (ret < 0) {
>  goto exit;
>  }
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 039bdc2f7e..f3e53c781d 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -3184,7 +3184,7 @@ qcow2_co_create(BlockdevCreateOptions *create_options, 
> Error **errp)
>  blk_set_allow_write_beyond_eof(blk, true);
>  
>  /* Clear the protocol layer and preallocate it if necessary */
> -ret = blk_truncate(blk, 0, PREALLOC_MODE_OFF, errp);
> +ret = blk_truncate_for_formatting(blk, 0, errp);
>  if (ret < 0) {
>  goto out;
>  }
> diff --git a/block/qed.c b/block/qed.c
> index 77c7cef175..ec244158b5 100644
> --- a/block/qed.c
> +++ b/block/qed.c
> @@ -673,7 +673,7 @@ static int coroutine_fn 
> bdrv_qed_co_create(BlockdevCreateOptions *opts,
>  l1_size = header.cluster_size * header.table_size;
>  
>  /* File must start empty and grow, check truncate is supported */
> -ret = blk_truncate(blk, 0, PREALLOC_MODE_OFF, errp);
> +ret = blk_truncate_for_formatting(blk, 0, errp);
>  if (ret < 0) {
>  goto out;
>  }


Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH 4/7] block: Generic file creation fallback

2019-07-16 Thread Maxim Levitsky
On Fri, 2019-07-12 at 19:35 +0200, Max Reitz wrote:
> If a protocol driver does not support image creation, we can see whether
> maybe the file exists already.  If so, just truncating it will be
> sufficient.
> 
> Signed-off-by: Max Reitz 
> ---
>  block.c | 83 -
>  1 file changed, 71 insertions(+), 12 deletions(-)
> 
> diff --git a/block.c b/block.c
> index c139540f2b..5466585501 100644
> --- a/block.c
> +++ b/block.c
> @@ -531,20 +531,63 @@ out:
>  return ret;
>  }
>  
> -int bdrv_create_file(const char *filename, QemuOpts *opts, Error **errp)
> +static int bdrv_create_file_fallback(const char *filename, BlockDriver *drv,
> + QemuOpts *opts, Error **errp)
>  {
> -BlockDriver *drv;
> +BlockBackend *blk;
> +QDict *options = qdict_new();
> +int64_t size = 0;
> +char *buf = NULL;
> +PreallocMode prealloc;
>  Error *local_err = NULL;
>  int ret;
>  
> +size = qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0);
> +buf = qemu_opt_get_del(opts, BLOCK_OPT_PREALLOC);
> +prealloc = qapi_enum_parse(&PreallocMode_lookup, buf,
> +   PREALLOC_MODE_OFF, &local_err);
> +g_free(buf);
> +if (local_err) {
> +error_propagate(errp, local_err);
> +return -EINVAL;
> +}
> +
> +if (prealloc != PREALLOC_MODE_OFF) {
> +error_setg(errp, "Unsupported preallocation mode '%s'",
> +   PreallocMode_str(prealloc));
> +return -ENOTSUP;
> +}
> +
> +qdict_put_str(options, "driver", drv->format_name);
> +
> +blk = blk_new_open(filename, NULL, options,
> +   BDRV_O_RDWR | BDRV_O_RESIZE, errp);
> +if (!blk) {
> +error_prepend(errp, "Protocol driver '%s' does not support image "
> +  "creation, and opening the image failed: ",
> +  drv->format_name);
> +return -EINVAL;
> +}
> +
> +ret = blk_truncate_for_formatting(blk, size, errp);
> +blk_unref(blk);
> +return ret;
> +}
> +
> +int bdrv_create_file(const char *filename, QemuOpts *opts, Error **errp)
> +{
> +BlockDriver *drv;
> +
>  drv = bdrv_find_protocol(filename, true, errp);
>  if (drv == NULL) {
>  return -ENOENT;
>  }
>  
> -ret = bdrv_create(drv, filename, opts, &local_err);
> -error_propagate(errp, local_err);
> -return ret;
> +if (drv->bdrv_co_create_opts) {
> +return bdrv_create(drv, filename, opts, errp);
> +} else {
> +return bdrv_create_file_fallback(filename, drv, opts, errp);
> +}
>  }
>  
>  /**
> @@ -1420,6 +1463,24 @@ QemuOptsList bdrv_runtime_opts = {
>  },
>  };
>  
> +static QemuOptsList fallback_create_opts = {
> +.name = "fallback-create-opts",
> +.head = QTAILQ_HEAD_INITIALIZER(fallback_create_opts.head),
> +.desc = {
> +{
> +.name = BLOCK_OPT_SIZE,
> +.type = QEMU_OPT_SIZE,
> +.help = "Virtual disk size"
> +},
> +{
> +.name = BLOCK_OPT_PREALLOC,
> +.type = QEMU_OPT_STRING,
> +.help = "Preallocation mode (allowed values: off)"
> +},
> +{ /* end of list */ }
> +}
> +};
> +
>  /*
>   * Common part for opening disk images and files
>   *
> @@ -5681,14 +5742,12 @@ void bdrv_img_create(const char *filename, const char 
> *fmt,
>  return;
>  }
>  
> -if (!proto_drv->create_opts) {
> -error_setg(errp, "Protocol driver '%s' does not support image 
> creation",
> -   proto_drv->format_name);
> -return;
> -}
> -
>  create_opts = qemu_opts_append(create_opts, drv->create_opts);
> -create_opts = qemu_opts_append(create_opts, proto_drv->create_opts);
> +if (proto_drv->create_opts) {
> +create_opts = qemu_opts_append(create_opts, proto_drv->create_opts);
> +} else {
> +create_opts = qemu_opts_append(create_opts, &fallback_create_opts);
> +}
>  
>  /* Create parameter list with default values */
>  opts = qemu_opts_create(create_opts, NULL, 0, &error_abort);

Looks good!


Reviewed-by: Maxim Levitsky 
Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH 1/7] block/nbd: Fix hang in .bdrv_close()

2019-07-16 Thread Maxim Levitsky
On Fri, 2019-07-12 at 19:35 +0200, Max Reitz wrote:
> When nbd_close() is called from a coroutine, the connection_co never
> gets to run, and thus nbd_teardown_connection() hangs.
> 
> This is because aio_co_enter() only puts the connection_co into the main
> coroutine's wake-up queue, so this main coroutine needs to yield and
> wait for connection_co to terminate.

After diving into NBD's co-routines (this is 2nd time I do this) and speaking 
about this
with Max Reitz on IRC, could I suggest to extend the explanation a bit, 
something like that:


When nbd_close() is called from a coroutine, the connection_co never
gets to run, and thus nbd_teardown_connection() hangs.

This happens because the connection_co is woken up by nbd_teardown_connection() 
by closing the socket, 
which wakes up the IO channel on which the connection_co is blocked.

However connection_co is waken up by aio_co_wake, which has an assumption that 
if the caller is already in
a coroutine, the caller doesn't switch immediately to the woken coroutine, but 
rather it adds the coroutine to list
of coroutines to wake immediately after the current co-routine 
yields/terminates (self->co_queue_wakeup)
Since we instead do aio_poll, that never happens.


The patch itself looks fine, so
Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky

> 
> Suggested-by: Kevin Wolf 
> Signed-off-by: Max Reitz 
> ---
>  block/nbd.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 81edabbf35..8f5ee86842 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -61,6 +61,7 @@ typedef struct BDRVNBDState {
>  CoMutex send_mutex;
>  CoQueue free_sema;
>  Coroutine *connection_co;
> +Coroutine *teardown_co;
>  int in_flight;
>  
>  NBDClientRequest requests[MAX_NBD_REQUESTS];
> @@ -135,7 +136,15 @@ static void nbd_teardown_connection(BlockDriverState *bs)
>  qio_channel_shutdown(s->ioc,
>   QIO_CHANNEL_SHUTDOWN_BOTH,
>   NULL);
> -BDRV_POLL_WHILE(bs, s->connection_co);
> +if (qemu_in_coroutine()) {
> +s->teardown_co = qemu_coroutine_self();
> +/* connection_co resumes us when it terminates */
> +qemu_coroutine_yield();
> +s->teardown_co = NULL;
> +} else {
> +BDRV_POLL_WHILE(bs, s->connection_co);
> +}
> +assert(!s->connection_co);
>  
>  nbd_client_detach_aio_context(bs);
>  object_unref(OBJECT(s->sioc));
> @@ -207,6 +216,9 @@ static coroutine_fn void nbd_connection_entry(void 
> *opaque)
>  bdrv_dec_in_flight(s->bs);
>  
>  s->connection_co = NULL;
> +if (s->teardown_co) {
> +aio_co_wake(s->teardown_co);
> +}
>  aio_wait_kick();
>  }
>  





Re: [Qemu-block] [PATCH 2/7] block: Add blk_truncate_for_formatting()

2019-07-16 Thread Maxim Levitsky
On Tue, 2019-07-16 at 16:08 +0300, Maxim Levitsky wrote:
> On Fri, 2019-07-12 at 19:35 +0200, Max Reitz wrote:
> > Signed-off-by: Max Reitz 
> > ---
> >  include/sysemu/block-backend.h | 12 
> >  block/block-backend.c  | 54 ++
> >  2 files changed, 66 insertions(+)
> > 
> > diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
> > index 733c4957eb..cd9ec8bf52 100644
> > --- a/include/sysemu/block-backend.h
> > +++ b/include/sysemu/block-backend.h
> > @@ -236,6 +236,18 @@ int blk_pwrite_compressed(BlockBackend *blk, int64_t 
> > offset, const void *buf,
> >int bytes);
> >  int blk_truncate(BlockBackend *blk, int64_t offset, PreallocMode prealloc,
> >   Error **errp);
> > +
> > +/**
> > + * Wrapper of blk_truncate() for format drivers that need to truncate
> > + * their protocol node before formatting it.
> > + * Invoke blk_truncate() to truncate the file to @offset; if that
> > + * fails with -ENOTSUP (and the file is already big enough), try to
> > + * overwrite the first sector with zeroes.  If that succeeds, return
> > + * success.
> > + */
> > +int blk_truncate_for_formatting(BlockBackend *blk, int64_t offset,
> > +Error **errp);
> > +
> >  int blk_pdiscard(BlockBackend *blk, int64_t offset, int bytes);
> >  int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
> >   int64_t pos, int size);
> > diff --git a/block/block-backend.c b/block/block-backend.c
> > index a8d160fd5d..c0e64b1ee1 100644
> > --- a/block/block-backend.c
> > +++ b/block/block-backend.c
> > @@ -2041,6 +2041,60 @@ int blk_truncate(BlockBackend *blk, int64_t offset, 
> > PreallocMode prealloc,
> >  return bdrv_truncate(blk->root, offset, prealloc, errp);
> >  }
> >  
> > +int blk_truncate_for_formatting(BlockBackend *blk, int64_t offset, Error 
> > **errp)
> > +{
> > +Error *local_err = NULL;
> > +int64_t current_size;
> > +int bytes_to_clear;
> > +int ret;
> > +
> > +ret = blk_truncate(blk, offset, PREALLOC_MODE_OFF, &local_err);
> > +if (ret < 0 && ret != -ENOTSUP) {
> > +error_propagate(errp, local_err);
> > +return ret;
> > +} else if (ret >= 0) {
> > +return ret;
> > +}
> 
> What if the truncate does succeed? For example the current implementation of 
> raw_co_truncate,
> does return zero when you truncate to less that block device size 
> (and this is kind of wrong since you can't really change the block device 
> size)
> 
> Even more, I see is that in the later patch, you call this with offset == 0 
> which
> I think will always succeed on a raw block device, thus skipping the zeroing 
> code.
> 
> How about just doing the zeroing in the bdrv_create_file_fallback?
> 
> 
> Another idea:
> 
> blk_truncate_for_formatting would first truncate the file to 0, then
> check if the size of the file became zero in addition to the successful 
> return value.
> 
> If the file size became zero, truncate the file to the requested size - this 
> should make sure that file is empty.
> Otherwise, zero the first sector.
> 
> It might also be nice to add a check that if the size didn't became zero, 
> that it remained the same
> to avoid strange situations of semi broken truncate.
> 
> 
> Also I would rename the function to something like blk_raw_format_file,
> basically a function which tries its best to erase an existing file contents
> 
> 
> Yet another idea would to drop the lying in the raw_co_truncate (on block 
> devices), and fail always,
> unless asked to truncate to the exact file size, and let the callers deal 
> with that.
> Callers where it is not critical for the truncate to work can just ignore 
> this failure.
> That is probably hard to implement 
> 
> Or we can add a truncate 'mode' to .bdrv_co_truncate, which would let the 
> caller indicate its intention,
> that is if the caller must truncate to that size or it can accept truncate 
> ending up in bigger file that it asked for. 
> 
> As we once discussed on IRC, the fact that truncate on a block device 
> 'succeeds',
> despite not really beeing able to change the block device size, causes other 
> issues,
> like not beeing able to use preallocation=full when creating a qcow2 image on 
> a block device.
> 
> Best regards,
>   Maxim Levitsky
> 
> > +
> > +current_size = blk_getlength(blk);
> 

[Qemu-block] [PATCH v5] LUKS: support preallocation

2019-07-16 Thread Maxim Levitsky
preallocation=off and preallocation=metadata
both allocate luks header only, and preallocation=falloc/full
is passed to underlying file.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1534951

Signed-off-by: Maxim Levitsky 
---

This is hopefully a revision without coding style violations.

Note that I still haven't tested the blockdev-create code, other that
compile testing it.

Best regards,
    Maxim Levitsky


 block/crypto.c   | 30 +++---
 qapi/block-core.json |  6 +-
 2 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/block/crypto.c b/block/crypto.c
index 8237424ae6..7eb698774e 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -74,6 +74,7 @@ static ssize_t block_crypto_read_func(QCryptoBlock *block,
 struct BlockCryptoCreateData {
 BlockBackend *blk;
 uint64_t size;
+PreallocMode prealloc;
 };
 
 
@@ -112,7 +113,7 @@ static ssize_t block_crypto_init_func(QCryptoBlock *block,
  * available to the guest, so we must take account of that
  * which will be used by the crypto header
  */
-return blk_truncate(data->blk, data->size + headerlen, PREALLOC_MODE_OFF,
+return blk_truncate(data->blk, data->size + headerlen, data->prealloc,
 errp);
 }
 
@@ -251,6 +252,7 @@ static int block_crypto_open_generic(QCryptoBlockFormat 
format,
 static int block_crypto_co_create_generic(BlockDriverState *bs,
   int64_t size,
   QCryptoBlockCreateOptions *opts,
+  PreallocMode prealloc,
   Error **errp)
 {
 int ret;
@@ -266,9 +268,14 @@ static int block_crypto_co_create_generic(BlockDriverState 
*bs,
 goto cleanup;
 }
 
+if (prealloc == PREALLOC_MODE_METADATA) {
+prealloc = PREALLOC_MODE_OFF;
+}
+
 data = (struct BlockCryptoCreateData) {
 .blk = blk,
 .size = size,
+.prealloc = prealloc,
 };
 
 crypto = qcrypto_block_create(opts, NULL,
@@ -500,6 +507,7 @@ block_crypto_co_create_luks(BlockdevCreateOptions 
*create_options, Error **errp)
 BlockdevCreateOptionsLUKS *luks_opts;
 BlockDriverState *bs = NULL;
 QCryptoBlockCreateOptions create_opts;
+PreallocMode preallocation = PREALLOC_MODE_OFF;
 int ret;
 
 assert(create_options->driver == BLOCKDEV_DRIVER_LUKS);
@@ -515,8 +523,12 @@ block_crypto_co_create_luks(BlockdevCreateOptions 
*create_options, Error **errp)
 .u.luks = *qapi_BlockdevCreateOptionsLUKS_base(luks_opts),
 };
 
+if (luks_opts->has_preallocation) {
+preallocation = luks_opts->preallocation;
+}
+
 ret = block_crypto_co_create_generic(bs, luks_opts->size, &create_opts,
- errp);
+ preallocation, errp);
 if (ret < 0) {
 goto fail;
 }
@@ -534,12 +546,24 @@ static int coroutine_fn 
block_crypto_co_create_opts_luks(const char *filename,
 QCryptoBlockCreateOptions *create_opts = NULL;
 BlockDriverState *bs = NULL;
 QDict *cryptoopts;
+PreallocMode prealloc;
+char *buf = NULL;
 int64_t size;
 int ret;
+Error *local_err = NULL;
 
 /* Parse options */
 size = qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0);
 
+buf = qemu_opt_get_del(opts, BLOCK_OPT_PREALLOC);
+prealloc = qapi_enum_parse(&PreallocMode_lookup, buf,
+   PREALLOC_MODE_OFF, &local_err);
+g_free(buf);
+if (local_err) {
+error_propagate(errp, local_err);
+return -EINVAL;
+}
+
 cryptoopts = qemu_opts_to_qdict_filtered(opts, NULL,
  &block_crypto_create_opts_luks,
  true);
@@ -565,7 +589,7 @@ static int coroutine_fn 
block_crypto_co_create_opts_luks(const char *filename,
 }
 
 /* Create format layer */
-ret = block_crypto_co_create_generic(bs, size, create_opts, errp);
+ret = block_crypto_co_create_generic(bs, size, create_opts, prealloc, 
errp);
 if (ret < 0) {
 goto fail;
 }
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 0d43d4f37c..3840c99cbe 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4205,13 +4205,17 @@
 #
 # @file Node to create the image format on
 # @size Size of the virtual disk in bytes
+# @preallocationPreallocation mode for the new image
+#   (since: 4.2)
+#   (default: off; allowed values: off, metadata, falloc, full)
 #
 # Since: 2.12
 ##
 { 'struct': 'BlockdevCreateOptionsLUKS',
   'base': 'QCryptoBlockCreateOptionsLUKS',
   'data': { 'file': 'BlockdevRef',
-'size': 'size' } }
+'size': 'size',
+'*preallocation':   'PreallocMode' } }
 
 ##
 # @BlockdevCreateOptionsNfs:
-- 
2.17.2




[Qemu-block] [H v5 0/3] Few bugfixes for userspace nvme driver

2019-07-16 Thread Maxim Levitsky
This is reduced version of patch series for userspace nvme driver,
that only includes the bugfixes I made.

Best regards,
Maxim Levitsky

Maxim Levitsky (3):
  block/nvme: fix doorbell stride
  block/nvme: support larger that 512 bytes sector devices
  block/nvme: don't touch the completion entries

 block/nvme.c | 52 ++--
 1 file changed, 42 insertions(+), 10 deletions(-)

-- 
2.17.2




[Qemu-block] [PATCH v4 0/3] Few bugfixes for userspace nvme driver

2019-07-16 Thread Maxim Levitsky
This is reduced version of patch series for userspace nvme driver,
that only includes the bugfixes I made.

Best regards,
Maxim Levitsky

Maxim Levitsky (3):
  block/nvme: fix doorbell stride
  block/nvme: support larger that 512 bytes sector devices
  block/nvme: don't touch the completion entries

 block/nvme.c | 52 ++--
 1 file changed, 42 insertions(+), 10 deletions(-)

-- 
2.17.2




[Qemu-block] [PATCH v4 2/3] block/nvme: support larger that 512 bytes sector devices

2019-07-16 Thread Maxim Levitsky
Currently the driver hardcodes the sector size to 512,
and doesn't check the underlying device. Fix that.

Also fail if underlying nvme device is formatted with metadata
as this needs special support.

Signed-off-by: Maxim Levitsky 
---
 block/nvme.c | 45 -
 1 file changed, 40 insertions(+), 5 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 82fdefccd6..35ce10dc79 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -102,8 +102,11 @@ typedef struct {
 size_t doorbell_scale;
 bool write_cache_supported;
 EventNotifier irq_notifier;
+
 uint64_t nsze; /* Namespace size reported by identify command */
 int nsid;  /* The namespace id to read/write data. */
+size_t blkshift;
+
 uint64_t max_transfer;
 bool plugged;
 
@@ -418,8 +421,9 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 BDRVNVMeState *s = bs->opaque;
 NvmeIdCtrl *idctrl;
 NvmeIdNs *idns;
+NvmeLBAF *lbaf;
 uint8_t *resp;
-int r;
+int r, hwsect_size;
 uint64_t iova;
 NvmeCmd cmd = {
 .opcode = NVME_ADM_CMD_IDENTIFY,
@@ -466,7 +470,22 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 }
 
 s->nsze = le64_to_cpu(idns->nsze);
+lbaf = &idns->lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
+
+if (lbaf->ms) {
+error_setg(errp, "Namespaces with metadata are not yet supported");
+goto out;
+}
+
+hwsect_size = 1 << lbaf->ds;
+
+if (hwsect_size < BDRV_SECTOR_SIZE || hwsect_size > s->page_size) {
+error_setg(errp, "Namespace has unsupported block size (%d)",
+hwsect_size);
+goto out;
+}
 
+s->blkshift = lbaf->ds;
 out:
 qemu_vfio_dma_unmap(s->vfio, resp);
 qemu_vfree(resp);
@@ -785,8 +804,22 @@ fail:
 static int64_t nvme_getlength(BlockDriverState *bs)
 {
 BDRVNVMeState *s = bs->opaque;
+return s->nsze << s->blkshift;
+}
 
-return s->nsze << BDRV_SECTOR_BITS;
+static int64_t nvme_get_blocksize(BlockDriverState *bs)
+{
+BDRVNVMeState *s = bs->opaque;
+assert(s->blkshift >= BDRV_SECTOR_BITS);
+return 1 << s->blkshift;
+}
+
+static int nvme_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
+{
+int64_t blocksize = nvme_get_blocksize(bs);
+bsz->phys = blocksize;
+bsz->log = blocksize;
+return 0;
 }
 
 /* Called with s->dma_map_lock */
@@ -917,13 +950,14 @@ static coroutine_fn int 
nvme_co_prw_aligned(BlockDriverState *bs,
 BDRVNVMeState *s = bs->opaque;
 NVMeQueuePair *ioq = s->queues[1];
 NVMeRequest *req;
-uint32_t cdw12 = (((bytes >> BDRV_SECTOR_BITS) - 1) & 0x) |
+
+uint32_t cdw12 = (((bytes >> s->blkshift) - 1) & 0x) |
(flags & BDRV_REQ_FUA ? 1 << 30 : 0);
 NvmeCmd cmd = {
 .opcode = is_write ? NVME_CMD_WRITE : NVME_CMD_READ,
 .nsid = cpu_to_le32(s->nsid),
-.cdw10 = cpu_to_le32((offset >> BDRV_SECTOR_BITS) & 0x),
-.cdw11 = cpu_to_le32(((offset >> BDRV_SECTOR_BITS) >> 32) & 
0x),
+.cdw10 = cpu_to_le32((offset >> s->blkshift) & 0x),
+.cdw11 = cpu_to_le32(((offset >> s->blkshift) >> 32) & 0x),
 .cdw12 = cpu_to_le32(cdw12),
 };
 NVMeCoData data = {
@@ -1154,6 +1188,7 @@ static BlockDriver bdrv_nvme = {
 .bdrv_file_open   = nvme_file_open,
 .bdrv_close   = nvme_close,
 .bdrv_getlength   = nvme_getlength,
+.bdrv_probe_blocksizes= nvme_probe_blocksizes,
 
 .bdrv_co_preadv   = nvme_co_preadv,
 .bdrv_co_pwritev  = nvme_co_pwritev,
-- 
2.17.2




[Qemu-block] [PATCH v4 1/3] block/nvme: fix doorbell stride

2019-07-16 Thread Maxim Levitsky
Fix the math involving non standard doorbell stride

Signed-off-by: Maxim Levitsky 
Reviewed-by: Max Reitz 
---
 block/nvme.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index 9896b7f7c6..82fdefccd6 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -217,7 +217,7 @@ static NVMeQueuePair 
*nvme_create_queue_pair(BlockDriverState *bs,
 error_propagate(errp, local_err);
 goto fail;
 }
-q->cq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale + 1];
+q->cq.doorbell = &s->regs->doorbells[(idx * 2 + 1) * s->doorbell_scale];
 
 return q;
 fail:
-- 
2.17.2




[Qemu-block] [PATCH v4 3/3] block/nvme: don't touch the completion entries

2019-07-16 Thread Maxim Levitsky
Completion entries are meant to be only read by the host and written by the 
device.
The driver is supposed to scan the completions from the last point where it 
left,
and until it sees a completion with non flipped phase bit.


Signed-off-by: Maxim Levitsky 
Reviewed-by: Max Reitz 
---
 block/nvme.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 35ce10dc79..c28755cc31 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -318,7 +318,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
NVMeQueuePair *q)
 while (q->inflight) {
 int16_t cid;
 c = (NvmeCqe *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
-if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
+if ((le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
 break;
 }
 q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
@@ -342,10 +342,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
NVMeQueuePair *q)
 qemu_mutex_unlock(&q->lock);
 req.cb(req.opaque, nvme_translate_error(c));
 qemu_mutex_lock(&q->lock);
-c->cid = cpu_to_le16(0);
 q->inflight--;
-/* Flip Phase Tag bit. */
-c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
 progress = true;
 }
 if (progress) {
-- 
2.17.2




Re: [Qemu-block] [H v5 0/3] Few bugfixes for userspace nvme driver

2019-07-16 Thread Maxim Levitsky
On Tue, 2019-07-16 at 19:29 +0300, Maxim Levitsky wrote:
> This is reduced version of patch series for userspace nvme driver,
> that only includes the bugfixes I made.
> 
> Best regards,
>   Maxim Levitsky
> 
> Maxim Levitsky (3):
>   block/nvme: fix doorbell stride
>   block/nvme: support larger that 512 bytes sector devices
>   block/nvme: don't touch the completion entries
> 
>  block/nvme.c | 52 ++--
>  1 file changed, 42 insertions(+), 10 deletions(-)
> 
Ignore that, I messed up the patch prefix here.
Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v4 0/3] Few bugfixes for userspace nvme driver

2019-07-21 Thread Maxim Levitsky
On Fri, 2019-07-19 at 11:51 +0200, Max Reitz wrote:
> On 16.07.19 18:30, Maxim Levitsky wrote:
> > This is reduced version of patch series for userspace nvme driver,
> > that only includes the bugfixes I made.
> > 
> > Best regards,
> >     Maxim Levitsky
> > 
> > Maxim Levitsky (3):
> >   block/nvme: fix doorbell stride
> >   block/nvme: support larger that 512 bytes sector devices
> >   block/nvme: don't touch the completion entries
> > 
> >  block/nvme.c | 52 ++--
> >  1 file changed, 42 insertions(+), 10 deletions(-)
> 
> Thanks, applied to my block branch:
> 
> https://git.xanclic.moe/XanClic/qemu/commits/branch/block
> 
> Max
> 
Thank you!!
Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v5] LUKS: support preallocation

2019-07-21 Thread Maxim Levitsky
On Fri, 2019-07-19 at 12:28 +0200, Max Reitz wrote:
> On 16.07.19 18:19, Maxim Levitsky wrote:
> > preallocation=off and preallocation=metadata
> > both allocate luks header only, and preallocation=falloc/full
> > is passed to underlying file.
> > 
> > Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1534951
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> > 
> > This is hopefully a revision without coding style violations.
> > 
> > Note that I still haven't tested the blockdev-create code, other that
> > compile testing it.
> > 
> > Best regards,
> > Maxim Levitsky
> > 
> > 
> >  block/crypto.c   | 30 +++---
> >  qapi/block-core.json |  6 +-
> >  2 files changed, 32 insertions(+), 4 deletions(-)
> 
> Thanks, applied to my block-next branch for 4.2:
> 
> https://git.xanclic.moe/XanClic/qemu/commits/branch/block-next
> 
> Max
> 
> (The Patchew warning doesn’t look like it’s caused by this patch.)
> 

Thank you!!
Best regards,
Maxim Levitsky




[Qemu-block] [PATCH 1/2] LUKS: better error message when creating too large files

2019-07-21 Thread Maxim Levitsky
Currently if you attampt to create too large file with luks you
get the following error message:

Formatting 'test.luks', fmt=luks size=17592186044416 key-secret=sec0
qemu-img: test.luks: Could not resize file: File too large

While for raw format the error message is
qemu-img: test.img: The image size is too large for file format 'raw'


The reason for this is that qemu-img checks for errono of the failure,
and presents the later error when it is -EFBIG

However crypto generic code 'swallows' the errno and replaces it
with -EIO.

As an attempt to make it better, we can make luks driver,
detect -EFBIG and in this case present a better error message,
which is what this patch does

The new error message is:

qemu-img: error creating test.luks: The requested file size is too large

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1534898
Signed-off-by: Maxim Levitsky 
---
 block/crypto.c | 25 +
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/block/crypto.c b/block/crypto.c
index 8237424ae6..73b1013fa1 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -102,18 +102,35 @@ static ssize_t block_crypto_init_func(QCryptoBlock *block,
   Error **errp)
 {
 struct BlockCryptoCreateData *data = opaque;
+Error *local_error = NULL;
+int ret;
 
 if (data->size > INT64_MAX || headerlen > INT64_MAX - data->size) {
-error_setg(errp, "The requested file size is too large");
-return -EFBIG;
+ret = -EFBIG;
+goto error;
 }
 
 /* User provided size should reflect amount of space made
  * available to the guest, so we must take account of that
  * which will be used by the crypto header
  */
-return blk_truncate(data->blk, data->size + headerlen, PREALLOC_MODE_OFF,
-errp);
+ret = blk_truncate(data->blk, data->size + headerlen, PREALLOC_MODE_OFF,
+   &local_error);
+
+if (ret >= 0) {
+return ret;
+}
+
+error:
+if (ret == -EFBIG) {
+/* Replace the error message with a better one */
+error_free(local_error);
+error_setg(errp, "The requested file size is too large");
+} else {
+error_propagate(errp, local_error);
+}
+
+return ret;
 }
 
 
-- 
2.17.2




[Qemu-block] [PATCH 2/2] qemu-img: better error message when opening a backing file fails

2019-07-21 Thread Maxim Levitsky
Currently we print message like that:

"
new_file.qcow2 : error message
"

However the error could have come from opening the backing file (e.g when it 
missing encryption keys),
thus try to clarify this by using this format:

"
qemu-img: error creating new_file.qcow2: base_file.qcow2: error message
Could not open backing image to determine size.
"


Test used:

qemu-img create -f qcow2 \
--object secret,id=sec0,data=hunter9 \
--object secret,id=sec1,data=my_new_secret_password \
-b 'json:{ "encrypt.key-secret": "sec1", "driver": "qcow2", "file": { 
"driver": "file", "filename": "base.qcow2" }}' \
-o encrypt.format=luks,encrypt.key-secret=sec1 \
sn.qcow2


Error message before:

qemu-img: sn.qcow2: Invalid password, cannot unlock any keyslot
Could not open backing image to determine size.


Error message after:

qemu-img: error creating sn.qcow2: \
json:{ "encrypt.key-secret": "sec1", "driver": "qcow2", "file": { 
"driver": "file", "filename": "base.qcow2" }}: \
Invalid password, cannot unlock any keyslot
Could not open backing image to determine size.

Signed-off-by: Maxim Levitsky 
---
 block.c| 1 +
 qemu-img.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 29e931e217..5eb47b2199 100644
--- a/block.c
+++ b/block.c
@@ -5790,6 +5790,7 @@ void bdrv_img_create(const char *filename, const char 
*fmt,
 "This may become an error in future versions.\n");
 local_err = NULL;
 } else if (!bs) {
+error_prepend(&local_err, "%s: ", backing_file);
 /* Couldn't open bs, do not have size */
 error_append_hint(&local_err,
   "Could not open backing image to determine 
size.\n");
diff --git a/qemu-img.c b/qemu-img.c
index 79983772de..134bf2fbe0 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -545,7 +545,7 @@ static int img_create(int argc, char **argv)
 bdrv_img_create(filename, fmt, base_filename, base_fmt,
 options, img_size, flags, quiet, &local_err);
 if (local_err) {
-error_reportf_err(local_err, "%s: ", filename);
+error_reportf_err(local_err, "error creating %s: ", filename);
 goto fail;
 }
 
-- 
2.17.2




[Qemu-block] [PATCH 0/2] RFC: Trivial error message fixes for luks format

2019-07-21 Thread Maxim Levitsky
These are attempts to improve a bit error message
based on bunch of luks related bugzillas assigned to me.
Feel free to reject these if you think that it doesn't
make the messages better.

Best regards,
    Maxim Levitsky

Maxim Levitsky (2):
  LUKS: better error message when creating too large files
  qemu-img: better error message when opening a backing file fails

 block.c|  1 +
 block/crypto.c | 25 +
 qemu-img.c |  2 +-
 3 files changed, 23 insertions(+), 5 deletions(-)

-- 
2.17.2




Re: [Qemu-block] [PATCH 2/2] qemu-img: better error message when opening a backing file fails

2019-07-22 Thread Maxim Levitsky
On Mon, 2019-07-22 at 11:41 +0200, Kevin Wolf wrote:
> Am 21.07.2019 um 20:15 hat Maxim Levitsky geschrieben:
> > Currently we print message like that:
> > 
> > "
> > new_file.qcow2 : error message
> > "
> > 
> > However the error could have come from opening the backing file (e.g when 
> > it missing encryption keys),
> > thus try to clarify this by using this format:
> > 
> > "
> > qemu-img: error creating new_file.qcow2: base_file.qcow2: error message
> > Could not open backing image to determine size.
> > "
> 
> The old error message was just unspecific. Your new error message can be
> actively misleading because you just unconditionally print the filename
> of the direct backing file, even though the error could have occurred
> while opening the backing file of the backing file (or even further down
> the backing chain).
> 
> It's a common problem we have with backing files and error messages: We
> either don't print the filename where the error actually happened (like
> in this case), or we print all of the backing files in the chain (such
> as "Could not open top.qcow2: Could not open mid.qcow2: Could not open
> base.qcow2: Invalid something").
> 
> Ideally, we'd find a way to print only the backing filename in such
> cases ("Could not open base.qcow2: Invalid something"). I'd gladly
> accept a patch that fixes error messages in this way for both open and
> create, but I'm afraid that your approach in this patch is too
> simplistic and not an improvement

You raise a very good point, I didn't thought about this.
Thanks,

Best regards,
Maxim Levitsky





Re: [Qemu-block] [Qemu-trivial] [Qemu-devel] [PATCH 2/2] qemu-img: better error message when opening a backing file fails

2019-07-22 Thread Maxim Levitsky
On Mon, 2019-07-22 at 10:15 +0100, Daniel P. Berrangé wrote:
> On Sun, Jul 21, 2019 at 09:15:08PM +0300, Maxim Levitsky wrote:
> > Currently we print message like that:
> > 
> > "
> > new_file.qcow2 : error message
> > "
> > 
> > However the error could have come from opening the backing file (e.g when 
> > it missing encryption keys),
> > thus try to clarify this by using this format:
> > 
> > "
> > qemu-img: error creating new_file.qcow2: base_file.qcow2: error message
> > Could not open backing image to determine size.
> > "
> > 
> > 
> > Test used:
> > 
> > qemu-img create -f qcow2 \
> > --object secret,id=sec0,data=hunter9 \
> > --object secret,id=sec1,data=my_new_secret_password \
> > -b 'json:{ "encrypt.key-secret": "sec1", "driver": "qcow2", "file": 
> > { "driver": "file", "filename": "base.qcow2" }}' \
> > -o encrypt.format=luks,encrypt.key-secret=sec1 \
> > sn.qcow2
> > 
> > 
> > Error message before:
> > 
> > qemu-img: sn.qcow2: Invalid password, cannot unlock any keyslot
> > Could not open backing image to determine size.
> > 
> > 
> > Error message after:
> > 
> > qemu-img: error creating sn.qcow2: \
> > json:{ "encrypt.key-secret": "sec1", "driver": "qcow2", "file": { 
> > "driver": "file", "filename": "base.qcow2" }}: \
> > Invalid password, cannot unlock any keyslot
> > Could not open backing image to determine size.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block.c| 1 +
> >  qemu-img.c | 2 +-
> >  2 files changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block.c b/block.c
> > index 29e931e217..5eb47b2199 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -5790,6 +5790,7 @@ void bdrv_img_create(const char *filename, const char 
> > *fmt,
> >  "This may become an error in future 
> > versions.\n");
> >  local_err = NULL;
> >  } else if (!bs) {
> > +error_prepend(&local_err, "%s: ", backing_file);
> >  /* Couldn't open bs, do not have size */
> >  error_append_hint(&local_err,
> >"Could not open backing image to determine 
> > size.\n");
> 
> I think it'd be better todo
> 
>   error_append_hint(&local_err,
> "Could not open backing image '%s' to 
> determine size.\n",
>  backing_file);
> 
> At least when backing_file isn't a horrible blob of JSON, the error
> message is easier to read this way IMHO.
I agree, but I guess I need to drop this patch because of possible nesting of 
the backing files,
as Kevin Wolf pointed out.

Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v3] block/rbd: add preallocation support

2019-07-23 Thread Maxim Levitsky
ame_zero_supported(&s->cluster);
>  
>  /* rbd_open is always r/w */
>  r = rbd_open(s->io_ctx, s->image_name, &s->image, s->snap);
> @@ -1089,21 +1263,16 @@ static int coroutine_fn 
> qemu_rbd_co_truncate(BlockDriverState *bs,
>   PreallocMode prealloc,
>   Error **errp)
>  {
> -int r;
> -
> -if (prealloc != PREALLOC_MODE_OFF) {
> -error_setg(errp, "Unsupported preallocation mode '%s'",
> -   PreallocMode_str(prealloc));
> -return -ENOTSUP;
> -}
> +BDRVRBDState *s = bs->opaque;
> +int ret;
>  
> -r = qemu_rbd_resize(bs, offset);
> -if (r < 0) {
> -error_setg_errno(errp, -r, "Failed to resize file");
> -return r;
> +ret = qemu_rbd_do_truncate(s->cluster, s->image, offset, prealloc,
> +   s->ws_zero_supported, errp);
> +if (ret == 0) {
> +s->image_size = offset;
>  }
>  
> -return 0;
> +return ret;
>  }
>  
>  static int qemu_rbd_snap_create(BlockDriverState *bs,
> @@ -1256,6 +1425,11 @@ static QemuOptsList qemu_rbd_create_opts = {
>      .type = QEMU_OPT_SIZE,
>  .help = "RBD object size"
>  },
> +{
> +.name = BLOCK_OPT_PREALLOC,
> +.type = QEMU_OPT_STRING,
> +.help = "Preallocation mode (allowed values: off, full)"
> +},
>  {
>  .name = "password-secret",
>  .type = QEMU_OPT_STRING,
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 0d43d4f37c..ff55171f8d 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -4346,13 +4346,16 @@
>  #   point to a snapshot.
>  # @size Size of the virtual disk in bytes
>  # @cluster-size RBD object size
> +# @preallocationPreallocation mode for the new image (since: 4.2)
> +#   (default: off; allowed values: off, full)
>  #
>  # Since: 2.12
>  ##
>  { 'struct': 'BlockdevCreateOptionsRbd',
>'data': { 'location': 'BlockdevOptionsRbd',
>  'size': 'size',
> -'*cluster-size' :   'size' } }
> +'*cluster-size' :   'size',
> +'*preallocation':   'PreallocMode' } }
>  
>  ##
>  # @BlockdevVmdkSubformat:

I think I don't see anything obviously wrong, but note that I don't know ceph 
yet,
thus I might have missed something.

So:
Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky





Re: [Qemu-block] [Qemu-devel] [PATCH v2 01/11] qemu-img: Fix bdrv_has_zero_init() use in convert

2019-07-25 Thread Maxim Levitsky
On Wed, 2019-07-24 at 19:12 +0200, Max Reitz wrote:
> bdrv_has_zero_init() only has meaning for newly created images or image
> areas.  If qemu-img convert did not create the image itself, it cannot
> rely on bdrv_has_zero_init()'s result to carry any meaning.
> 
> Signed-off-by: Max Reitz 
> ---
>  qemu-img.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/qemu-img.c b/qemu-img.c
> index 79983772de..0f4be80c10 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1578,6 +1578,7 @@ typedef struct ImgConvertState {
>  bool has_zero_init;
>  bool compressed;
>  bool unallocated_blocks_are_zero;
> +bool target_is_new;
>  bool target_has_backing;
>  int64_t target_backing_sectors; /* negative if unknown */
>  bool wr_in_order;
> @@ -1975,9 +1976,11 @@ static int convert_do_copy(ImgConvertState *s)
>  int64_t sector_num = 0;
>  
>  /* Check whether we have zero initialisation or can get it efficiently */
> -s->has_zero_init = s->min_sparse && !s->target_has_backing
> - ? bdrv_has_zero_init(blk_bs(s->target))
> - : false;
> +if (s->target_is_new && s->min_sparse && !s->target_has_backing) {
> +s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
> +} else {
> +s->has_zero_init = false;
> +}
>  
>  if (!s->has_zero_init && !s->target_has_backing &&
>  bdrv_can_write_zeroes_with_unmap(blk_bs(s->target)))
> @@ -2423,6 +2426,8 @@ static int img_convert(int argc, char **argv)
>  }
>  }
>  
> +s.target_is_new = !skip_create;
> +
>  flags = s.min_sparse ? (BDRV_O_RDWR | BDRV_O_UNMAP) : BDRV_O_RDWR;
>  ret = bdrv_parse_cache_mode(cache, &flags, &writethrough);
>  if (ret < 0) {


Reviewed-by: Maxim Levitsky 
Best regards,
Maxim Levitsky





Re: [Qemu-block] [Qemu-devel] [PATCH v2 02/11] mirror: Fix bdrv_has_zero_init() use

2019-07-25 Thread Maxim Levitsky
   has_buf_size, buf_size,
> has_on_source_error, on_source_error,
> diff --git a/tests/test-block-iothread.c b/tests/test-block-iothread.c
> index 1949d5e61a..debfb69bfb 100644
> --- a/tests/test-block-iothread.c
> +++ b/tests/test-block-iothread.c
> @@ -611,7 +611,7 @@ static void test_propagate_mirror(void)
>  
>  /* Start a mirror job */
>  mirror_start("job0", src, target, NULL, JOB_DEFAULT, 0, 0, 0,
> - MIRROR_SYNC_MODE_NONE, MIRROR_OPEN_BACKING_CHAIN,
> + MIRROR_SYNC_MODE_NONE, MIRROR_OPEN_BACKING_CHAIN, false,
>   BLOCKDEV_ON_ERROR_REPORT, BLOCKDEV_ON_ERROR_REPORT,
>   false, "filter_node", MIRROR_COPY_MODE_BACKGROUND,
>   &error_abort);


>From my limited understanding of this code, it looks ok to me.

Still to be very sure, I sort of suggest still to check that nobody relies on 
target zeroing 
when non in full sync mode, to avoid breaking the users

For example, QMP reference states that MIRROR_SYNC_MODE_TOP copies data in the 
topmost image to the destination.
If there is only the topmost image, I could image the caller assume that target 
is identical to the source.

Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky




Re: [Qemu-block] [Qemu-devel] [PATCH v2 03/11] block: Add bdrv_has_zero_init_truncate()

2019-07-25 Thread Maxim Levitsky
On Wed, 2019-07-24 at 19:12 +0200, Max Reitz wrote:
> No .bdrv_has_zero_init() implementation returns 1 if growing the file
> would add non-zero areas (at least with PREALLOC_MODE_OFF), so using it
> in lieu of this new function was always safe.
> 
> But on the other hand, it is possible that growing an image that is not
> zero-initialized would still add a zero-initialized area, like when
> using nonpreallocating truncation on a preallocated image.  For callers
> that care only about truncation, not about creation with potential
> preallocation, this new function is useful.
> 
> Alternatively, we could have added a PreallocMode parameter to
> bdrv_has_zero_init().  But the only user would have been qemu-img
> convert, which does not have a plain PreallocMode value right now -- it
> would have to parse the creation option to obtain it.  Therefore, the
> simpler solution is to let bdrv_has_zero_init() inquire the
> preallocation status and add the new bdrv_has_zero_init_truncate() that
> presupposes PREALLOC_MODE_OFF.
> 
> Signed-off-by: Max Reitz 
> ---
>  include/block/block.h |  1 +
>  include/block/block_int.h |  7 +++
>  block.c   | 21 +
>  3 files changed, 29 insertions(+)
> 
> diff --git a/include/block/block.h b/include/block/block.h
> index 50a07c1c33..5321d8afdf 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -438,6 +438,7 @@ int bdrv_pdiscard(BdrvChild *child, int64_t offset, 
> int64_t bytes);
>  int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
>  int bdrv_has_zero_init_1(BlockDriverState *bs);
>  int bdrv_has_zero_init(BlockDriverState *bs);
> +int bdrv_has_zero_init_truncate(BlockDriverState *bs);
>  bool bdrv_unallocated_blocks_are_zero(BlockDriverState *bs);
>  bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
>  int bdrv_block_status(BlockDriverState *bs, int64_t offset,
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 6a0b1b5008..d7fc6b296b 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -420,9 +420,16 @@ struct BlockDriver {
>  /*
>   * Returns 1 if newly created images are guaranteed to contain only
>   * zeros, 0 otherwise.
> + * Must return 0 if .bdrv_has_zero_init_truncate() returns 0.
>   */
>  int (*bdrv_has_zero_init)(BlockDriverState *bs);
>  
> +/*
> + * Returns 1 if new areas added by growing the image with
> + * PREALLOC_MODE_OFF contain only zeros, 0 otherwise.
> + */
> +int (*bdrv_has_zero_init_truncate)(BlockDriverState *bs);
> +
>  /* Remove fd handlers, timers, and other event loop callbacks so the 
> event
>   * loop is no longer in use.  Called with no in-flight requests and in
>   * depth-first traversal order with parents before child nodes.
> diff --git a/block.c b/block.c
> index cbd8da5f3b..81ae44dcf3 100644
> --- a/block.c
> +++ b/block.c
> @@ -5066,6 +5066,27 @@ int bdrv_has_zero_init(BlockDriverState *bs)
>  return 0;
>  }
>  
> +int bdrv_has_zero_init_truncate(BlockDriverState *bs)
> +{
> +if (!bs->drv) {
> +return 0;
> +}
> +
> +if (bs->backing) {
> +/* Depends on the backing image length, but better safe than sorry */
> +return 0;
> +}
> +if (bs->drv->bdrv_has_zero_init_truncate) {
> +return bs->drv->bdrv_has_zero_init_truncate(bs);
> +}
> +if (bs->file && bs->drv->is_filter) {
> +return bdrv_has_zero_init_truncate(bs->file->bs);
> +}
> +
> +    /* safe default */
> +return 0;
> +}
> +
>  bool bdrv_unallocated_blocks_are_zero(BlockDriverState *bs)
>  {
>  BlockDriverInfo bdi;


This looks like a very correct change, even for the sake
of clarifying the scope of bdrv_has_zero_init

Reviewed-by: Maxim Levitsky 
Best regards,
Maxim Levitsky




Re: [Qemu-block] [Qemu-devel] [PATCH v2 05/11] block: Use bdrv_has_zero_init_truncate()

2019-07-25 Thread Maxim Levitsky
On Wed, 2019-07-24 at 19:12 +0200, Max Reitz wrote:
> Signed-off-by: Max Reitz 
> ---
>  block/parallels.c | 2 +-
>  block/vhdx.c  | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/parallels.c b/block/parallels.c
> index 00fae125d1..7cd2714b69 100644
> --- a/block/parallels.c
> +++ b/block/parallels.c
> @@ -835,7 +835,7 @@ static int parallels_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  goto fail_options;
>  }
>  
> -if (!bdrv_has_zero_init(bs->file->bs)) {
> +if (!bdrv_has_zero_init_truncate(bs->file->bs)) {
>  s->prealloc_mode = PRL_PREALLOC_MODE_FALLOCATE;
>  }
>  
> diff --git a/block/vhdx.c b/block/vhdx.c
> index d6070b6fa8..a02d1c99a7 100644
> --- a/block/vhdx.c
> +++ b/block/vhdx.c
> @@ -1282,7 +1282,7 @@ static coroutine_fn int vhdx_co_writev(BlockDriverState 
> *bs, int64_t sector_num,
>  /* Queue another write of zero buffers if the underlying file
>   * does not zero-fill on file extension */
>  
> -if (bdrv_has_zero_init(bs->file->bs) == 0) {
> +if (bdrv_has_zero_init_truncate(bs->file->bs) == 0) {
>  use_zero_buffers = true;
>  
>  /* zero fill the front, if any */

Reviewed-by: Maxim Levitsky 
Best regards,
Maxim Levitsky




Re: [Qemu-block] [Qemu-devel] [PATCH v2 06/11] qcow2: Fix .bdrv_has_zero_init()

2019-07-25 Thread Maxim Levitsky
On Wed, 2019-07-24 at 19:12 +0200, Max Reitz wrote:
> If a qcow2 file is preallocated, it can no longer guarantee that it
> initially appears as filled with zeroes.
> 
> So implement .bdrv_has_zero_init() by checking whether the file is
> preallocated; if so, forward the call to the underlying storage node,
> except for when it is encrypted: Encrypted preallocated images always
> return effectively random data, so .bdrv_has_zero_init() must always
> return 0 for them.
> 
> .bdrv_has_zero_init_truncate() can remain bdrv_has_zero_init_1(),
> because it presupposes PREALLOC_MODE_OFF.
> 
> Reported-by: Stefano Garzarella 
> Signed-off-by: Max Reitz 
> ---
>  block/qcow2.c | 29 -
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 5c40f54d64..b4e73aa443 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -4631,6 +4631,33 @@ static ImageInfoSpecific 
> *qcow2_get_specific_info(BlockDriverState *bs,
>  return spec_info;
>  }
>  
> +static int qcow2_has_zero_init(BlockDriverState *bs)
> +{
> +BDRVQcow2State *s = bs->opaque;
> +bool preallocated;
> +
> +if (qemu_in_coroutine()) {
> +qemu_co_mutex_lock(&s->lock);
> +}
> +/*
> + * Check preallocation status: Preallocated images have all L2
> + * tables allocated, nonpreallocated images have none.  It is
> + * therefore enough to check the first one.
> + */
> +preallocated = s->l1_size > 0 && s->l1_table[0] != 0;
> +if (qemu_in_coroutine()) {
> +qemu_co_mutex_unlock(&s->lock);
> +}
> +
> +if (!preallocated) {
> +return 1;
> +} else if (bs->encrypted) {
> +return 0;
> +} else {
> +return bdrv_has_zero_init(s->data_file->bs);
> +}
> +}
> +
>  static int qcow2_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
>int64_t pos)
>  {
> @@ -5186,7 +5213,7 @@ BlockDriver bdrv_qcow2 = {
>  .bdrv_child_perm  = bdrv_format_default_perms,
>  .bdrv_co_create_opts  = qcow2_co_create_opts,
>  .bdrv_co_create   = qcow2_co_create,
> -.bdrv_has_zero_init = bdrv_has_zero_init_1,
> +    .bdrv_has_zero_init   = qcow2_has_zero_init,
>  .bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
>  .bdrv_co_block_status = qcow2_co_block_status,
>  


Reviewed-by: Maxim Levitsky 
Best regards,
Maxim Levitsky







Re: [Qemu-block] [Qemu-devel] [PATCH v2 04/11] block: Implement .bdrv_has_zero_init_truncate()

2019-07-25 Thread Maxim Levitsky
v,
> diff --git a/block/qed.c b/block/qed.c
> index 77c7cef175..daaedb6864 100644
> --- a/block/qed.c
> +++ b/block/qed.c
> @@ -1668,6 +1668,7 @@ static BlockDriver bdrv_qed = {
>  .bdrv_co_create   = bdrv_qed_co_create,
>  .bdrv_co_create_opts  = bdrv_qed_co_create_opts,
>  .bdrv_has_zero_init   = bdrv_has_zero_init_1,
> +.bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
>  .bdrv_co_block_status = bdrv_qed_co_block_status,
>  .bdrv_co_readv= bdrv_qed_co_readv,
>  .bdrv_co_writev   = bdrv_qed_co_writev,
> diff --git a/block/raw-format.c b/block/raw-format.c
> index bffd424dd0..42c28cc29a 100644
> --- a/block/raw-format.c
> +++ b/block/raw-format.c
> @@ -413,6 +413,11 @@ static int raw_has_zero_init(BlockDriverState *bs)
>  return bdrv_has_zero_init(bs->file->bs);
>  }
>  
> +static int raw_has_zero_init_truncate(BlockDriverState *bs)
> +{
> +return bdrv_has_zero_init_truncate(bs->file->bs);
> +}
> +
>  static int coroutine_fn raw_co_create_opts(const char *filename, QemuOpts 
> *opts,
> Error **errp)
>  {
> @@ -572,6 +577,7 @@ BlockDriver bdrv_raw = {
>  .bdrv_co_ioctl= &raw_co_ioctl,
>  .create_opts  = &raw_create_opts,
>  .bdrv_has_zero_init   = &raw_has_zero_init,
> +.bdrv_has_zero_init_truncate = &raw_has_zero_init_truncate,
>  .strong_runtime_opts  = raw_strong_runtime_opts,
>  .mutable_opts = mutable_opts,
>  };
> diff --git a/block/rbd.c b/block/rbd.c
> index 59757b3120..057af43d48 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -1288,6 +1288,7 @@ static BlockDriver bdrv_rbd = {
>  .bdrv_co_create = qemu_rbd_co_create,
>  .bdrv_co_create_opts= qemu_rbd_co_create_opts,
>  .bdrv_has_zero_init = bdrv_has_zero_init_1,
> +.bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
>  .bdrv_get_info  = qemu_rbd_getinfo,
>  .create_opts= &qemu_rbd_create_opts,
>  .bdrv_getlength = qemu_rbd_getlength,
> diff --git a/block/sheepdog.c b/block/sheepdog.c
> index 6f402e5d4d..a4e111f981 100644
> --- a/block/sheepdog.c
> +++ b/block/sheepdog.c
> @@ -3228,6 +3228,7 @@ static BlockDriver bdrv_sheepdog = {
>  .bdrv_co_create   = sd_co_create,
>  .bdrv_co_create_opts  = sd_co_create_opts,
>  .bdrv_has_zero_init   = bdrv_has_zero_init_1,
> +.bdrv_has_zero_init_truncate  = bdrv_has_zero_init_1,
>  .bdrv_getlength   = sd_getlength,
>  .bdrv_get_allocated_file_size = sd_get_allocated_file_size,
>  .bdrv_co_truncate = sd_co_truncate,
> diff --git a/block/ssh.c b/block/ssh.c
> index 501933b855..84d01e892b 100644
> --- a/block/ssh.c
> +++ b/block/ssh.c
> @@ -1390,6 +1390,7 @@ static BlockDriver bdrv_ssh = {
>  .bdrv_co_create_opts  = ssh_co_create_opts,
>  .bdrv_close   = ssh_close,
>  .bdrv_has_zero_init   = ssh_has_zero_init,
> +.bdrv_has_zero_init_truncate  = ssh_has_zero_init,
>  .bdrv_co_readv= ssh_co_readv,
>  .bdrv_co_writev   = ssh_co_writev,
>  .bdrv_getlength   = ssh_getlength,

Reviewed-by: Maxim Levitsky 
Best regards,
Maxim Levitsky




Re: [Qemu-block] [Qemu-devel] [PATCH v2 07/11] vdi: Fix .bdrv_has_zero_init()

2019-07-25 Thread Maxim Levitsky
On Wed, 2019-07-24 at 19:12 +0200, Max Reitz wrote:
> Static VDI images cannot guarantee to be zero-initialized.  If the image
> has been statically allocated, forward the call to the underlying
> storage node.
> 
> Reported-by: Stefano Garzarella 
> Signed-off-by: Max Reitz 
> Reviewed-by: Stefan Weil 
> Acked-by: Stefano Garzarella 
> Tested-by: Stefano Garzarella 
> ---
>  block/vdi.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/block/vdi.c b/block/vdi.c
> index b9845a4cbd..0caa3f281d 100644
> --- a/block/vdi.c
> +++ b/block/vdi.c
> @@ -988,6 +988,17 @@ static void vdi_close(BlockDriverState *bs)
>  error_free(s->migration_blocker);
>  }
>  
> +static int vdi_has_zero_init(BlockDriverState *bs)
> +{
> +BDRVVdiState *s = bs->opaque;
> +
> +if (s->header.image_type == VDI_TYPE_STATIC) {
> +return bdrv_has_zero_init(bs->file->bs);
> +} else {
> +return 1;
> +}
> +}
> +
>  static QemuOptsList vdi_create_opts = {
>  .name = "vdi-create-opts",
>  .head = QTAILQ_HEAD_INITIALIZER(vdi_create_opts.head),
> @@ -1028,7 +1039,7 @@ static BlockDriver bdrv_vdi = {
>  .bdrv_child_perm  = bdrv_format_default_perms,
>  .bdrv_co_create  = vdi_co_create,
>  .bdrv_co_create_opts = vdi_co_create_opts,
> -.bdrv_has_zero_init = bdrv_has_zero_init_1,
> +.bdrv_has_zero_init  = vdi_has_zero_init,
>  .bdrv_co_block_status = vdi_co_block_status,
>  .bdrv_make_empty = vdi_make_empty,
>  


I am not familiar with VDI format to be honest, but knowing that dynamic format 
allows for growing
and static are preallocated this makes sense.

I see that the code when it allocates a new block at the end of the file, 
actually zeroes it out, so most
likely this is right.


Reviewed-by: Maxim Levitsky 
Best regards,
Maxim Levitsky




Re: [Qemu-block] [Qemu-devel] [PATCH v2 08/11] vhdx: Fix .bdrv_has_zero_init()

2019-07-25 Thread Maxim Levitsky
On Wed, 2019-07-24 at 19:12 +0200, Max Reitz wrote:
> Fixed VHDX images cannot guarantee to be zero-initialized.  If the image
> has the "fixed" subformat, forward the call to the underlying storage
> node.
> 
> Reported-by: Stefano Garzarella 
> Signed-off-by: Max Reitz 
> ---
>  block/vhdx.c | 26 +-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/block/vhdx.c b/block/vhdx.c
> index a02d1c99a7..6a09d0a55c 100644
> --- a/block/vhdx.c
> +++ b/block/vhdx.c
> @@ -2075,6 +2075,30 @@ static int coroutine_fn vhdx_co_check(BlockDriverState 
> *bs,
>  return 0;
>  }
>  
> +static int vhdx_has_zero_init(BlockDriverState *bs)
> +{
> +BDRVVHDXState *s = bs->opaque;
> +int state;
> +
> +/*
> + * Check the subformat: Fixed images have all BAT entries present,
> + * dynamic images have none (right after creation).  It is
> + * therefore enough to check the first BAT entry.
> + */
> +if (!s->bat_entries) {
> +return 1;
> +}
> +
> +state = s->bat[0] & VHDX_BAT_STATE_BIT_MASK;
> +if (state == PAYLOAD_BLOCK_FULLY_PRESENT) {
> +/* Fixed subformat */
> +return bdrv_has_zero_init(bs->file->bs);
> +}
> +
> +/* Dynamic subformat */
> +return 1;
> +}
> +
>  static QemuOptsList vhdx_create_opts = {
>  .name = "vhdx-create-opts",
>  .head = QTAILQ_HEAD_INITIALIZER(vhdx_create_opts.head),
> @@ -2128,7 +2152,7 @@ static BlockDriver bdrv_vhdx = {
>  .bdrv_co_create_opts= vhdx_co_create_opts,
>  .bdrv_get_info  = vhdx_get_info,
>  .bdrv_co_check  = vhdx_co_check,
> -.bdrv_has_zero_init = bdrv_has_zero_init_1,
> +.bdrv_has_zero_init = vhdx_has_zero_init,
>  
>  .create_opts= &vhdx_create_opts,
>  };

I am not familiar with VHDX format to be honest too, but knowing that dynamic 
format allows for growing
and static are preallocated this makes sense.

Its a bit amusing and not surprising that the the spec for this format is in 
.docx. 
I took a quick look to get a rough impression of the file format.


Reviewed-by: Maxim Levitsky 
Best regards,
Maxim Levitsky






Re: [Qemu-block] [Qemu-devel] [PATCH v2 09/11] iotests: Convert to preallocated encrypted qcow2

2019-07-25 Thread Maxim Levitsky
On Wed, 2019-07-24 at 19:12 +0200, Max Reitz wrote:
> Add a test case for converting an empty image (which only returns zeroes
> when read) to a preallocated encrypted qcow2 image.
> qcow2_has_zero_init() should return 0 then, thus forcing qemu-img
> convert to create zero clusters.
> 
> Signed-off-by: Max Reitz 
> Acked-by: Stefano Garzarella 
> Tested-by: Stefano Garzarella 
> ---
>  tests/qemu-iotests/188 | 20 +++-
>  tests/qemu-iotests/188.out |  4 
>  2 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/tests/qemu-iotests/188 b/tests/qemu-iotests/188
> index be7278aa65..afca44df54 100755
> --- a/tests/qemu-iotests/188
> +++ b/tests/qemu-iotests/188
> @@ -48,7 +48,7 @@ SECRETALT="secret,id=sec0,data=platypus"
>  
>  _make_test_img --object $SECRET -o 
> "encrypt.format=luks,encrypt.key-secret=sec0,encrypt.iter-time=10" $size
>  
> -IMGSPEC="driver=$IMGFMT,file.filename=$TEST_IMG,encrypt.key-secret=sec0"
> +IMGSPEC="driver=$IMGFMT,encrypt.key-secret=sec0,file.filename=$TEST_IMG"
This change I think doesn't change anything

>  
>  QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
>  
> @@ -68,6 +68,24 @@ echo
>  echo "== verify open failure with wrong password =="
>  $QEMU_IO --object $SECRETALT -c "read -P 0xa 0 $size" --image-opts $IMGSPEC 
> | _filter_qemu_io | _filter_testdir
>  
> +_cleanup_test_img
> +
> +echo
> +echo "== verify that has_zero_init returns false when preallocating =="
> +
> +# Empty source file
> +if [ -n "$TEST_IMG_FILE" ]; then
> +TEST_IMG_FILE="${TEST_IMG_FILE}.orig" _make_test_img $size
> +else
> +TEST_IMG="${TEST_IMG}.orig" _make_test_img $size
> +fi

I wonder why do we have TEST_IMG_FILE and TEST_IMG, I don't know iotests well 
enough
>From the quick look at the code, the TEST_IMG_FILE is an actual file, while 
>TEST_IMG can
be various URL like address.

> +
> +$QEMU_IMG convert -O "$IMGFMT" --object $SECRET \
> +-o 
> "encrypt.format=luks,encrypt.key-secret=sec0,encrypt.iter-time=10,preallocation=metadata"
>  \
> +"${TEST_IMG}.orig" "$TEST_IMG"
> +
> +$QEMU_IMG compare --object $SECRET --image-opts "${IMGSPEC}.orig" "$IMGSPEC"
> +
>  
>  # success, all done
>  echo "*** done"
> diff --git a/tests/qemu-iotests/188.out b/tests/qemu-iotests/188.out
> index 97b1402671..c568ef3701 100644
> --- a/tests/qemu-iotests/188.out
> +++ b/tests/qemu-iotests/188.out
> @@ -15,4 +15,8 @@ read 16777216/16777216 bytes at offset 0
>  
>  == verify open failure with wrong password ==
>  qemu-io: can't open: Invalid password, cannot unlock any keyslot
> +
> +== verify that has_zero_init returns false when preallocating ==
> +Formatting 'TEST_DIR/t.IMGFMT.orig', fmt=IMGFMT size=16777216
> +Images are identical.
>  *** done

Reviewed-by: Maxim Levitsky 
Best regards,
Maxim Levitsky




Re: [Qemu-block] [Qemu-devel] [PATCH v2 10/11] iotests: Test convert -n to pre-filled image

2019-07-25 Thread Maxim Levitsky
On Wed, 2019-07-24 at 19:12 +0200, Max Reitz wrote:
> Signed-off-by: Max Reitz 
> ---
>  tests/qemu-iotests/122 | 17 +
>  tests/qemu-iotests/122.out |  8 
>  2 files changed, 25 insertions(+)
> 
> diff --git a/tests/qemu-iotests/122 b/tests/qemu-iotests/122
> index 85c3a8d047..059011ebb1 100755
> --- a/tests/qemu-iotests/122
> +++ b/tests/qemu-iotests/122
> @@ -257,6 +257,23 @@ for min_sparse in 4k 8k; do
>  $QEMU_IMG map --output=json "$TEST_IMG".orig | _filter_qemu_img_map
>  done
>  
> +
> +echo
> +echo '=== -n to a non-zero image ==='
> +echo
> +
> +# Keep source zero
> +_make_test_img 64M
> +
> +# Output is not zero, but has bdrv_has_zero_init() == 1
> +TEST_IMG="$TEST_IMG".orig _make_test_img 64M
> +$QEMU_IO -c "write -P 42 0 64k" "$TEST_IMG".orig | _filter_qemu_io
> +
> +# Convert with -n, which should not assume that the target is zeroed
> +$QEMU_IMG convert -O $IMGFMT -n "$TEST_IMG" "$TEST_IMG".orig
> +
> +$QEMU_IMG compare "$TEST_IMG" "$TEST_IMG".orig
> +
>  # success, all done
>  echo '*** done'
>  rm -f $seq.full
> diff --git a/tests/qemu-iotests/122.out b/tests/qemu-iotests/122.out
> index c576705284..849b6cc2ef 100644
> --- a/tests/qemu-iotests/122.out
> +++ b/tests/qemu-iotests/122.out
> @@ -220,4 +220,12 @@ convert -c -S 8k
>  { "start": 9216, "length": 8192, "depth": 0, "zero": true, "data": false},
>  { "start": 17408, "length": 1024, "depth": 0, "zero": false, "data": true},
>  { "start": 18432, "length": 67090432, "depth": 0, "zero": true, "data": 
> false}]
> +
> +=== -n to a non-zero image ===
> +
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
> +Formatting 'TEST_DIR/t.IMGFMT.orig', fmt=IMGFMT size=67108864
> +wrote 65536/65536 bytes at offset 0
> +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +Images are identical.
>  *** done


Reviewed-by: Maxim Levitsky 
Best regards,
Maxim Levitsky





Re: [Qemu-block] [PULL 2/5] block/nvme: support larger that 512 bytes sector devices

2019-07-29 Thread Maxim Levitsky
On Mon, 2019-07-29 at 14:16 +0100, Peter Maydell wrote:
> On Mon, 22 Jul 2019 at 18:26, Max Reitz  wrote:
> > 
> > From: Maxim Levitsky 
> > 
> > Currently the driver hardcodes the sector size to 512,
> > and doesn't check the underlying device. Fix that.
> > 
> > Also fail if underlying nvme device is formatted with metadata
> > as this needs special support.
> > 
> > Signed-off-by: Maxim Levitsky 
> > Message-id: 20190716163020.13383-3-mlevi...@redhat.com
> > Signed-off-by: Max Reitz 
> > +static int64_t nvme_get_blocksize(BlockDriverState *bs)
> > +{
> > +BDRVNVMeState *s = bs->opaque;
> > +assert(s->blkshift >= BDRV_SECTOR_BITS);
> > +return 1 << s->blkshift;
> > +}
> 
> Hi -- Coverity points out here that we calculate the
> "1 << s->blkshift" as a 32-bit shift, but then return an
> int64_t type (CID 1403771).
> 
> Can the blkshift ever really be 31 or more ?

In theory, in the spec it is a 8 bit field, in practice, it should not be larger
that 12, because at least Linux doesn't support at all block devices that have
larger that 4K block size.

Best regards,
Maxim Levitsky

> 
> The types here seem weird anyway -- we return an int64_t,
> but the only user of this is nvme_probe_blocksizes(),
> which uses the value only to set BlockSizes::phys and ::log,
> both of which are of type "uint32_t". That leads me to think
> that the right return type for the function is uint32_t.
> 
> PS: this is the only Coverity issue currently outstanding so
> if it's a trivial fix it might be nice to put it into rc3.
> 
> thanks
> -- PMM
> 





Re: [Qemu-block] [PULL 2/5] block/nvme: support larger that 512 bytes sector devices

2019-07-29 Thread Maxim Levitsky
On Mon, 2019-07-29 at 15:25 +0200, Max Reitz wrote:
> On 29.07.19 15:16, Peter Maydell wrote:
> > On Mon, 22 Jul 2019 at 18:26, Max Reitz  wrote:
> > > 
> > > From: Maxim Levitsky 
> > > 
> > > Currently the driver hardcodes the sector size to 512,
> > > and doesn't check the underlying device. Fix that.
> > > 
> > > Also fail if underlying nvme device is formatted with metadata
> > > as this needs special support.
> > > 
> > > Signed-off-by: Maxim Levitsky 
> > > Message-id: 20190716163020.13383-3-mlevi...@redhat.com
> > > Signed-off-by: Max Reitz 
> > > +static int64_t nvme_get_blocksize(BlockDriverState *bs)
> > > +{
> > > +BDRVNVMeState *s = bs->opaque;
> > > +assert(s->blkshift >= BDRV_SECTOR_BITS);
> > > +return 1 << s->blkshift;
> > > +}
> > 
> > Hi -- Coverity points out here that we calculate the
> > "1 << s->blkshift" as a 32-bit shift, but then return an
> > int64_t type (CID 1403771).
> > 
> > Can the blkshift ever really be 31 or more ?
> > 
> > The types here seem weird anyway -- we return an int64_t,
> > but the only user of this is nvme_probe_blocksizes(),
> > which uses the value only to set BlockSizes::phys and ::log,
> > both of which are of type "uint32_t". That leads me to think
> > that the right return type for the function is uint32_t.
> > 
> > PS: this is the only Coverity issue currently outstanding so
> > if it's a trivial fix it might be nice to put it into rc3.
> 
> Maxim, what do you think?

Fully agree with that.

> 
> How about we let nvme_identify() limit blkshift to something sane and
> then return a uint32_t here?
> 
> In theory it would be limited by page_size, and that has a maximum value
> of 2^27.  In practice, though, that limit is checked by another 32-bit
> shift...

2^27 is the maximum NVME page size, but in theory, but only in theory,
you can have blocks larger that a page size, you will just have to give the 
controller
more that one page, even when you read a single block.

But like I said in the other mail, Linux doesn't support at all block size > 
cpu page size which is almost always 4K,
thus 12 is the practical limit for the block size these days, and of course 
both 27 and 31 is well above this.

So I'll send a patch to fix this today or tomorrow.

Best regards,
Maxim Levitsky

> Max
> 





Re: [Qemu-block] [PATCH for-4.1?] nvme: Limit blkshift to 12 (for 4 kB blocks)

2019-07-30 Thread Maxim Levitsky
On Tue, 2019-07-30 at 13:48 +0200, Max Reitz wrote:
> Linux does not support blocks greater than 4 kB anyway, so we might as
> well limit blkshift to 12 and thus save us from some potential trouble.

Well in theory its not 4K but PAGE_SIZE, thus on some IBM machines that I heard 
have
64K page size that might work, but again, I don't think any hardware vendor
dared yet to sell devices with sector size > 4K.

Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky

> 
> Reported-by: Peter Maydell 
> Suggested-by: Maxim Levitsky 
> Signed-off-by: Max Reitz 
> ---
> I won't be around for too long today, so I thought I'd just write a
> patch myself now.
> ---
>  block/nvme.c | 22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index c28755cc31..2c85713519 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -105,7 +105,7 @@ typedef struct {
>  
>  uint64_t nsze; /* Namespace size reported by identify command */
>  int nsid;  /* The namespace id to read/write data. */
> -size_t blkshift;
> +int blkshift;
>  
>  uint64_t max_transfer;
>  bool plugged;
> @@ -420,7 +420,7 @@ static void nvme_identify(BlockDriverState *bs, int 
> namespace, Error **errp)
>  NvmeIdNs *idns;
>  NvmeLBAF *lbaf;
>  uint8_t *resp;
> -int r, hwsect_size;
> +int r;
>  uint64_t iova;
>  NvmeCmd cmd = {
>  .opcode = NVME_ADM_CMD_IDENTIFY,
> @@ -474,11 +474,11 @@ static void nvme_identify(BlockDriverState *bs, int 
> namespace, Error **errp)
>  goto out;
>  }
>  
> -hwsect_size = 1 << lbaf->ds;
> -
> -if (hwsect_size < BDRV_SECTOR_SIZE || hwsect_size > s->page_size) {
> -error_setg(errp, "Namespace has unsupported block size (%d)",
> -hwsect_size);
> +if (lbaf->ds < BDRV_SECTOR_BITS || lbaf->ds > 12 ||
> +(1 << lbaf->ds) > s->page_size)
> +{
> +error_setg(errp, "Namespace has unsupported block size (2^%d)",
> +   lbaf->ds);
>  goto out;
>  }
>  
> @@ -804,16 +804,16 @@ static int64_t nvme_getlength(BlockDriverState *bs)
>  return s->nsze << s->blkshift;
>  }
>  
> -static int64_t nvme_get_blocksize(BlockDriverState *bs)
> +static uint32_t nvme_get_blocksize(BlockDriverState *bs)
>  {
>  BDRVNVMeState *s = bs->opaque;
> -assert(s->blkshift >= BDRV_SECTOR_BITS);
> -return 1 << s->blkshift;
> +assert(s->blkshift >= BDRV_SECTOR_BITS && s->blkshift <= 12);
> +return UINT32_C(1) << s->blkshift;
>  }
>  
>  static int nvme_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
>  {
> -int64_t blocksize = nvme_get_blocksize(bs);
> +uint32_t blocksize = nvme_get_blocksize(bs);
>  bsz->phys = blocksize;
>  bsz->log = blocksize;
>  return 0;





[Qemu-block] [PATCH 01/13] block-crypto: misc refactoring

2019-08-14 Thread Maxim Levitsky
* rename the write_func to create_write_func,
  and init_func to create_init_func
  this is  preparation for other write_func that will
  be used to update the encryption keys.

No functional changes

Signed-off-by: Maxim Levitsky 
---
 block/crypto.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/block/crypto.c b/block/crypto.c
index 8237424ae6..42a3f0898b 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -51,7 +51,6 @@ static int block_crypto_probe_generic(QCryptoBlockFormat 
format,
 }
 }
 
-
 static ssize_t block_crypto_read_func(QCryptoBlock *block,
   size_t offset,
   uint8_t *buf,
@@ -77,7 +76,7 @@ struct BlockCryptoCreateData {
 };
 
 
-static ssize_t block_crypto_write_func(QCryptoBlock *block,
+static ssize_t block_crypto_create_write_func(QCryptoBlock *block,
size_t offset,
const uint8_t *buf,
size_t buflen,
@@ -95,8 +94,7 @@ static ssize_t block_crypto_write_func(QCryptoBlock *block,
 return ret;
 }
 
-
-static ssize_t block_crypto_init_func(QCryptoBlock *block,
+static ssize_t block_crypto_create_init_func(QCryptoBlock *block,
   size_t headerlen,
   void *opaque,
   Error **errp)
@@ -108,7 +106,8 @@ static ssize_t block_crypto_init_func(QCryptoBlock *block,
 return -EFBIG;
 }
 
-/* User provided size should reflect amount of space made
+/*
+ * User provided size should reflect amount of space made
  * available to the guest, so we must take account of that
  * which will be used by the crypto header
  */
@@ -117,6 +116,8 @@ static ssize_t block_crypto_init_func(QCryptoBlock *block,
 }
 
 
+
+
 static QemuOptsList block_crypto_runtime_opts_luks = {
 .name = "crypto",
 .head = QTAILQ_HEAD_INITIALIZER(block_crypto_runtime_opts_luks.head),
@@ -272,8 +273,8 @@ static int block_crypto_co_create_generic(BlockDriverState 
*bs,
 };
 
 crypto = qcrypto_block_create(opts, NULL,
-  block_crypto_init_func,
-  block_crypto_write_func,
+  block_crypto_create_init_func,
+  block_crypto_create_write_func,
   &data,
   errp);
 
-- 
2.17.2




[Qemu-block] [PATCH 04/13] qcrypto-luks: refactoring: simplify the math used for keyslot locations

2019-08-14 Thread Maxim Levitsky
Signed-off-by: Maxim Levitsky 
---
 crypto/block-luks.c | 64 +++--
 1 file changed, 38 insertions(+), 26 deletions(-)

diff --git a/crypto/block-luks.c b/crypto/block-luks.c
index 6bb369f3b4..e1a4df94b7 100644
--- a/crypto/block-luks.c
+++ b/crypto/block-luks.c
@@ -417,6 +417,33 @@ static int masterkeylen(QCryptoBlockLUKS *luks)
 }
 
 
+/*
+ * Returns number of sectors needed to store the key material
+ * given number of anti forensic stripes
+ */
+static int splitkeylen_sectors(QCryptoBlockLUKS *luks, int stripes)
+
+{
+/*
+ * This calculation doesn't match that shown in the spec,
+ * but instead follows the cryptsetup implementation.
+ */
+
+size_t header_sectors = QCRYPTO_BLOCK_LUKS_KEY_SLOT_OFFSET /
+ QCRYPTO_BLOCK_LUKS_SECTOR_SIZE;
+
+size_t splitkeylen = masterkeylen(luks) * stripes;
+
+/* First align the key material size to block size*/
+size_t splitkeylen_sectors =
+DIV_ROUND_UP(splitkeylen, QCRYPTO_BLOCK_LUKS_SECTOR_SIZE);
+
+/* Then also align the key material size to the size of the header */
+return ROUND_UP(splitkeylen_sectors, header_sectors);
+}
+
+
+
 /*
  * Stores the main LUKS header, taking care of endianess
  */
@@ -1169,7 +1196,7 @@ qcrypto_block_luks_create(QCryptoBlock *block,
 QCryptoBlockCreateOptionsLUKS luks_opts;
 Error *local_err = NULL;
 uint8_t *masterkey = NULL;
-size_t splitkeylen = 0;
+size_t next_sector;
 size_t i;
 char *password;
 const char *cipher_alg;
@@ -1388,23 +1415,16 @@ qcrypto_block_luks_create(QCryptoBlock *block,
 goto error;
 }
 
+/* start with the sector that follows the header*/
+next_sector = QCRYPTO_BLOCK_LUKS_KEY_SLOT_OFFSET /
+  QCRYPTO_BLOCK_LUKS_SECTOR_SIZE;
 
-/* Although LUKS has multiple key slots, we're just going
- * to use the first key slot */
-splitkeylen = luks->header.key_bytes * QCRYPTO_BLOCK_LUKS_STRIPES;
 for (i = 0; i < QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS; i++) {
-luks->header.key_slots[i].active = 
QCRYPTO_BLOCK_LUKS_KEY_SLOT_DISABLED;
-luks->header.key_slots[i].stripes = QCRYPTO_BLOCK_LUKS_STRIPES;
-
-/* This calculation doesn't match that shown in the spec,
- * but instead follows the cryptsetup implementation.
- */
-luks->header.key_slots[i].key_offset =
-(QCRYPTO_BLOCK_LUKS_KEY_SLOT_OFFSET /
- QCRYPTO_BLOCK_LUKS_SECTOR_SIZE) +
-(ROUND_UP(DIV_ROUND_UP(splitkeylen, 
QCRYPTO_BLOCK_LUKS_SECTOR_SIZE),
-  (QCRYPTO_BLOCK_LUKS_KEY_SLOT_OFFSET /
-   QCRYPTO_BLOCK_LUKS_SECTOR_SIZE)) * i);
+QCryptoBlockLUKSKeySlot *slot = &luks->header.key_slots[i];
+slot->active = QCRYPTO_BLOCK_LUKS_KEY_SLOT_DISABLED;
+slot->key_offset = next_sector;
+slot->stripes = QCRYPTO_BLOCK_LUKS_STRIPES;
+next_sector += splitkeylen_sectors(luks, QCRYPTO_BLOCK_LUKS_STRIPES);
 }
 
 
@@ -1412,17 +1432,9 @@ qcrypto_block_luks_create(QCryptoBlock *block,
  * slot headers, rounded up to the nearest sector, combined with
  * the size of each master key material region, also rounded up
  * to the nearest sector */
-luks->header.payload_offset =
-(QCRYPTO_BLOCK_LUKS_KEY_SLOT_OFFSET /
- QCRYPTO_BLOCK_LUKS_SECTOR_SIZE) +
-(ROUND_UP(DIV_ROUND_UP(splitkeylen, QCRYPTO_BLOCK_LUKS_SECTOR_SIZE),
-  (QCRYPTO_BLOCK_LUKS_KEY_SLOT_OFFSET /
-   QCRYPTO_BLOCK_LUKS_SECTOR_SIZE)) *
- QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS);
-
+luks->header.payload_offset = next_sector;
 block->sector_size = QCRYPTO_BLOCK_LUKS_SECTOR_SIZE;
-block->payload_offset = luks->header.payload_offset *
-block->sector_size;
+block->payload_offset = luks->header.payload_offset * block->sector_size;
 
 /* Reserve header space to match payload offset */
 initfunc(block, block->payload_offset, opaque, &local_err);
-- 
2.17.2




[Qemu-block] [PATCH 05/13] qcrypto-luks: clear the masterkey and password before freeing them always

2019-08-14 Thread Maxim Levitsky
While there are other places where these are still stored in memory,
this is still one less key material area that can be sniffed with
various side channel attacks



Signed-off-by: Maxim Levitsky 
---
 crypto/block-luks.c | 52 ++---
 1 file changed, 44 insertions(+), 8 deletions(-)

diff --git a/crypto/block-luks.c b/crypto/block-luks.c
index e1a4df94b7..336e633df4 100644
--- a/crypto/block-luks.c
+++ b/crypto/block-luks.c
@@ -1023,8 +1023,18 @@ qcrypto_block_luks_load_key(QCryptoBlock *block,
  cleanup:
 qcrypto_ivgen_free(ivgen);
 qcrypto_cipher_free(cipher);
-g_free(splitkey);
-g_free(possiblekey);
+
+if (splitkey) {
+memset(splitkey, 0, splitkeylen);
+g_free(splitkey);
+}
+
+if (possiblekey) {
+memset(possiblekey, 0, masterkeylen(luks));
+g_free(possiblekey);
+
+}
+
 return ret;
 }
 
@@ -1161,16 +1171,34 @@ qcrypto_block_luks_open(QCryptoBlock *block,
 block->sector_size = QCRYPTO_BLOCK_LUKS_SECTOR_SIZE;
 block->payload_offset = luks->header.payload_offset * block->sector_size;
 
-g_free(masterkey);
-g_free(password);
+if (masterkey) {
+memset(masterkey, 0, masterkeylen(luks));
+g_free(masterkey);
+}
+
+if (password) {
+memset(password, 0, strlen(password));
+g_free(password);
+}
+
 return 0;
 
  fail:
-g_free(masterkey);
+
+if (masterkey) {
+memset(masterkey, 0, masterkeylen(luks));
+g_free(masterkey);
+}
+
+if (password) {
+memset(password, 0, strlen(password));
+g_free(password);
+}
+
 qcrypto_block_free_cipher(block);
 qcrypto_ivgen_free(block->ivgen);
+
 g_free(luks);
-g_free(password);
 return ret;
 }
 
@@ -1459,7 +1487,10 @@ qcrypto_block_luks_create(QCryptoBlock *block,
 
 memset(masterkey, 0, luks->header.key_bytes);
 g_free(masterkey);
+
+memset(password, 0, strlen(password));
 g_free(password);
+
 g_free(cipher_mode_spec);
 
 return 0;
@@ -1467,9 +1498,14 @@ qcrypto_block_luks_create(QCryptoBlock *block,
  error:
 if (masterkey) {
 memset(masterkey, 0, luks->header.key_bytes);
+g_free(masterkey);
 }
-g_free(masterkey);
-g_free(password);
+
+if (password) {
+memset(password, 0, strlen(password));
+g_free(password);
+}
+
 g_free(cipher_mode_spec);
 
 qcrypto_block_free_cipher(block);
-- 
2.17.2




[Qemu-block] [PATCH 00/13] RFC: luks/encrypted qcow2 key management

2019-08-14 Thread Maxim Levitsky
Hi!

This patch series implements key management for luks based encryption
It supports both raw luks images and qcow2 encrypted images.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1731898

There are still several issues that need to be figured out,
on which the feedback is very welcome, but other than that the code mostly 
works.

The main issues are:

1. Instead of the proposed blockdev-update-encryption/blockdev-erase-encryption
interface, it is probably better to implement 'blockdev-amend-options' in qmp,
and use this both for offline and online key update (with some translation
layer to convert the qemu-img 'options' to qmp structures)

This interface already exists for offline qcow2 format options update/

This is an issue that was raised today on IRC with Kevin Wolf. Really thanks
for the idea!

We agreed that this new qmp interface should take the same options as
blockdev-create does, however since we want to be able to edit the encryption
slots separately, this implies that we sort of need to allow this on creation
time as well.

Also the BlockdevCreateOptions is a union, which is specialized by the driver 
name
which is great for creation, but for update, the driver name is already known,
and thus the user should not be forced to pass it again.
However qmp doesn't seem to support union type guessing based on actual fields
given (this might not be desired either), which complicates this somewhat.

2. 'crypto' driver (the raw luks block device/file) has special behavior for

share-rw=on. write sharing usually is only allowed for raw files, files that
qemu doesn't itself touch, but only guest does. For such files a well behaved 
guests can
share the storage.

On the other hand most of the format drivers need to store the metadata, and we 
don't
have any format driver which implements some kind of sync vs other users of the 
same
file, thus this is not allowed.

However since for luks which is technically a format driver, the metadata is 
readonly,
such write sharing was allowed till now, and due to backward compatibility 
should
still be allowed in the future.

This causes an issue with online updating of the keys, and the solution that 
was suggested
by Keven that I implemented was to request the exclusive write access only 
during the key
update.

Testing. This was lightly tested with manual testing and with few iotests that 
I prepared.
I haven't yet tested fully the write sharing behavior, nor did I run the whole 
iotests
suite to see if this code causes some regressions. Since I will need probably
to rewrite some chunks of it to change to 'amend' interface, I decided to post 
it now,
to see if you have other ideas/comments to add.

Best regards,
Maxim Levitsky

Maxim Levitsky (13):
  block-crypto: misc refactoring
  qcrypto-luks: misc refactoring
  qcrypto-luks: refactoring: extract load/store/check/parse header
functions
  qcrypto-luks: refactoring: simplify the math used for keyslot
locations
  qcrypto-luks: clear the masterkey and password before freeing them
always
  qcrypto-luks: implement more rigorous header checking
  block: add manage-encryption command (qmp and blockdev)
  qcrypto: add the plumbing for encryption management
  qcrypto-luks: implement the encryption key management
  block/crypto: implement the encryption key management
  block/qcow2: implement the encryption key managment
  qemu-img: implement key management
  iotests : add tests for encryption key management

 block/block-backend.c|9 +
 block/crypto.c   |  127 ++-
 block/crypto.h   |3 +
 block/io.c   |   24 +
 block/qcow2.c|   27 +
 blockdev.c   |   40 +
 crypto/block-luks.c  | 1673 --
 crypto/block.c   |   29 +
 crypto/blockpriv.h   |9 +
 include/block/block.h|   12 +
 include/block/block_int.h|   11 +
 include/crypto/block.h   |   27 +
 include/sysemu/block-backend.h   |7 +
 qapi/block-core.json |   36 +
 qapi/crypto.json |   26 +
 qemu-img-cmds.hx |   13 +
 qemu-img.c   |  140 +++
 tests/qemu-iotests/257   |  197 
 tests/qemu-iotests/257.out   |   96 ++
 tests/qemu-iotests/258   |   95 ++
 tests/qemu-iotests/258.out   |   30 +
 tests/qemu-iotests/259   |  199 
 tests/qemu-iotests/259.out   |5 +
 tests/qemu-iotests/common.filter |5 +-
 tests/qemu-iotests/group |3 +
 25 files changed, 2286 insertions(+), 557 deletions(-)
 create mode 100755 tests/qemu-iotests/257
 create mode 100644 tests/qemu-iotests/257.out
 create mode 100755 tests/qemu-iotests/258
 create mode 100644 tests/qemu-iotests/258.out
 create mode 100644 tests/qemu-iotests/259
 create mode 100644 tests/qemu-iotests/259.out

-- 
2.17.2




[Qemu-block] [PATCH 06/13] qcrypto-luks: implement more rigorous header checking

2019-08-14 Thread Maxim Levitsky
Check that keyslots don't overlap with the data,
and check that keyslots don't overlap with each other.
(this is done using naive O(n^2) nested loops,
but since there are just 8 keyslots, this doens't really matter.

Signed-off-by: Maxim Levitsky 
---
 crypto/block-luks.c | 42 ++
 1 file changed, 42 insertions(+)

diff --git a/crypto/block-luks.c b/crypto/block-luks.c
index 336e633df4..1997e92fe1 100644
--- a/crypto/block-luks.c
+++ b/crypto/block-luks.c
@@ -551,6 +551,8 @@ static int
 qcrypto_block_luks_check_header(QCryptoBlockLUKS *luks, Error **errp)
 {
 int ret;
+int i, j;
+
 
 if (memcmp(luks->header.magic, qcrypto_block_luks_magic,
QCRYPTO_BLOCK_LUKS_MAGIC_LEN) != 0) {
@@ -566,6 +568,46 @@ qcrypto_block_luks_check_header(QCryptoBlockLUKS *luks, 
Error **errp)
 goto fail;
 }
 
+/* Check all keyslots for corruption  */
+for (i = 0 ; i < QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS ; i++) {
+
+QCryptoBlockLUKSKeySlot *slot1 = &luks->header.key_slots[i];
+uint start1 = slot1->key_offset;
+uint len1 = splitkeylen_sectors(luks, slot1->stripes);
+
+if (slot1->stripes == 0 ||
+(slot1->active != QCRYPTO_BLOCK_LUKS_KEY_SLOT_DISABLED &&
+slot1->active != QCRYPTO_BLOCK_LUKS_KEY_SLOT_ENABLED)) {
+
+error_setg(errp, "Keyslot %i is corrupted", i);
+ret = -EINVAL;
+goto fail;
+}
+
+if (start1 + len1 > luks->header.payload_offset) {
+error_setg(errp,
+   "Keyslot %i is overlapping with the encrypted payload",
+   i);
+ret = -EINVAL;
+goto fail;
+}
+
+for (j = i + 1 ; j < QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS ; j++) {
+
+QCryptoBlockLUKSKeySlot *slot2 = &luks->header.key_slots[j];
+uint start2 = slot2->key_offset;
+uint len2 = splitkeylen_sectors(luks, slot2->stripes);
+
+if (start1 + len1 > start2 && start2 + len2 > start1) {
+error_setg(errp,
+   "Keyslots %i and %i are overlapping in the header",
+   i, j);
+ret = -EINVAL;
+goto fail;
+}
+}
+
+}
 return 0;
 fail:
 return ret;
-- 
2.17.2




[Qemu-block] [PATCH 02/13] qcrypto-luks: misc refactoring

2019-08-14 Thread Maxim Levitsky
This is also a preparation for key read/write/erase functions

* use master key len from the header
* prefer to use crypto params in the QCryptoBlockLUKS
  over passing them as function arguments
* define QCRYPTO_BLOCK_LUKS_DEFAULT_ITER_TIME
* Add comments to various crypto parameters in the QCryptoBlockLUKS

Signed-off-by: Maxim Levitsky 
---
 crypto/block-luks.c | 213 ++--
 1 file changed, 105 insertions(+), 108 deletions(-)

diff --git a/crypto/block-luks.c b/crypto/block-luks.c
index 409ab50f20..48213abde7 100644
--- a/crypto/block-luks.c
+++ b/crypto/block-luks.c
@@ -70,6 +70,8 @@ typedef struct QCryptoBlockLUKSKeySlot 
QCryptoBlockLUKSKeySlot;
 
 #define QCRYPTO_BLOCK_LUKS_SECTOR_SIZE 512LL
 
+#define QCRYPTO_BLOCK_LUKS_DEFAULT_ITER_TIME 2000
+
 static const char qcrypto_block_luks_magic[QCRYPTO_BLOCK_LUKS_MAGIC_LEN] = {
 'L', 'U', 'K', 'S', 0xBA, 0xBE
 };
@@ -199,13 +201,25 @@ QEMU_BUILD_BUG_ON(sizeof(struct QCryptoBlockLUKSHeader) 
!= 592);
 struct QCryptoBlockLUKS {
 QCryptoBlockLUKSHeader header;
 
-/* Cache parsed versions of what's in header fields,
- * as we can't rely on QCryptoBlock.cipher being
- * non-NULL */
+/* Main encryption algorithm used for encryption*/
 QCryptoCipherAlgorithm cipher_alg;
+
+/* Mode of encryption for the selected encryption algorithm */
 QCryptoCipherMode cipher_mode;
+
+/* Initialization vector generation algorithm */
 QCryptoIVGenAlgorithm ivgen_alg;
+
+/* Hash algorithm used for IV generation*/
 QCryptoHashAlgorithm ivgen_hash_alg;
+
+/*
+ * Encryption algorithm used for IV generation.
+ * Usually the same as main encryption algorithm
+ */
+QCryptoCipherAlgorithm ivgen_cipher_alg;
+
+/* Hash algorithm used in pbkdf2 function */
 QCryptoHashAlgorithm hash_alg;
 };
 
@@ -397,6 +411,12 @@ qcrypto_block_luks_essiv_cipher(QCryptoCipherAlgorithm 
cipher,
 }
 }
 
+static int masterkeylen(QCryptoBlockLUKS *luks)
+{
+return luks->header.key_bytes;
+}
+
+
 /*
  * Given a key slot, and user password, this will attempt to unlock
  * the master encryption key from the key slot.
@@ -410,21 +430,15 @@ qcrypto_block_luks_essiv_cipher(QCryptoCipherAlgorithm 
cipher,
  */
 static int
 qcrypto_block_luks_load_key(QCryptoBlock *block,
-QCryptoBlockLUKSKeySlot *slot,
+uint slot_idx,
 const char *password,
-QCryptoCipherAlgorithm cipheralg,
-QCryptoCipherMode ciphermode,
-QCryptoHashAlgorithm hash,
-QCryptoIVGenAlgorithm ivalg,
-QCryptoCipherAlgorithm ivcipheralg,
-QCryptoHashAlgorithm ivhash,
 uint8_t *masterkey,
-size_t masterkeylen,
 QCryptoBlockReadFunc readfunc,
 void *opaque,
 Error **errp)
 {
 QCryptoBlockLUKS *luks = block->opaque;
+QCryptoBlockLUKSKeySlot *slot = &luks->header.key_slots[slot_idx];
 uint8_t *splitkey;
 size_t splitkeylen;
 uint8_t *possiblekey;
@@ -439,9 +453,9 @@ qcrypto_block_luks_load_key(QCryptoBlock *block,
 return 0;
 }
 
-splitkeylen = masterkeylen * slot->stripes;
+splitkeylen = masterkeylen(luks) * slot->stripes;
 splitkey = g_new0(uint8_t, splitkeylen);
-possiblekey = g_new0(uint8_t, masterkeylen);
+possiblekey = g_new0(uint8_t, masterkeylen(luks));
 
 /*
  * The user password is used to generate a (possible)
@@ -450,11 +464,11 @@ qcrypto_block_luks_load_key(QCryptoBlock *block,
  * the key is correct and validate the results of
  * decryption later.
  */
-if (qcrypto_pbkdf2(hash,
+if (qcrypto_pbkdf2(luks->hash_alg,
(const uint8_t *)password, strlen(password),
slot->salt, QCRYPTO_BLOCK_LUKS_SALT_LEN,
slot->iterations,
-   possiblekey, masterkeylen,
+   possiblekey, masterkeylen(luks),
errp) < 0) {
 goto cleanup;
 }
@@ -478,19 +492,19 @@ qcrypto_block_luks_load_key(QCryptoBlock *block,
 
 /* Setup the cipher/ivgen that we'll use to try to decrypt
  * the split master key material */
-cipher = qcrypto_cipher_new(cipheralg, ciphermode,
-possiblekey, masterkeylen,
+cipher = qcrypto_cipher_new(luks->cipher_alg, luks->cipher_mode,
+possiblekey, masterkeylen(luks),
 errp);
 if (!cipher) {
 goto cleanup;
 }
 
-niv = qcrypto_cipher_get_iv_len(cipheralg,
-c

[Qemu-block] [PATCH 03/13] qcrypto-luks: refactoring: extract load/store/check/parse header functions

2019-08-14 Thread Maxim Levitsky
With upcoming key management, the header will
need to be stored after the image is created.

Extracting load header isn't strictly needed, but
do this anyway for the symmetry.

Also I extracted a function that does basic sanity
checks on the just read header, and a function
which parses all the crypto format to make the
code a bit more readable, plus now the code
doesn't destruct the in-header cipher-mode string,
so that the header now can be stored many times,
which is needed for the key management.

Also this allows to contain the endianess conversions
in these functions alone

The header is no longer endian swapped in place,
to prevent (mostly theoretical races I think)
races where someone could see the header in the
process of beeing byteswapped.

Signed-off-by: Maxim Levitsky 
---
 crypto/block-luks.c | 756 ++--
 1 file changed, 440 insertions(+), 316 deletions(-)

diff --git a/crypto/block-luks.c b/crypto/block-luks.c
index 48213abde7..6bb369f3b4 100644
--- a/crypto/block-luks.c
+++ b/crypto/block-luks.c
@@ -417,6 +417,427 @@ static int masterkeylen(QCryptoBlockLUKS *luks)
 }
 
 
+/*
+ * Stores the main LUKS header, taking care of endianess
+ */
+static int
+qcrypto_block_luks_store_header(QCryptoBlock *block,
+QCryptoBlockWriteFunc writefunc,
+void *opaque,
+Error **errp)
+{
+QCryptoBlockLUKS *luks = block->opaque;
+Error *local_err = NULL;
+size_t i;
+QCryptoBlockLUKSHeader *hdr_copy;
+
+/* Create a copy of the header */
+hdr_copy = g_new0(QCryptoBlockLUKSHeader, 1);
+memcpy(hdr_copy, &luks->header, sizeof(QCryptoBlockLUKSHeader));
+
+/*
+ * Everything on disk uses Big Endian (tm), so flip header fields
+ * before writing them
+ */
+cpu_to_be16s(&hdr_copy->version);
+cpu_to_be32s(&hdr_copy->payload_offset);
+cpu_to_be32s(&hdr_copy->key_bytes);
+cpu_to_be32s(&hdr_copy->master_key_iterations);
+
+for (i = 0; i < QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS; i++) {
+cpu_to_be32s(&hdr_copy->key_slots[i].active);
+cpu_to_be32s(&hdr_copy->key_slots[i].iterations);
+cpu_to_be32s(&hdr_copy->key_slots[i].key_offset);
+cpu_to_be32s(&hdr_copy->key_slots[i].stripes);
+}
+
+/* Write out the partition header and key slot headers */
+writefunc(block, 0, (const uint8_t *)hdr_copy, sizeof(*hdr_copy),
+  opaque, &local_err);
+
+g_free(hdr_copy);
+
+if (local_err) {
+error_propagate(errp, local_err);
+return -1;
+}
+return 0;
+}
+
+/*
+ * Loads the main LUKS header,and byteswaps it to native endianess
+ * And run basic sanity checks on it
+ */
+static int
+qcrypto_block_luks_load_header(QCryptoBlock *block,
+QCryptoBlockReadFunc readfunc,
+void *opaque,
+Error **errp)
+{
+ssize_t rv;
+size_t i;
+int ret = 0;
+QCryptoBlockLUKS *luks = block->opaque;
+
+/*
+ * Read the entire LUKS header, minus the key material from
+ * the underlying device
+ */
+
+rv = readfunc(block, 0,
+  (uint8_t *)&luks->header,
+  sizeof(luks->header),
+  opaque,
+  errp);
+if (rv < 0) {
+ret = rv;
+goto fail;
+}
+
+/*
+ * The header is always stored in big-endian format, so
+ * convert everything to native
+ */
+be16_to_cpus(&luks->header.version);
+be32_to_cpus(&luks->header.payload_offset);
+be32_to_cpus(&luks->header.key_bytes);
+be32_to_cpus(&luks->header.master_key_iterations);
+
+for (i = 0; i < QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS; i++) {
+be32_to_cpus(&luks->header.key_slots[i].active);
+be32_to_cpus(&luks->header.key_slots[i].iterations);
+be32_to_cpus(&luks->header.key_slots[i].key_offset);
+be32_to_cpus(&luks->header.key_slots[i].stripes);
+}
+
+
+return 0;
+fail:
+return ret;
+}
+
+
+/*
+ * Does basic sanity checks on the LUKS header
+ */
+static int
+qcrypto_block_luks_check_header(QCryptoBlockLUKS *luks, Error **errp)
+{
+int ret;
+
+if (memcmp(luks->header.magic, qcrypto_block_luks_magic,
+   QCRYPTO_BLOCK_LUKS_MAGIC_LEN) != 0) {
+error_setg(errp, "Volume is not in LUKS format");
+ret = -EINVAL;
+goto fail;
+}
+
+if (luks->header.version != QCRYPTO_BLOCK_LUKS_VERSION) {
+error_setg(errp, "LUKS version %" PRIu32 " is not supported",
+   luks->header.version);
+ret = -ENOTSUP;
+goto fail;
+}
+
+return 0;
+fail:
+return ret;
+}
+
+
+/*
+ * Parses the crypto parameters tha

[Qemu-block] [PATCH 09/13] qcrypto-luks: implement the encryption key management

2019-08-14 Thread Maxim Levitsky
Signed-off-by: Maxim Levitsky 
---
 crypto/block-luks.c | 374 +++-
 1 file changed, 373 insertions(+), 1 deletion(-)

diff --git a/crypto/block-luks.c b/crypto/block-luks.c
index 1997e92fe1..2c33643b52 100644
--- a/crypto/block-luks.c
+++ b/crypto/block-luks.c
@@ -72,6 +72,8 @@ typedef struct QCryptoBlockLUKSKeySlot 
QCryptoBlockLUKSKeySlot;
 
 #define QCRYPTO_BLOCK_LUKS_DEFAULT_ITER_TIME 2000
 
+#define QCRYPTO_BLOCK_LUKS_ERASE_ITERATIONS 40
+
 static const char qcrypto_block_luks_magic[QCRYPTO_BLOCK_LUKS_MAGIC_LEN] = {
 'L', 'U', 'K', 'S', 0xBA, 0xBE
 };
@@ -221,6 +223,9 @@ struct QCryptoBlockLUKS {
 
 /* Hash algorithm used in pbkdf2 function */
 QCryptoHashAlgorithm hash_alg;
+
+/* Name of the secret that was used to open the image */
+char *secret;
 };
 
 
@@ -1121,6 +1126,194 @@ qcrypto_block_luks_find_key(QCryptoBlock *block,
 }
 
 
+
+/*
+ * Returns true if a slot i is marked as containing as active
+ * (contains encrypted copy of the master key)
+ */
+
+static bool
+qcrypto_block_luks_slot_active(QCryptoBlockLUKS *luks, int slot_idx)
+{
+uint32_t val = luks->header.key_slots[slot_idx].active;
+return val ==  QCRYPTO_BLOCK_LUKS_KEY_SLOT_ENABLED;
+}
+
+/*
+ * Returns the number of slots that are marked as active
+ * (contains encrypted copy of the master key)
+ */
+
+static int
+qcrypto_block_luks_count_active_slots(QCryptoBlockLUKS *luks)
+{
+int i, ret = 0;
+
+for (i = 0; i < QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS; i++) {
+if (qcrypto_block_luks_slot_active(luks, i)) {
+ret++;
+}
+}
+return ret;
+}
+
+
+/*
+ * Finds first key slot which is not active
+ * Returns the key slot index, or -1 if doesn't exist
+ */
+
+static int
+qcrypto_block_luks_find_free_keyslot(QCryptoBlockLUKS *luks)
+{
+uint i;
+
+for (i = 0; i < QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS; i++) {
+if (!qcrypto_block_luks_slot_active(luks, i)) {
+return i;
+}
+}
+return -1;
+
+}
+
+/*
+ * Erases an keyslot given its index
+ *
+ * Returns:
+ *0 if the keyslot was erased successfully
+ *   -1 if a error occurred while erasing the keyslot
+ *
+ */
+
+static int
+qcrypto_block_luks_erase_key(QCryptoBlock *block,
+ uint slot_idx,
+ QCryptoBlockWriteFunc writefunc,
+ void *opaque,
+ Error **errp)
+{
+QCryptoBlockLUKS *luks = block->opaque;
+QCryptoBlockLUKSKeySlot *slot = &luks->header.key_slots[slot_idx];
+uint8_t *garbagekey = NULL;
+size_t splitkeylen = masterkeylen(luks) * slot->stripes;
+int i;
+int ret = -1;
+
+assert(slot_idx < QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS);
+assert(splitkeylen > 0);
+
+garbagekey = g_malloc0(splitkeylen);
+
+/* Reset the key slot header */
+memset(slot->salt, 0, QCRYPTO_BLOCK_LUKS_SALT_LEN);
+slot->iterations = 0;
+slot->active = QCRYPTO_BLOCK_LUKS_KEY_SLOT_DISABLED;
+
+qcrypto_block_luks_store_header(block,  writefunc, opaque, errp);
+
+/*
+ * Now try to erase the key material, even if the header
+ * update failed
+ */
+
+for (i = 0 ; i < QCRYPTO_BLOCK_LUKS_ERASE_ITERATIONS ; i++) {
+if (qcrypto_random_bytes(garbagekey, splitkeylen, errp) < 0) {
+
+/*
+ * If we failed to get the random data, still write
+ * *something* to the key slot at least once
+ */
+
+if (i > 0) {
+goto cleanup;
+}
+}
+
+if (writefunc(block, slot->key_offset * QCRYPTO_BLOCK_LUKS_SECTOR_SIZE,
+  garbagekey,
+  splitkeylen,
+  opaque,
+  errp) != splitkeylen) {
+goto cleanup;
+}
+}
+
+ret = 0;
+cleanup:
+g_free(garbagekey);
+return ret;
+}
+
+
+/*
+ * Erase all the keys that match the given password
+ * Will stop when only one keyslot is remaining
+ * Returns 0 is some keys were erased or -1 on failure
+ */
+
+static int
+qcrypto_block_luks_erase_matching_keys(QCryptoBlock *block,
+ const char *password,
+ QCryptoBlockReadFunc readfunc,
+ QCryptoBlockWriteFunc writefunc,
+ void *opaque,
+ bool force,
+ Error **errp)
+{
+QCryptoBlockLUKS *luks = block->opaque;
+uint i;
+int rv, ret = -1;
+uint8_t *masterkey;
+uint erased_count = 0;
+uint active_slot_count = qcrypto_block_luks_count_active_slots(luks);
+
+masterkey = g_new0(uint8_t, masterkeylen(luks));
+
+for (i = 0; i < QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS; i++) {
+
+/* refuse to erase last key if not for

[Qemu-block] [PATCH 11/13] block/qcow2: implement the encryption key managment

2019-08-14 Thread Maxim Levitsky
This is the main purpose of the patchset, to enaable
us to manage luks like header, embedded in the qcow2
image, which standard cryptosetup tools don't support.

Signed-off-by: Maxim Levitsky 
---
 block/qcow2.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 039bdc2f7e..a87e58f36a 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -5086,6 +5086,31 @@ void qcow2_signal_corruption(BlockDriverState *bs, bool 
fatal, int64_t offset,
 s->signaled_corruption = true;
 }
 
+
+static int qcow2_setup_encryption(BlockDriverState *bs,
+  enum BlkSetupEncryptionAction action,
+  QCryptoEncryptionSetupOptions *options,
+  bool force,
+  Error **errp)
+{
+BDRVQcow2State *s = bs->opaque;
+
+if (!s->crypto) {
+error_setg(errp, "Can't manage encryption - image is not encrypted");
+return -EINVAL;
+}
+
+return qcrypto_block_setup_encryption(s->crypto,
+  qcow2_crypto_hdr_read_func,
+  qcow2_crypto_hdr_write_func,
+  bs,
+  action,
+  options,
+  force,
+  errp);
+}
+
+
 static QemuOptsList qcow2_create_opts = {
 .name = "qcow2-create-opts",
 .head = QTAILQ_HEAD_INITIALIZER(qcow2_create_opts.head),
@@ -5232,6 +5257,8 @@ BlockDriver bdrv_qcow2 = {
 .bdrv_reopen_bitmaps_rw = qcow2_reopen_bitmaps_rw,
 .bdrv_can_store_new_dirty_bitmap = qcow2_can_store_new_dirty_bitmap,
 .bdrv_remove_persistent_dirty_bitmap = 
qcow2_remove_persistent_dirty_bitmap,
+
+.bdrv_setup_encryption = qcow2_setup_encryption,
 };
 
 static void bdrv_qcow2_init(void)
-- 
2.17.2




[Qemu-block] [PATCH 07/13] block: add manage-encryption command (qmp and blockdev)

2019-08-14 Thread Maxim Levitsky
This adds:

* x-blockdev-update-encryption and x-blockdev-erase-encryption qmp commands
  Both commands take the QCryptoKeyManageOptions
  the x-blockdev-update-encryption is meant for non destructive addition
  of key slots / whatever the encryption driver supports in the future

  x-blockdev-erase-encryption is meant for destructive encryption key erase,
  in some cases even without way to recover the data.


* bdrv_setup_encryption callback in the block driver
  This callback does both the above functions with 'action' parameter

* QCryptoKeyManageOptions with set of options that drivers can use for 
encryption managment
  Currently it has all the options that LUKS needs, and later it can be extended
  (via union) to support more encryption drivers if needed

* blk_setup_encryption / bdrv_setup_encryption - the usual block layer wrappers.
  Note that bdrv_setup_encryption takes BlockDriverState and not BdrvChild,
  for the ease of use from the qmp code. It is not expected that this function
  will be used by anything but qmp and qemu-img code


Signed-off-by: Maxim Levitsky 
---
 block/block-backend.c  |  9 
 block/io.c | 24 
 blockdev.c | 40 ++
 include/block/block.h  | 12 ++
 include/block/block_int.h  | 11 ++
 include/sysemu/block-backend.h |  7 ++
 qapi/block-core.json   | 36 ++
 qapi/crypto.json   | 26 ++
 8 files changed, 165 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index 0056b526b8..1b75f28d0c 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2284,3 +2284,12 @@ const BdrvChild *blk_root(BlockBackend *blk)
 {
 return blk->root;
 }
+
+int blk_setup_encryption(BlockBackend *blk,
+ enum BlkSetupEncryptionAction action,
+ QCryptoEncryptionSetupOptions *options,
+ bool force,
+ Error **errp)
+{
+return bdrv_setup_encryption(blk->root->bs, action, options, force, errp);
+}
diff --git a/block/io.c b/block/io.c
index 06305c6ea6..50090afe68 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3256,3 +3256,27 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, 
PreallocMode prealloc,
 
 return tco.ret;
 }
+
+
+int bdrv_setup_encryption(BlockDriverState *bs,
+  enum BlkSetupEncryptionAction action,
+  QCryptoEncryptionSetupOptions *options,
+  bool force,
+  Error **errp)
+{
+Error *local_err = NULL;
+int ret;
+
+if (!(bs->open_flags & BDRV_O_RDWR)) {
+error_setg(errp, "Can't do key management on read only block device");
+return -ENOTSUP;
+}
+
+ret = bs->drv->bdrv_setup_encryption(bs, action, options, force,
+ &local_err);
+if (ret) {
+error_propagate(errp, local_err);
+return ret;
+}
+return 0;
+}
diff --git a/blockdev.c b/blockdev.c
index 4d141e9a1f..27be251656 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -4563,6 +4563,46 @@ void qmp_block_latency_histogram_set(
 }
 }
 
+void qmp_x_blockdev_update_encryption(const char *node_name,
+  bool has_force, bool force,
+  QCryptoEncryptionSetupOptions *options,
+  Error **errp)
+{
+BlockDriverState *bs = bdrv_find_node(node_name);
+Error *local_error = NULL;
+
+if (!bs) {
+error_setg(errp, "Cannot find node %s", node_name);
+return;
+}
+
+if (bdrv_setup_encryption(bs, BLK_UPDATE_ENCRYPTION, options,
+  has_force ? force : false, &local_error)) {
+error_propagate(errp, local_error);
+}
+}
+
+
+void qmp_x_blockdev_erase_encryption(const char *node_name,
+ bool has_force, bool force,
+ QCryptoEncryptionSetupOptions *options,
+ Error **errp)
+{
+BlockDriverState *bs = bdrv_find_node(node_name);
+Error *local_error = NULL;
+
+if (!bs) {
+error_setg(errp, "Cannot find node %s", node_name);
+return;
+}
+
+if (bdrv_setup_encryption(bs, BLK_ERASE_ENCRYPTION, options,
+  has_force ? force : false, &local_error)) {
+error_propagate(errp, local_error);
+}
+}
+
+
 QemuOptsList qemu_common_drive_opts = {
 .name = "drive",
 .head = QTAILQ_HEAD_INITIALIZER(qemu_common_drive_opts.head),
diff --git a/include/block/block.h b/include/block/block.h
index 50a07c1c33..b55ef4c416 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -276,6 +276,12 @@ enum {

[Qemu-block] [PATCH 08/13] qcrypto: add the plumbing for encryption management

2019-08-14 Thread Maxim Levitsky
This adds qcrypto_block_manage_encryption, which
 is thin wrapper around manage_encryption of the crypto driver
 which is also added

Signed-off-by: Maxim Levitsky 
---
 crypto/block.c | 29 +
 crypto/blockpriv.h |  9 +
 include/crypto/block.h | 27 +++
 3 files changed, 65 insertions(+)

diff --git a/crypto/block.c b/crypto/block.c
index ee96759f7d..5916e49aba 100644
--- a/crypto/block.c
+++ b/crypto/block.c
@@ -20,6 +20,7 @@
 
 #include "qemu/osdep.h"
 #include "qapi/error.h"
+
 #include "blockpriv.h"
 #include "block-qcow.h"
 #include "block-luks.h"
@@ -282,6 +283,34 @@ void qcrypto_block_free(QCryptoBlock *block)
 }
 
 
+int qcrypto_block_setup_encryption(QCryptoBlock *block,
+   QCryptoBlockReadFunc readfunc,
+   QCryptoBlockWriteFunc writefunc,
+   void *opaque,
+   enum BlkSetupEncryptionAction action,
+   QCryptoEncryptionSetupOptions *options,
+   bool force,
+   Error **errp)
+{
+if (!block->driver->setup_encryption) {
+error_setg(errp,
+"Crypto format %s doesn't support management of encryption 
keys",
+QCryptoBlockFormat_str(block->format));
+return -1;
+}
+
+return block->driver->setup_encryption(block,
+   readfunc,
+   writefunc,
+   opaque,
+   action,
+   options,
+   force,
+   errp);
+}
+
+
+
 typedef int (*QCryptoCipherEncDecFunc)(QCryptoCipher *cipher,
 const void *in,
 void *out,
diff --git a/crypto/blockpriv.h b/crypto/blockpriv.h
index 71c59cb542..804965dca3 100644
--- a/crypto/blockpriv.h
+++ b/crypto/blockpriv.h
@@ -81,6 +81,15 @@ struct QCryptoBlockDriver {
 
 bool (*has_format)(const uint8_t *buf,
size_t buflen);
+
+int (*setup_encryption)(QCryptoBlock *block,
+QCryptoBlockReadFunc readfunc,
+QCryptoBlockWriteFunc writefunc,
+void *opaque,
+enum BlkSetupEncryptionAction action,
+QCryptoEncryptionSetupOptions *options,
+bool force,
+Error **errp);
 };
 
 
diff --git a/include/crypto/block.h b/include/crypto/block.h
index fe12899831..60d46e3efc 100644
--- a/include/crypto/block.h
+++ b/include/crypto/block.h
@@ -23,6 +23,7 @@
 
 #include "crypto/cipher.h"
 #include "crypto/ivgen.h"
+#include "block/block.h"
 
 typedef struct QCryptoBlock QCryptoBlock;
 
@@ -268,4 +269,30 @@ uint64_t qcrypto_block_get_sector_size(QCryptoBlock 
*block);
  */
 void qcrypto_block_free(QCryptoBlock *block);
 
+
+/**
+ * qcrypto_block_setup_encryption:
+ * @block: the block encryption object
+ *
+ * @readfunc: callback for reading data from the volume header
+ * @writefunc: callback for writing data to the volume header
+ * @opaque: data to pass to @readfunc and @writefunc
+ * @action: tell the driver the setup action (add/erase currently)
+ * @options: driver specific options, that specify
+ *   what encryption settings to manage
+ * @force: hint for the driver to allow unsafe operation
+ * @errp: error pointer
+ *
+ * Adds/Erases a new encryption key using @options
+ *
+ */
+int qcrypto_block_setup_encryption(QCryptoBlock *block,
+   QCryptoBlockReadFunc readfunc,
+   QCryptoBlockWriteFunc writefunc,
+   void *opaque,
+   enum BlkSetupEncryptionAction action,
+   QCryptoEncryptionSetupOptions *options,
+   bool force,
+   Error **errp);
+
 #endif /* QCRYPTO_BLOCK_H */
-- 
2.17.2




  1   2   3   4   5   6   7   8   9   10   >