Re: [Qemu-block] [PATCH v6 38/42] iotests: Let complete_and_wait() work with commit

2019-08-22 Thread Vladimir Sementsov-Ogievskiy
09.08.2019 19:14, Max Reitz wrote:
> complete_and_wait() and wait_ready() currently only work for mirror
> jobs.  Let them work for active commit jobs, too.
> 
> Signed-off-by: Max Reitz 
> ---
>   tests/qemu-iotests/iotests.py | 10 +++---
>   1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
> index 84438e837c..3ef846d1dc 100644
> --- a/tests/qemu-iotests/iotests.py
> +++ b/tests/qemu-iotests/iotests.py
> @@ -761,8 +761,12 @@ class QMPTestCase(unittest.TestCase):
>   
>   def wait_ready(self, drive='drive0'):
>   '''Wait until a block job BLOCK_JOB_READY event'''
> -f = {'data': {'type': 'mirror', 'device': drive } }
> -event = self.vm.event_wait(name='BLOCK_JOB_READY', match=f)
> +event = self.vm.events_wait([
> +('BLOCK_JOB_READY',
> + {'data': {'type': 'mirror', 'device': drive } }),
> +('BLOCK_JOB_READY',
> + {'data': {'type': 'commit', 'device': drive } })
> +])
>   
>   def wait_ready_and_cancel(self, drive='drive0'):
>   self.wait_ready(drive=drive)
> @@ -780,7 +784,7 @@ class QMPTestCase(unittest.TestCase):
>   self.assert_qmp(result, 'return', {})
>   
>   event = self.wait_until_completed(drive=drive)
> -self.assert_qmp(event, 'data/type', 'mirror')
> +self.assertTrue(event['data']['type'] in ['mirror', 'commit'])
>   
>   def pause_wait(self, job_id='job0'):
>   with Timeout(1, "Timeout waiting for job to pause"):
> 


Reviewed-by: Vladimir Sementsov-Ogievskiy 


-- 
Best regards,
Vladimir


Re: [Qemu-block] [PATCH 1/2] vhost-user-blk: prevent using uninitialized vqs

2019-08-22 Thread yuchenlin via Qemu-block
Raphael Norwitz  於 2019-08-23 04:16 寫道: > > Same 
rational as: e6cc11d64fc998c11a4dfcde8fda3fc33a74d844 > > Of the 3 virtqueues, 
seabios only sets cmd, leaving ctrl > and event without a physical address. 
This can cause > vhost_verify_ring_part_mapping to return ENOMEM, causing > the 
following logs: > > qemu-system-x86_64: Unable to map available ring for ring 0 
> qemu-system-x86_64: Verify ring failure on region 0 > > This has already been 
fixed for vhost scsi devices and was > recently vhost-user scsi devices. This 
commit fixes it for > vhost-user-blk devices. > > Suggested-by: Phillippe 
Mathieu-Daude  > Signed-off-by: Raphael Norwitz 
 Reviewed-by: yuchenlin  
Thanks. > > > --- > hw/block/vhost-user-blk.c | 2 +- > 1 file changed, 1 
insertion(+), 1 deletion(-) > > diff --git a/hw/block/vhost-user-blk.c 
b/hw/block/vhost-user-blk.c > index 0b8c5df..63da9bb 100644 > --- 
a/hw/block/vhost-user-blk.c > +++ b/hw/block/vhost-user-blk.c > @@ -421,7 
+421,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error 
**errp) > } > > s->inflight = g_new0(struct vhost_inflight, 1); > - s->vqs = 
g_new(struct vhost_virtqueue, s->num_queues); > + s->vqs = g_new0(struct 
vhost_virtqueue, s->num_queues); > s->watch = 0; > s->connected = false; > > -- 
> 1.9.4 > >


Re: [Qemu-block] [Qemu-devel] [PULL 2/3] tests: Run the iotests during "make check" again

2019-08-22 Thread Paolo Bonzini
On 17/08/19 10:54, Thomas Huth wrote:
> People often forget to run the iotests before submitting patches or pull
> requests - this is likely due to the fact that we do not run the tests
> during our mandatory "make check" tests yet. Now that we've got a proper
> "auto" group of iotests that should be fine to run in every environment,
> we can enable the iotests during "make check" again by running the "auto"
> tests by default from the check-block.sh script.
> 
> Some cases still need to be checked first, though: iotests need bash and
> GNU sed (otherwise they fail), and if gprof is enabled, it spoils the
> output of some test cases causing them to fail. So if we detect that one
> of the required programs is missing or that gprof is enabled, we still
> have to skip the iotests to avoid failures.
> 
> And finally, since we are using check-block.sh now again, this patch also
> removes the qemu-iotests-quick.sh script since we do not need that anymore
> (and having two shell wrapper scripts around the block tests seems rather
> confusing than helpful).
> 
> Message-Id: <20190717111947.30356-4-th...@redhat.com>
> Signed-off-by: Thomas Huth 
> [AJB: -makecheck to check-block.sh, move check-block to start and gate it]
> Signed-off-by: Alex Bennée 

This breaks when sanitizers are enabled.  There are leaks reported,
though I'm not sure if they are real, and in additions the warning lines
break qemu-iotests' output comparison.

Paolo



Re: [Qemu-block] [PATCH] block: workaround for unaligned byte range in fallocate()

2019-08-22 Thread Eric Blake
On 8/22/19 1:31 PM, Andrey Shinkevich wrote:
> Revert the commit 118f99442d 'block/io.c: fix for the allocation failure'
> and make better error handling for the file systems that do not support

s/make/use/

> fallocate() for the unaligned byte range. Allow falling back to pwrite

s/the/an/

> in case fallocate() returns EINVAL.
> 
> Suggested-by: Kevin Wolf 
> Suggested-by: Eric Blake 
> Signed-off-by: Andrey Shinkevich 
> ---
> Discussed in email thread with the message ID
> <1554474244-553661-1-git-send-email-andrey.shinkev...@virtuozzo.com>
> 
>  block/file-posix.c | 7 +++
>  block/io.c | 2 +-
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index fbeb006..2c254ff 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1588,6 +1588,13 @@ static int handle_aiocb_write_zeroes(void *opaque)
>  if (s->has_write_zeroes) {
>  int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
> aiocb->aio_offset, aiocb->aio_nbytes);
> +if (ret == -EINVAL) {
> +/*
> + * Allow falling back to pwrite for file systems that
> + * do not support fallocate() for unaligned byte range.

s/for/for an/

> + */
> +return -ENOTSUP;
> +}
>  if (ret == 0 || ret != -ENOTSUP) {
>  return ret;
>  }
> diff --git a/block/io.c b/block/io.c
> index 56bbf19..58f08cd 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1558,7 +1558,7 @@ static int coroutine_fn 
> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>  assert(!bs->supported_zero_flags);
>  }
>  
> -if (ret < 0 && !(flags & BDRV_REQ_NO_FALLBACK)) {
> +if (ret == -ENOTSUP && !(flags & BDRV_REQ_NO_FALLBACK)) {
>  /* Fall back to bounce buffer if write zeroes is unsupported */
>  BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
>  
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-block] [PATCH 1/2] vhost-user-blk: prevent using uninitialized vqs

2019-08-22 Thread Raphael Norwitz
Same rational as: e6cc11d64fc998c11a4dfcde8fda3fc33a74d844

Of the 3 virtqueues, seabios only sets cmd, leaving ctrl
and event without a physical address. This can cause
vhost_verify_ring_part_mapping to return ENOMEM, causing
the following logs:

qemu-system-x86_64: Unable to map available ring for ring 0
qemu-system-x86_64: Verify ring failure on region 0

This has already been fixed for vhost scsi devices and was
recently vhost-user scsi devices. This commit fixes it for
vhost-user-blk devices.

Suggested-by: Phillippe Mathieu-Daude 
Signed-off-by: Raphael Norwitz 
---
 hw/block/vhost-user-blk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 0b8c5df..63da9bb 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -421,7 +421,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, 
Error **errp)
 }
 
 s->inflight = g_new0(struct vhost_inflight, 1);
-s->vqs = g_new(struct vhost_virtqueue, s->num_queues);
+s->vqs = g_new0(struct vhost_virtqueue, s->num_queues);
 s->watch = 0;
 s->connected = false;
 
-- 
1.9.4




Re: [Qemu-block] [PATCH] block: workaround for unaligned byte range in fallocate()

2019-08-22 Thread Vladimir Sementsov-Ogievskiy
22.08.2019 21:55, Vladimir Sementsov-Ogievskiy wrote:
> 22.08.2019 21:31, Andrey Shinkevich wrote:
>> Revert the commit 118f99442d 'block/io.c: fix for the allocation failure'
>> and make better error handling for the file systems that do not support
>> fallocate() for the unaligned byte range. Allow falling back to pwrite
>> in case fallocate() returns EINVAL.
>>
>> Suggested-by: Kevin Wolf 
>> Suggested-by: Eric Blake 
>> Signed-off-by: Andrey Shinkevich 
>> ---
>> Discussed in email thread with the message ID
>> <1554474244-553661-1-git-send-email-andrey.shinkev...@virtuozzo.com>
>>
>>   block/file-posix.c | 7 +++
>>   block/io.c | 2 +-
>>   2 files changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/file-posix.c b/block/file-posix.c
>> index fbeb006..2c254ff 100644
>> --- a/block/file-posix.c
>> +++ b/block/file-posix.c
>> @@ -1588,6 +1588,13 @@ static int j(void *opaque)
>>   if (s->has_write_zeroes) {
>>   int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
>>  aiocb->aio_offset, aiocb->aio_nbytes);
>> +    if (ret == -EINVAL) {
>> +    /*
>> + * Allow falling back to pwrite for file systems that
>> + * do not support fallocate() for unaligned byte range.
>> + */
>> +    return -ENOTSUP;
>> +    }
>>   if (ret == 0 || ret != -ENOTSUP) {
>>   return ret;
>>   }
> 
> Hmm stop, you've done exactly what Den was afraid of:
> 
> the next line
>    s->has_write_zeroes = false;
> 
> will disable write_zeroes forever.
> 
> Something like
> 
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1588,10 +1588,12 @@ static int handle_aiocb_write_zeroes(void *opaque)
>   if (s->has_write_zeroes) {
>   int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
>      aiocb->aio_offset, aiocb->aio_nbytes);
> -    if (ret == 0 || ret != -ENOTSUP) {
> +    if (ret == 0 || (ret != -ENOTSUP && ret != -EINVAL)) {
>   return ret;
>   }
> -    s->has_write_zeroes = false;
> +    if (ret == -ENOTSUP) {
> +    s->has_write_zeroes = false;
> +    }
>   }
>   #endif
> 
> 
> will work better. So, handle ENOTSUP as "disable write_zeros forever", and 
> EINVAL as
> "don't disable, but fallback to writing zeros". And we need same handling for 
> following do_fallocate() calls
> too (otherwise they again fails with EINVAL which will break the whole thing).
> 

Oops, sorry, I misread your patch, it's OK.

Still we may want to handle other do_fallocate() calls in same manner, or may 
be just:

@@ -1558,7 +1558,13 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
  assert(!bs->supported_zero_flags);
  }

-if (ret < 0 && !(flags & BDRV_REQ_NO_FALLBACK)) {
+/*
+ * We are sure that our arguments make sense, so consider "invalid
+ * argument" in same manner as "not supported".
+ */
+if ((ret == -ENOTSUP || ret == -EINVAL) &&
+!(flags & BDRV_REQ_NO_FALLBACK))
+{
  /* Fall back to bounce buffer if write zeroes is unsupported */
  BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;




-- 
Best regards,
Vladimir


Re: [Qemu-block] [PATCH] block: gluster: Probe alignment limits

2019-08-22 Thread Nir Soffer
On Thu, Aug 22, 2019 at 10:03 AM Niels de Vos  wrote:

> On Wed, Aug 21, 2019 at 07:04:17PM +0200, Max Reitz wrote:
> > On 17.08.19 23:21, Nir Soffer wrote:
> > > Implement alignment probing similar to file-posix, by reading from the
> > > first 4k of the image.
> > >
> > > Before this change, provisioning a VM on storage with sector size of
> > > 4096 bytes would fail when the installer try to create filesystems.
> Here
> > > is an example command that reproduces this issue:
> > >
> > > $ qemu-system-x86_64 -accel kvm -m 2048 -smp 2 \
> > > -drive
> file=gluster://gluster1/gv0/fedora29.raw,format=raw,cache=none \
> > > -cdrom Fedora-Server-dvd-x86_64-29-1.2.iso
> > >
> > > The installer fails in few seconds when trying to create filesystem on
> > > /dev/mapper/fedora-root. In error report we can see that it failed with
> > > EINVAL (I could not extract the error from guest).
> > >
> > > Copying disk fails with EINVAL:
> > >
> > > $ qemu-img convert -p -f raw -O raw -t none -T none \
> > > gluster://gluster1/gv0/fedora29.raw \
> > > gluster://gluster1/gv0/fedora29-clone.raw
> > > qemu-img: error while writing sector 4190208: Invalid argument
> > >
> > > This is a fix to same issue fixed in commit a6b257a08e3d (file-posix:
> > > Handle undetectable alignment) for gluster:// images.
> > >
> > > This fix has the same limit, that the first block of the image should
> be
> > > allocated, otherwise we cannot detect the alignment and fallback to a
> > > safe value (4096) even when using storage with sector size of 512
> bytes.
> > >
> > > Signed-off-by: Nir Soffer 
> > > ---
> > >  block/gluster.c | 47 +++
> > >  1 file changed, 47 insertions(+)
> > >
> > > diff --git a/block/gluster.c b/block/gluster.c
> > > index f64dc5b01e..d936240b72 100644
> > > --- a/block/gluster.c
> > > +++ b/block/gluster.c
> > > @@ -52,6 +52,9 @@
> > >
> > >  #define GERR_INDEX_HINT "hint: check in 'server' array index '%d'\n"
> > >
> > > +/* The value is known only on the server side. */
> > > +#define MAX_ALIGN 4096
> > > +
> > >  typedef struct GlusterAIOCB {
> > >  int64_t size;
> > >  int ret;
> > > @@ -902,8 +905,52 @@ out:
> > >  return ret;
> > >  }
> > >
> > > +/*
> > > + * Check if read is allowed with given memory buffer and length.
> > > + *
> > > + * This function is used to check O_DIRECT request alignment.
> > > + */
> > > +static bool gluster_is_io_aligned(struct glfs_fd *fd, void *buf,
> size_t len)
> > > +{
> > > +ssize_t ret = glfs_pread(fd, buf, len, 0, 0, NULL);
> > > +return ret >= 0 || errno != EINVAL;
> >
> > Is glfs_pread() guaranteed to return EINVAL on invalid alignment?
> > file-posix says this is only the case on Linux (for normal files).  Now
> > I also don’t know whether the gluster driver works on anything but Linux
> > anyway.
>
> The behaviour depends on the filesystem used by the Gluster backend. XFS
> is the recommendation, but in the end it is up to the users. The Gluster
> server is known to work on Linux, NetBSD and FreeBSD, the vast majority
> of users runs it on Linux.
>
> I do not think there is a strong guarantee EINVAL is always correct. How
> about only checking 'ret > 0'?
>

Looks like we don't have a choice.

>
> > > +}
> > > +
> > > +static void gluster_probe_alignment(BlockDriverState *bs, struct
> glfs_fd *fd,
> > > +Error **errp)
> > > +{
> > > +char *buf;
> > > +size_t alignments[] = {1, 512, 1024, 2048, 4096};
> > > +size_t align;
> > > +int i;
> > > +
> > > +buf = qemu_memalign(MAX_ALIGN, MAX_ALIGN);
> > > +
> > > +for (i = 0; i < ARRAY_SIZE(alignments); i++) {
> > > +align = alignments[i];
> > > +if (gluster_is_io_aligned(fd, buf, align)) {
> > > +/* Fallback to safe value. */
> > > +bs->bl.request_alignment = (align != 1) ? align :
> MAX_ALIGN;
> > > +break;
> > > +}
> > > +}
> >
> > I don’t like the fact that the last element of alignments[] should be
> > the same as MAX_ALIGN, without that ever having been made explicit
> anywhere.
> >
> > It’s a bit worse in the file-posix patch, because if getpagesize() is
> > greater than 4k, max_align will be greater than 4k.  But MAX_BLOCKSIZE
> > is 4k, too, so I suppose we wouldn’t support any block size beyond that
> > anyway, which makes the error message appropriate still.
> >
> > > +
> > > +qemu_vfree(buf);
> > > +
> > > +if (!bs->bl.request_alignment) {
> > > +error_setg(errp, "Could not find working O_DIRECT alignment");
> > > +error_append_hint(errp, "Try cache.direct=off\n");
> > > +}
> > > +}
> > > +
> > >  static void qemu_gluster_refresh_limits(BlockDriverState *bs, Error
> **errp)
> > >  {
> > > +BDRVGlusterState *s = bs->opaque;
> > > +
> > > +gluster_probe_alignment(bs, s->fd, errp);
> > > +
> > > +bs->bl.min_mem_alignment = bs->bl.request_alignment;
> >
> > 

Re: [Qemu-block] [PATCH] block: posix: Always allocate the first block

2019-08-22 Thread Nir Soffer
On Thu, Aug 22, 2019 at 9:11 PM Max Reitz  wrote:

> On 22.08.19 18:39, Nir Soffer wrote:
> > On Thu, Aug 22, 2019 at 5:28 PM Max Reitz  > > wrote:
> >
> > On 16.08.19 23:21, Nir Soffer wrote:
> > > When creating an image with preallocation "off" or "falloc", the
> first
> > > block of the image is typically not allocated. When using Gluster
> > > storage backed by XFS filesystem, reading this block using direct
> I/O
> > > succeeds regardless of request length, fooling alignment detection.
> > >
> > > In this case we fallback to a safe value (4096) instead of the
> optimal
> > > value (512), which may lead to unneeded data copying when aligning
> > > requests.  Allocating the first block avoids the fallback.
> > >
> > > When using preallocation=off, we always allocate at least one
> > filesystem
> > > block:
> > >
> > > $ ./qemu-img create -f raw test.raw 1g
> > > Formatting 'test.raw', fmt=raw size=1073741824
> > >
> > > $ ls -lhs test.raw
> > > 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> > >
> > > I did quick performance tests for these flows:
> > > - Provisioning a VM with a new raw image.
> > > - Copying disks with qemu-img convert to new raw target image
> > >
> > > I installed Fedora 29 server on raw sparse image, measuring the
> time
> > > from clicking "Begin installation" until the "Reboot" button
> appears:
> > >
> > > Before(s)  After(s) Diff(%)
> > > ---
> > >  356389+8.4
> > >
> > > I ran this only once, so we cannot tell much from these results.
> >
> > So you’d expect it to be fast but it was slower?  Well, you only ran
> it
> > once and it isn’t really a precise benchmark...
> >
> > > The second test was cloning the installation image with qemu-img
> > > convert, doing 10 runs:
> > >
> > > for i in $(seq 10); do
> > > rm -f dst.raw
> > > sleep 10
> > > time ./qemu-img convert -f raw -O raw -t none -T none
> > src.raw dst.raw
> > > done
> > >
> > > Here is a table comparing the total time spent:
> > >
> > > TypeBefore(s)   After(s)Diff(%)
> > > ---
> > > real  530.028469.123  -11.4
> > > user   17.204 10.768  -37.4
> > > sys17.881  7.011  -60.7
> > >
> > > Here we see very clear improvement in CPU usage.
> > >
> > > Signed-off-by: Nir Soffer  > >
> > > ---
> > >  block/file-posix.c | 25 +
> > >  tests/qemu-iotests/150.out |  1 +
> > >  tests/qemu-iotests/160 |  4 
> > >  tests/qemu-iotests/175 | 19 +--
> > >  tests/qemu-iotests/175.out |  8 
> > >  tests/qemu-iotests/221.out | 12 
> > >  tests/qemu-iotests/253.out | 12 
> > >  7 files changed, 63 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/block/file-posix.c b/block/file-posix.c
> > > index b9c33c8f6c..3964dd2021 100644
> > > --- a/block/file-posix.c
> > > +++ b/block/file-posix.c
> > > @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void
> *opaque)
> > >  return ret;
> > >  }
> > >
> > > +/*
> > > + * Help alignment detection by allocating the first block.
> > > + *
> > > + * When reading with direct I/O from unallocated area on Gluster
> > backed by XFS,
> > > + * reading succeeds regardless of request length. In this case we
> > fallback to
> > > + * safe aligment which is not optimal. Allocating the first block
> > avoids this
> > > + * fallback.
> > > + *
> > > + * Returns: 0 on success, -errno on failure.
> > > + */
> > > +static int allocate_first_block(int fd)
> > > +{
> > > +ssize_t n;
> > > +
> > > +do {
> > > +n = pwrite(fd, "\0", 1, 0);
> >
> > This breaks when fd has been opened with O_DIRECT.
> >
> >
> > It seems that we always open images without O_DIRECT when creating an
> image
> > in qemu-img create, or when creating a target image in qemu-img convert.
>
> Yes.  But you don’t call this function directly from image creation code
> but instead from the truncation function.  (The former also calls the
> latter, but truncating is also an operation on its own.)
>
> [...]
>
> > (Which happens when you open some file with cache.direct=on, and then
> > use e.g. QMP’s block_resize.)
> >
> >
> > What would be a command triggering this? I can add a test.
>
> block_resize, as I’ve said:
>
> $ ./qemu-img create -f raw empty.img 0
>

This is extreme edge case - why would someone create such image?


> $ x86_64-softmmu/qemu-system-x86_64 \
> -qmp stdio \
> -blockdev 

Re: [Qemu-block] [PATCH] block: workaround for unaligned byte range in fallocate()

2019-08-22 Thread Vladimir Sementsov-Ogievskiy
22.08.2019 21:31, Andrey Shinkevich wrote:
> Revert the commit 118f99442d 'block/io.c: fix for the allocation failure'
> and make better error handling for the file systems that do not support
> fallocate() for the unaligned byte range. Allow falling back to pwrite
> in case fallocate() returns EINVAL.
> 
> Suggested-by: Kevin Wolf 
> Suggested-by: Eric Blake 
> Signed-off-by: Andrey Shinkevich 
> ---
> Discussed in email thread with the message ID
> <1554474244-553661-1-git-send-email-andrey.shinkev...@virtuozzo.com>
> 
>   block/file-posix.c | 7 +++
>   block/io.c | 2 +-
>   2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index fbeb006..2c254ff 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1588,6 +1588,13 @@ static int handle_aiocb_write_zeroes(void *opaque)
>   if (s->has_write_zeroes) {
>   int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
>  aiocb->aio_offset, aiocb->aio_nbytes);
> +if (ret == -EINVAL) {
> +/*
> + * Allow falling back to pwrite for file systems that
> + * do not support fallocate() for unaligned byte range.
> + */
> +return -ENOTSUP;
> +}
>   if (ret == 0 || ret != -ENOTSUP) {
>   return ret;
>   }

Hmm stop, you've done exactly what Den was afraid of:

the next line
   s->has_write_zeroes = false;

will disable write_zeroes forever.

Something like

--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1588,10 +1588,12 @@ static int handle_aiocb_write_zeroes(void *opaque)
  if (s->has_write_zeroes) {
  int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
 aiocb->aio_offset, aiocb->aio_nbytes);
-if (ret == 0 || ret != -ENOTSUP) {
+if (ret == 0 || (ret != -ENOTSUP && ret != -EINVAL)) {
  return ret;
  }
-s->has_write_zeroes = false;
+if (ret == -ENOTSUP) {
+s->has_write_zeroes = false;
+}
  }
  #endif


will work better. So, handle ENOTSUP as "disable write_zeros forever", and 
EINVAL as
"don't disable, but fallback to writing zeros". And we need same handling for 
following do_fallocate() calls
too (otherwise they again fails with EINVAL which will break the whole thing).

> diff --git a/block/io.c b/block/io.c
> index 56bbf19..58f08cd 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1558,7 +1558,7 @@ static int coroutine_fn 
> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>   assert(!bs->supported_zero_flags);
>   }
>   
> -if (ret < 0 && !(flags & BDRV_REQ_NO_FALLBACK)) {
> +if (ret == -ENOTSUP && !(flags & BDRV_REQ_NO_FALLBACK)) {
>   /* Fall back to bounce buffer if write zeroes is unsupported */
>   BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
>   
> 


-- 
Best regards,
Vladimir


[Qemu-block] [PATCH] block: workaround for unaligned byte range in fallocate()

2019-08-22 Thread Andrey Shinkevich
Revert the commit 118f99442d 'block/io.c: fix for the allocation failure'
and make better error handling for the file systems that do not support
fallocate() for the unaligned byte range. Allow falling back to pwrite
in case fallocate() returns EINVAL.

Suggested-by: Kevin Wolf 
Suggested-by: Eric Blake 
Signed-off-by: Andrey Shinkevich 
---
Discussed in email thread with the message ID
<1554474244-553661-1-git-send-email-andrey.shinkev...@virtuozzo.com>

 block/file-posix.c | 7 +++
 block/io.c | 2 +-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index fbeb006..2c254ff 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1588,6 +1588,13 @@ static int handle_aiocb_write_zeroes(void *opaque)
 if (s->has_write_zeroes) {
 int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
aiocb->aio_offset, aiocb->aio_nbytes);
+if (ret == -EINVAL) {
+/*
+ * Allow falling back to pwrite for file systems that
+ * do not support fallocate() for unaligned byte range.
+ */
+return -ENOTSUP;
+}
 if (ret == 0 || ret != -ENOTSUP) {
 return ret;
 }
diff --git a/block/io.c b/block/io.c
index 56bbf19..58f08cd 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1558,7 +1558,7 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 assert(!bs->supported_zero_flags);
 }
 
-if (ret < 0 && !(flags & BDRV_REQ_NO_FALLBACK)) {
+if (ret == -ENOTSUP && !(flags & BDRV_REQ_NO_FALLBACK)) {
 /* Fall back to bounce buffer if write zeroes is unsupported */
 BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
 
-- 
1.8.3.1




Re: [Qemu-block] [PATCH] block: posix: Always allocate the first block

2019-08-22 Thread Max Reitz
On 22.08.19 18:39, Nir Soffer wrote:
> On Thu, Aug 22, 2019 at 5:28 PM Max Reitz  > wrote:
> 
> On 16.08.19 23:21, Nir Soffer wrote:
> > When creating an image with preallocation "off" or "falloc", the first
> > block of the image is typically not allocated. When using Gluster
> > storage backed by XFS filesystem, reading this block using direct I/O
> > succeeds regardless of request length, fooling alignment detection.
> >
> > In this case we fallback to a safe value (4096) instead of the optimal
> > value (512), which may lead to unneeded data copying when aligning
> > requests.  Allocating the first block avoids the fallback.
> >
> > When using preallocation=off, we always allocate at least one
> filesystem
> > block:
> >
> >     $ ./qemu-img create -f raw test.raw 1g
> >     Formatting 'test.raw', fmt=raw size=1073741824
> >
> >     $ ls -lhs test.raw
> >     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> >
> > I did quick performance tests for these flows:
> > - Provisioning a VM with a new raw image.
> > - Copying disks with qemu-img convert to new raw target image
> >
> > I installed Fedora 29 server on raw sparse image, measuring the time
> > from clicking "Begin installation" until the "Reboot" button appears:
> >
> > Before(s)  After(s)     Diff(%)
> > ---
> >      356        389        +8.4
> >
> > I ran this only once, so we cannot tell much from these results.
> 
> So you’d expect it to be fast but it was slower?  Well, you only ran it
> once and it isn’t really a precise benchmark...
> 
> > The second test was cloning the installation image with qemu-img
> > convert, doing 10 runs:
> >
> >     for i in $(seq 10); do
> >         rm -f dst.raw
> >         sleep 10
> >         time ./qemu-img convert -f raw -O raw -t none -T none
> src.raw dst.raw
> >     done
> >
> > Here is a table comparing the total time spent:
> >
> > Type    Before(s)   After(s)    Diff(%)
> > ---
> > real      530.028    469.123      -11.4
> > user       17.204     10.768      -37.4
> > sys        17.881      7.011      -60.7
> >
> > Here we see very clear improvement in CPU usage.
> >
> > Signed-off-by: Nir Soffer  >
> > ---
> >  block/file-posix.c         | 25 +
> >  tests/qemu-iotests/150.out |  1 +
> >  tests/qemu-iotests/160     |  4 
> >  tests/qemu-iotests/175     | 19 +--
> >  tests/qemu-iotests/175.out |  8 
> >  tests/qemu-iotests/221.out | 12 
> >  tests/qemu-iotests/253.out | 12 
> >  7 files changed, 63 insertions(+), 18 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index b9c33c8f6c..3964dd2021 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void *opaque)
> >      return ret;
> >  }
> > 
> > +/*
> > + * Help alignment detection by allocating the first block.
> > + *
> > + * When reading with direct I/O from unallocated area on Gluster
> backed by XFS,
> > + * reading succeeds regardless of request length. In this case we
> fallback to
> > + * safe aligment which is not optimal. Allocating the first block
> avoids this
> > + * fallback.
> > + *
> > + * Returns: 0 on success, -errno on failure.
> > + */
> > +static int allocate_first_block(int fd)
> > +{
> > +    ssize_t n;
> > +
> > +    do {
> > +        n = pwrite(fd, "\0", 1, 0);
> 
> This breaks when fd has been opened with O_DIRECT.
> 
> 
> It seems that we always open images without O_DIRECT when creating an image
> in qemu-img create, or when creating a target image in qemu-img convert.

Yes.  But you don’t call this function directly from image creation code
but instead from the truncation function.  (The former also calls the
latter, but truncating is also an operation on its own.)

[...]

> (Which happens when you open some file with cache.direct=on, and then
> use e.g. QMP’s block_resize.)
> 
> 
> What would be a command triggering this? I can add a test.

block_resize, as I’ve said:

$ ./qemu-img create -f raw empty.img 0
$ x86_64-softmmu/qemu-system-x86_64 \
-qmp stdio \
-blockdev file,node-name=file,filename=empty.img,cache.direct=on \
 < 
> It isn’t that bad because eventually you simply ignore the error.  But
> it still makes me wonder whether we shouldn’t write like the biggest
> power of two that does not exceed the new file length or MAX_BLOCKSIZE.
> 
> 
> It makes sense if there is a way to cause qemu-img to use O_DIRECT when
> 

Re: [Qemu-block] [Qemu-devel] [PATCH v2 00/12] block: qiov_offset parameter for io

2019-08-22 Thread Vladimir Sementsov-Ogievskiy
22.08.2019 20:39, Vladimir Sementsov-Ogievskiy wrote:
> 22.08.2019 20:24, Vladimir Sementsov-Ogievskiy wrote:
>> 22.08.2019 18:50, Stefan Hajnoczi wrote:
>>> On Tue, Jun 04, 2019 at 07:15:02PM +0300, Vladimir Sementsov-Ogievskiy 
>>> wrote:
 Hi all!

 Here is new parameter qiov_offset for io path, to avoid
 a lot of places with same pattern of creating local_qiov or hd_qiov
 variables.

 These series also includes my
 "[Qemu-devel] [PATCH 0/2] block/io: refactor padding"
 with some changes [described in 01 and 03 emails]

 Vladimir Sementsov-Ogievskiy (12):
    util/iov: introduce qemu_iovec_init_extended
    util/iov: improve qemu_iovec_is_zero
    block/io: refactor padding
    block: define .*_part io handlers in BlockDriver
    block/io: bdrv_co_do_copy_on_readv: use and support qiov_offset
    block/io: bdrv_co_do_copy_on_readv: lazy allocation
    block/io: bdrv_aligned_preadv: use and support qiov_offset
    block/io: bdrv_aligned_pwritev: use and support qiov_offset
    block/io: introduce bdrv_co_p{read,write}v_part
    block/qcow2: refactor qcow2_co_preadv to use buffer-based io
    block/qcow2: implement .bdrv_co_preadv_part
    block/qcow2: implement .bdrv_co_pwritev(_compressed)_part

   block/qcow2.h |   1 +
   include/block/block_int.h |  21 ++
   include/qemu/iov.h    |  10 +-
   block/backup.c    |   2 +-
   block/io.c    | 532 ++
   block/qcow2-cluster.c |  14 +-
   block/qcow2.c | 131 +-
   qemu-img.c    |   4 +-
   util/iov.c    | 153 +--
   9 files changed, 559 insertions(+), 309 deletions(-)
>>>
>>> qemu-iotests 077 fails after I apply this series (including your
>>> uninitialized variable fix).  I'm afraid I can't include it in the block
>>> pull request, sorry!
>>>
>>> Stefan
>>>
>>
>> Hmm, 77 don't work on master for me:
>> 077  fail   [20:23:57] [20:23:59]    output mismatch 
>> (see 077.out.bad)
>> --- /work/src/qemu/up-block-drop-hd-qiov/tests/qemu-iotests/077.out 
>> 2019-04-22 15:06:56.162045432 +0300
>> +++ /work/src/qemu/up-block-drop-hd-qiov/tests/qemu-iotests/077.out.bad 
>> 2019-08-22 20:23:59.124122307 +0300
>> @@ -1,7 +1,15 @@
>>   QA output created by 077
>> +==117186==WARNING: ASan doesn't fully support makecontext/swapcontext 
>> functions and may produce false positives in some cases!
>> +==117186==WARNING: ASan is ignoring requested __asan_handle_no_return: 
>> stack top: 0x7ffefc071000; bottom 0x7fad7277b000; size: 0x0051898f6000 
>> (350200225792)
>> +False positive error reports may follow
>> +For details see 
>> http://code.google.com/p/address-sanitizer/issues/detail?id=189
>>   Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
>>
>>   == Some concurrent requests involving RMW ==
>> +==117197==WARNING: ASan doesn't fully support makecontext/swapcontext 
>> functions and may produce false positives in some cases!
>> +==117197==WARNING: ASan is ignoring requested __asan_handle_no_return: 
>> stack top: 0x7fffa4fcc000; bottom 0x7fa93a2c1000; size: 0x00566ad0b000 
>> (371159248896)
>> +False positive error reports may follow
>> +For details see 
>> http://code.google.com/p/address-sanitizer/issues/detail?id=189
>>   wrote XXX/XXX bytes at offset XXX
>>   XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>   wrote XXX/XXX bytes at offset XXX
>> @@ -66,6 +74,10 @@
>>   XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>
>>   == Verify image content ==
>> +==117219==WARNING: ASan doesn't fully support makecontext/swapcontext 
>> functions and may produce false positives in some cases!
>> +==117219==WARNING: ASan is ignoring requested __asan_handle_no_return: 
>> stack top: 0x7ffc722de000; bottom 0x7f0848232000; size: 0x00f42a0ac000 
>> (1048677367808)
>> +False positive error reports may follow
>> +For details see 
>> http://code.google.com/p/address-sanitizer/issues/detail?id=189
>>   read 512/512 bytes at offset 0
>>   512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>   read 512/512 bytes at offset 512
>> @@ -156,5 +168,9 @@
>>   1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>   read 2048/2048 bytes at offset 71680
>>   2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>> +==117229==WARNING: ASan doesn't fully support makecontext/swapcontext 
>> functions and may produce false positives in some cases!
>> +==117229==WARNING: ASan is ignoring requested __asan_handle_no_return: 
>> stack top: 0x7fffa3342000; bottom 0x7fd85a275000; size: 0x0027490cd000 
>> (168729300992)
>> +False positive error reports may follow
>> +For details see 
>> http://code.google.com/p/address-sanitizer/issues/detail?id=189
>>   No errors were found on the image.
>>   *** done
>> Failures: 077
>> Failed 1 of 1 tests
>>
>>
>>
> 
> But after "block/io: 

Re: [Qemu-block] [Qemu-devel] [PATCH v2 00/12] block: qiov_offset parameter for io

2019-08-22 Thread Vladimir Sementsov-Ogievskiy
22.08.2019 20:24, Vladimir Sementsov-Ogievskiy wrote:
> 22.08.2019 18:50, Stefan Hajnoczi wrote:
>> On Tue, Jun 04, 2019 at 07:15:02PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> Here is new parameter qiov_offset for io path, to avoid
>>> a lot of places with same pattern of creating local_qiov or hd_qiov
>>> variables.
>>>
>>> These series also includes my
>>> "[Qemu-devel] [PATCH 0/2] block/io: refactor padding"
>>> with some changes [described in 01 and 03 emails]
>>>
>>> Vladimir Sementsov-Ogievskiy (12):
>>>    util/iov: introduce qemu_iovec_init_extended
>>>    util/iov: improve qemu_iovec_is_zero
>>>    block/io: refactor padding
>>>    block: define .*_part io handlers in BlockDriver
>>>    block/io: bdrv_co_do_copy_on_readv: use and support qiov_offset
>>>    block/io: bdrv_co_do_copy_on_readv: lazy allocation
>>>    block/io: bdrv_aligned_preadv: use and support qiov_offset
>>>    block/io: bdrv_aligned_pwritev: use and support qiov_offset
>>>    block/io: introduce bdrv_co_p{read,write}v_part
>>>    block/qcow2: refactor qcow2_co_preadv to use buffer-based io
>>>    block/qcow2: implement .bdrv_co_preadv_part
>>>    block/qcow2: implement .bdrv_co_pwritev(_compressed)_part
>>>
>>>   block/qcow2.h |   1 +
>>>   include/block/block_int.h |  21 ++
>>>   include/qemu/iov.h    |  10 +-
>>>   block/backup.c    |   2 +-
>>>   block/io.c    | 532 ++
>>>   block/qcow2-cluster.c |  14 +-
>>>   block/qcow2.c | 131 +-
>>>   qemu-img.c    |   4 +-
>>>   util/iov.c    | 153 +--
>>>   9 files changed, 559 insertions(+), 309 deletions(-)
>>
>> qemu-iotests 077 fails after I apply this series (including your
>> uninitialized variable fix).  I'm afraid I can't include it in the block
>> pull request, sorry!
>>
>> Stefan
>>
> 
> Hmm, 77 don't work on master for me:
> 077  fail   [20:23:57] [20:23:59]    output mismatch 
> (see 077.out.bad)
> --- /work/src/qemu/up-block-drop-hd-qiov/tests/qemu-iotests/077.out 
> 2019-04-22 15:06:56.162045432 +0300
> +++ /work/src/qemu/up-block-drop-hd-qiov/tests/qemu-iotests/077.out.bad 
> 2019-08-22 20:23:59.124122307 +0300
> @@ -1,7 +1,15 @@
>   QA output created by 077
> +==117186==WARNING: ASan doesn't fully support makecontext/swapcontext 
> functions and may produce false positives in some cases!
> +==117186==WARNING: ASan is ignoring requested __asan_handle_no_return: stack 
> top: 0x7ffefc071000; bottom 0x7fad7277b000; size: 0x0051898f6000 
> (350200225792)
> +False positive error reports may follow
> +For details see 
> http://code.google.com/p/address-sanitizer/issues/detail?id=189
>   Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
> 
>   == Some concurrent requests involving RMW ==
> +==117197==WARNING: ASan doesn't fully support makecontext/swapcontext 
> functions and may produce false positives in some cases!
> +==117197==WARNING: ASan is ignoring requested __asan_handle_no_return: stack 
> top: 0x7fffa4fcc000; bottom 0x7fa93a2c1000; size: 0x00566ad0b000 
> (371159248896)
> +False positive error reports may follow
> +For details see 
> http://code.google.com/p/address-sanitizer/issues/detail?id=189
>   wrote XXX/XXX bytes at offset XXX
>   XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   wrote XXX/XXX bytes at offset XXX
> @@ -66,6 +74,10 @@
>   XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> 
>   == Verify image content ==
> +==117219==WARNING: ASan doesn't fully support makecontext/swapcontext 
> functions and may produce false positives in some cases!
> +==117219==WARNING: ASan is ignoring requested __asan_handle_no_return: stack 
> top: 0x7ffc722de000; bottom 0x7f0848232000; size: 0x00f42a0ac000 
> (1048677367808)
> +False positive error reports may follow
> +For details see 
> http://code.google.com/p/address-sanitizer/issues/detail?id=189
>   read 512/512 bytes at offset 0
>   512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   read 512/512 bytes at offset 512
> @@ -156,5 +168,9 @@
>   1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   read 2048/2048 bytes at offset 71680
>   2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +==117229==WARNING: ASan doesn't fully support makecontext/swapcontext 
> functions and may produce false positives in some cases!
> +==117229==WARNING: ASan is ignoring requested __asan_handle_no_return: stack 
> top: 0x7fffa3342000; bottom 0x7fd85a275000; size: 0x0027490cd000 
> (168729300992)
> +False positive error reports may follow
> +For details see 
> http://code.google.com/p/address-sanitizer/issues/detail?id=189
>   No errors were found on the image.
>   *** done
> Failures: 077
> Failed 1 of 1 tests
> 
> 
> 

But after "block/io: refactor padding" it hangs instead of this fail.. This is 
not good


-- 
Best regards,
Vladimir



Re: [Qemu-block] [Qemu-devel] [PATCH v2 00/12] block: qiov_offset parameter for io

2019-08-22 Thread Vladimir Sementsov-Ogievskiy
22.08.2019 18:50, Stefan Hajnoczi wrote:
> On Tue, Jun 04, 2019 at 07:15:02PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all!
>>
>> Here is new parameter qiov_offset for io path, to avoid
>> a lot of places with same pattern of creating local_qiov or hd_qiov
>> variables.
>>
>> These series also includes my
>> "[Qemu-devel] [PATCH 0/2] block/io: refactor padding"
>> with some changes [described in 01 and 03 emails]
>>
>> Vladimir Sementsov-Ogievskiy (12):
>>util/iov: introduce qemu_iovec_init_extended
>>util/iov: improve qemu_iovec_is_zero
>>block/io: refactor padding
>>block: define .*_part io handlers in BlockDriver
>>block/io: bdrv_co_do_copy_on_readv: use and support qiov_offset
>>block/io: bdrv_co_do_copy_on_readv: lazy allocation
>>block/io: bdrv_aligned_preadv: use and support qiov_offset
>>block/io: bdrv_aligned_pwritev: use and support qiov_offset
>>block/io: introduce bdrv_co_p{read,write}v_part
>>block/qcow2: refactor qcow2_co_preadv to use buffer-based io
>>block/qcow2: implement .bdrv_co_preadv_part
>>block/qcow2: implement .bdrv_co_pwritev(_compressed)_part
>>
>>   block/qcow2.h |   1 +
>>   include/block/block_int.h |  21 ++
>>   include/qemu/iov.h|  10 +-
>>   block/backup.c|   2 +-
>>   block/io.c| 532 ++
>>   block/qcow2-cluster.c |  14 +-
>>   block/qcow2.c | 131 +-
>>   qemu-img.c|   4 +-
>>   util/iov.c| 153 +--
>>   9 files changed, 559 insertions(+), 309 deletions(-)
> 
> qemu-iotests 077 fails after I apply this series (including your
> uninitialized variable fix).  I'm afraid I can't include it in the block
> pull request, sorry!
> 
> Stefan
> 

Hmm, 77 don't work on master for me:
077  fail   [20:23:57] [20:23:59]output mismatch 
(see 077.out.bad)
--- /work/src/qemu/up-block-drop-hd-qiov/tests/qemu-iotests/077.out 
2019-04-22 15:06:56.162045432 +0300
+++ /work/src/qemu/up-block-drop-hd-qiov/tests/qemu-iotests/077.out.bad 
2019-08-22 20:23:59.124122307 +0300
@@ -1,7 +1,15 @@
  QA output created by 077
+==117186==WARNING: ASan doesn't fully support makecontext/swapcontext 
functions and may produce false positives in some cases!
+==117186==WARNING: ASan is ignoring requested __asan_handle_no_return: stack 
top: 0x7ffefc071000; bottom 0x7fad7277b000; size: 0x0051898f6000 (350200225792)
+False positive error reports may follow
+For details see http://code.google.com/p/address-sanitizer/issues/detail?id=189
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728

  == Some concurrent requests involving RMW ==
+==117197==WARNING: ASan doesn't fully support makecontext/swapcontext 
functions and may produce false positives in some cases!
+==117197==WARNING: ASan is ignoring requested __asan_handle_no_return: stack 
top: 0x7fffa4fcc000; bottom 0x7fa93a2c1000; size: 0x00566ad0b000 (371159248896)
+False positive error reports may follow
+For details see http://code.google.com/p/address-sanitizer/issues/detail?id=189
  wrote XXX/XXX bytes at offset XXX
  XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote XXX/XXX bytes at offset XXX
@@ -66,6 +74,10 @@
  XXX bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)

  == Verify image content ==
+==117219==WARNING: ASan doesn't fully support makecontext/swapcontext 
functions and may produce false positives in some cases!
+==117219==WARNING: ASan is ignoring requested __asan_handle_no_return: stack 
top: 0x7ffc722de000; bottom 0x7f0848232000; size: 0x00f42a0ac000 (1048677367808)
+False positive error reports may follow
+For details see http://code.google.com/p/address-sanitizer/issues/detail?id=189
  read 512/512 bytes at offset 0
  512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  read 512/512 bytes at offset 512
@@ -156,5 +168,9 @@
  1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  read 2048/2048 bytes at offset 71680
  2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+==117229==WARNING: ASan doesn't fully support makecontext/swapcontext 
functions and may produce false positives in some cases!
+==117229==WARNING: ASan is ignoring requested __asan_handle_no_return: stack 
top: 0x7fffa3342000; bottom 0x7fd85a275000; size: 0x0027490cd000 (168729300992)
+False positive error reports may follow
+For details see http://code.google.com/p/address-sanitizer/issues/detail?id=189
  No errors were found on the image.
  *** done
Failures: 077
Failed 1 of 1 tests



-- 
Best regards,
Vladimir



Re: [Qemu-block] [Qemu-devel] [PATCH 0/2] block/file-posix: Fix xfs_write_zeroes()

2019-08-22 Thread Max Reitz
On 22.08.19 18:53, Paolo Bonzini wrote:
> On 22/08/19 18:26, Max Reitz wrote:
>> Lukàš ran over a nasty regression in our xfs_write_zeroes() function
>> (sorry, my fault) made apparent by a recent patch from Anton that makes
>> qcow2 images heavily exercise the offending code path.
>>
>> This series fixes the bug and adds a test to prevent it from
>> reoccurring.
>>
>>
>> Max Reitz (2):
>>   block/file-posix: Fix xfs_write_zeroes()
>>   iotests: Test reverse sub-cluster qcow2 writes
>>
>>  block/file-posix.c | 16 ++---
>>  tests/qemu-iotests/265 | 67 ++
>>  tests/qemu-iotests/265.out |  6 
>>  tests/qemu-iotests/group   |  1 +
>>  4 files changed, 85 insertions(+), 5 deletions(-)
>>  create mode 100755 tests/qemu-iotests/265
>>  create mode 100644 tests/qemu-iotests/265.out
>>
> 
> What about just killing libxfs support and only use fallocate?
> FALLOC_FL_ZERO_RANGE was added in Linux 3.15 (2014) and the only
> platform we probably support with such an old kernel is of course
> RHEL/CentOS 7 which has had it backported.

Works just as well for me. :-)

Max



Re: [Qemu-block] [Qemu-devel] [PATCH 0/2] block/file-posix: Fix xfs_write_zeroes()

2019-08-22 Thread Paolo Bonzini
On 22/08/19 18:26, Max Reitz wrote:
> Lukàš ran over a nasty regression in our xfs_write_zeroes() function
> (sorry, my fault) made apparent by a recent patch from Anton that makes
> qcow2 images heavily exercise the offending code path.
> 
> This series fixes the bug and adds a test to prevent it from
> reoccurring.
> 
> 
> Max Reitz (2):
>   block/file-posix: Fix xfs_write_zeroes()
>   iotests: Test reverse sub-cluster qcow2 writes
> 
>  block/file-posix.c | 16 ++---
>  tests/qemu-iotests/265 | 67 ++
>  tests/qemu-iotests/265.out |  6 
>  tests/qemu-iotests/group   |  1 +
>  4 files changed, 85 insertions(+), 5 deletions(-)
>  create mode 100755 tests/qemu-iotests/265
>  create mode 100644 tests/qemu-iotests/265.out
> 

What about just killing libxfs support and only use fallocate?
FALLOC_FL_ZERO_RANGE was added in Linux 3.15 (2014) and the only
platform we probably support with such an old kernel is of course
RHEL/CentOS 7 which has had it backported.

Paolo



Re: [Qemu-block] [PATCH] block: posix: Always allocate the first block

2019-08-22 Thread Nir Soffer
On Thu, Aug 22, 2019 at 5:28 PM Max Reitz  wrote:

> On 16.08.19 23:21, Nir Soffer wrote:
> > When creating an image with preallocation "off" or "falloc", the first
> > block of the image is typically not allocated. When using Gluster
> > storage backed by XFS filesystem, reading this block using direct I/O
> > succeeds regardless of request length, fooling alignment detection.
> >
> > In this case we fallback to a safe value (4096) instead of the optimal
> > value (512), which may lead to unneeded data copying when aligning
> > requests.  Allocating the first block avoids the fallback.
> >
> > When using preallocation=off, we always allocate at least one filesystem
> > block:
> >
> > $ ./qemu-img create -f raw test.raw 1g
> > Formatting 'test.raw', fmt=raw size=1073741824
> >
> > $ ls -lhs test.raw
> > 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> >
> > I did quick performance tests for these flows:
> > - Provisioning a VM with a new raw image.
> > - Copying disks with qemu-img convert to new raw target image
> >
> > I installed Fedora 29 server on raw sparse image, measuring the time
> > from clicking "Begin installation" until the "Reboot" button appears:
> >
> > Before(s)  After(s) Diff(%)
> > ---
> >  356389+8.4
> >
> > I ran this only once, so we cannot tell much from these results.
>
> So you’d expect it to be fast but it was slower?  Well, you only ran it
> once and it isn’t really a precise benchmark...
>
> > The second test was cloning the installation image with qemu-img
> > convert, doing 10 runs:
> >
> > for i in $(seq 10); do
> > rm -f dst.raw
> > sleep 10
> > time ./qemu-img convert -f raw -O raw -t none -T none src.raw
> dst.raw
> > done
> >
> > Here is a table comparing the total time spent:
> >
> > TypeBefore(s)   After(s)Diff(%)
> > ---
> > real  530.028469.123  -11.4
> > user   17.204 10.768  -37.4
> > sys17.881  7.011  -60.7
> >
> > Here we see very clear improvement in CPU usage.
> >
> > Signed-off-by: Nir Soffer 
> > ---
> >  block/file-posix.c | 25 +
> >  tests/qemu-iotests/150.out |  1 +
> >  tests/qemu-iotests/160 |  4 
> >  tests/qemu-iotests/175 | 19 +--
> >  tests/qemu-iotests/175.out |  8 
> >  tests/qemu-iotests/221.out | 12 
> >  tests/qemu-iotests/253.out | 12 
> >  7 files changed, 63 insertions(+), 18 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index b9c33c8f6c..3964dd2021 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void *opaque)
> >  return ret;
> >  }
> >
> > +/*
> > + * Help alignment detection by allocating the first block.
> > + *
> > + * When reading with direct I/O from unallocated area on Gluster backed
> by XFS,
> > + * reading succeeds regardless of request length. In this case we
> fallback to
> > + * safe aligment which is not optimal. Allocating the first block
> avoids this
> > + * fallback.
> > + *
> > + * Returns: 0 on success, -errno on failure.
> > + */
> > +static int allocate_first_block(int fd)
> > +{
> > +ssize_t n;
> > +
> > +do {
> > +n = pwrite(fd, "\0", 1, 0);
>
> This breaks when fd has been opened with O_DIRECT.
>

It seems that we always open images without O_DIRECT when creating an image
in qemu-img create, or when creating a target image in qemu-img convert.

Here is a trace of qemu-img create:

$ strace -f -tt -o /tmp/create.trace ./qemu-img create -f raw -o
preallocation=falloc /tmp/gv1/src.raw 1g
Formatting '/tmp/gv1/src.raw', fmt=raw size=1073741824 preallocation=falloc

1. open #1

17686 18:58:23.681921 openat(AT_FDCWD, "/tmp/gv1/src.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 9
17686 18:58:23.683753 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
17686 18:58:23.683829 close(9)  = 0

2. open #2

17686 18:58:23.684146 openat(AT_FDCWD, "/tmp/gv1/src.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 9
17686 18:58:23.684227 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
17686 18:58:23.684256 close(9)  = 0

3. open #3

17686 18:58:23.684339 openat(AT_FDCWD, "/tmp/gv1/src.raw",
O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 9
...
17688 18:58:23.690178 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
17688 18:58:23.690217 ftruncate(9, 0 
...
17688 18:58:23.700266 <... ftruncate resumed>) = 0
...
17688 18:58:23.700595 fstat(9,  
...
17688 18:58:23.700619 <... fstat resumed>{st_mode=S_IFREG|0600, st_size=0,
...}) = 0
...
17688 18:58:23.700651 fallocate(9, 0, 0, 1073741824 
...
17688 18:58:23.728141 <... fallocate resumed>) = 0
...
17688 18:58:23.728196 pwrite64(9, "\0", 1, 0) = 1
...
17686 18:58:23.738391 close(9)  = 0

Here is convert trace:

$ strace -f -tt -o /tmp/convert.trace 

Re: [Qemu-block] [PATCH] pr-manager: Fix invalid g_free() crash bug

2019-08-22 Thread Paolo Bonzini
On 22/08/19 15:38, Markus Armbruster wrote:
> pr_manager_worker() passes its @opaque argument to g_free().  Wrong;
> it points to pr_manager_worker()'s automatic @data.  Broken when
> commit 2f3a7ab39be converted @data from heap- to stack-allocated.  Fix
> by deleting the g_free().
> 
> Fixes: 2f3a7ab39bec4ba8022dc4d42ea641165b004e3e
> Cc: qemu-sta...@nongnu.org
> Signed-off-by: Markus Armbruster 
> ---
>  scsi/pr-manager.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/scsi/pr-manager.c b/scsi/pr-manager.c
> index ee43663576..0c866e8698 100644
> --- a/scsi/pr-manager.c
> +++ b/scsi/pr-manager.c
> @@ -39,7 +39,6 @@ static int pr_manager_worker(void *opaque)
>  int fd = data->fd;
>  int r;
>  
> -g_free(data);
>  trace_pr_manager_run(fd, hdr->cmdp[0], hdr->cmdp[1]);
>  
>  /* The reference was taken in pr_manager_execute.  */
> 

Acked-by: Paolo Bonzini 

Since I am disappearing soon, I wouldn't mind if the block layer people
picked this up.

Paolo



Re: [Qemu-block] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]

2019-08-22 Thread Max Reitz
On 22.08.19 17:40, Max Reitz wrote:
> On 22.08.19 17:25, Max Reitz wrote:
>> On 22.08.19 14:09, Max Reitz wrote:
>>> (CC-ing Paolo because of the XFS connection, and Stefan because why not.)
>>>
>>> On 22.08.19 13:27, Lukáš Doktor wrote:
 Dne 21. 08. 19 v 19:51 Max Reitz napsal(a):
> On 21.08.19 16:14, Lukáš Doktor wrote:
>> Hello guys,
>>
>> First attempt was rejected due to zip attachment, let's try it again 
>> with just Avocado-vt debug.log and serial console log files attached.
>>
>> I bisected a regression on aarch64 all the way to this commit: "qcow2: 
>> skip writing zero buffers to empty COW areas" 
>> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look 
>> at it?
>
> I think I can see the issue on my x64 system (I don’t see the XFS
> corruption, but the installation fails because of some segfaults).
>
> I haven’t found a simpler way to reproduce the problem yet, though,
> which is a pain... :-/
>
> It looks like the problem disappears when I configure qemu with
> “--disable-xfsctl”.  Can you try that?
>
> Max
>

 Hello Max,

 yes, I'm getting the same behavior. With "--disable-xfsctl" it works well. 
 Also looking at the option I understand why it only failed on aarch64 for 
 me, I don't have libs installed on the other machines, therefor it was 
 disabled by "./configure" there. Anyway I guess disabling it in my builds 
 won't really fix the issue, right? :-)
>>>
>>> Thanks!
>>>
>>> No, it won’t, but it means the actual root of the problem is probably
>>> rather in some XFS-related code (be it because qemu uses it the wrong
>>> way or because of XFS kernel code) than in the pure qcow2 commit that
>>> made the problem surface by exercising it heavily.  (Or in an
>>> interaction between the two.)
>>
>> OK, I got a simpler reproducer now:
>>
>> $ ./qemu-img create -f qcow2 test.qcow2 1M
>> $ (for i in $(seq 15 -1 0); do \
>>echo "aio_write -P 42 $((i * 64 + 1))k 62k"; \
>>done) \
>>   | ./qemu-io test.qcow2
>> $ for i in $(seq 0 15); do \
>>   echo $i; \
>>   ofs=$((i * 64)); \
>>   ./qemu-io -c "read -P 0 ${ofs}k 1k" \
>> -c "read -P 42 $((ofs + 1))k 62k" \
>> -c "read -P 0 $((ofs + 63))k 1k" \
>> test.qcow2 \
>>   | grep 'verification'; \
>>   done
>>
>> On XFS with --enable-xfsctl, this basically always gives me some
>> verification failure somewhere.  (On tmpfs or with --disable-xfsctl, it
>> never fails.)
>>
>> So it seems to be related to I/O from back to front.
>>
>> (You can also reproduce it with a plain “qemu-img bench” invocation,
>> like “./qemu-img bench -w --pattern=42 -o 1k -S 64k -s 62k test.qcow2”
>> (on, say, a 4 GB image), but then the failure appears much later in the
>> image, because you have to wait from some requests to come in reverse
>> (by chance) first.)
> 
> The problem is the ftruncate() in xfs_write_zeroes().  It is possible
> for it to yield, then other requests come in, and the data they write
> may get discarded once the ftruncate() settles.

I’ve just sent a patch: “block/file-posix: Fix xfs_write_zeroes()”,
Message-ID <20190822162618.27670-1-mre...@redhat.com>:
https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg01148.html

Max



[Qemu-block] [PATCH 2/2] iotests: Test reverse sub-cluster qcow2 writes

2019-08-22 Thread Max Reitz
This exercises the regression introduced in commit
50ba5b2d994853b38fed10e0841b119da0f8b8e5.  On my machine, it has close
to a 50 % false-negative rate, but that should still be sufficient to
test the fix.

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/265 | 67 ++
 tests/qemu-iotests/265.out |  6 
 tests/qemu-iotests/group   |  1 +
 3 files changed, 74 insertions(+)
 create mode 100755 tests/qemu-iotests/265
 create mode 100644 tests/qemu-iotests/265.out

diff --git a/tests/qemu-iotests/265 b/tests/qemu-iotests/265
new file mode 100755
index 00..dce6f77be3
--- /dev/null
+++ b/tests/qemu-iotests/265
@@ -0,0 +1,67 @@
+#!/usr/bin/env bash
+#
+# Test reverse-ordered qcow2 writes on a sub-cluster level
+#
+# Copyright (C) 2019 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+seq=$(basename $0)
+echo "QA output created by $seq"
+
+status=1   # failure is the default!
+
+_cleanup()
+{
+_cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+# qcow2-specific test
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+
+echo '--- Writing to the image ---'
+
+# Reduce cluster size so we get more and quicker I/O
+IMGOPTS='cluster_size=4096' _make_test_img 1M
+(for ((kb = 1024 - 4; kb >= 0; kb -= 4)); do \
+ echo "aio_write -P 42 $((kb + 1))k 2k"; \
+ done) \
+ | $QEMU_IO "$TEST_IMG" > /dev/null
+
+echo '--- Verifying its content ---'
+
+(for ((kb = 0; kb < 1024; kb += 4)); do \
+echo "read -P 0 ${kb}k 1k"; \
+echo "read -P 42 $((kb + 1))k 2k"; \
+echo "read -P 0 $((kb + 3))k 1k"; \
+ done) \
+ | $QEMU_IO "$TEST_IMG" | _filter_qemu_io | grep 'verification'
+
+# Status of qemu-io
+if [ ${PIPESTATUS[1]} = 0 ]; then
+echo 'Content verified.'
+fi
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/265.out b/tests/qemu-iotests/265.out
new file mode 100644
index 00..6eac620f25
--- /dev/null
+++ b/tests/qemu-iotests/265.out
@@ -0,0 +1,6 @@
+QA output created by 265
+--- Writing to the image ---
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
+--- Verifying its content ---
+Content verified.
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index d95d556414..0c129c1644 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -274,3 +274,4 @@
 257 rw
 258 rw quick
 262 rw quick migration
+265 rw auto quick
-- 
2.21.0




[Qemu-block] [PATCH 0/2] block/file-posix: Fix xfs_write_zeroes()

2019-08-22 Thread Max Reitz
Lukàš ran over a nasty regression in our xfs_write_zeroes() function
(sorry, my fault) made apparent by a recent patch from Anton that makes
qcow2 images heavily exercise the offending code path.

This series fixes the bug and adds a test to prevent it from
reoccurring.


Max Reitz (2):
  block/file-posix: Fix xfs_write_zeroes()
  iotests: Test reverse sub-cluster qcow2 writes

 block/file-posix.c | 16 ++---
 tests/qemu-iotests/265 | 67 ++
 tests/qemu-iotests/265.out |  6 
 tests/qemu-iotests/group   |  1 +
 4 files changed, 85 insertions(+), 5 deletions(-)
 create mode 100755 tests/qemu-iotests/265
 create mode 100644 tests/qemu-iotests/265.out

-- 
2.21.0




[Qemu-block] [PATCH 1/2] block/file-posix: Fix xfs_write_zeroes()

2019-08-22 Thread Max Reitz
Calling ftruncate() in xfs_write_zeroes() is dangerous because it may
yield and then discard data that parallel write requests have written
past the old EOF in the meantime.  We must not use it here.

Instead, return -ENOTSUP and let the more generic fallocate code handle
writing zeroes past the EOF.

Reported-by: Lukáš Doktor 
Fixes: 50ba5b2d994853b38fed10e0841b119da0f8b8e5
Cc: qemu-sta...@nongnu.org
Signed-off-by: Max Reitz 
---
 block/file-posix.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index fbeb0068db..b49e0784a4 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1472,10 +1472,13 @@ static int xfs_write_zeroes(BDRVRawState *s, int64_t 
offset, uint64_t bytes)
 }
 
 if (offset + bytes > len) {
-/* XFS_IOC_ZERO_RANGE does not increase the file length */
-if (ftruncate(s->fd, offset + bytes) < 0) {
-return -errno;
-}
+/*
+ * XFS_IOC_ZERO_RANGE does not increase the file length, but
+ * the caller probably wants us to.
+ * Calling ftruncate() would not be safe, so let the generic
+ * implementation handle this case.
+ */
+return -ENOTSUP;
 }
 
 memset(, 0, sizeof(fl));
@@ -1580,7 +1583,10 @@ static int handle_aiocb_write_zeroes(void *opaque)
 
 #ifdef CONFIG_XFS
 if (s->is_xfs) {
-return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
+int ret = xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
+if (ret != -ENOTSUP) {
+return ret;
+}
 }
 #endif
 
-- 
2.21.0




[Qemu-block] [PULL 1/2] util/async: hold AioContext ref to prevent use-after-free

2019-08-22 Thread Stefan Hajnoczi
The tests/test-bdrv-drain /bdrv-drain/iothread/drain test case does the
following:

1. The preadv coroutine calls aio_bh_schedule_oneshot() and then yields.
2. The one-shot BH executes in another AioContext.  All it does is call
   aio_co_wakeup(preadv_co).
3. The preadv coroutine is re-entered and returns.

There is a race condition in aio_co_wake() where the preadv coroutine
returns and the test case destroys the preadv IOThread.  aio_co_wake()
can still be running in the other AioContext and it performs an access
to the freed IOThread AioContext.

Here is the race in aio_co_schedule():

  QSLIST_INSERT_HEAD_ATOMIC(>scheduled_coroutines,
co, co_scheduled_next);
  <-- race: co may execute before we invoke qemu_bh_schedule()!
  qemu_bh_schedule(ctx->co_schedule_bh);

So if co causes ctx to be freed then we're in trouble.  Fix this problem
by holding a reference to ctx.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Paolo Bonzini 
Message-id: 20190723190623.21537-1-stefa...@redhat.com
Message-Id: <20190723190623.21537-1-stefa...@redhat.com>
Signed-off-by: Stefan Hajnoczi 
---
 util/async.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/util/async.c b/util/async.c
index 8d2105729c..4e4c7af51e 100644
--- a/util/async.c
+++ b/util/async.c
@@ -459,9 +459,17 @@ void aio_co_schedule(AioContext *ctx, Coroutine *co)
 abort();
 }
 
+/* The coroutine might run and release the last ctx reference before we
+ * invoke qemu_bh_schedule().  Take a reference to keep ctx alive until
+ * we're done.
+ */
+aio_context_ref(ctx);
+
 QSLIST_INSERT_HEAD_ATOMIC(>scheduled_coroutines,
   co, co_scheduled_next);
 qemu_bh_schedule(ctx->co_schedule_bh);
+
+aio_context_unref(ctx);
 }
 
 void aio_co_wake(struct Coroutine *co)
-- 
2.21.0




[Qemu-block] [PULL 2/2] vhost-user-scsi: prevent using uninitialized vqs

2019-08-22 Thread Stefan Hajnoczi
From: Raphael Norwitz 

Of the 3 virtqueues, seabios only sets cmd, leaving ctrl
and event without a physical address. This can cause
vhost_verify_ring_part_mapping to return ENOMEM, causing
the following logs:

qemu-system-x86_64: Unable to map available ring for ring 0
qemu-system-x86_64: Verify ring failure on region 0

The qemu commit e6cc11d64fc998c11a4dfcde8fda3fc33a74d844
has already resolved the issue for vhost scsi devices but
the fix was never applied to vhost-user scsi devices.

Signed-off-by: Raphael Norwitz 
Reviewed-by: Stefan Hajnoczi 
Message-id: 1560299717-177734-1-git-send-email-raphael.norw...@nutanix.com
Message-Id: <1560299717-177734-1-git-send-email-raphael.norw...@nutanix.com>
Signed-off-by: Stefan Hajnoczi 
---
 hw/scsi/vhost-user-scsi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c
index 31c9d34637..6a6c15dd32 100644
--- a/hw/scsi/vhost-user-scsi.c
+++ b/hw/scsi/vhost-user-scsi.c
@@ -93,7 +93,7 @@ static void vhost_user_scsi_realize(DeviceState *dev, Error 
**errp)
 }
 
 vsc->dev.nvqs = 2 + vs->conf.num_queues;
-vsc->dev.vqs = g_new(struct vhost_virtqueue, vsc->dev.nvqs);
+vsc->dev.vqs = g_new0(struct vhost_virtqueue, vsc->dev.nvqs);
 vsc->dev.vq_index = 0;
 vsc->dev.backend_features = 0;
 vqs = vsc->dev.vqs;
-- 
2.21.0




[Qemu-block] [PULL 0/2] Block patches

2019-08-22 Thread Stefan Hajnoczi
The following changes since commit 33f18cf7dca7741d3647d514040904ce83edd73d:

  Merge remote-tracking branch 
'remotes/kraxel/tags/audio-20190821-pull-request' into staging (2019-08-21 
15:18:50 +0100)

are available in the Git repository at:

  https://github.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 5d4c1ed3d46d7e2010b389fe5f3376f605182ab0:

  vhost-user-scsi: prevent using uninitialized vqs (2019-08-22 16:52:23 +0100)


Pull request



Raphael Norwitz (1):
  vhost-user-scsi: prevent using uninitialized vqs

Stefan Hajnoczi (1):
  util/async: hold AioContext ref to prevent use-after-free

 hw/scsi/vhost-user-scsi.c | 2 +-
 util/async.c  | 8 
 2 files changed, 9 insertions(+), 1 deletion(-)

-- 
2.21.0




Re: [Qemu-block] [Qemu-devel] [PATCH v2 00/12] block: qiov_offset parameter for io

2019-08-22 Thread Stefan Hajnoczi
On Tue, Jun 04, 2019 at 07:15:02PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> Here is new parameter qiov_offset for io path, to avoid
> a lot of places with same pattern of creating local_qiov or hd_qiov
> variables.
> 
> These series also includes my
> "[Qemu-devel] [PATCH 0/2] block/io: refactor padding"
> with some changes [described in 01 and 03 emails]
> 
> Vladimir Sementsov-Ogievskiy (12):
>   util/iov: introduce qemu_iovec_init_extended
>   util/iov: improve qemu_iovec_is_zero
>   block/io: refactor padding
>   block: define .*_part io handlers in BlockDriver
>   block/io: bdrv_co_do_copy_on_readv: use and support qiov_offset
>   block/io: bdrv_co_do_copy_on_readv: lazy allocation
>   block/io: bdrv_aligned_preadv: use and support qiov_offset
>   block/io: bdrv_aligned_pwritev: use and support qiov_offset
>   block/io: introduce bdrv_co_p{read,write}v_part
>   block/qcow2: refactor qcow2_co_preadv to use buffer-based io
>   block/qcow2: implement .bdrv_co_preadv_part
>   block/qcow2: implement .bdrv_co_pwritev(_compressed)_part
> 
>  block/qcow2.h |   1 +
>  include/block/block_int.h |  21 ++
>  include/qemu/iov.h|  10 +-
>  block/backup.c|   2 +-
>  block/io.c| 532 ++
>  block/qcow2-cluster.c |  14 +-
>  block/qcow2.c | 131 +-
>  qemu-img.c|   4 +-
>  util/iov.c| 153 +--
>  9 files changed, 559 insertions(+), 309 deletions(-)

qemu-iotests 077 fails after I apply this series (including your
uninitialized variable fix).  I'm afraid I can't include it in the block
pull request, sorry!

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]

2019-08-22 Thread Max Reitz
On 22.08.19 17:25, Max Reitz wrote:
> On 22.08.19 14:09, Max Reitz wrote:
>> (CC-ing Paolo because of the XFS connection, and Stefan because why not.)
>>
>> On 22.08.19 13:27, Lukáš Doktor wrote:
>>> Dne 21. 08. 19 v 19:51 Max Reitz napsal(a):
 On 21.08.19 16:14, Lukáš Doktor wrote:
> Hello guys,
>
> First attempt was rejected due to zip attachment, let's try it again with 
> just Avocado-vt debug.log and serial console log files attached.
>
> I bisected a regression on aarch64 all the way to this commit: "qcow2: 
> skip writing zero buffers to empty COW areas" 
> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at 
> it?

 I think I can see the issue on my x64 system (I don’t see the XFS
 corruption, but the installation fails because of some segfaults).

 I haven’t found a simpler way to reproduce the problem yet, though,
 which is a pain... :-/

 It looks like the problem disappears when I configure qemu with
 “--disable-xfsctl”.  Can you try that?

 Max

>>>
>>> Hello Max,
>>>
>>> yes, I'm getting the same behavior. With "--disable-xfsctl" it works well. 
>>> Also looking at the option I understand why it only failed on aarch64 for 
>>> me, I don't have libs installed on the other machines, therefor it was 
>>> disabled by "./configure" there. Anyway I guess disabling it in my builds 
>>> won't really fix the issue, right? :-)
>>
>> Thanks!
>>
>> No, it won’t, but it means the actual root of the problem is probably
>> rather in some XFS-related code (be it because qemu uses it the wrong
>> way or because of XFS kernel code) than in the pure qcow2 commit that
>> made the problem surface by exercising it heavily.  (Or in an
>> interaction between the two.)
> 
> OK, I got a simpler reproducer now:
> 
> $ ./qemu-img create -f qcow2 test.qcow2 1M
> $ (for i in $(seq 15 -1 0); do \
>echo "aio_write -P 42 $((i * 64 + 1))k 62k"; \
>done) \
>   | ./qemu-io test.qcow2
> $ for i in $(seq 0 15); do \
>   echo $i; \
>   ofs=$((i * 64)); \
>   ./qemu-io -c "read -P 0 ${ofs}k 1k" \
> -c "read -P 42 $((ofs + 1))k 62k" \
> -c "read -P 0 $((ofs + 63))k 1k" \
> test.qcow2 \
>   | grep 'verification'; \
>   done
> 
> On XFS with --enable-xfsctl, this basically always gives me some
> verification failure somewhere.  (On tmpfs or with --disable-xfsctl, it
> never fails.)
> 
> So it seems to be related to I/O from back to front.
> 
> (You can also reproduce it with a plain “qemu-img bench” invocation,
> like “./qemu-img bench -w --pattern=42 -o 1k -S 64k -s 62k test.qcow2”
> (on, say, a 4 GB image), but then the failure appears much later in the
> image, because you have to wait from some requests to come in reverse
> (by chance) first.)

The problem is the ftruncate() in xfs_write_zeroes().  It is possible
for it to yield, then other requests come in, and the data they write
may get discarded once the ftruncate() settles.

Max



Re: [Qemu-block] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]

2019-08-22 Thread Max Reitz
On 22.08.19 14:09, Max Reitz wrote:
> (CC-ing Paolo because of the XFS connection, and Stefan because why not.)
> 
> On 22.08.19 13:27, Lukáš Doktor wrote:
>> Dne 21. 08. 19 v 19:51 Max Reitz napsal(a):
>>> On 21.08.19 16:14, Lukáš Doktor wrote:
 Hello guys,

 First attempt was rejected due to zip attachment, let's try it again with 
 just Avocado-vt debug.log and serial console log files attached.

 I bisected a regression on aarch64 all the way to this commit: "qcow2: 
 skip writing zero buffers to empty COW areas" 
 c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at 
 it?
>>>
>>> I think I can see the issue on my x64 system (I don’t see the XFS
>>> corruption, but the installation fails because of some segfaults).
>>>
>>> I haven’t found a simpler way to reproduce the problem yet, though,
>>> which is a pain... :-/
>>>
>>> It looks like the problem disappears when I configure qemu with
>>> “--disable-xfsctl”.  Can you try that?
>>>
>>> Max
>>>
>>
>> Hello Max,
>>
>> yes, I'm getting the same behavior. With "--disable-xfsctl" it works well. 
>> Also looking at the option I understand why it only failed on aarch64 for 
>> me, I don't have libs installed on the other machines, therefor it was 
>> disabled by "./configure" there. Anyway I guess disabling it in my builds 
>> won't really fix the issue, right? :-)
> 
> Thanks!
> 
> No, it won’t, but it means the actual root of the problem is probably
> rather in some XFS-related code (be it because qemu uses it the wrong
> way or because of XFS kernel code) than in the pure qcow2 commit that
> made the problem surface by exercising it heavily.  (Or in an
> interaction between the two.)

OK, I got a simpler reproducer now:

$ ./qemu-img create -f qcow2 test.qcow2 1M
$ (for i in $(seq 15 -1 0); do \
   echo "aio_write -P 42 $((i * 64 + 1))k 62k"; \
   done) \
  | ./qemu-io test.qcow2
$ for i in $(seq 0 15); do \
  echo $i; \
  ofs=$((i * 64)); \
  ./qemu-io -c "read -P 0 ${ofs}k 1k" \
-c "read -P 42 $((ofs + 1))k 62k" \
-c "read -P 0 $((ofs + 63))k 1k" \
test.qcow2 \
  | grep 'verification'; \
  done

On XFS with --enable-xfsctl, this basically always gives me some
verification failure somewhere.  (On tmpfs or with --disable-xfsctl, it
never fails.)

So it seems to be related to I/O from back to front.

(You can also reproduce it with a plain “qemu-img bench” invocation,
like “./qemu-img bench -w --pattern=42 -o 1k -S 64k -s 62k test.qcow2”
(on, say, a 4 GB image), but then the failure appears much later in the
image, because you have to wait from some requests to come in reverse
(by chance) first.)

Max



Re: [Qemu-block] [Qemu-devel] [QEMU] [PATCH v5 3/8] bootdevice: Add interface to gather LCHS

2019-08-22 Thread Sam Eiderman via Qemu-block
> I’ve got a couple of “undelivered mail returned to sender” mails for Sam
> recently, but anyway...

- shmuel.eider...@oracle.com
+ sam...@google.com

> It doesn’t look like any caller actually passes a NULL @dev, so why not
> drop the @suffix part?

Just copied it from the bootindex implementation.
I think the suffix part there was implemented specifically for fdc since
the same device can have two suffixes (A and B).
This is not relavant here, but I think we still need the suffix to
create the device name for seabios to find.

Sam




Re: [Qemu-block] [Qemu-devel] [PATCH 01/13] block-crypto: misc refactoring

2019-08-22 Thread Maxim Levitsky
On Thu, 2019-08-22 at 16:34 +0200, Max Reitz wrote:
> On 22.08.19 02:05, Maxim Levitsky wrote:
> > On Tue, 2019-08-20 at 18:38 +0200, Max Reitz wrote:
> > > On 14.08.19 22:22, Maxim Levitsky wrote:
> > > > * rename the write_func to create_write_func,
> > > >   and init_func to create_init_func
> > > >   this is  preparation for other write_func that will
> > > >   be used to update the encryption keys.
> > > > 
> > > > No functional changes
> > > > 
> > > > Signed-off-by: Maxim Levitsky 
> > > > ---
> > > >  block/crypto.c | 15 ---
> > > >  1 file changed, 8 insertions(+), 7 deletions(-)
> > > > 
> > > 
> > > I’m not quite sure why you remove or add blank lines seemingly at 
> > > random...
> > 
> > Basically to have consistent two space separation between all functions.
> > A bit of OCD I confess :-)
> 
> Well, it didn’t work because in one place you added two empty lines
> where we already had two, so there are four now.
Exactly :-) 

While the reason I sometimes add/remove black lines between functions
is this, this time this was just a leftover from some stuff I removed and forget
to remove the black lines. Usually prior to sending the patches I 'polish' very
carefully such stuff, but this time since I send up the RFC, I didn't do that 
that well thus various issues like that poped up.


Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH] virtio-blk: Cancel the pending BH when the dataplane is reset

2019-08-22 Thread Philippe Mathieu-Daudé
On 8/22/19 3:01 PM, Stefan Hajnoczi wrote:
> On Fri, Aug 16, 2019 at 07:15:03PM +0200, Philippe Mathieu-Daudé wrote:
>> When 'system_reset' is called, the main loop clear the memory
>> region cache before the BH has a chance to execute. Later when
>> the deferred function is called, some assumptions that were
>> made when scheduling them are no longer true when they actually
>> execute.
>>
>> This is what happens using a virtio-blk device (fresh RHEL7.8 install):
>>
>>  $ (sleep 12.3; echo system_reset; sleep 12.3; echo system_reset; sleep 1; 
>> echo q) \
>>| qemu-system-x86_64 -m 4G -smp 8 -boot menu=on \
>>  -device virtio-blk-pci,id=image1,drive=drive_image1 \
>>  -drive 
>> file=/var/lib/libvirt/images/rhel78.qcow2,if=none,id=drive_image1,format=qcow2,cache=none
>>  \
>>  -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c4:e7:84 \
>>  -netdev tap,id=net0,script=/bin/true,downscript=/bin/true,vhost=on \
>>  -monitor stdio -serial null -nographic
>>   (qemu) system_reset
>>   (qemu) system_reset
>>   (qemu) qemu-system-x86_64: hw/virtio/virtio.c:225: 
>> vring_get_region_caches: Assertion `caches != NULL' failed.
>>   Aborted
>>
>>   (gdb) bt
>>   Thread 1 (Thread 0x7f109c17b680 (LWP 10939)):
>>   #0  0x5604083296d1 in vring_get_region_caches (vq=0x56040a24bdd0) at 
>> hw/virtio/virtio.c:227
>>   #1  0x56040832972b in vring_avail_flags (vq=0x56040a24bdd0) at 
>> hw/virtio/virtio.c:235
>>   #2  0x56040832d13d in virtio_should_notify (vdev=0x56040a240630, 
>> vq=0x56040a24bdd0) at hw/virtio/virtio.c:1648
>>   #3  0x56040832d1f8 in virtio_notify_irqfd (vdev=0x56040a240630, 
>> vq=0x56040a24bdd0) at hw/virtio/virtio.c:1662
>>   #4  0x5604082d213d in notify_guest_bh (opaque=0x56040a243ec0) at 
>> hw/block/dataplane/virtio-blk.c:75
>>   #5  0x56040883dc35 in aio_bh_call (bh=0x56040a243f10) at 
>> util/async.c:90
>>   #6  0x56040883dccd in aio_bh_poll (ctx=0x560409161980) at 
>> util/async.c:118
>>   #7  0x560408842af7 in aio_dispatch (ctx=0x560409161980) at 
>> util/aio-posix.c:460
>>   #8  0x56040883e068 in aio_ctx_dispatch (source=0x560409161980, 
>> callback=0x0, user_data=0x0) at util/async.c:261
>>   #9  0x7f10a8fca06d in g_main_context_dispatch () at 
>> /lib64/libglib-2.0.so.0
>>   #10 0x560408841445 in glib_pollfds_poll () at util/main-loop.c:215
>>   #11 0x5604088414bf in os_host_main_loop_wait (timeout=0) at 
>> util/main-loop.c:238
>>   #12 0x5604088415c4 in main_loop_wait (nonblocking=0) at 
>> util/main-loop.c:514
>>   #13 0x560408416b1e in main_loop () at vl.c:1923
>>   #14 0x56040841e0e8 in main (argc=20, argv=0x7ffc2c3f9c58, 
>> envp=0x7ffc2c3f9d00) at vl.c:4578
>>
>> Fix this by cancelling the BH when the virtio dataplane is stopped.
>>
>> Reported-by: Yihuang Yu 
>> Suggested-by: Stefan Hajnoczi 
>> Fixes: https://bugs.launchpad.net/qemu/+bug/1839428
>> Signed-off-by: Philippe Mathieu-Daudé 
>> ---
>>  hw/block/dataplane/virtio-blk.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/hw/block/dataplane/virtio-blk.c 
>> b/hw/block/dataplane/virtio-blk.c
>> index 9299a1a7c2..4030faa21d 100644
>> --- a/hw/block/dataplane/virtio-blk.c
>> +++ b/hw/block/dataplane/virtio-blk.c
>> @@ -301,6 +301,8 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
>>  /* Clean up guest notifier (irq) */
>>  k->set_guest_notifiers(qbus->parent, nvqs, false);
>>  
>> +qemu_bh_cancel(s->bh);
>> +
>>  vblk->dataplane_started = false;
>>  s->stopping = false;
>>  }
>> -- 
>> 2.20.1
>>
> 
> Along the lines of what John said:
> 
> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
> index 9299a1a7c2..4030faa21d 100644
> --- a/hw/block/dataplane/virtio-blk.c
> +++ b/hw/block/dataplane/virtio-blk.c
> @@ -301,6 +301,8 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
> +qemu_bh_cancel(s->bh);
> +notify_guest_bh(s); /* final chance to notify guest */
> +
>  /* Clean up guest notifier (irq) */
>  k->set_guest_notifiers(qbus->parent, nvqs, false);
>  
>  vblk->dataplane_started = false;
>  s->stopping = false;
>  }
> 
> Let's notify the guest if any batched notifications are waiting.  This
> ensures that no notifications are lost when dataplane is stopped.

OK, works for me, thanks!



Re: [Qemu-block] [Qemu-devel] [PATCH] pr-manager: Fix invalid g_free() crash bug

2019-08-22 Thread Philippe Mathieu-Daudé
On 8/22/19 3:38 PM, Markus Armbruster wrote:
> pr_manager_worker() passes its @opaque argument to g_free().  Wrong;
> it points to pr_manager_worker()'s automatic @data.  Broken when
> commit 2f3a7ab39be converted @data from heap- to stack-allocated.  Fix
> by deleting the g_free().
> 
> Fixes: 2f3a7ab39bec4ba8022dc4d42ea641165b004e3e
> Cc: qemu-sta...@nongnu.org
> Signed-off-by: Markus Armbruster 
> ---
>  scsi/pr-manager.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/scsi/pr-manager.c b/scsi/pr-manager.c
> index ee43663576..0c866e8698 100644
> --- a/scsi/pr-manager.c
> +++ b/scsi/pr-manager.c
> @@ -39,7 +39,6 @@ static int pr_manager_worker(void *opaque)
>  int fd = data->fd;
>  int r;
>  
> -g_free(data);
>  trace_pr_manager_run(fd, hdr->cmdp[0], hdr->cmdp[1]);
>  
>  /* The reference was taken in pr_manager_execute.  */
> 

Reviewed-by: Philippe Mathieu-Daudé 



Re: [Qemu-block] [PATCH v2 00/12] block: qiov_offset parameter for io

2019-08-22 Thread Stefan Hajnoczi
On Thu, Aug 15, 2019 at 11:12:35AM +, Vladimir Sementsov-Ogievskiy wrote:
> 29.07.2019 18:24, Stefan Hajnoczi wrote:
> > On Tue, Jun 04, 2019 at 07:15:02PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> >> Hi all!
> >>
> >> Here is new parameter qiov_offset for io path, to avoid
> >> a lot of places with same pattern of creating local_qiov or hd_qiov
> >> variables.
> >>
> >> These series also includes my
> >> "[Qemu-devel] [PATCH 0/2] block/io: refactor padding"
> >> with some changes [described in 01 and 03 emails]
> >>
> >> Vladimir Sementsov-Ogievskiy (12):
> >>util/iov: introduce qemu_iovec_init_extended
> >>util/iov: improve qemu_iovec_is_zero
> >>block/io: refactor padding
> >>block: define .*_part io handlers in BlockDriver
> >>block/io: bdrv_co_do_copy_on_readv: use and support qiov_offset
> >>block/io: bdrv_co_do_copy_on_readv: lazy allocation
> >>block/io: bdrv_aligned_preadv: use and support qiov_offset
> >>block/io: bdrv_aligned_pwritev: use and support qiov_offset
> >>block/io: introduce bdrv_co_p{read,write}v_part
> >>block/qcow2: refactor qcow2_co_preadv to use buffer-based io
> >>block/qcow2: implement .bdrv_co_preadv_part
> >>block/qcow2: implement .bdrv_co_pwritev(_compressed)_part
> >>
> >>   block/qcow2.h |   1 +
> >>   include/block/block_int.h |  21 ++
> >>   include/qemu/iov.h|  10 +-
> >>   block/backup.c|   2 +-
> >>   block/io.c| 532 ++
> >>   block/qcow2-cluster.c |  14 +-
> >>   block/qcow2.c | 131 +-
> >>   qemu-img.c|   4 +-
> >>   util/iov.c| 153 +--
> >>   9 files changed, 559 insertions(+), 309 deletions(-)
> >>
> >> -- 
> >> 2.18.0
> > 
> > Thanks, applied to my block tree:
> > https://github.com/stefanha/qemu/commits/block
> > 
> > Stefan
> > 
> 
> Could you please squash this into 01:
> 
> diff --git a/util/iov.c b/util/iov.c
> index 0ed75e764c..5059e10431 100644
> --- a/util/iov.c
> +++ b/util/iov.c
> @@ -422,7 +422,7 @@ void qemu_iovec_init_extended(
>   void *tail_buf, size_t tail_len)
>   {
>   size_t mid_head, mid_tail;
> -int total_niov, mid_niov;
> +int total_niov, mid_niov = 0;
>   struct iovec *p, *mid_iov;
> 
>   if (mid_len) {

Done.  Sending the pull request today.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH 12/13] qemu-img: implement key management

2019-08-22 Thread Max Reitz
On 22.08.19 13:32, Daniel P. Berrangé wrote:
> On Tue, Aug 20, 2019 at 08:29:55PM +0200, Max Reitz wrote:
>> On 14.08.19 22:22, Maxim Levitsky wrote:
>>> Signed-off-by: Maxim Levitsky 
>>> ---
>>>  block/crypto.c   |  16 ++
>>>  block/crypto.h   |   3 +
>>>  qemu-img-cmds.hx |  13 +
>>>  qemu-img.c   | 140 +++
>>>  4 files changed, 172 insertions(+)
>>
>> Yes, this seems a bit weird.  Putting it under amend seems like the
>> natural thing if that works; if not, I think it should be a single
>> qemu-img subcommand instead of two.
> 
> I'm not convinced by overloading two distinct operations on to one
> sub-command - doesn't seem to give an obvious benefit to overload
> them & IME experiance overloading results in harder to understand
> commands due to having distinct args to each command.

Because it suits the qemu-img interface we currently have.  For example,
we have a single subcommand for internal snapshot management (“qemu-img
snapshot”), so I think it makes sense to have a single subcommand for
encrypted image management.

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH 16/16] nvme: support multiple namespaces

2019-08-22 Thread Ross Lagerwall

On 7/5/19 8:23 AM, Klaus Birkelund Jensen wrote:

This adds support for multiple namespaces by introducing a new 'nvme-ns'
device model. The nvme device creates a bus named from the device name
('id'). The nvme-ns devices then connect to this and registers
themselves with the nvme device.

This changes how an nvme device is created. Example with two namespaces:

   -drive file=nvme0n1.img,if=none,id=disk1
   -drive file=nvme0n2.img,if=none,id=disk2
   -device nvme,serial=deadbeef,id=nvme0
   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2

A maximum of 256 namespaces can be configured.


snip

  static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
  {
+NvmeNamespace *ns = req->ns;
+
  uint32_t dw10 = le32_to_cpu(cmd->cdw10);
  uint32_t dw11 = le32_to_cpu(cmd->cdw11);
  uint32_t result;
@@ -1464,7 +1474,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
  result = cpu_to_le32(n->features.err_rec);
  break;
  case NVME_VOLATILE_WRITE_CACHE:
-result = blk_enable_write_cache(n->conf.blk);
+result = blk_enable_write_cache(ns->conf.blk);
  trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
  break;


I tried this patch series by installing Windows with a single NVME 
controller having two namespaces. QEMU crashed in get_feature / 
NVME_VOLATILE_WRITE_CACHE because req->ns was NULL.


nvme_get_feature / nvme_set_feature look wrong to me since I can't see 
how req->ns would have been set. Should they have similar code to 
nvme_io_cmd to set req->ns from cmd->nsid?


After working around this issue everything else seemed to be working 
well. Thanks for your work on this patch series.


Thanks,
--
Ross Lagerwall



Re: [Qemu-block] [Qemu-devel] [PATCH 01/13] block-crypto: misc refactoring

2019-08-22 Thread Max Reitz
On 22.08.19 02:05, Maxim Levitsky wrote:
> On Tue, 2019-08-20 at 18:38 +0200, Max Reitz wrote:
>> On 14.08.19 22:22, Maxim Levitsky wrote:
>>> * rename the write_func to create_write_func,
>>>   and init_func to create_init_func
>>>   this is  preparation for other write_func that will
>>>   be used to update the encryption keys.
>>>
>>> No functional changes
>>>
>>> Signed-off-by: Maxim Levitsky 
>>> ---
>>>  block/crypto.c | 15 ---
>>>  1 file changed, 8 insertions(+), 7 deletions(-)
>>>
>>
>> I’m not quite sure why you remove or add blank lines seemingly at random...
> 
> Basically to have consistent two space separation between all functions.
> A bit of OCD I confess :-)

Well, it didn’t work because in one place you added two empty lines
where we already had two, so there are four now.

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH 02/13] qcrypto-luks: misc refactoring

2019-08-22 Thread Max Reitz
On 22.08.19 01:59, Maxim Levitsky wrote:
> On Tue, 2019-08-20 at 19:36 +0200, Max Reitz wrote:
>> On 14.08.19 22:22, Maxim Levitsky wrote:
>>> This is also a preparation for key read/write/erase functions
>>>
>>> * use master key len from the header
>>> * prefer to use crypto params in the QCryptoBlockLUKS
>>>   over passing them as function arguments
>>> * define QCRYPTO_BLOCK_LUKS_DEFAULT_ITER_TIME
>>> * Add comments to various crypto parameters in the QCryptoBlockLUKS
>>>
>>> Signed-off-by: Maxim Levitsky 
>>> ---
>>>  crypto/block-luks.c | 213 ++--
>>>  1 file changed, 105 insertions(+), 108 deletions(-)


[...]

>>> @@ -410,21 +430,15 @@ 
>>> qcrypto_block_luks_essiv_cipher(QCryptoCipherAlgorithm cipher,
>>>   */
>>>  static int
>>>  qcrypto_block_luks_load_key(QCryptoBlock *block,
>>> -QCryptoBlockLUKSKeySlot *slot,
>>> +uint slot_idx,
>>
>> Did you use uint on purpose or do you mean a plain “unsigned”?
> Well there are just 8 slots, but yea I don't mind to make this an unsigned 
> int.

My point was that “uint” is not a standard C type.

[...]

>>> @@ -930,6 +922,17 @@ qcrypto_block_luks_create(QCryptoBlock *block,
>>>  luks_opts.has_ivgen_hash_alg = true;
>>>  }
>>>  }
>>> +
>>> +luks = g_new0(QCryptoBlockLUKS, 1);
>>> +block->opaque = luks;
>>> +
>>> +luks->cipher_alg = luks_opts.cipher_alg;
>>> +luks->cipher_mode = luks_opts.cipher_mode;
>>> +luks->ivgen_alg = luks_opts.ivgen_alg;
>>> +luks->ivgen_hash_alg = luks_opts.ivgen_hash_alg;
>>> +luks->hash_alg = luks_opts.hash_alg;
>>> +
>>> +
>>
>> Why did you pull this up?  Now @luks is leaked in both of the next error
>> paths.
> 
> This is because the purpose of these fields changed. As Daniel explained to me
> they were initially added after the fact to serve as a cache of into to 
> present in qemu-img info callback.
> But now I use these everywhere in the code so I won't them to be correct as 
> soon as possible to minimize
> the risk of calling some function that uses these and would read garbage.

I get that, but I was wondering why you pulled the allocation of @luks
up above the next two conditional blocks.  Allocating and initializing
there should have worked just fine.

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH] block: posix: Always allocate the first block

2019-08-22 Thread Max Reitz
On 16.08.19 23:21, Nir Soffer wrote:
> When creating an image with preallocation "off" or "falloc", the first
> block of the image is typically not allocated. When using Gluster
> storage backed by XFS filesystem, reading this block using direct I/O
> succeeds regardless of request length, fooling alignment detection.
> 
> In this case we fallback to a safe value (4096) instead of the optimal
> value (512), which may lead to unneeded data copying when aligning
> requests.  Allocating the first block avoids the fallback.
> 
> When using preallocation=off, we always allocate at least one filesystem
> block:
> 
> $ ./qemu-img create -f raw test.raw 1g
> Formatting 'test.raw', fmt=raw size=1073741824
> 
> $ ls -lhs test.raw
> 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> 
> I did quick performance tests for these flows:
> - Provisioning a VM with a new raw image.
> - Copying disks with qemu-img convert to new raw target image
> 
> I installed Fedora 29 server on raw sparse image, measuring the time
> from clicking "Begin installation" until the "Reboot" button appears:
> 
> Before(s)  After(s) Diff(%)
> ---
>  356389+8.4
> 
> I ran this only once, so we cannot tell much from these results.

So you’d expect it to be fast but it was slower?  Well, you only ran it
once and it isn’t really a precise benchmark...

> The second test was cloning the installation image with qemu-img
> convert, doing 10 runs:
> 
> for i in $(seq 10); do
> rm -f dst.raw
> sleep 10
> time ./qemu-img convert -f raw -O raw -t none -T none src.raw dst.raw
> done
> 
> Here is a table comparing the total time spent:
> 
> TypeBefore(s)   After(s)Diff(%)
> ---
> real  530.028469.123  -11.4
> user   17.204 10.768  -37.4
> sys17.881  7.011  -60.7
> 
> Here we see very clear improvement in CPU usage.
> 
> Signed-off-by: Nir Soffer 
> ---
>  block/file-posix.c | 25 +
>  tests/qemu-iotests/150.out |  1 +
>  tests/qemu-iotests/160 |  4 
>  tests/qemu-iotests/175 | 19 +--
>  tests/qemu-iotests/175.out |  8 
>  tests/qemu-iotests/221.out | 12 
>  tests/qemu-iotests/253.out | 12 
>  7 files changed, 63 insertions(+), 18 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index b9c33c8f6c..3964dd2021 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void *opaque)
>  return ret;
>  }
>  
> +/*
> + * Help alignment detection by allocating the first block.
> + *
> + * When reading with direct I/O from unallocated area on Gluster backed by 
> XFS,
> + * reading succeeds regardless of request length. In this case we fallback to
> + * safe aligment which is not optimal. Allocating the first block avoids this
> + * fallback.
> + *
> + * Returns: 0 on success, -errno on failure.
> + */
> +static int allocate_first_block(int fd)
> +{
> +ssize_t n;
> +
> +do {
> +n = pwrite(fd, "\0", 1, 0);

This breaks when fd has been opened with O_DIRECT.

(Which happens when you open some file with cache.direct=on, and then
use e.g. QMP’s block_resize.)

It isn’t that bad because eventually you simply ignore the error.  But
it still makes me wonder whether we shouldn’t write like the biggest
power of two that does not exceed the new file length or MAX_BLOCKSIZE.

> +} while (n == -1 && errno == EINTR);
> +
> +return (n == -1) ? -errno : 0;
> +}
> +
>  static int handle_aiocb_truncate(void *opaque)
>  {
>  RawPosixAIOData *aiocb = opaque;
> @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void *opaque)
>  /* posix_fallocate() doesn't set errno. */
>  error_setg_errno(errp, -result,
>   "Could not preallocate new data");
> +} else if (current_length == 0) {
> +allocate_first_block(fd);

Should posix_fallocate() not take care of precisely this?

>  }
>  } else {
>  result = 0;

[...]

> diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160
> index df89d3864b..ad2d054a47 100755
> --- a/tests/qemu-iotests/160
> +++ b/tests/qemu-iotests/160
> @@ -57,6 +57,10 @@ for skip in $TEST_SKIP_BLOCKS; do
>  $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip" -O "$IMGFMT" 
> \
>  2> /dev/null
>  TEST_IMG="$TEST_IMG.out" _check_test_img
> +
> +# We always write the first byte of an image.
> +printf "\0" > "$TEST_IMG.out.dd"
> +
>  dd if="$TEST_IMG" of="$TEST_IMG.out.dd" skip="$skip" status=none

Won’t this dd completely overwrite $TEST_IMG.out.dd (especially given
the lack of conv=notrunc)?

>  
>  echo
> diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175
> index 51e62c8276..c6a3a7bb1e 100755
> --- 

Re: [Qemu-block] [Qemu-devel] [PATCH v3 1/5] socket: Add backlog parameter to socket_listen

2019-08-22 Thread Daniel P . Berrangé
On Tue, Aug 20, 2019 at 01:17:59PM +0200, Juan Quintela wrote:
> Daniel P. Berrangé  wrote:
> > On Tue, Aug 20, 2019 at 12:48:32PM +0200, Juan Quintela wrote:
> >> Current parameter was always one.  We continue with that value for now
> >> in all callers.
> >> 
> >> Signed-off-by: Juan Quintela 
> >> 
> >> ---
> >> Moved trace to socket_listen
> >> ---
> >>  include/qemu/sockets.h|  2 +-
> >>  io/channel-socket.c   |  2 +-
> >>  qga/channel-posix.c   |  2 +-
> >>  tests/test-util-sockets.c | 12 ++--
> >>  util/qemu-sockets.c   | 33 ++---
> >>  util/trace-events |  3 +++
> >>  6 files changed, 34 insertions(+), 20 deletions(-)
> >
> > Reviewed-by: Daniel P. Berrangé 
> 
> Hi
> 
> Everything is reviewed by you, and it is mostly socket code.  Should I
> do the pull request, or are you doing it?

I'm fine if you want to send a PR, since this is ultimately about
fixing migration problems.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [Qemu-devel] [PATCH 07/13] block: add manage-encryption command (qmp and blockdev)

2019-08-22 Thread Markus Armbruster
Maxim Levitsky  writes:

> On Wed, 2019-08-21 at 13:47 +0200, Markus Armbruster wrote:
>> Maxim Levitsky  writes:
>> 
>> > This adds:
>> > 
>> > * x-blockdev-update-encryption and x-blockdev-erase-encryption qmp commands
>> >   Both commands take the QCryptoKeyManageOptions
>> >   the x-blockdev-update-encryption is meant for non destructive addition
>> >   of key slots / whatever the encryption driver supports in the future
>> > 
>> >   x-blockdev-erase-encryption is meant for destructive encryption key 
>> > erase,
>> >   in some cases even without way to recover the data.
>> > 
>> > 
>> > * bdrv_setup_encryption callback in the block driver
>> >   This callback does both the above functions with 'action' parameter
>> > 
>> > * QCryptoKeyManageOptions with set of options that drivers can use for 
>> > encryption managment
>> >   Currently it has all the options that LUKS needs, and later it can be 
>> > extended
>> >   (via union) to support more encryption drivers if needed
>> > 
>> > * blk_setup_encryption / bdrv_setup_encryption - the usual block layer 
>> > wrappers.
>> >   Note that bdrv_setup_encryption takes BlockDriverState and not BdrvChild,
>> >   for the ease of use from the qmp code. It is not expected that this 
>> > function
>> >   will be used by anything but qmp and qemu-img code
>> > 
>> > 
>> > Signed-off-by: Maxim Levitsky 
>> 
>> [...]
>> > diff --git a/qapi/block-core.json b/qapi/block-core.json
>> > index 0d43d4f37c..53ed411eed 100644
>> > --- a/qapi/block-core.json
>> > +++ b/qapi/block-core.json
>> > @@ -5327,3 +5327,39 @@
>> >'data' : { 'node-name': 'str',
>> >   'iothread': 'StrOrNull',
>> >   '*force': 'bool' } }
>> > +
>> > +
>> > +##
>> > +# @x-blockdev-update-encryption:
>> > +#
>> > +# Update the encryption keys for an encrypted block device
>> > +#
>> > +# @node-name:   Name of the blockdev to operate on
>> > +# @force: Disable safety checks (use with care)
>> 
>> What checks excactly are disabled?
> Ability to overwrite an used slot with a different password. 
> If overwrite fails, the image won't be recoverable.
>
> The safe way is to add a new slot, then erase the old
> one, but this changes the slot where the password
> is stored, unless this procedure is used twice

Would this be a useful addition to the doc comment?

>> > +# @options:   Driver specific options
>> > +#
>> > +
>> > +# Since: 4.2
>> > +##
>> > +{ 'command': 'x-blockdev-update-encryption',
>> > +  'data': { 'node-name' : 'str',
>> > +'*force' : 'bool',
>> > +'options': 'QCryptoEncryptionSetupOptions' } }
>> > +
>> > +##
>> > +# @x-blockdev-erase-encryption:
>> > +#
>> > +# Erase the encryption keys for an encrypted block device
>> > +#
>> > +# @node-name:   Name of the blockdev to operate on
>> > +# @force: Disable safety checks (use with care)
>> 
>> Likewise.
> 1. Erase a slot which is already marked as
> erased. Mostly harmless but pointless as well.
>
> 2. Erase last keyslot. This irreversibly destroys
> any ability to read the data from the device,
> unless a backup of the header and the key material is
> done prior. Still can be useful when it is desired to
> erase the data fast.

Would this be a useful addition to the doc comment?

>> > +# @options:   Driver specific options
>> > +#
>> > +# Returns: @QCryptoKeyManageResult
>> 
>> Doc comment claims the command returns something, even though it
>> doesn't.  Please fix.  Sadly, the doc generator fails to flag that.
> This is leftover, fixed now although most likely this interface will die.
> I was initially planning to return
> information on which slot was allocated when user left that
> decision to the driver.
>
>> 
>> > +#
>> > +# Since: 4.2
>> > +##
>> > +{ 'command': 'x-blockdev-erase-encryption',
>> > +  'data': { 'node-name' : 'str',
>> > +'*force' : 'bool',
>> > +'options': 'QCryptoEncryptionSetupOptions' } }

Hmm, all members of @options are optional.  If I don't want to specify
any of them, I still have to say "options": {}.  Should @options be
optional, too?

Question is not relevant for x-blockdev-update-encryption, because
there, options.key-secret isn't actually optional.  Correct?

>> > diff --git a/qapi/crypto.json b/qapi/crypto.json
>> > index b2a4cff683..69e8b086db 100644
>> > --- a/qapi/crypto.json
>> > +++ b/qapi/crypto.json
>> > @@ -309,3 +309,29 @@
>> >'base': 'QCryptoBlockInfoBase',
>> >'discriminator': 'format',
>> >'data': { 'luks': 'QCryptoBlockInfoLUKS' } }
>> > +
>> > +
>> > +##
>> > +# @QCryptoEncryptionSetupOptions:
>> > +#
>> > +# Driver specific options for encryption key management.
>> 
>> Specific to which driver?
>
> This is the same issue, of not beeing able to detect an union.
>
> I was planning to have an union here where we could add
> add the driver specific options if we need to have another crypto driver,
> however since I discovered that union needs user to pass the driver name,
> I just 

[Qemu-block] [PATCH] pr-manager: Fix invalid g_free() crash bug

2019-08-22 Thread Markus Armbruster
pr_manager_worker() passes its @opaque argument to g_free().  Wrong;
it points to pr_manager_worker()'s automatic @data.  Broken when
commit 2f3a7ab39be converted @data from heap- to stack-allocated.  Fix
by deleting the g_free().

Fixes: 2f3a7ab39bec4ba8022dc4d42ea641165b004e3e
Cc: qemu-sta...@nongnu.org
Signed-off-by: Markus Armbruster 
---
 scsi/pr-manager.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/scsi/pr-manager.c b/scsi/pr-manager.c
index ee43663576..0c866e8698 100644
--- a/scsi/pr-manager.c
+++ b/scsi/pr-manager.c
@@ -39,7 +39,6 @@ static int pr_manager_worker(void *opaque)
 int fd = data->fd;
 int r;
 
-g_free(data);
 trace_pr_manager_run(fd, hdr->cmdp[0], hdr->cmdp[1]);
 
 /* The reference was taken in pr_manager_execute.  */
-- 
2.21.0




Re: [Qemu-block] [PATCH] virtio-blk: Cancel the pending BH when the dataplane is reset

2019-08-22 Thread Stefan Hajnoczi
On Fri, Aug 16, 2019 at 02:35:56PM -0400, John Snow wrote:
> 
> 
> On 8/16/19 1:15 PM, Philippe Mathieu-Daudé wrote:
> > When 'system_reset' is called, the main loop clear the memory
> > region cache before the BH has a chance to execute. Later when
> > the deferred function is called, some assumptions that were
> > made when scheduling them are no longer true when they actually
> > execute.
> > 
> > This is what happens using a virtio-blk device (fresh RHEL7.8 install):
> > 
> >  $ (sleep 12.3; echo system_reset; sleep 12.3; echo system_reset; sleep 1; 
> > echo q) \
> >| qemu-system-x86_64 -m 4G -smp 8 -boot menu=on \
> >  -device virtio-blk-pci,id=image1,drive=drive_image1 \
> >  -drive 
> > file=/var/lib/libvirt/images/rhel78.qcow2,if=none,id=drive_image1,format=qcow2,cache=none
> >  \
> >  -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c4:e7:84 \
> >  -netdev tap,id=net0,script=/bin/true,downscript=/bin/true,vhost=on \
> >  -monitor stdio -serial null -nographic
> >   (qemu) system_reset
> >   (qemu) system_reset
> >   (qemu) qemu-system-x86_64: hw/virtio/virtio.c:225: 
> > vring_get_region_caches: Assertion `caches != NULL' failed.
> >   Aborted
> > 
> >   (gdb) bt
> >   Thread 1 (Thread 0x7f109c17b680 (LWP 10939)):
> >   #0  0x5604083296d1 in vring_get_region_caches (vq=0x56040a24bdd0) at 
> > hw/virtio/virtio.c:227
> >   #1  0x56040832972b in vring_avail_flags (vq=0x56040a24bdd0) at 
> > hw/virtio/virtio.c:235
> >   #2  0x56040832d13d in virtio_should_notify (vdev=0x56040a240630, 
> > vq=0x56040a24bdd0) at hw/virtio/virtio.c:1648
> >   #3  0x56040832d1f8 in virtio_notify_irqfd (vdev=0x56040a240630, 
> > vq=0x56040a24bdd0) at hw/virtio/virtio.c:1662
> >   #4  0x5604082d213d in notify_guest_bh (opaque=0x56040a243ec0) at 
> > hw/block/dataplane/virtio-blk.c:75
> >   #5  0x56040883dc35 in aio_bh_call (bh=0x56040a243f10) at 
> > util/async.c:90
> >   #6  0x56040883dccd in aio_bh_poll (ctx=0x560409161980) at 
> > util/async.c:118
> >   #7  0x560408842af7 in aio_dispatch (ctx=0x560409161980) at 
> > util/aio-posix.c:460
> >   #8  0x56040883e068 in aio_ctx_dispatch (source=0x560409161980, 
> > callback=0x0, user_data=0x0) at util/async.c:261
> >   #9  0x7f10a8fca06d in g_main_context_dispatch () at 
> > /lib64/libglib-2.0.so.0
> >   #10 0x560408841445 in glib_pollfds_poll () at util/main-loop.c:215
> >   #11 0x5604088414bf in os_host_main_loop_wait (timeout=0) at 
> > util/main-loop.c:238
> >   #12 0x5604088415c4 in main_loop_wait (nonblocking=0) at 
> > util/main-loop.c:514
> >   #13 0x560408416b1e in main_loop () at vl.c:1923
> >   #14 0x56040841e0e8 in main (argc=20, argv=0x7ffc2c3f9c58, 
> > envp=0x7ffc2c3f9d00) at vl.c:4578
> > 
> > Fix this by cancelling the BH when the virtio dataplane is stopped.
> > 
> > Reported-by: Yihuang Yu 
> > Suggested-by: Stefan Hajnoczi 
> > Fixes: https://bugs.launchpad.net/qemu/+bug/1839428
> > Signed-off-by: Philippe Mathieu-Daudé 
> > ---
> >  hw/block/dataplane/virtio-blk.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/hw/block/dataplane/virtio-blk.c 
> > b/hw/block/dataplane/virtio-blk.c
> > index 9299a1a7c2..4030faa21d 100644
> > --- a/hw/block/dataplane/virtio-blk.c
> > +++ b/hw/block/dataplane/virtio-blk.c
> > @@ -301,6 +301,8 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
> >  /* Clean up guest notifier (irq) */
> >  k->set_guest_notifiers(qbus->parent, nvqs, false);
> >  
> > +qemu_bh_cancel(s->bh);
> > +
> >  vblk->dataplane_started = false;
> >  s->stopping = false;
> >  }
> > 
> 
> Naive question:
> 
> Since we're canceling the BH here and we're stopping the device; do we
> need to do anything like clear out batch_notify_vqs? I assume in
> system_reset contexts that's going to be handled anyway, are there
> non-reset contexts where it matters?

Spurious guest notifications aren't a problem but missing notifications
can hang the guest.  I have proposed a modified version of this code
that ensures pending batched notifications are sent.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH] virtio-blk: Cancel the pending BH when the dataplane is reset

2019-08-22 Thread Stefan Hajnoczi
On Fri, Aug 16, 2019 at 07:15:03PM +0200, Philippe Mathieu-Daudé wrote:
> When 'system_reset' is called, the main loop clear the memory
> region cache before the BH has a chance to execute. Later when
> the deferred function is called, some assumptions that were
> made when scheduling them are no longer true when they actually
> execute.
> 
> This is what happens using a virtio-blk device (fresh RHEL7.8 install):
> 
>  $ (sleep 12.3; echo system_reset; sleep 12.3; echo system_reset; sleep 1; 
> echo q) \
>| qemu-system-x86_64 -m 4G -smp 8 -boot menu=on \
>  -device virtio-blk-pci,id=image1,drive=drive_image1 \
>  -drive 
> file=/var/lib/libvirt/images/rhel78.qcow2,if=none,id=drive_image1,format=qcow2,cache=none
>  \
>  -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c4:e7:84 \
>  -netdev tap,id=net0,script=/bin/true,downscript=/bin/true,vhost=on \
>  -monitor stdio -serial null -nographic
>   (qemu) system_reset
>   (qemu) system_reset
>   (qemu) qemu-system-x86_64: hw/virtio/virtio.c:225: vring_get_region_caches: 
> Assertion `caches != NULL' failed.
>   Aborted
> 
>   (gdb) bt
>   Thread 1 (Thread 0x7f109c17b680 (LWP 10939)):
>   #0  0x5604083296d1 in vring_get_region_caches (vq=0x56040a24bdd0) at 
> hw/virtio/virtio.c:227
>   #1  0x56040832972b in vring_avail_flags (vq=0x56040a24bdd0) at 
> hw/virtio/virtio.c:235
>   #2  0x56040832d13d in virtio_should_notify (vdev=0x56040a240630, 
> vq=0x56040a24bdd0) at hw/virtio/virtio.c:1648
>   #3  0x56040832d1f8 in virtio_notify_irqfd (vdev=0x56040a240630, 
> vq=0x56040a24bdd0) at hw/virtio/virtio.c:1662
>   #4  0x5604082d213d in notify_guest_bh (opaque=0x56040a243ec0) at 
> hw/block/dataplane/virtio-blk.c:75
>   #5  0x56040883dc35 in aio_bh_call (bh=0x56040a243f10) at util/async.c:90
>   #6  0x56040883dccd in aio_bh_poll (ctx=0x560409161980) at 
> util/async.c:118
>   #7  0x560408842af7 in aio_dispatch (ctx=0x560409161980) at 
> util/aio-posix.c:460
>   #8  0x56040883e068 in aio_ctx_dispatch (source=0x560409161980, 
> callback=0x0, user_data=0x0) at util/async.c:261
>   #9  0x7f10a8fca06d in g_main_context_dispatch () at 
> /lib64/libglib-2.0.so.0
>   #10 0x560408841445 in glib_pollfds_poll () at util/main-loop.c:215
>   #11 0x5604088414bf in os_host_main_loop_wait (timeout=0) at 
> util/main-loop.c:238
>   #12 0x5604088415c4 in main_loop_wait (nonblocking=0) at 
> util/main-loop.c:514
>   #13 0x560408416b1e in main_loop () at vl.c:1923
>   #14 0x56040841e0e8 in main (argc=20, argv=0x7ffc2c3f9c58, 
> envp=0x7ffc2c3f9d00) at vl.c:4578
> 
> Fix this by cancelling the BH when the virtio dataplane is stopped.
> 
> Reported-by: Yihuang Yu 
> Suggested-by: Stefan Hajnoczi 
> Fixes: https://bugs.launchpad.net/qemu/+bug/1839428
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  hw/block/dataplane/virtio-blk.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
> index 9299a1a7c2..4030faa21d 100644
> --- a/hw/block/dataplane/virtio-blk.c
> +++ b/hw/block/dataplane/virtio-blk.c
> @@ -301,6 +301,8 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
>  /* Clean up guest notifier (irq) */
>  k->set_guest_notifiers(qbus->parent, nvqs, false);
>  
> +qemu_bh_cancel(s->bh);
> +
>  vblk->dataplane_started = false;
>  s->stopping = false;
>  }
> -- 
> 2.20.1
> 

Along the lines of what John said:

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 9299a1a7c2..4030faa21d 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -301,6 +301,8 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
+qemu_bh_cancel(s->bh);
+notify_guest_bh(s); /* final chance to notify guest */
+
 /* Clean up guest notifier (irq) */
 k->set_guest_notifiers(qbus->parent, nvqs, false);
 
 vblk->dataplane_started = false;
 s->stopping = false;
 }

Let's notify the guest if any batched notifications are waiting.  This
ensures that no notifications are lost when dataplane is stopped.


signature.asc
Description: PGP signature


Re: [Qemu-block] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]

2019-08-22 Thread Max Reitz
(CC-ing Paolo because of the XFS connection, and Stefan because why not.)

On 22.08.19 13:27, Lukáš Doktor wrote:
> Dne 21. 08. 19 v 19:51 Max Reitz napsal(a):
>> On 21.08.19 16:14, Lukáš Doktor wrote:
>>> Hello guys,
>>>
>>> First attempt was rejected due to zip attachment, let's try it again with 
>>> just Avocado-vt debug.log and serial console log files attached.
>>>
>>> I bisected a regression on aarch64 all the way to this commit: "qcow2: skip 
>>> writing zero buffers to empty COW areas" 
>>> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at 
>>> it?
>>
>> I think I can see the issue on my x64 system (I don’t see the XFS
>> corruption, but the installation fails because of some segfaults).
>>
>> I haven’t found a simpler way to reproduce the problem yet, though,
>> which is a pain... :-/
>>
>> It looks like the problem disappears when I configure qemu with
>> “--disable-xfsctl”.  Can you try that?
>>
>> Max
>>
> 
> Hello Max,
> 
> yes, I'm getting the same behavior. With "--disable-xfsctl" it works well. 
> Also looking at the option I understand why it only failed on aarch64 for me, 
> I don't have libs installed on the other machines, therefor it was disabled 
> by "./configure" there. Anyway I guess disabling it in my builds won't really 
> fix the issue, right? :-)

Thanks!

No, it won’t, but it means the actual root of the problem is probably
rather in some XFS-related code (be it because qemu uses it the wrong
way or because of XFS kernel code) than in the pure qcow2 commit that
made the problem surface by exercising it heavily.  (Or in an
interaction between the two.)

Max



Re: [Qemu-block] [PATCH 08/13] qcrypto: add the plumbing for encryption management

2019-08-22 Thread Daniel P . Berrangé
On Thu, Aug 22, 2019 at 02:47:03PM +0300, Maxim Levitsky wrote:
> On Thu, 2019-08-22 at 12:16 +0100, Daniel P. Berrangé wrote:
> > On Wed, Aug 14, 2019 at 11:22:14PM +0300, Maxim Levitsky wrote:
> > > This adds qcrypto_block_manage_encryption, which
> > >  is thin wrapper around manage_encryption of the crypto driver
> > >  which is also added
> > > 
> > > Signed-off-by: Maxim Levitsky 
> > > ---
> > >  crypto/block.c | 29 +
> > >  crypto/blockpriv.h |  9 +
> > >  include/crypto/block.h | 27 +++
> > >  3 files changed, 65 insertions(+)
> > > 
> > > diff --git a/crypto/block.c b/crypto/block.c
> > > index ee96759f7d..5916e49aba 100644
> > > --- a/crypto/block.c
> > > +++ b/crypto/block.c
> > > @@ -20,6 +20,7 @@
> > >  
> > >  #include "qemu/osdep.h"
> > >  #include "qapi/error.h"
> > > +
> > >  #include "blockpriv.h"
> > >  #include "block-qcow.h"
> > >  #include "block-luks.h"
> > > @@ -282,6 +283,34 @@ void qcrypto_block_free(QCryptoBlock *block)
> > >  }
> > >  
> > >  
> > > +int qcrypto_block_setup_encryption(QCryptoBlock *block,
> > > +   QCryptoBlockReadFunc readfunc,
> > > +   QCryptoBlockWriteFunc writefunc,
> > > +   void *opaque,
> > > +   enum BlkSetupEncryptionAction action,
> > > +   QCryptoEncryptionSetupOptions 
> > > *options,
> > > +   bool force,
> > > +   Error **errp)
> > > +{
> > > +if (!block->driver->setup_encryption) {
> > > +error_setg(errp,
> > > +"Crypto format %s doesn't support management of 
> > > encryption keys",
> > > +QCryptoBlockFormat_str(block->format));
> > > +return -1;
> > > +}
> > > +
> > > +return block->driver->setup_encryption(block,
> > > +   readfunc,
> > > +   writefunc,
> > > +   opaque,
> > > +   action,
> > > +   options,
> > > +   force,
> > > +   errp);
> > > +}
> > > +
> > > +
> > > +
> > >  typedef int (*QCryptoCipherEncDecFunc)(QCryptoCipher *cipher,
> > >  const void *in,
> > >  void *out,
> > > diff --git a/crypto/blockpriv.h b/crypto/blockpriv.h
> > > index 71c59cb542..804965dca3 100644
> > > --- a/crypto/blockpriv.h
> > > +++ b/crypto/blockpriv.h
> > > @@ -81,6 +81,15 @@ struct QCryptoBlockDriver {
> > >  
> > >  bool (*has_format)(const uint8_t *buf,
> > > size_t buflen);
> > > +
> > > +int (*setup_encryption)(QCryptoBlock *block,
> > > +QCryptoBlockReadFunc readfunc,
> > > +QCryptoBlockWriteFunc writefunc,
> > > +void *opaque,
> > > +enum BlkSetupEncryptionAction action,
> > > +QCryptoEncryptionSetupOptions *options,
> > > +bool force,
> > > +Error **errp);
> > >  };
> > >  
> > >  
> > > diff --git a/include/crypto/block.h b/include/crypto/block.h
> > > index fe12899831..60d46e3efc 100644
> > > --- a/include/crypto/block.h
> > > +++ b/include/crypto/block.h
> > > @@ -23,6 +23,7 @@
> > >  
> > >  #include "crypto/cipher.h"
> > >  #include "crypto/ivgen.h"
> > > +#include "block/block.h"
> > >  
> > >  typedef struct QCryptoBlock QCryptoBlock;
> > >  
> > > @@ -268,4 +269,30 @@ uint64_t qcrypto_block_get_sector_size(QCryptoBlock 
> > > *block);
> > >   */
> > >  void qcrypto_block_free(QCryptoBlock *block);
> > >  
> > > +
> > > +/**
> > > + * qcrypto_block_setup_encryption:
> > > + * @block: the block encryption object
> > > + *
> > > + * @readfunc: callback for reading data from the volume header
> > > + * @writefunc: callback for writing data to the volume header
> > > + * @opaque: data to pass to @readfunc and @writefunc
> > > + * @action: tell the driver the setup action (add/erase currently)
> > > + * @options: driver specific options, that specify
> > > + *   what encryption settings to manage
> > > + * @force: hint for the driver to allow unsafe operation
> > > + * @errp: error pointer
> > > + *
> > > + * Adds/Erases a new encryption key using @options
> > 
> > I'd prefer to see separate APIs for add + erase instead
> > of overloading. It'll lead to a clearer API from callers
> > POV to see exactly which parameters are for each action.
> 
> 
> The downside of this is some duplication of code in the middle layers.
> I don't mind doing that if you are OK with that.
> 
> Also note that if we go with amend 

Re: [Qemu-block] [PATCH v8 2/3] block/nbd: nbd reconnect

2019-08-22 Thread Vladimir Sementsov-Ogievskiy
21.08.2019 20:35, Eric Blake wrote:
> On 8/21/19 11:52 AM, Vladimir Sementsov-Ogievskiy wrote:
>> Implement reconnect. To achieve this:
>>
>> 1. add new modes:
>> connecting-wait: means, that reconnecting is in progress, and there
>>   were small number of reconnect attempts, so all requests are
>>   waiting for the connection.
>> connecting-nowait: reconnecting is in progress, there were a lot of
>>   attempts of reconnect, all requests will return errors.
>>
>> two old modes are used too:
>> connected: normal state
>> quit: exiting after fatal error or on close
>>
>> Possible transitions are:
>>
>> * -> quit
>> connecting-* -> connected
>> connecting-wait -> connecting-nowait (transition is done after
>>reconnect-delay seconds in connecting-wait mode)
>> connected -> connecting-wait
>>
>> 2. Implement reconnect in connection_co. So, in connecting-* mode,
>>  connection_co, tries to reconnect unlimited times.
>>
>> 3. Retry nbd queries on channel error, if we are in connecting-wait
>>  state.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy 
>> ---
> 
>> +static bool nbd_client_connecting(BDRVNBDState *s)
>> +{
>> +return s->state == NBD_CLIENT_CONNECTING_WAIT ||
>> +s->state == NBD_CLIENT_CONNECTING_NOWAIT;
> 
> 
> Indentation looks unusual. I might have done:
> 
>  return (s->state == NBD_CLIENT_CONNECTING_WAIT ||
>  s->state == NBD_CLIENT_CONNECTING_NOWAIT);
> 
> Or even exploit the enum encoding:
> 
>  return s->state <= NBD_CLIENT_CONNECTING_NOWAIT
> 
> Is s->state updated atomically, or do we risk the case where we might
> see two different values of s->state across the || sequence point?  Does
> that matter?

I hope it all happens in one aio context so state change should not intersects 
with this
function as it doesn't yield.

> 
>> +}
>> +
>> +static bool nbd_client_connecting_wait(BDRVNBDState *s)
>> +{
>> +return s->state == NBD_CLIENT_CONNECTING_WAIT;
>> +}
>> +
>> +static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
>> +{
>> +Error *local_err = NULL;
>> +
>> +if (!nbd_client_connecting(s)) {
>> +return;
>> +}
>> +assert(nbd_client_connecting(s));
> 
> This assert adds nothing given the condition beforehand.
> 
>> +
>> +/* Wait for completion of all in-flight requests */
>> +
>> +qemu_co_mutex_lock(>send_mutex);
>> +
>> +while (s->in_flight > 0) {
>> +qemu_co_mutex_unlock(>send_mutex);
>> +nbd_recv_coroutines_wake_all(s);
>> +s->wait_in_flight = true;
>> +qemu_coroutine_yield();
>> +s->wait_in_flight = false;
>> +qemu_co_mutex_lock(>send_mutex);
>> +}
>> +
>> +qemu_co_mutex_unlock(>send_mutex);
>> +
>> +if (!nbd_client_connecting(s)) {
>> +return;
>> +}
>> +
>> +/*
>> + * Now we are sure that nobody is accessing the channel, and no one will
>> + * try until we set the state to CONNECTED.
>> + */
>> +
>> +/* Finalize previous connection if any */
>> +if (s->ioc) {
>> +nbd_client_detach_aio_context(s->bs);
>> +object_unref(OBJECT(s->sioc));
>> +s->sioc = NULL;
>> +object_unref(OBJECT(s->ioc));
>> +s->ioc = NULL;
>> +}
>> +
>> +s->connect_status = nbd_client_connect(s->bs, _err);
>> +error_free(s->connect_err);
>> +s->connect_err = NULL;
>> +error_propagate(>connect_err, local_err);
>> +local_err = NULL;
>> +
>> +if (s->connect_status < 0) {
>> +/* failed attempt */
>> +return;
>> +}
>> +
>> +/* successfully connected */
>> +s->state = NBD_CLIENT_CONNECTED;
>> +qemu_co_queue_restart_all(>free_sema);
>> +}
>> +
>> +static coroutine_fn void nbd_reconnect_loop(BDRVNBDState *s)
>> +{
>> +uint64_t start_time_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>> +uint64_t delay_ns = s->reconnect_delay * NANOSECONDS_PER_SECOND;
>> +uint64_t timeout = 1 * NANOSECONDS_PER_SECOND;
>> +uint64_t max_timeout = 16 * NANOSECONDS_PER_SECOND;
>> +
>> +nbd_reconnect_attempt(s);
>> +
>> +while (nbd_client_connecting(s)) {
>> +if (s->state == NBD_CLIENT_CONNECTING_WAIT &&
>> +qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time_ns > 
>> delay_ns)
>> +{
>> +s->state = NBD_CLIENT_CONNECTING_NOWAIT;
>> +qemu_co_queue_restart_all(>free_sema);
>> +}
>> +
>> +qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, timeout,
>> + >connection_co_sleep_ns_state);
>> +if (s->drained) {
>> +bdrv_dec_in_flight(s->bs);
>> +s->wait_drained_end = true;
>> +while (s->drained) {
>> +/*
>> + * We may be entered once from 
>> nbd_client_attach_aio_context_bh
>> + * and then from nbd_client_co_drain_end. So here is a loop.
>> + */
>> +qemu_coroutine_yield();
>> + 

Re: [Qemu-block] [PATCH 08/13] qcrypto: add the plumbing for encryption management

2019-08-22 Thread Maxim Levitsky
On Thu, 2019-08-22 at 12:16 +0100, Daniel P. Berrangé wrote:
> On Wed, Aug 14, 2019 at 11:22:14PM +0300, Maxim Levitsky wrote:
> > This adds qcrypto_block_manage_encryption, which
> >  is thin wrapper around manage_encryption of the crypto driver
> >  which is also added
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  crypto/block.c | 29 +
> >  crypto/blockpriv.h |  9 +
> >  include/crypto/block.h | 27 +++
> >  3 files changed, 65 insertions(+)
> > 
> > diff --git a/crypto/block.c b/crypto/block.c
> > index ee96759f7d..5916e49aba 100644
> > --- a/crypto/block.c
> > +++ b/crypto/block.c
> > @@ -20,6 +20,7 @@
> >  
> >  #include "qemu/osdep.h"
> >  #include "qapi/error.h"
> > +
> >  #include "blockpriv.h"
> >  #include "block-qcow.h"
> >  #include "block-luks.h"
> > @@ -282,6 +283,34 @@ void qcrypto_block_free(QCryptoBlock *block)
> >  }
> >  
> >  
> > +int qcrypto_block_setup_encryption(QCryptoBlock *block,
> > +   QCryptoBlockReadFunc readfunc,
> > +   QCryptoBlockWriteFunc writefunc,
> > +   void *opaque,
> > +   enum BlkSetupEncryptionAction action,
> > +   QCryptoEncryptionSetupOptions *options,
> > +   bool force,
> > +   Error **errp)
> > +{
> > +if (!block->driver->setup_encryption) {
> > +error_setg(errp,
> > +"Crypto format %s doesn't support management of encryption 
> > keys",
> > +QCryptoBlockFormat_str(block->format));
> > +return -1;
> > +}
> > +
> > +return block->driver->setup_encryption(block,
> > +   readfunc,
> > +   writefunc,
> > +   opaque,
> > +   action,
> > +   options,
> > +   force,
> > +   errp);
> > +}
> > +
> > +
> > +
> >  typedef int (*QCryptoCipherEncDecFunc)(QCryptoCipher *cipher,
> >  const void *in,
> >  void *out,
> > diff --git a/crypto/blockpriv.h b/crypto/blockpriv.h
> > index 71c59cb542..804965dca3 100644
> > --- a/crypto/blockpriv.h
> > +++ b/crypto/blockpriv.h
> > @@ -81,6 +81,15 @@ struct QCryptoBlockDriver {
> >  
> >  bool (*has_format)(const uint8_t *buf,
> > size_t buflen);
> > +
> > +int (*setup_encryption)(QCryptoBlock *block,
> > +QCryptoBlockReadFunc readfunc,
> > +QCryptoBlockWriteFunc writefunc,
> > +void *opaque,
> > +enum BlkSetupEncryptionAction action,
> > +QCryptoEncryptionSetupOptions *options,
> > +bool force,
> > +Error **errp);
> >  };
> >  
> >  
> > diff --git a/include/crypto/block.h b/include/crypto/block.h
> > index fe12899831..60d46e3efc 100644
> > --- a/include/crypto/block.h
> > +++ b/include/crypto/block.h
> > @@ -23,6 +23,7 @@
> >  
> >  #include "crypto/cipher.h"
> >  #include "crypto/ivgen.h"
> > +#include "block/block.h"
> >  
> >  typedef struct QCryptoBlock QCryptoBlock;
> >  
> > @@ -268,4 +269,30 @@ uint64_t qcrypto_block_get_sector_size(QCryptoBlock 
> > *block);
> >   */
> >  void qcrypto_block_free(QCryptoBlock *block);
> >  
> > +
> > +/**
> > + * qcrypto_block_setup_encryption:
> > + * @block: the block encryption object
> > + *
> > + * @readfunc: callback for reading data from the volume header
> > + * @writefunc: callback for writing data to the volume header
> > + * @opaque: data to pass to @readfunc and @writefunc
> > + * @action: tell the driver the setup action (add/erase currently)
> > + * @options: driver specific options, that specify
> > + *   what encryption settings to manage
> > + * @force: hint for the driver to allow unsafe operation
> > + * @errp: error pointer
> > + *
> > + * Adds/Erases a new encryption key using @options
> 
> I'd prefer to see separate APIs for add + erase instead
> of overloading. It'll lead to a clearer API from callers
> POV to see exactly which parameters are for each action.


The downside of this is some duplication of code in the middle layers.
I don't mind doing that if you are OK with that.

Also note that if we go with amend options, we will have to use the single
API.


Best regards,
Maxim Levitsky





Re: [Qemu-block] [PATCH 10/13] block/crypto: implement the encryption key management

2019-08-22 Thread Maxim Levitsky
On Thu, 2019-08-22 at 12:29 +0100, Daniel P. Berrangé wrote:
> On Wed, Aug 14, 2019 at 11:22:16PM +0300, Maxim Levitsky wrote:
> > This implements the encryption key management
> > using the generic code in qcrypto layer
> > 
> > This code adds another 'write_func' because the initialization
> > write_func works directly on the underlying file,
> > because during the creation, there is no open instance
> > of the luks driver, but during regular use, we have it,
> > and should use it instead.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/crypto.c | 96 --
> >  1 file changed, 93 insertions(+), 3 deletions(-)
> > 
> > diff --git a/block/crypto.c b/block/crypto.c
> > index 42a3f0898b..415b6db041 100644
> > --- a/block/crypto.c
> > +++ b/block/crypto.c
> > @@ -36,6 +36,7 @@ typedef struct BlockCrypto BlockCrypto;
> >  
> >  struct BlockCrypto {
> >  QCryptoBlock *block;
> > +bool updating_keys;
> >  };
> >  
> >  
> > @@ -69,6 +70,24 @@ static ssize_t block_crypto_read_func(QCryptoBlock 
> > *block,
> >  return ret;
> >  }
> >  
> > +static ssize_t block_crypto_write_func(QCryptoBlock *block,
> > +  size_t offset,
> > +  const uint8_t *buf,
> > +  size_t buflen,
> > +  void *opaque,
> > +  Error **errp)
> > +{
> > +BlockDriverState *bs = opaque;
> > +ssize_t ret;
> > +
> > +ret = bdrv_pwrite(bs->file, offset, buf, buflen);
> > +if (ret < 0) {
> > +error_setg_errno(errp, -ret, "Could not write encryption header");
> > +return ret;
> > +}
> > +return ret;
> > +}
> > +
> >  
> >  struct BlockCryptoCreateData {
> >  BlockBackend *blk;
> > @@ -622,6 +641,78 @@ block_crypto_get_specific_info_luks(BlockDriverState 
> > *bs, Error **errp)
> >  return spec_info;
> >  }
> >  
> > +
> > +static int
> > +block_crypto_setup_encryption(BlockDriverState *bs,
> > +  enum BlkSetupEncryptionAction action,
> > +  QCryptoEncryptionSetupOptions *options,
> > +  bool force,
> > +  Error **errp)
> > +{
> > +BlockCrypto *crypto = bs->opaque;
> > +int ret;
> > +
> > +assert(crypto);
> > +assert(crypto->block);
> > +
> > +crypto->updating_keys = true;
> > +
> > +ret = bdrv_child_refresh_perms(bs, bs->file, errp);
> > +
> > +if (ret) {
> > +crypto->updating_keys = false;
> > +return ret;
> > +}
> > +
> > +ret = qcrypto_block_setup_encryption(crypto->block,
> > +  block_crypto_read_func,
> > +  block_crypto_write_func,
> > +  bs,
> > +  action,
> > +  options,
> > +  force,
> > +  errp);
> > +
> > +crypto->updating_keys = false;
> > +bdrv_child_refresh_perms(bs, bs->file, errp);
> > +
> > +
> > +return ret;
> > +
> > +}
> > +
> > +
> > +static void
> > +block_crypto_child_perms(BlockDriverState *bs, BdrvChild *c,
> > + const BdrvChildRole *role,
> > + BlockReopenQueue *reopen_queue,
> > + uint64_t perm, uint64_t shared,
> > + uint64_t *nperm, uint64_t *nshared)
> > +{
> > +
> > +BlockCrypto *crypto = bs->opaque;
> > +
> > +/*
> > + * This driver doesn't modify LUKS metadata except
> > + * when updating the encryption slots.
> > + * Allow share-rw=on as a special case.
> > + *
> > + * Encryption update will set the crypto->updating_keys
> > + * during that period and refresh permissions
> > + *
> > + * */
> > +
> > +if (crypto->updating_keys) {
> > +/*need exclusive write access for header update  */
> > +perm |= BLK_PERM_WRITE;
> > +shared &= ~BLK_PERM_WRITE;
> > +}
> 
> So if 2 QEMU's have the same LUKS image open, this means that
> if one tries to update the header, it will fail to upgrade
> its lock & thus be blocked from updating header ?
> 
I guess so. 

That what I understood from our talk with Kevin and then also from
my own understanding of the permission code.
I absolutely don't like the 'global' variable 'crypto->updating_keys'
but we kind of agreed that this must be done like that.
In the defense of this, this code is only needed for backward compatibility,
so maybe this is the right solution after all.

> > +
> > +bdrv_filter_default_perms(bs, c, role, reopen_queue,
> > +perm, shared, nperm, nshared);
> > 

Best regards,
Maxim Levitsky





Re: [Qemu-block] QEMU bitmap backup usability FAQ

2019-08-22 Thread Vladimir Sementsov-Ogievskiy
22.08.2019 0:19, John Snow wrote:
> 
> 
> On 8/21/19 10:21 AM, Vladimir Sementsov-Ogievskiy wrote:
>> [CC Nikolay]
>>
>> 21.08.2019 1:25, John Snow wrote:
>>> Hi, downstream here at Red Hat I've been fielding some questions about
>>> the usability and feature readiness of Bitmaps (and related features) in
>>> QEMU.
>>>
>>> Here are some questions I answered internally that I am copying to the
>>> list for two reasons:
>>>
>>> (1) To make sure my answers are actually correct, and
>>> (2) To share this pseudo-reference with the community at large.
>>>
>>> This is long, and mostly for reference. There's a summary at the bottom
>>> with some todo items and observations about the usability of the feature
>>> as it exists in QEMU.
>>>
>>> Before too long, I intend to send a more summarized "roadmap" mail which
>>> details all of the current and remaining work to be done in and around
>>> the bitmaps feature in QEMU.
>>>
>>>
>>> Questions:
>>>
 "What format(s) is/are required for this functionality?"
>>>
>>>   From the QEMU API, any format can be used to create and author
>>> incremental backups. The only known format limitations are:
>>>
>>> 1. Persistent bitmaps cannot be created on any format except qcow2,
>>> although there are hooks to add support to other formats at a later date
>>> if desired.
>>>
>>> DANGER CAVEAT #1: Adding bitmaps to QEMU by default creates transient
>>> bitmaps instead of persistent ones.
>>>
>>> Possible TODO: Allow users to 'upgrade' transient bitmaps to persistent
>>> ones in case they made a mistake.
>>
>> I doubt, as in my opinion real users of Qemu are not people but libvirt, 
>> which
>> should never make such mistake.
>>
> 
> Right, that's largely been the consensus here; but there is some concern
> that libvirt might not be the only user of QEMU, so I felt it was worth
> documenting some of the crucial moments where it was "easy" to get it wrong.
> 
> I think like it or not, the API that QEMU presents has to be considered
> on its own without libvirt because it's not a given that libvirt will
> forever and always be the only user of QEMU.
> 
> I do think that any problems of this kind that can be solved in libvirt
> are not immediate, crucial action items. libvirt WILL be the major user
> of these features.
> 
> However, try as we might, releasing a set of primitive operations that
> offer 998 ways to corrupt your data and 2 ways to manage it correctly
> are going to provoke some questions from people who are trying to work
> with that API, including from libvirt developers.
> 
> It might be the conclusion that it's libvirt's job to safeguard the user
> from themselves, but we at least need to present consistent and clear
> information about the way we expect/anticipate people to use the APIs,
> because people DO keep asking me about several of these issues and the
> usability problems they perceive with the QEMU API.
> 
> So this thread was largely in attempt to explore what some "solutions"
> to perceived problems look like, mostly to come to the conclusion that
> the actual "must-haves" list in QEMU is not very long compared to the
> "nice-to-haves?" list.
> 
>>>
>>>
>>> 2. When using push backups (blockdev-backup, drive-backup), you may use
>>> any format as a target format. >
>>> DANGER CAVEAT #2: without backing file and/or filesystem-less sparse
>>> support, these images will be unusable.
>>
>> You mean incremental backups of course, as the whole document is about 
>> bitmaps.
>>
> 
> Ah, yes, incremental push backups. Full backups are of course not a
> problem. :)
> 
>>>
>>> EXAMPLE: Backing up to a raw file loses allocation information, so we
>>> can no longer distinguish between zeroes and unallocated regions. The
>>> cluster size is also lost. This file will not be usable without
>>> additional metadata recorded elsewhere.*
>>>
>>> (* This is complicated, but it is in theory possible to do a push backup
>>> to e.g. an NBD target with custom server code that saves allocation
>>> information to a metadata file, which would allow you to reconstruct
>>> backups. For instance, recording in a .json file which extents were
>>> written out would allow you to -- with a custom binary -- write this
>>> information on top of a base file to reconstruct a backup.)
>>>
>>>
>>> 3. Any format can be used for either shared storage or live storage
>>> migrations. There are TWO distinct mechanisms for migrating bitmaps:
>>>
>>> A) The bitmap is flushed to storage and re-opened on the destination.
>>> This is only supported for qcow2 and shared-storage migrations.
>>
>> cons: flushing/reopening is done during migration downtime, so if you have
>> a lot of bitmap data (for example, 64k granulared bitmap for 16tb disk is
>> ~30MB, and there may be several bitmaps) downtime will become long.
>>
> 
> Worth documenting the drawback, yes.
> 
>>>
>>> B) The bitmap is live-migrated to the destination. This is supported for
>>> any format and can be used for either shared storage or live 

Re: [Qemu-block] [PATCH 00/13] RFC: luks/encrypted qcow2 key management

2019-08-22 Thread Daniel P . Berrangé
On Wed, Aug 14, 2019 at 11:22:06PM +0300, Maxim Levitsky wrote:
> Hi!
> 
> This patch series implements key management for luks based encryption
> It supports both raw luks images and qcow2 encrypted images.
> 
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1731898
> 
> There are still several issues that need to be figured out,
> on which the feedback is very welcome, but other than that the code mostly 
> works.
> 
> The main issues are:
> 
> 1. Instead of the proposed 
> blockdev-update-encryption/blockdev-erase-encryption
> interface, it is probably better to implement 'blockdev-amend-options' in qmp,
> and use this both for offline and online key update (with some translation
> layer to convert the qemu-img 'options' to qmp structures)
> 
> This interface already exists for offline qcow2 format options update/
> 
> This is an issue that was raised today on IRC with Kevin Wolf. Really thanks
> for the idea!
> 
> We agreed that this new qmp interface should take the same options as
> blockdev-create does, however since we want to be able to edit the encryption
> slots separately, this implies that we sort of need to allow this on creation
> time as well.
> 
> Also the BlockdevCreateOptions is a union, which is specialized by the driver 
> name
> which is great for creation, but for update, the driver name is already known,
> and thus the user should not be forced to pass it again.
> However qmp doesn't seem to support union type guessing based on actual fields
> given (this might not be desired either), which complicates this somewhat.

Given this design question around the integration into blockdev, I'd
suggest splitting the series into two parts.

One series should do all the work in crypto/ code to support adding
and erasing key slots.

One series should focus on block/ layer QMP/qemu-img integration.

The block layer QAPI stuff shouldn't leak into the crypto/ code.

So this will let us get on with reviewing & unit testing the
crypto code, while we discuss block layer design options in more
detail.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [PATCH 12/13] qemu-img: implement key management

2019-08-22 Thread Daniel P . Berrangé
On Tue, Aug 20, 2019 at 08:29:55PM +0200, Max Reitz wrote:
> On 14.08.19 22:22, Maxim Levitsky wrote:
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/crypto.c   |  16 ++
> >  block/crypto.h   |   3 +
> >  qemu-img-cmds.hx |  13 +
> >  qemu-img.c   | 140 +++
> >  4 files changed, 172 insertions(+)
> 
> Yes, this seems a bit weird.  Putting it under amend seems like the
> natural thing if that works; if not, I think it should be a single
> qemu-img subcommand instead of two.

I'm not convinced by overloading two distinct operations on to one
sub-command - doesn't seem to give an obvious benefit to overload
them & IME experiance overloading results in harder to understand
commands due to having distinct args to each command.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [PATCH] block: posix: Always allocate the first block

2019-08-22 Thread Nir Soffer
Max, did you have time to look at this?

On Sat, Aug 17, 2019 at 12:21 AM Nir Soffer  wrote:

> When creating an image with preallocation "off" or "falloc", the first
> block of the image is typically not allocated. When using Gluster
> storage backed by XFS filesystem, reading this block using direct I/O
> succeeds regardless of request length, fooling alignment detection.
>
> In this case we fallback to a safe value (4096) instead of the optimal
> value (512), which may lead to unneeded data copying when aligning
> requests.  Allocating the first block avoids the fallback.
>
> When using preallocation=off, we always allocate at least one filesystem
> block:
>
> $ ./qemu-img create -f raw test.raw 1g
> Formatting 'test.raw', fmt=raw size=1073741824
>
> $ ls -lhs test.raw
> 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
>
> I did quick performance tests for these flows:
> - Provisioning a VM with a new raw image.
> - Copying disks with qemu-img convert to new raw target image
>
> I installed Fedora 29 server on raw sparse image, measuring the time
> from clicking "Begin installation" until the "Reboot" button appears:
>
> Before(s)  After(s) Diff(%)
> ---
>  356389+8.4
>
> I ran this only once, so we cannot tell much from these results.
>
> The second test was cloning the installation image with qemu-img
> convert, doing 10 runs:
>
> for i in $(seq 10); do
> rm -f dst.raw
> sleep 10
> time ./qemu-img convert -f raw -O raw -t none -T none src.raw
> dst.raw
> done
>
> Here is a table comparing the total time spent:
>
> TypeBefore(s)   After(s)Diff(%)
> ---
> real  530.028469.123  -11.4
> user   17.204 10.768  -37.4
> sys17.881  7.011  -60.7
>
> Here we see very clear improvement in CPU usage.
>
> Signed-off-by: Nir Soffer 
> ---
>  block/file-posix.c | 25 +
>  tests/qemu-iotests/150.out |  1 +
>  tests/qemu-iotests/160 |  4 
>  tests/qemu-iotests/175 | 19 +--
>  tests/qemu-iotests/175.out |  8 
>  tests/qemu-iotests/221.out | 12 
>  tests/qemu-iotests/253.out | 12 
>  7 files changed, 63 insertions(+), 18 deletions(-)
>
> diff --git a/block/file-posix.c b/block/file-posix.c
> index b9c33c8f6c..3964dd2021 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void *opaque)
>  return ret;
>  }
>
> +/*
> + * Help alignment detection by allocating the first block.
> + *
> + * When reading with direct I/O from unallocated area on Gluster backed
> by XFS,
> + * reading succeeds regardless of request length. In this case we
> fallback to
> + * safe aligment which is not optimal. Allocating the first block avoids
> this
> + * fallback.
> + *
> + * Returns: 0 on success, -errno on failure.
> + */
> +static int allocate_first_block(int fd)
> +{
> +ssize_t n;
> +
> +do {
> +n = pwrite(fd, "\0", 1, 0);
> +} while (n == -1 && errno == EINTR);
> +
> +return (n == -1) ? -errno : 0;
> +}
> +
>  static int handle_aiocb_truncate(void *opaque)
>  {
>  RawPosixAIOData *aiocb = opaque;
> @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void *opaque)
>  /* posix_fallocate() doesn't set errno. */
>  error_setg_errno(errp, -result,
>   "Could not preallocate new data");
> +} else if (current_length == 0) {
> +allocate_first_block(fd);
>  }
>  } else {
>  result = 0;
> @@ -1855,6 +1878,8 @@ static int handle_aiocb_truncate(void *opaque)
>  if (ftruncate(fd, offset) != 0) {
>  result = -errno;
>  error_setg_errno(errp, -result, "Could not resize file");
> +} else if (current_length == 0 && offset > current_length) {
> +allocate_first_block(fd);
>  }
>  return result;
>  default:
> diff --git a/tests/qemu-iotests/150.out b/tests/qemu-iotests/150.out
> index 2a54e8dcfa..3cdc7727a5 100644
> --- a/tests/qemu-iotests/150.out
> +++ b/tests/qemu-iotests/150.out
> @@ -3,6 +3,7 @@ QA output created by 150
>  === Mapping sparse conversion ===
>
>  Offset  Length  File
> +0   0x1000  TEST_DIR/t.IMGFMT
>
>  === Mapping non-sparse conversion ===
>
> diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160
> index df89d3864b..ad2d054a47 100755
> --- a/tests/qemu-iotests/160
> +++ b/tests/qemu-iotests/160
> @@ -57,6 +57,10 @@ for skip in $TEST_SKIP_BLOCKS; do
>  $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip" -O
> "$IMGFMT" \
>  2> /dev/null
>  TEST_IMG="$TEST_IMG.out" _check_test_img
> +
> +# We always write the first byte of an image.
> +printf "\0" > "$TEST_IMG.out.dd"
> +
>  dd 

Re: [Qemu-block] [PATCH 10/13] block/crypto: implement the encryption key management

2019-08-22 Thread Daniel P . Berrangé
On Wed, Aug 14, 2019 at 11:22:16PM +0300, Maxim Levitsky wrote:
> This implements the encryption key management
> using the generic code in qcrypto layer
> 
> This code adds another 'write_func' because the initialization
> write_func works directly on the underlying file,
> because during the creation, there is no open instance
> of the luks driver, but during regular use, we have it,
> and should use it instead.
> 
> Signed-off-by: Maxim Levitsky 
> ---
>  block/crypto.c | 96 --
>  1 file changed, 93 insertions(+), 3 deletions(-)
> 
> diff --git a/block/crypto.c b/block/crypto.c
> index 42a3f0898b..415b6db041 100644
> --- a/block/crypto.c
> +++ b/block/crypto.c
> @@ -36,6 +36,7 @@ typedef struct BlockCrypto BlockCrypto;
>  
>  struct BlockCrypto {
>  QCryptoBlock *block;
> +bool updating_keys;
>  };
>  
>  
> @@ -69,6 +70,24 @@ static ssize_t block_crypto_read_func(QCryptoBlock *block,
>  return ret;
>  }
>  
> +static ssize_t block_crypto_write_func(QCryptoBlock *block,
> +  size_t offset,
> +  const uint8_t *buf,
> +  size_t buflen,
> +  void *opaque,
> +  Error **errp)
> +{
> +BlockDriverState *bs = opaque;
> +ssize_t ret;
> +
> +ret = bdrv_pwrite(bs->file, offset, buf, buflen);
> +if (ret < 0) {
> +error_setg_errno(errp, -ret, "Could not write encryption header");
> +return ret;
> +}
> +return ret;
> +}
> +
>  
>  struct BlockCryptoCreateData {
>  BlockBackend *blk;
> @@ -622,6 +641,78 @@ block_crypto_get_specific_info_luks(BlockDriverState 
> *bs, Error **errp)
>  return spec_info;
>  }
>  
> +
> +static int
> +block_crypto_setup_encryption(BlockDriverState *bs,
> +  enum BlkSetupEncryptionAction action,
> +  QCryptoEncryptionSetupOptions *options,
> +  bool force,
> +  Error **errp)
> +{
> +BlockCrypto *crypto = bs->opaque;
> +int ret;
> +
> +assert(crypto);
> +assert(crypto->block);
> +
> +crypto->updating_keys = true;
> +
> +ret = bdrv_child_refresh_perms(bs, bs->file, errp);
> +
> +if (ret) {
> +crypto->updating_keys = false;
> +return ret;
> +}
> +
> +ret = qcrypto_block_setup_encryption(crypto->block,
> +  block_crypto_read_func,
> +  block_crypto_write_func,
> +  bs,
> +  action,
> +  options,
> +  force,
> +  errp);
> +
> +crypto->updating_keys = false;
> +bdrv_child_refresh_perms(bs, bs->file, errp);
> +
> +
> +return ret;
> +
> +}
> +
> +
> +static void
> +block_crypto_child_perms(BlockDriverState *bs, BdrvChild *c,
> + const BdrvChildRole *role,
> + BlockReopenQueue *reopen_queue,
> + uint64_t perm, uint64_t shared,
> + uint64_t *nperm, uint64_t *nshared)
> +{
> +
> +BlockCrypto *crypto = bs->opaque;
> +
> +/*
> + * This driver doesn't modify LUKS metadata except
> + * when updating the encryption slots.
> + * Allow share-rw=on as a special case.
> + *
> + * Encryption update will set the crypto->updating_keys
> + * during that period and refresh permissions
> + *
> + * */
> +
> +if (crypto->updating_keys) {
> +/*need exclusive write access for header update  */
> +perm |= BLK_PERM_WRITE;
> +shared &= ~BLK_PERM_WRITE;
> +}

So if 2 QEMU's have the same LUKS image open, this means that
if one tries to update the header, it will fail to upgrade
its lock & thus be blocked from updating header ?

> +
> +bdrv_filter_default_perms(bs, c, role, reopen_queue,
> +perm, shared, nperm, nshared);
> +}

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [PATCH 09/13] qcrypto-luks: implement the encryption key management

2019-08-22 Thread Daniel P . Berrangé
On Wed, Aug 14, 2019 at 11:22:15PM +0300, Maxim Levitsky wrote:
> Signed-off-by: Maxim Levitsky 
> ---
>  crypto/block-luks.c | 374 +++-
>  1 file changed, 373 insertions(+), 1 deletion(-)
> 
> diff --git a/crypto/block-luks.c b/crypto/block-luks.c
> index 1997e92fe1..2c33643b52 100644
> --- a/crypto/block-luks.c
> +++ b/crypto/block-luks.c
> @@ -72,6 +72,8 @@ typedef struct QCryptoBlockLUKSKeySlot 
> QCryptoBlockLUKSKeySlot;
>  
>  #define QCRYPTO_BLOCK_LUKS_DEFAULT_ITER_TIME 2000
>  
> +#define QCRYPTO_BLOCK_LUKS_ERASE_ITERATIONS 40
> +
>  static const char qcrypto_block_luks_magic[QCRYPTO_BLOCK_LUKS_MAGIC_LEN] = {
>  'L', 'U', 'K', 'S', 0xBA, 0xBE
>  };
> @@ -221,6 +223,9 @@ struct QCryptoBlockLUKS {
>  
>  /* Hash algorithm used in pbkdf2 function */
>  QCryptoHashAlgorithm hash_alg;
> +
> +/* Name of the secret that was used to open the image */
> +char *secret;
>  };
>  
>  
> @@ -1121,6 +1126,194 @@ qcrypto_block_luks_find_key(QCryptoBlock *block,
>  }
>  
>  
> +
> +/*
> + * Returns true if a slot i is marked as containing as active

s/as containing//

> + * (contains encrypted copy of the master key)
> + */
> +
> +static bool
> +qcrypto_block_luks_slot_active(QCryptoBlockLUKS *luks, int slot_idx)
> +{
> +uint32_t val = luks->header.key_slots[slot_idx].active;
> +return val ==  QCRYPTO_BLOCK_LUKS_KEY_SLOT_ENABLED;
> +}
> +
> +/*
> + * Returns the number of slots that are marked as active
> + * (contains encrypted copy of the master key)
> + */
> +
> +static int
> +qcrypto_block_luks_count_active_slots(QCryptoBlockLUKS *luks)
> +{
> +int i, ret = 0;

I prefer to see 'size_t' for loop iterators 

> +
> +for (i = 0; i < QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS; i++) {
> +if (qcrypto_block_luks_slot_active(luks, i)) {
> +ret++;
> +}
> +}
> +return ret;
> +}
> +
> +
> +/*
> + * Finds first key slot which is not active
> + * Returns the key slot index, or -1 if doesn't exist
> + */
> +
> +static int
> +qcrypto_block_luks_find_free_keyslot(QCryptoBlockLUKS *luks)
> +{
> +uint i;
> +
> +for (i = 0; i < QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS; i++) {
> +if (!qcrypto_block_luks_slot_active(luks, i)) {
> +return i;
> +}
> +}
> +return -1;
> +
> +}
> +
> +/*
> + * Erases an keyslot given its index
> + *
> + * Returns:
> + *0 if the keyslot was erased successfully
> + *   -1 if a error occurred while erasing the keyslot
> + *
> + */
> +

Redundant blank line

> +static int
> +qcrypto_block_luks_erase_key(QCryptoBlock *block,
> + uint slot_idx,
> + QCryptoBlockWriteFunc writefunc,
> + void *opaque,
> + Error **errp)
> +{
> +QCryptoBlockLUKS *luks = block->opaque;
> +QCryptoBlockLUKSKeySlot *slot = >header.key_slots[slot_idx];
> +uint8_t *garbagekey = NULL;
> +size_t splitkeylen = masterkeylen(luks) * slot->stripes;
> +int i;
> +int ret = -1;
> +
> +assert(slot_idx < QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS);
> +assert(splitkeylen > 0);
> +
> +garbagekey = g_malloc0(splitkeylen);
> +
> +/* Reset the key slot header */
> +memset(slot->salt, 0, QCRYPTO_BLOCK_LUKS_SALT_LEN);
> +slot->iterations = 0;
> +slot->active = QCRYPTO_BLOCK_LUKS_KEY_SLOT_DISABLED;
> +
> +qcrypto_block_luks_store_header(block,  writefunc, opaque, errp);
> +
> +/*
> + * Now try to erase the key material, even if the header
> + * update failed
> + */
> +

Redundant blank line

> +for (i = 0 ; i < QCRYPTO_BLOCK_LUKS_ERASE_ITERATIONS ; i++) {
> +if (qcrypto_random_bytes(garbagekey, splitkeylen, errp) < 0) {
> +

Again, many more times beelow.

> +/*
> + * If we failed to get the random data, still write
> + * *something* to the key slot at least once
> + */

Specificially  we write all-zeros, since garbagekey was allocated
with g_malloc0

> +
> +if (i > 0) {
> +goto cleanup;
> +}
> +}
> +
> +if (writefunc(block, slot->key_offset * 
> QCRYPTO_BLOCK_LUKS_SECTOR_SIZE,
> +  garbagekey,
> +  splitkeylen,
> +  opaque,
> +  errp) != splitkeylen) {

Indent is off - align with '('

> +goto cleanup;
> +}
> +}
> +
> +ret = 0;
> +cleanup:
> +g_free(garbagekey);
> +return ret;
> +}
> +
> +
> +/*
> + * Erase all the keys that match the given password
> + * Will stop when only one keyslot is remaining
> + * Returns 0 is some keys were erased or -1 on failure
> + */
> +
> +static int
> +qcrypto_block_luks_erase_matching_keys(QCryptoBlock *block,
> + const char *password,
> + QCryptoBlockReadFunc readfunc,
> + 

Re: [Qemu-block] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]

2019-08-22 Thread Lukáš Doktor
Dne 21. 08. 19 v 19:51 Max Reitz napsal(a):
> On 21.08.19 16:14, Lukáš Doktor wrote:
>> Hello guys,
>>
>> First attempt was rejected due to zip attachment, let's try it again with 
>> just Avocado-vt debug.log and serial console log files attached.
>>
>> I bisected a regression on aarch64 all the way to this commit: "qcow2: skip 
>> writing zero buffers to empty COW areas" 
>> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at it?
> 
> I think I can see the issue on my x64 system (I don’t see the XFS
> corruption, but the installation fails because of some segfaults).
> 
> I haven’t found a simpler way to reproduce the problem yet, though,
> which is a pain... :-/
> 
> It looks like the problem disappears when I configure qemu with
> “--disable-xfsctl”.  Can you try that?
> 
> Max
> 

Hello Max,

yes, I'm getting the same behavior. With "--disable-xfsctl" it works well. Also 
looking at the option I understand why it only failed on aarch64 for me, I 
don't have libs installed on the other machines, therefor it was disabled by 
"./configure" there. Anyway I guess disabling it in my builds won't really fix 
the issue, right? :-)

Regards,
Lukáš



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH 08/13] qcrypto: add the plumbing for encryption management

2019-08-22 Thread Daniel P . Berrangé
On Wed, Aug 14, 2019 at 11:22:14PM +0300, Maxim Levitsky wrote:
> This adds qcrypto_block_manage_encryption, which
>  is thin wrapper around manage_encryption of the crypto driver
>  which is also added
> 
> Signed-off-by: Maxim Levitsky 
> ---
>  crypto/block.c | 29 +
>  crypto/blockpriv.h |  9 +
>  include/crypto/block.h | 27 +++
>  3 files changed, 65 insertions(+)
> 
> diff --git a/crypto/block.c b/crypto/block.c
> index ee96759f7d..5916e49aba 100644
> --- a/crypto/block.c
> +++ b/crypto/block.c
> @@ -20,6 +20,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qapi/error.h"
> +
>  #include "blockpriv.h"
>  #include "block-qcow.h"
>  #include "block-luks.h"
> @@ -282,6 +283,34 @@ void qcrypto_block_free(QCryptoBlock *block)
>  }
>  
>  
> +int qcrypto_block_setup_encryption(QCryptoBlock *block,
> +   QCryptoBlockReadFunc readfunc,
> +   QCryptoBlockWriteFunc writefunc,
> +   void *opaque,
> +   enum BlkSetupEncryptionAction action,
> +   QCryptoEncryptionSetupOptions *options,
> +   bool force,
> +   Error **errp)
> +{
> +if (!block->driver->setup_encryption) {
> +error_setg(errp,
> +"Crypto format %s doesn't support management of encryption 
> keys",
> +QCryptoBlockFormat_str(block->format));
> +return -1;
> +}
> +
> +return block->driver->setup_encryption(block,
> +   readfunc,
> +   writefunc,
> +   opaque,
> +   action,
> +   options,
> +   force,
> +   errp);
> +}
> +
> +
> +
>  typedef int (*QCryptoCipherEncDecFunc)(QCryptoCipher *cipher,
>  const void *in,
>  void *out,
> diff --git a/crypto/blockpriv.h b/crypto/blockpriv.h
> index 71c59cb542..804965dca3 100644
> --- a/crypto/blockpriv.h
> +++ b/crypto/blockpriv.h
> @@ -81,6 +81,15 @@ struct QCryptoBlockDriver {
>  
>  bool (*has_format)(const uint8_t *buf,
> size_t buflen);
> +
> +int (*setup_encryption)(QCryptoBlock *block,
> +QCryptoBlockReadFunc readfunc,
> +QCryptoBlockWriteFunc writefunc,
> +void *opaque,
> +enum BlkSetupEncryptionAction action,
> +QCryptoEncryptionSetupOptions *options,
> +bool force,
> +Error **errp);
>  };
>  
>  
> diff --git a/include/crypto/block.h b/include/crypto/block.h
> index fe12899831..60d46e3efc 100644
> --- a/include/crypto/block.h
> +++ b/include/crypto/block.h
> @@ -23,6 +23,7 @@
>  
>  #include "crypto/cipher.h"
>  #include "crypto/ivgen.h"
> +#include "block/block.h"
>  
>  typedef struct QCryptoBlock QCryptoBlock;
>  
> @@ -268,4 +269,30 @@ uint64_t qcrypto_block_get_sector_size(QCryptoBlock 
> *block);
>   */
>  void qcrypto_block_free(QCryptoBlock *block);
>  
> +
> +/**
> + * qcrypto_block_setup_encryption:
> + * @block: the block encryption object
> + *
> + * @readfunc: callback for reading data from the volume header
> + * @writefunc: callback for writing data to the volume header
> + * @opaque: data to pass to @readfunc and @writefunc
> + * @action: tell the driver the setup action (add/erase currently)
> + * @options: driver specific options, that specify
> + *   what encryption settings to manage
> + * @force: hint for the driver to allow unsafe operation
> + * @errp: error pointer
> + *
> + * Adds/Erases a new encryption key using @options

I'd prefer to see separate APIs for add + erase instead
of overloading. It'll lead to a clearer API from callers
POV to see exactly which parameters are for each action.

> + *
> + */
> +int qcrypto_block_setup_encryption(QCryptoBlock *block,
> +   QCryptoBlockReadFunc readfunc,
> +   QCryptoBlockWriteFunc writefunc,
> +   void *opaque,
> +   enum BlkSetupEncryptionAction action,
> +   QCryptoEncryptionSetupOptions *options,
> +   bool force,
> +   Error **errp);
> +

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-

Re: [Qemu-block] [PATCH 07/13] block: add manage-encryption command (qmp and blockdev)

2019-08-22 Thread Daniel P . Berrangé
On Tue, Aug 20, 2019 at 08:27:48PM +0200, Max Reitz wrote:
> On 14.08.19 22:22, Maxim Levitsky wrote:
> > This adds:
> > 
> > * x-blockdev-update-encryption and x-blockdev-erase-encryption qmp commands
> >   Both commands take the QCryptoKeyManageOptions
> >   the x-blockdev-update-encryption is meant for non destructive addition
> >   of key slots / whatever the encryption driver supports in the future
> > 
> >   x-blockdev-erase-encryption is meant for destructive encryption key erase,
> >   in some cases even without way to recover the data.
> > 
> > 
> > * bdrv_setup_encryption callback in the block driver
> >   This callback does both the above functions with 'action' parameter
> > 
> > * QCryptoKeyManageOptions with set of options that drivers can use for 
> > encryption managment
> >   Currently it has all the options that LUKS needs, and later it can be 
> > extended
> >   (via union) to support more encryption drivers if needed
> > 
> > * blk_setup_encryption / bdrv_setup_encryption - the usual block layer 
> > wrappers.
> >   Note that bdrv_setup_encryption takes BlockDriverState and not BdrvChild,
> >   for the ease of use from the qmp code. It is not expected that this 
> > function
> >   will be used by anything but qmp and qemu-img code
> > 
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/block-backend.c  |  9 
> >  block/io.c | 24 
> >  blockdev.c | 40 ++
> >  include/block/block.h  | 12 ++
> >  include/block/block_int.h  | 11 ++
> >  include/sysemu/block-backend.h |  7 ++
> >  qapi/block-core.json   | 36 ++
> >  qapi/crypto.json   | 26 ++
> >  8 files changed, 165 insertions(+)
> 
> Now I don’t know whether you want to keep this interface at all, because
> the cover letter seemed to imply you’d prefer a QMP amend.  But let it
> be said that a QMP amend is no trivial task.  I think the most difficult
> bit is that the qcow2 implementation currently is inherently an offline
> operation.  It isn’t a good idea to use it on a live image.  (Maybe it
> works, but it’s definitely not what I had in mind when I wrote it.)

For key mgmt we definitely need to have an online capability.

If layering it into a general purpose format "amend' option
is too complex, then we should keep it separate rather than
making our lives uncecessarily difficult IMHO.

> > diff --git a/qapi/block-core.json b/qapi/block-core.json
> > index 0d43d4f37c..53ed411eed 100644
> > --- a/qapi/block-core.json
> > +++ b/qapi/block-core.json
> > @@ -5327,3 +5327,39 @@
> >'data' : { 'node-name': 'str',
> >   'iothread': 'StrOrNull',
> >   '*force': 'bool' } }
> > +
> > +
> > +##
> > +# @x-blockdev-update-encryption:
> > +#
> > +# Update the encryption keys for an encrypted block device
> > +#
> > +# @node-name:Name of the blockdev to operate on
> > +# @force: Disable safety checks (use with care)
> > +# @options:   Driver specific options
> > +#
> > +
> > +# Since: 4.2
> > +##
> > +{ 'command': 'x-blockdev-update-encryption',
> > +  'data': { 'node-name' : 'str',
> > +'*force' : 'bool',
> > +'options': 'QCryptoEncryptionSetupOptions' } }
> > +
> > +##
> > +# @x-blockdev-erase-encryption:
> > +#
> > +# Erase the encryption keys for an encrypted block device
> > +#
> > +# @node-name:Name of the blockdev to operate on
> 
> Why the tab?
> 
> > +# @force: Disable safety checks (use with care)
> 
> I think being a bit more verbose wouldn’t hurt.
> 
> (Same above.)
> 
> > +# @options:   Driver specific options
> > +#
> > +# Returns: @QCryptoKeyManageResult
> > +#
> > +# Since: 4.2
> > +##
> > +{ 'command': 'x-blockdev-erase-encryption',
> > +  'data': { 'node-name' : 'str',
> > +'*force' : 'bool',
> > +'options': 'QCryptoEncryptionSetupOptions' } }
> > diff --git a/qapi/crypto.json b/qapi/crypto.json
> > index b2a4cff683..69e8b086db 100644
> > --- a/qapi/crypto.json
> > +++ b/qapi/crypto.json
> > @@ -309,3 +309,29 @@
> >'base': 'QCryptoBlockInfoBase',
> >'discriminator': 'format',
> >'data': { 'luks': 'QCryptoBlockInfoLUKS' } }
> > +
> > +
> > +##
> > +# @QCryptoEncryptionSetupOptions:
> > +#
> > +# Driver specific options for encryption key management.
> 
> The options do seem LUKS-specific, but the name of this structure does not.
> 
> > +# @key-secret: the ID of a QCryptoSecret object providing the password
> > +#  to add or to erase (optional for erase)
> > +#
> > +# @old-key-secret: the ID of a QCryptoSecret object providing the password
> > +#  that can currently unlock the image
> > +#
> > +# @slot: Key slot to update/erase
> > +#(optional, for update will select a free slot,
> > +#for erase will erase all slots that match the password)
> > +#
> > 

Re: [Qemu-block] [Qemu-devel] [PATCH 02/13] qcrypto-luks: misc refactoring

2019-08-22 Thread Maxim Levitsky
On Thu, 2019-08-22 at 12:10 +0100, Daniel P. Berrangé wrote:
> On Thu, Aug 22, 2019 at 02:04:28PM +0300, Maxim Levitsky wrote:
> > On Thu, 2019-08-22 at 11:29 +0100, Daniel P. Berrangé wrote:
> > > On Thu, Aug 15, 2019 at 05:40:11PM -0400, John Snow wrote:
> > > > 
> > > > 
> > > > On 8/14/19 4:22 PM, Maxim Levitsky wrote:
> > > > > This is also a preparation for key read/write/erase functions
> > > > > 
> > > > 
> > > > This is a matter of taste and I am not usually reviewing LUKS patches
> > > > (So don't take me too seriously), but I would prefer not to have "misc"
> > > > patches and instead split things out by individual changes along with a
> > > > nice commit message for each change.
> > > > 
> > > > > * use master key len from the header
> > > > 
> > > > This touches enough lines that you could make it its own patch, I think.
> > > > 
> > > > > * prefer to use crypto params in the QCryptoBlockLUKS
> > > > >   over passing them as function arguments
> > > > 
> > > > I think the same is true here, and highlighting which variables you are
> > > > sticking into state instead of leaving as functional parameters would be
> > > > nice to see without all the other changes.
> > > > 
> > > > > * define QCRYPTO_BLOCK_LUKS_DEFAULT_ITER_TIME
> > > > 
> > > > This can likely be squashed with whichever patch of yours first needs to
> > > > use it, because it's so short.
> > > > 
> > > > > * Add comments to various crypto parameters in the QCryptoBlockLUKS
> > > > > 
> > > > 
> > > > Can probably be squashed with item #2.
> > > 
> > > Agreed, with all these points  - it is too hard to review this
> > > for correctness with everything merged in one commit, so I'll
> > > wait for v2 before reviewing much more.
> > > 
> > > > > @@ -397,6 +411,12 @@ 
> > > > > qcrypto_block_luks_essiv_cipher(QCryptoCipherAlgorithm cipher,
> > > > >  }
> > > > >  }
> > > > >  
> > > > > +static int masterkeylen(QCryptoBlockLUKS *luks)
> > > > > +{
> > > > > +return luks->header.key_bytes;
> > > > > +}
> > > > > +
> > > > > +
> > > > 
> > > > generally QEMU uses snake_case_names; please spell as "master_key_len".
> > > 
> > > Also naming convention in this file expects "qcrypto_block_luks_" prefix
> > > for all methods
> > > 
> > 
> > The point of adding the masterkeylen was that IMHO it wasn't clear that
> > luks->header.key_bytes is the master key length.
> > 
> > However defining something like 
> > qcrypto_block_luks_master_key_len(QCryptoBlockLUKS *luks) 
> > is way longer, and will force me to wrap just too many places in the code 
> > to keep 80
> > character limit.
> > 
> > Now I am thinking of other ways to make this thing better:
> > 
> > 1. How about adding luks->masterkeylen and using it. luks state already has
> > several parsed values from the header, so using another one wouldn't hurt?
> 
> With those the parsed values are actually a different format from the
> header values, so it makes sense to have duplication.  Duplication
> just for sake of having a different name will just be confusing
> with some code using one field & some code using the other field
> when they are identical.
> 
> > 2. how about renaming the luks->header.key_bytes to 
> > luks->header->master_key_len?
> 
> This is fine.

Roger that!
Best regards,
Maxim Levitsky


> 
> Regards,
> Daniel





Re: [Qemu-block] [Qemu-devel] [PATCH 02/13] qcrypto-luks: misc refactoring

2019-08-22 Thread Daniel P . Berrangé
On Thu, Aug 22, 2019 at 02:04:28PM +0300, Maxim Levitsky wrote:
> On Thu, 2019-08-22 at 11:29 +0100, Daniel P. Berrangé wrote:
> > On Thu, Aug 15, 2019 at 05:40:11PM -0400, John Snow wrote:
> > > 
> > > 
> > > On 8/14/19 4:22 PM, Maxim Levitsky wrote:
> > > > This is also a preparation for key read/write/erase functions
> > > > 
> > > 
> > > This is a matter of taste and I am not usually reviewing LUKS patches
> > > (So don't take me too seriously), but I would prefer not to have "misc"
> > > patches and instead split things out by individual changes along with a
> > > nice commit message for each change.
> > > 
> > > > * use master key len from the header
> > > 
> > > This touches enough lines that you could make it its own patch, I think.
> > > 
> > > > * prefer to use crypto params in the QCryptoBlockLUKS
> > > >   over passing them as function arguments
> > > 
> > > I think the same is true here, and highlighting which variables you are
> > > sticking into state instead of leaving as functional parameters would be
> > > nice to see without all the other changes.
> > > 
> > > > * define QCRYPTO_BLOCK_LUKS_DEFAULT_ITER_TIME
> > > 
> > > This can likely be squashed with whichever patch of yours first needs to
> > > use it, because it's so short.
> > > 
> > > > * Add comments to various crypto parameters in the QCryptoBlockLUKS
> > > > 
> > > 
> > > Can probably be squashed with item #2.
> > 
> > Agreed, with all these points  - it is too hard to review this
> > for correctness with everything merged in one commit, so I'll
> > wait for v2 before reviewing much more.
> > 
> > > > @@ -397,6 +411,12 @@ 
> > > > qcrypto_block_luks_essiv_cipher(QCryptoCipherAlgorithm cipher,
> > > >  }
> > > >  }
> > > >  
> > > > +static int masterkeylen(QCryptoBlockLUKS *luks)
> > > > +{
> > > > +return luks->header.key_bytes;
> > > > +}
> > > > +
> > > > +
> > > 
> > > generally QEMU uses snake_case_names; please spell as "master_key_len".
> > 
> > Also naming convention in this file expects "qcrypto_block_luks_" prefix
> > for all methods
> > 
> The point of adding the masterkeylen was that IMHO it wasn't clear that
> luks->header.key_bytes is the master key length.
> 
> However defining something like 
> qcrypto_block_luks_master_key_len(QCryptoBlockLUKS *luks) 
> is way longer, and will force me to wrap just too many places in the code to 
> keep 80
> character limit.
> 
> Now I am thinking of other ways to make this thing better:
> 
> 1. How about adding luks->masterkeylen and using it. luks state already has
> several parsed values from the header, so using another one wouldn't hurt?

With those the parsed values are actually a different format from the
header values, so it makes sense to have duplication.  Duplication
just for sake of having a different name will just be confusing
with some code using one field & some code using the other field
when they are identical.

> 2. how about renaming the luks->header.key_bytes to 
> luks->header->master_key_len?

This is fine.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [PATCH 06/13] qcrypto-luks: implement more rigorous header checking

2019-08-22 Thread Daniel P . Berrangé
On Wed, Aug 14, 2019 at 11:22:12PM +0300, Maxim Levitsky wrote:
> Check that keyslots don't overlap with the data,
> and check that keyslots don't overlap with each other.
> (this is done using naive O(n^2) nested loops,
> but since there are just 8 keyslots, this doens't really matter.
> 
> Signed-off-by: Maxim Levitsky 
> ---
>  crypto/block-luks.c | 42 ++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/crypto/block-luks.c b/crypto/block-luks.c
> index 336e633df4..1997e92fe1 100644
> --- a/crypto/block-luks.c
> +++ b/crypto/block-luks.c
> @@ -551,6 +551,8 @@ static int
>  qcrypto_block_luks_check_header(QCryptoBlockLUKS *luks, Error **errp)
>  {
>  int ret;
> +int i, j;
> +
>  
>  if (memcmp(luks->header.magic, qcrypto_block_luks_magic,
> QCRYPTO_BLOCK_LUKS_MAGIC_LEN) != 0) {
> @@ -566,6 +568,46 @@ qcrypto_block_luks_check_header(QCryptoBlockLUKS *luks, 
> Error **errp)
>  goto fail;
>  }
>  
> +/* Check all keyslots for corruption  */
> +for (i = 0 ; i < QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS ; i++) {
> +
> +QCryptoBlockLUKSKeySlot *slot1 = >header.key_slots[i];
> +uint start1 = slot1->key_offset;
> +uint len1 = splitkeylen_sectors(luks, slot1->stripes);

Using 'uint' is not normal QEMU style.

Either use 'unsigned int'  or if a specific size is needed
then one of the 'guintNN' types from glib.

This applies elsewhere in this patch series too, but
I'll only comment here & let you find the other cases.

> +
> +if (slot1->stripes == 0 ||
> +(slot1->active != QCRYPTO_BLOCK_LUKS_KEY_SLOT_DISABLED &&
> +slot1->active != QCRYPTO_BLOCK_LUKS_KEY_SLOT_ENABLED)) {
> +

Redundant blank line

> +error_setg(errp, "Keyslot %i is corrupted", i);

I'd do a separate check for stripes and active fields, and then give a
specific error message for each. That way if this does ever trigger
in practice will immediately understand which check failed.

Also using '%d' rather than '%i' is more common convention


> +ret = -EINVAL;
> +goto fail;
> +}
> +
> +if (start1 + len1 > luks->header.payload_offset) {
> +error_setg(errp,
> +   "Keyslot %i is overlapping with the encrypted 
> payload",
> +   i);
> +ret = -EINVAL;
> +goto fail;
> +}
> +
> +for (j = i + 1 ; j < QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS ; j++) {
> +

Redundant blank

> +QCryptoBlockLUKSKeySlot *slot2 = >header.key_slots[j];
> +uint start2 = slot2->key_offset;
> +uint len2 = splitkeylen_sectors(luks, slot2->stripes);
> +
> +if (start1 + len1 > start2 && start2 + len2 > start1) {
> +error_setg(errp,
> +   "Keyslots %i and %i are overlapping in the 
> header",

%d

> +   i, j);
> +ret = -EINVAL;
> +goto fail;
> +}
> +}
> +
> +}
>  return 0;
>  fail:
>  return ret;
> -- 
> 2.17.2
> 

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [PATCH 02/13] qcrypto-luks: misc refactoring

2019-08-22 Thread Maxim Levitsky
On Thu, 2019-08-22 at 11:29 +0100, Daniel P. Berrangé wrote:
> On Thu, Aug 15, 2019 at 05:40:11PM -0400, John Snow wrote:
> > 
> > 
> > On 8/14/19 4:22 PM, Maxim Levitsky wrote:
> > > This is also a preparation for key read/write/erase functions
> > > 
> > 
> > This is a matter of taste and I am not usually reviewing LUKS patches
> > (So don't take me too seriously), but I would prefer not to have "misc"
> > patches and instead split things out by individual changes along with a
> > nice commit message for each change.
> > 
> > > * use master key len from the header
> > 
> > This touches enough lines that you could make it its own patch, I think.
> > 
> > > * prefer to use crypto params in the QCryptoBlockLUKS
> > >   over passing them as function arguments
> > 
> > I think the same is true here, and highlighting which variables you are
> > sticking into state instead of leaving as functional parameters would be
> > nice to see without all the other changes.
> > 
> > > * define QCRYPTO_BLOCK_LUKS_DEFAULT_ITER_TIME
> > 
> > This can likely be squashed with whichever patch of yours first needs to
> > use it, because it's so short.
> > 
> > > * Add comments to various crypto parameters in the QCryptoBlockLUKS
> > > 
> > 
> > Can probably be squashed with item #2.
> 
> Agreed, with all these points  - it is too hard to review this
> for correctness with everything merged in one commit, so I'll
> wait for v2 before reviewing much more.
> 
> > > @@ -397,6 +411,12 @@ 
> > > qcrypto_block_luks_essiv_cipher(QCryptoCipherAlgorithm cipher,
> > >  }
> > >  }
> > >  
> > > +static int masterkeylen(QCryptoBlockLUKS *luks)
> > > +{
> > > +return luks->header.key_bytes;
> > > +}
> > > +
> > > +
> > 
> > generally QEMU uses snake_case_names; please spell as "master_key_len".
> 
> Also naming convention in this file expects "qcrypto_block_luks_" prefix
> for all methods
> 
The point of adding the masterkeylen was that IMHO it wasn't clear that
luks->header.key_bytes is the master key length.

However defining something like 
qcrypto_block_luks_master_key_len(QCryptoBlockLUKS *luks) 
is way longer, and will force me to wrap just too many places in the code to 
keep 80
character limit.

Now I am thinking of other ways to make this thing better:

1. How about adding luks->masterkeylen and using it. luks state already has
several parsed values from the header, so using another one wouldn't hurt?


2. how about renaming the luks->header.key_bytes to 
luks->header->master_key_len?

Best regards,
Maxim Levitsky




Re: [Qemu-block] [Qemu-devel] [PATCH 03/13] qcrypto-luks: refactoring: extract load/store/check/parse header functions

2019-08-22 Thread Maxim Levitsky
On Thu, 2019-08-22 at 11:32 +0100, Daniel P. Berrangé wrote:
> On Thu, Aug 22, 2019 at 01:43:05AM +0300, Maxim Levitsky wrote:
> > On Tue, 2019-08-20 at 20:01 +0200, Max Reitz wrote:
> > > On 14.08.19 22:22, Maxim Levitsky wrote:
> > > > With upcoming key management, the header will
> > > > need to be stored after the image is created.
> > > > 
> > > > Extracting load header isn't strictly needed, but
> > > > do this anyway for the symmetry.
> > > > 
> > > > Also I extracted a function that does basic sanity
> > > > checks on the just read header, and a function
> > > > which parses all the crypto format to make the
> > > > code a bit more readable, plus now the code
> > > > doesn't destruct the in-header cipher-mode string,
> > > > so that the header now can be stored many times,
> > > > which is needed for the key management.
> > > > 
> > > > Also this allows to contain the endianess conversions
> > > > in these functions alone
> > > > 
> > > > The header is no longer endian swapped in place,
> > > > to prevent (mostly theoretical races I think)
> > > > races where someone could see the header in the
> > > > process of beeing byteswapped.
> > > 
> > > The formatting looks weird, it doesn’t look quite 72 characters wide...
> > >  (what commit messages normally use)
> > 
> > Could you elaborate on that? I thought that code should not
> > exceed 80 character limit.
> 
> Code itself should be wrapped at 80 chars, but it is common for
> commit messages to use 72 chars wrapping.
> 
> 
> Regards,
> Daniel
All right, will do this.
Best regards,
Maxim Levitsky




Re: [Qemu-block] [Qemu-devel] [PATCH 05/13] qcrypto-luks: clear the masterkey and password before freeing them always

2019-08-22 Thread Maxim Levitsky
On Thu, 2019-08-22 at 11:49 +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 20, 2019 at 08:12:51PM +0200, Max Reitz wrote:
> > On 14.08.19 22:22, Maxim Levitsky wrote:
> > > While there are other places where these are still stored in memory,
> > > this is still one less key material area that can be sniffed with
> > > various side channel attacks
> > > 
> > > 
> > > 
> > 
> > (Many empty lines here)
> > 
> > > Signed-off-by: Maxim Levitsky 
> > > ---
> > >  crypto/block-luks.c | 52 ++---
> > >  1 file changed, 44 insertions(+), 8 deletions(-)
> > 
> > Wouldn’t it make sense to introduce a dedicated function for this?
> 
> Yes, it would.
> 
> In fact I have a series pending which bumps min glib and introduces
> use of auto-free functions in this code.
> 
> It would be desirable to have a autp-free func for memset+free
> so we can just declare the variable
> 
>q_autowipefree char *password = NULL;
> 
> and have it result in memset+free
> 

That is perfect.
When do you think you could post the series so that I could rebase
on top of it?

Best regards,
Maxim Levitsky




Re: [Qemu-block] [Qemu-devel] [PATCH 05/13] qcrypto-luks: clear the masterkey and password before freeing them always

2019-08-22 Thread Daniel P . Berrangé
On Tue, Aug 20, 2019 at 08:12:51PM +0200, Max Reitz wrote:
> On 14.08.19 22:22, Maxim Levitsky wrote:
> > While there are other places where these are still stored in memory,
> > this is still one less key material area that can be sniffed with
> > various side channel attacks
> > 
> > 
> > 
> 
> (Many empty lines here)
> 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  crypto/block-luks.c | 52 ++---
> >  1 file changed, 44 insertions(+), 8 deletions(-)
> 
> Wouldn’t it make sense to introduce a dedicated function for this?

Yes, it would.

In fact I have a series pending which bumps min glib and introduces
use of auto-free functions in this code.

It would be desirable to have a autp-free func for memset+free
so we can just declare the variable

   q_autowipefree char *password = NULL;

and have it result in memset+free

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [PATCH 04/13] qcrypto-luks: refactoring: simplify the math used for keyslot locations

2019-08-22 Thread Daniel P . Berrangé
On Wed, Aug 14, 2019 at 11:22:10PM +0300, Maxim Levitsky wrote:
> Signed-off-by: Maxim Levitsky 
> ---
>  crypto/block-luks.c | 64 +++--
>  1 file changed, 38 insertions(+), 26 deletions(-)
> 
> diff --git a/crypto/block-luks.c b/crypto/block-luks.c
> index 6bb369f3b4..e1a4df94b7 100644
> --- a/crypto/block-luks.c
> +++ b/crypto/block-luks.c
> @@ -417,6 +417,33 @@ static int masterkeylen(QCryptoBlockLUKS *luks)
>  }
>  
>  
> +/*
> + * Returns number of sectors needed to store the key material
> + * given number of anti forensic stripes
> + */
> +static int splitkeylen_sectors(QCryptoBlockLUKS *luks, int stripes)

Needs a qcrypto_block_luks_ prefix on method name.

I'd also put 'static int' on a separate line from method name
to reduce too long lines.

> +
> +{
> +/*
> + * This calculation doesn't match that shown in the spec,
> + * but instead follows the cryptsetup implementation.
> + */
> +
> +size_t header_sectors = QCRYPTO_BLOCK_LUKS_KEY_SLOT_OFFSET /
> + QCRYPTO_BLOCK_LUKS_SECTOR_SIZE;

Following line indent should only be 4 spaces

> +
> +size_t splitkeylen = masterkeylen(luks) * stripes;
> +
> +/* First align the key material size to block size*/
> +size_t splitkeylen_sectors =
> +DIV_ROUND_UP(splitkeylen, QCRYPTO_BLOCK_LUKS_SECTOR_SIZE);

Again 4 space indent.

> +
> +/* Then also align the key material size to the size of the header */
> +return ROUND_UP(splitkeylen_sectors, header_sectors);
> +}
> +
> +
> +
>  /*
>   * Stores the main LUKS header, taking care of endianess
>   */
> @@ -1169,7 +1196,7 @@ qcrypto_block_luks_create(QCryptoBlock *block,
>  QCryptoBlockCreateOptionsLUKS luks_opts;
>  Error *local_err = NULL;
>  uint8_t *masterkey = NULL;
> -size_t splitkeylen = 0;
> +size_t next_sector;
>  size_t i;
>  char *password;
>  const char *cipher_alg;
> @@ -1388,23 +1415,16 @@ qcrypto_block_luks_create(QCryptoBlock *block,
>  goto error;
>  }
>  
> +/* start with the sector that follows the header*/
> +next_sector = QCRYPTO_BLOCK_LUKS_KEY_SLOT_OFFSET /
> +  QCRYPTO_BLOCK_LUKS_SECTOR_SIZE;

I'd suggest 'post_header_sector'

>  
> -/* Although LUKS has multiple key slots, we're just going
> - * to use the first key slot */
> -splitkeylen = luks->header.key_bytes * QCRYPTO_BLOCK_LUKS_STRIPES;
>  for (i = 0; i < QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS; i++) {
> -luks->header.key_slots[i].active = 
> QCRYPTO_BLOCK_LUKS_KEY_SLOT_DISABLED;
> -luks->header.key_slots[i].stripes = QCRYPTO_BLOCK_LUKS_STRIPES;
> -
> -/* This calculation doesn't match that shown in the spec,
> - * but instead follows the cryptsetup implementation.
> - */
> -luks->header.key_slots[i].key_offset =
> -(QCRYPTO_BLOCK_LUKS_KEY_SLOT_OFFSET /
> - QCRYPTO_BLOCK_LUKS_SECTOR_SIZE) +
> -(ROUND_UP(DIV_ROUND_UP(splitkeylen, 
> QCRYPTO_BLOCK_LUKS_SECTOR_SIZE),
> -  (QCRYPTO_BLOCK_LUKS_KEY_SLOT_OFFSET /
> -   QCRYPTO_BLOCK_LUKS_SECTOR_SIZE)) * i);
> +QCryptoBlockLUKSKeySlot *slot = >header.key_slots[i];
> +slot->active = QCRYPTO_BLOCK_LUKS_KEY_SLOT_DISABLED;
> +slot->key_offset = next_sector;
> +slot->stripes = QCRYPTO_BLOCK_LUKS_STRIPES;
> +next_sector += splitkeylen_sectors(luks, QCRYPTO_BLOCK_LUKS_STRIPES);

I'm not a fan of the next_sector accumulator .

I'd prefer to see the '* i' part done in splitkeylen_sectors, so that
we have

  slot->key_offset = post_header_sector +
splitkeylen_sectors(luks, QCRYPTO_BLOCK_LUKS_STRIPES, i);

> @@ -1412,17 +1432,9 @@ qcrypto_block_luks_create(QCryptoBlock *block,
>   * slot headers, rounded up to the nearest sector, combined with
>   * the size of each master key material region, also rounded up
>   * to the nearest sector */
> -luks->header.payload_offset =
> -(QCRYPTO_BLOCK_LUKS_KEY_SLOT_OFFSET /
> - QCRYPTO_BLOCK_LUKS_SECTOR_SIZE) +
> -(ROUND_UP(DIV_ROUND_UP(splitkeylen, QCRYPTO_BLOCK_LUKS_SECTOR_SIZE),
> -  (QCRYPTO_BLOCK_LUKS_KEY_SLOT_OFFSET /
> -   QCRYPTO_BLOCK_LUKS_SECTOR_SIZE)) *
> - QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS);
> -
> +luks->header.payload_offset = next_sector;

  luks->header.payload_offset = post_header_sector +
splitkeylen_sectors(luks, QCRYPTO_BLOCK_LUKS_STRIPES,
QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS);


>  block->sector_size = QCRYPTO_BLOCK_LUKS_SECTOR_SIZE;
> -block->payload_offset = luks->header.payload_offset *
> -block->sector_size;
> +block->payload_offset = luks->header.payload_offset * block->sector_size;


This is reverting a whitspace change done in previous method

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: 

Re: [Qemu-block] [PATCH 03/13] qcrypto-luks: refactoring: extract load/store/check/parse header functions

2019-08-22 Thread Daniel P . Berrangé
On Wed, Aug 14, 2019 at 11:22:09PM +0300, Maxim Levitsky wrote:
> With upcoming key management, the header will
> need to be stored after the image is created.
> 
> Extracting load header isn't strictly needed, but
> do this anyway for the symmetry.
> 
> Also I extracted a function that does basic sanity
> checks on the just read header, and a function
> which parses all the crypto format to make the
> code a bit more readable, plus now the code
> doesn't destruct the in-header cipher-mode string,
> so that the header now can be stored many times,
> which is needed for the key management.
> 
> Also this allows to contain the endianess conversions
> in these functions alone
> 
> The header is no longer endian swapped in place,
> to prevent (mostly theoretical races I think)
> races where someone could see the header in the
> process of beeing byteswapped.
> 
> Signed-off-by: Maxim Levitsky 
> ---
>  crypto/block-luks.c | 756 ++--
>  1 file changed, 440 insertions(+), 316 deletions(-)

>  if (!(flags & QCRYPTO_BLOCK_OPEN_NO_IO)) {
>  /* Try to find which key slot our password is valid for
>   * and unlock the master key from that slot.
>   */
> -
>  masterkey = g_new0(uint8_t, masterkeylen(luks));
>  
>  if (qcrypto_block_luks_find_key(block,
> @@ -845,12 +1132,10 @@ qcrypto_block_luks_open(QCryptoBlock *block,
>  }
>  
>  block->sector_size = QCRYPTO_BLOCK_LUKS_SECTOR_SIZE;
> -block->payload_offset = luks->header.payload_offset *
> -block->sector_size;
> +block->payload_offset = luks->header.payload_offset * block->sector_size;
>  
>  g_free(masterkey);
>  g_free(password);
> -
>  return 0;

Smoe unrelated whitespace changes here.


> +/* populate the slot 0 with the password encrypted master key*/
> +/* This will also store the header */
> +if (qcrypto_block_luks_store_key(block,
> + 0,
> + password,
> + masterkey,
> + luks_opts.iter_time,
> + writefunc,
> + opaque,
> + errp)) {
>  goto error;
> -}
> + }

Indent is off by 1


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [Qemu-devel] [PATCH 03/13] qcrypto-luks: refactoring: extract load/store/check/parse header functions

2019-08-22 Thread Daniel P . Berrangé
On Thu, Aug 22, 2019 at 01:43:05AM +0300, Maxim Levitsky wrote:
> On Tue, 2019-08-20 at 20:01 +0200, Max Reitz wrote:
> > On 14.08.19 22:22, Maxim Levitsky wrote:
> > > With upcoming key management, the header will
> > > need to be stored after the image is created.
> > > 
> > > Extracting load header isn't strictly needed, but
> > > do this anyway for the symmetry.
> > > 
> > > Also I extracted a function that does basic sanity
> > > checks on the just read header, and a function
> > > which parses all the crypto format to make the
> > > code a bit more readable, plus now the code
> > > doesn't destruct the in-header cipher-mode string,
> > > so that the header now can be stored many times,
> > > which is needed for the key management.
> > > 
> > > Also this allows to contain the endianess conversions
> > > in these functions alone
> > > 
> > > The header is no longer endian swapped in place,
> > > to prevent (mostly theoretical races I think)
> > > races where someone could see the header in the
> > > process of beeing byteswapped.
> > 
> > The formatting looks weird, it doesn’t look quite 72 characters wide...
> >  (what commit messages normally use)
> Could you elaborate on that? I thought that code should not
> exceed 80 character limit.
> 
> > 
> > > Signed-off-by: Maxim Levitsky 
> > > ---
> > >  crypto/block-luks.c | 756 ++--
> > >  1 file changed, 440 insertions(+), 316 deletions(-)
> > 
> > Also, this commit is just too big.
> 
> Yea, but it has no functional changes.
> I can split it further, but that won't help much IMHO.

I'd find it easier to review if each newly introduced method was a
separate patch. It makes it easier to see which bit of removed
code was added to which method.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [Qemu-devel] [PATCH 03/13] qcrypto-luks: refactoring: extract load/store/check/parse header functions

2019-08-22 Thread Daniel P . Berrangé
On Thu, Aug 22, 2019 at 01:43:05AM +0300, Maxim Levitsky wrote:
> On Tue, 2019-08-20 at 20:01 +0200, Max Reitz wrote:
> > On 14.08.19 22:22, Maxim Levitsky wrote:
> > > With upcoming key management, the header will
> > > need to be stored after the image is created.
> > > 
> > > Extracting load header isn't strictly needed, but
> > > do this anyway for the symmetry.
> > > 
> > > Also I extracted a function that does basic sanity
> > > checks on the just read header, and a function
> > > which parses all the crypto format to make the
> > > code a bit more readable, plus now the code
> > > doesn't destruct the in-header cipher-mode string,
> > > so that the header now can be stored many times,
> > > which is needed for the key management.
> > > 
> > > Also this allows to contain the endianess conversions
> > > in these functions alone
> > > 
> > > The header is no longer endian swapped in place,
> > > to prevent (mostly theoretical races I think)
> > > races where someone could see the header in the
> > > process of beeing byteswapped.
> > 
> > The formatting looks weird, it doesn’t look quite 72 characters wide...
> >  (what commit messages normally use)
> Could you elaborate on that? I thought that code should not
> exceed 80 character limit.

Code itself should be wrapped at 80 chars, but it is common for
commit messages to use 72 chars wrapping.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [PATCH 02/13] qcrypto-luks: misc refactoring

2019-08-22 Thread Daniel P . Berrangé
On Thu, Aug 15, 2019 at 05:40:11PM -0400, John Snow wrote:
> 
> 
> On 8/14/19 4:22 PM, Maxim Levitsky wrote:
> > This is also a preparation for key read/write/erase functions
> > 
> 
> This is a matter of taste and I am not usually reviewing LUKS patches
> (So don't take me too seriously), but I would prefer not to have "misc"
> patches and instead split things out by individual changes along with a
> nice commit message for each change.
> 
> > * use master key len from the header
> 
> This touches enough lines that you could make it its own patch, I think.
> 
> > * prefer to use crypto params in the QCryptoBlockLUKS
> >   over passing them as function arguments
> 
> I think the same is true here, and highlighting which variables you are
> sticking into state instead of leaving as functional parameters would be
> nice to see without all the other changes.
> 
> > * define QCRYPTO_BLOCK_LUKS_DEFAULT_ITER_TIME
> 
> This can likely be squashed with whichever patch of yours first needs to
> use it, because it's so short.
> 
> > * Add comments to various crypto parameters in the QCryptoBlockLUKS
> > 
> 
> Can probably be squashed with item #2.

Agreed, with all these points  - it is too hard to review this
for correctness with everything merged in one commit, so I'll
wait for v2 before reviewing much more.

> > @@ -397,6 +411,12 @@ qcrypto_block_luks_essiv_cipher(QCryptoCipherAlgorithm 
> > cipher,
> >  }
> >  }
> >  
> > +static int masterkeylen(QCryptoBlockLUKS *luks)
> > +{
> > +return luks->header.key_bytes;
> > +}
> > +
> > +
> 
> generally QEMU uses snake_case_names; please spell as "master_key_len".

Also naming convention in this file expects "qcrypto_block_luks_" prefix
for all methods


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [PATCH 0/3] block/io_uring: fix EINTR and resubmit short reads

2019-08-22 Thread Stefan Hajnoczi
On Wed, Aug 21, 2019 at 06:20:39PM -0400, John Snow wrote:
> 
> 
> On 7/15/19 4:19 PM, Stefan Hajnoczi wrote:
> > Short reads are possible with cache=writeback (see Patch 3 for details).
> > Handle this by resubmitting requests until the read is completed.
> > 
> > Patch 1 adds trace events useful for debugging io_uring.
> > 
> > Patch 2 fixes EINTR.  This lays the groundwork for resubmitting requests in
> > Patch 3.
> > 
> > Aarushi: Feel free to squash this into your patch series if you are happy 
> > with
> > the code, I don't mind if the authorship information is lost.  After 
> > applying
> > these patches I can successfully boot a Fedora 30 guest qcow2 file with
> > cache=writeback.
> > 
> > Based-on: <20190610134905.22294-1-mehta.aar...@gmail.com>
> > 
> > Stefan Hajnoczi (3):
> >   block/io_uring: add submission and completion trace events
> >   block/io_uring: fix EINTR request resubmission
> >   block/io_uring: resubmit short buffered reads
> > 
> >  block/io_uring.c   | 136 ++---
> >  block/trace-events |   6 +-
> >  2 files changed, 108 insertions(+), 34 deletions(-)
> > 
> 
> Since this is over the 30 days mark, I'm going to assume this WAS
> squashed into Aarushi's patchset, and it's safe to drop this from the
> review queue for now?

Yes, Aarushi included in the io_uring patch series.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH] block: gluster: Probe alignment limits

2019-08-22 Thread Niels de Vos
On Wed, Aug 21, 2019 at 07:04:17PM +0200, Max Reitz wrote:
> On 17.08.19 23:21, Nir Soffer wrote:
> > Implement alignment probing similar to file-posix, by reading from the
> > first 4k of the image.
> > 
> > Before this change, provisioning a VM on storage with sector size of
> > 4096 bytes would fail when the installer try to create filesystems. Here
> > is an example command that reproduces this issue:
> > 
> > $ qemu-system-x86_64 -accel kvm -m 2048 -smp 2 \
> > -drive 
> > file=gluster://gluster1/gv0/fedora29.raw,format=raw,cache=none \
> > -cdrom Fedora-Server-dvd-x86_64-29-1.2.iso
> > 
> > The installer fails in few seconds when trying to create filesystem on
> > /dev/mapper/fedora-root. In error report we can see that it failed with
> > EINVAL (I could not extract the error from guest).
> > 
> > Copying disk fails with EINVAL:
> > 
> > $ qemu-img convert -p -f raw -O raw -t none -T none \
> > gluster://gluster1/gv0/fedora29.raw \
> > gluster://gluster1/gv0/fedora29-clone.raw
> > qemu-img: error while writing sector 4190208: Invalid argument
> > 
> > This is a fix to same issue fixed in commit a6b257a08e3d (file-posix:
> > Handle undetectable alignment) for gluster:// images.
> > 
> > This fix has the same limit, that the first block of the image should be
> > allocated, otherwise we cannot detect the alignment and fallback to a
> > safe value (4096) even when using storage with sector size of 512 bytes.
> > 
> > Signed-off-by: Nir Soffer 
> > ---
> >  block/gluster.c | 47 +++
> >  1 file changed, 47 insertions(+)
> > 
> > diff --git a/block/gluster.c b/block/gluster.c
> > index f64dc5b01e..d936240b72 100644
> > --- a/block/gluster.c
> > +++ b/block/gluster.c
> > @@ -52,6 +52,9 @@
> >  
> >  #define GERR_INDEX_HINT "hint: check in 'server' array index '%d'\n"
> >  
> > +/* The value is known only on the server side. */
> > +#define MAX_ALIGN 4096
> > +
> >  typedef struct GlusterAIOCB {
> >  int64_t size;
> >  int ret;
> > @@ -902,8 +905,52 @@ out:
> >  return ret;
> >  }
> >  
> > +/*
> > + * Check if read is allowed with given memory buffer and length.
> > + *
> > + * This function is used to check O_DIRECT request alignment.
> > + */
> > +static bool gluster_is_io_aligned(struct glfs_fd *fd, void *buf, size_t 
> > len)
> > +{
> > +ssize_t ret = glfs_pread(fd, buf, len, 0, 0, NULL);
> > +return ret >= 0 || errno != EINVAL;
> 
> Is glfs_pread() guaranteed to return EINVAL on invalid alignment?
> file-posix says this is only the case on Linux (for normal files).  Now
> I also don’t know whether the gluster driver works on anything but Linux
> anyway.

The behaviour depends on the filesystem used by the Gluster backend. XFS
is the recommendation, but in the end it is up to the users. The Gluster
server is known to work on Linux, NetBSD and FreeBSD, the vast majority
of users runs it on Linux.

I do not think there is a strong guarantee EINVAL is always correct. How
about only checking 'ret > 0'?

> 
> > +}
> > +
> > +static void gluster_probe_alignment(BlockDriverState *bs, struct glfs_fd 
> > *fd,
> > +Error **errp)
> > +{
> > +char *buf;
> > +size_t alignments[] = {1, 512, 1024, 2048, 4096};
> > +size_t align;
> > +int i;
> > +
> > +buf = qemu_memalign(MAX_ALIGN, MAX_ALIGN);
> > +
> > +for (i = 0; i < ARRAY_SIZE(alignments); i++) {
> > +align = alignments[i];
> > +if (gluster_is_io_aligned(fd, buf, align)) {
> > +/* Fallback to safe value. */
> > +bs->bl.request_alignment = (align != 1) ? align : MAX_ALIGN;
> > +break;
> > +}
> > +}
> 
> I don’t like the fact that the last element of alignments[] should be
> the same as MAX_ALIGN, without that ever having been made explicit anywhere.
> 
> It’s a bit worse in the file-posix patch, because if getpagesize() is
> greater than 4k, max_align will be greater than 4k.  But MAX_BLOCKSIZE
> is 4k, too, so I suppose we wouldn’t support any block size beyond that
> anyway, which makes the error message appropriate still.
> 
> > +
> > +qemu_vfree(buf);
> > +
> > +if (!bs->bl.request_alignment) {
> > +error_setg(errp, "Could not find working O_DIRECT alignment");
> > +error_append_hint(errp, "Try cache.direct=off\n");
> > +}
> > +}
> > +
> >  static void qemu_gluster_refresh_limits(BlockDriverState *bs, Error **errp)
> >  {
> > +BDRVGlusterState *s = bs->opaque;
> > +
> > +gluster_probe_alignment(bs, s->fd, errp);
> > +
> > +bs->bl.min_mem_alignment = bs->bl.request_alignment;
> 
> Well, I’ll just trust you that there is no weird system where the memory
> alignment is greater than the request alignment.
> 
> > +bs->bl.opt_mem_alignment = MAX(bs->bl.request_alignment, MAX_ALIGN);
> 
> Isn’t request_alignment guaranteed to not exceed MAX_ALIGN, i.e. isn’t
> this always MAX_ALIGN?
> 
> >