Re: [Qemu-devel] [PATCH v2 1/2] block: posix: Always allocate the first block

2019-08-26 Thread Max Reitz
On 26.08.19 17:41, Nir Soffer wrote:
> On Mon, Aug 26, 2019 at 3:31 PM Max Reitz  wrote:
>>
>> On 26.08.19 00:03, Nir Soffer wrote:
> ...
>>> +/*
>>> + * Help alignment probing by allocating the first block.
>>> + *
>>> + * When reading with direct I/O from unallocated area on Gluster backed by 
>>> XFS,
>>> + * reading succeeds regardless of request length. In this case we fallback 
>>> to
>>> + * safe alignment which is not optimal. Allocating the first block avoids 
>>> this
>>> + * fallback.
>>> + *
>>> + * fd may be opened with O_DIRECT, but we don't know the buffer alignment 
>>> or
>>> + * request alignment, so we use safe values.
>>> + *
>>> + * Returns: 0 on success, -errno on failure. Since this is an optimization,
>>> + * caller may ignore failures.
>>> + */
>>> +static int allocate_first_block(int fd, size_t max_size)
>>> +{
>>> +size_t write_size = MIN(MAX_BLOCKSIZE, max_size);
>>
>> Hm, well, there was a reason why I proposed rounding this down to the
>> next power of two.  If max_size is not a power of two but below
>> MAX_BLOCKSIZE, write_size will not be a power of two, and thus the write
>> below may fail even if write_size exceeds the physical block size.
>>
>> You can see that in the test case you add by using e.g. 768 as the
>> destination size (provided your test filesystem has a block size of 512).
>>
>> Now I would like to say that it’s stupid to resize an O_DIRECT file to a
>> size that is not a multiple of the block size; but I’ve had a bug
>> assigned to me before because that didn’t work.
>>
>> But maybe it’s actually better if it doesn’t work.  I don’t know.
> 
> I tried to avoid complexity that is unlikely to help anyone, but we
> can make the (typical)
> case of 512 bytes sector size work with this:
> 
> size_t write_size = (max_size < MAX_BLOCKSIZE)
> ? BDRV_SECTOR_SIZE
> : MAX_BLOCKSIZE;
> 
> Unfortunately testing max_size < 4096 will not be reliable since we don't know
> that underlying storage sector size.

Hm, well, why not, actually.  That’s simple enough and it should work in
all common configurations.

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH v2 1/2] block: posix: Always allocate the first block

2019-08-26 Thread Nir Soffer
On Mon, Aug 26, 2019 at 3:31 PM Max Reitz  wrote:
>
> On 26.08.19 00:03, Nir Soffer wrote:
...
> > +/*
> > + * Help alignment probing by allocating the first block.
> > + *
> > + * When reading with direct I/O from unallocated area on Gluster backed by 
> > XFS,
> > + * reading succeeds regardless of request length. In this case we fallback 
> > to
> > + * safe alignment which is not optimal. Allocating the first block avoids 
> > this
> > + * fallback.
> > + *
> > + * fd may be opened with O_DIRECT, but we don't know the buffer alignment 
> > or
> > + * request alignment, so we use safe values.
> > + *
> > + * Returns: 0 on success, -errno on failure. Since this is an optimization,
> > + * caller may ignore failures.
> > + */
> > +static int allocate_first_block(int fd, size_t max_size)
> > +{
> > +size_t write_size = MIN(MAX_BLOCKSIZE, max_size);
>
> Hm, well, there was a reason why I proposed rounding this down to the
> next power of two.  If max_size is not a power of two but below
> MAX_BLOCKSIZE, write_size will not be a power of two, and thus the write
> below may fail even if write_size exceeds the physical block size.
>
> You can see that in the test case you add by using e.g. 768 as the
> destination size (provided your test filesystem has a block size of 512).
>
> Now I would like to say that it’s stupid to resize an O_DIRECT file to a
> size that is not a multiple of the block size; but I’ve had a bug
> assigned to me before because that didn’t work.
>
> But maybe it’s actually better if it doesn’t work.  I don’t know.

I tried to avoid complexity that is unlikely to help anyone, but we
can make the (typical)
case of 512 bytes sector size work with this:

size_t write_size = (max_size < MAX_BLOCKSIZE)
? BDRV_SECTOR_SIZE
: MAX_BLOCKSIZE;

Unfortunately testing max_size < 4096 will not be reliable since we don't know
that underlying storage sector size.

...



Re: [Qemu-devel] [PATCH v2 1/2] block: posix: Always allocate the first block

2019-08-26 Thread Nir Soffer
On Mon, Aug 26, 2019 at 4:49 PM Eric Blake  wrote:
>
> On 8/26/19 7:31 AM, Max Reitz wrote:
>
> >>  # the file size.  This function hides the resulting difference in the
> >>  # stat -c '%b' output.
> >>  # Parameter 1: Number of blocks an empty file occupies
> >> -# Parameter 2: Image size in bytes
> >> +# Parameter 2: Minimal number of blocks in an image
> >> +# Parameter 3: Image size in bytes
> >>  _filter_blocks()
> >>  {
> >>  extra_blocks=$1
> >> -img_size=$2
> >> +min_blocks=$2
> >> +img_size=$3
> >>
> >> -sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing allocated/" \
> >> --e "s/blocks=$((extra_blocks + img_size / 
> >> 512))\\(\$\\|[^0-9]\\)/everything allocated/"
> >> +sed -e "s/blocks=$((min_blocks))\\(\$\\|[^0-9]\\)/min allocation/" \
> >
> > Superfluous parentheses ($(())), but not wrong.
>
> Note that $((..)) has a purpose: it can convert any variable content
> into decimal.  I can write min_blocks=0x1000, and $((min_blocks))
> results in 4096 while $min_blocks is still 0x1000.  But I'd need more
> context as to what the callers expect to pass as to whether the $((...))
> is superfluous here.

In this case min_blocks is computed and always use base 10, so we don't
need the $(()).

Nir



Re: [Qemu-devel] [PATCH v2 1/2] block: posix: Always allocate the first block

2019-08-26 Thread Nir Soffer
On Mon, Aug 26, 2019 at 4:46 PM Eric Blake  wrote:
>
> On 8/25/19 5:03 PM, Nir Soffer wrote:
> > When creating an image with preallocation "off" or "falloc", the first
> > block of the image is typically not allocated. When using Gluster
> > storage backed by XFS filesystem, reading this block using direct I/O
> > succeeds regardless of request length, fooling alignment detection.
> >
> > In this case we fallback to a safe value (4096) instead of the optimal
> > value (512), which may lead to unneeded data copying when aligning
> > requests.  Allocating the first block avoids the fallback.
> >
>
> > Here is a table comparing the total time spent:
> >
> > TypeBefore(s)   After(s)Diff(%)
> > ---
> > real  530.028469.123  -11.4
> > user   17.204 10.768  -37.4
> > sys17.881  7.011  -60.7
> >
> > We can see very clear improvement in CPU usage.
>
> Nice justification.
>
>
> > +/*
> > + * Help alignment probing by allocating the first block.
> > + *
>
> > +do {
> > +n = pwrite(fd, buf, write_size, 0);
> > +} while (n == -1 && errno == EINTR);
> > +
> > +qemu_vfree(buf);
>
> qemu_vfree() can corrupt errno...
>
> > +
> > +return (n == -1) ? -errno : 0;
>
> ...which means you may be returning an unexpected value here.
>
> Either we should patch qemu_vfree() to guarantee that errno is
> preserved, or you locally capture errno before calling it here.

qemu_vfree() returns void like free(), so changing errno is unexpected, but
other code using it should not be effected, so preserving errno here seems
like a better change.

Thanks!

Nir



Re: [Qemu-devel] [PATCH v2 1/2] block: posix: Always allocate the first block

2019-08-26 Thread Eric Blake
On 8/26/19 7:31 AM, Max Reitz wrote:

>>  # the file size.  This function hides the resulting difference in the
>>  # stat -c '%b' output.
>>  # Parameter 1: Number of blocks an empty file occupies
>> -# Parameter 2: Image size in bytes
>> +# Parameter 2: Minimal number of blocks in an image
>> +# Parameter 3: Image size in bytes
>>  _filter_blocks()
>>  {
>>  extra_blocks=$1
>> -img_size=$2
>> +min_blocks=$2
>> +img_size=$3
>>  
>> -sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing allocated/" \
>> --e "s/blocks=$((extra_blocks + img_size / 
>> 512))\\(\$\\|[^0-9]\\)/everything allocated/"
>> +sed -e "s/blocks=$((min_blocks))\\(\$\\|[^0-9]\\)/min allocation/" \
> 
> Superfluous parentheses ($(())), but not wrong.

Note that $((..)) has a purpose: it can convert any variable content
into decimal.  I can write min_blocks=0x1000, and $((min_blocks))
results in 4096 while $min_blocks is still 0x1000.  But I'd need more
context as to what the callers expect to pass as to whether the $((...))
is superfluous here.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH v2 1/2] block: posix: Always allocate the first block

2019-08-26 Thread Eric Blake
On 8/25/19 5:03 PM, Nir Soffer wrote:
> When creating an image with preallocation "off" or "falloc", the first
> block of the image is typically not allocated. When using Gluster
> storage backed by XFS filesystem, reading this block using direct I/O
> succeeds regardless of request length, fooling alignment detection.
> 
> In this case we fallback to a safe value (4096) instead of the optimal
> value (512), which may lead to unneeded data copying when aligning
> requests.  Allocating the first block avoids the fallback.
> 

> Here is a table comparing the total time spent:
> 
> TypeBefore(s)   After(s)Diff(%)
> ---
> real  530.028469.123  -11.4
> user   17.204 10.768  -37.4
> sys17.881  7.011  -60.7
> 
> We can see very clear improvement in CPU usage.

Nice justification.


> +/*
> + * Help alignment probing by allocating the first block.
> + *

> +do {
> +n = pwrite(fd, buf, write_size, 0);
> +} while (n == -1 && errno == EINTR);
> +
> +qemu_vfree(buf);

qemu_vfree() can corrupt errno...

> +
> +return (n == -1) ? -errno : 0;

...which means you may be returning an unexpected value here.

Either we should patch qemu_vfree() to guarantee that errno is
preserved, or you locally capture errno before calling it here.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH v2 1/2] block: posix: Always allocate the first block

2019-08-26 Thread Max Reitz
On 26.08.19 00:03, Nir Soffer wrote:
> When creating an image with preallocation "off" or "falloc", the first
> block of the image is typically not allocated. When using Gluster
> storage backed by XFS filesystem, reading this block using direct I/O
> succeeds regardless of request length, fooling alignment detection.
> 
> In this case we fallback to a safe value (4096) instead of the optimal
> value (512), which may lead to unneeded data copying when aligning
> requests.  Allocating the first block avoids the fallback.
> 
> Since we allocate the first block even with preallocation=off, we no
> longer create images with zero disk size:
> 
> $ ./qemu-img create -f raw test.raw 1g
> Formatting 'test.raw', fmt=raw size=1073741824
> 
> $ ls -lhs test.raw
> 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> 
> And converting the image requires additional cluster:
> 
> $ ./qemu-img measure -f raw -O qcow2 test.raw
> required size: 458752
> fully allocated size: 1074135040
> 
> I did quick performance test for copying disks with qemu-img convert to
> new raw target image to Gluster storage with sector size of 512 bytes:
> 
> for i in $(seq 10); do
> rm -f dst.raw
> sleep 10
> time ./qemu-img convert -f raw -O raw -t none -T none src.raw dst.raw
> done
> 
> Here is a table comparing the total time spent:
> 
> TypeBefore(s)   After(s)Diff(%)
> ---
> real  530.028469.123  -11.4
> user   17.204 10.768  -37.4
> sys17.881  7.011  -60.7
> 
> We can see very clear improvement in CPU usage.
> 
> Signed-off-by: Nir Soffer 
> ---
>  block/file-posix.c| 43 +++
>  tests/qemu-iotests/{150.out => 150.out.qcow2} |  0
>  tests/qemu-iotests/150.out.raw| 12 ++
>  tests/qemu-iotests/175| 19 +---
>  tests/qemu-iotests/175.out|  8 ++--
>  tests/qemu-iotests/178.out.qcow2  |  4 +-
>  tests/qemu-iotests/221.out| 12 --
>  tests/qemu-iotests/253.out| 12 --
>  8 files changed, 90 insertions(+), 20 deletions(-)
>  rename tests/qemu-iotests/{150.out => 150.out.qcow2} (100%)
>  create mode 100644 tests/qemu-iotests/150.out.raw
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index fbeb0068db..51688ae3fc 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1749,6 +1749,39 @@ static int handle_aiocb_discard(void *opaque)
>  return ret;
>  }
>  
> +/*
> + * Help alignment probing by allocating the first block.
> + *
> + * When reading with direct I/O from unallocated area on Gluster backed by 
> XFS,
> + * reading succeeds regardless of request length. In this case we fallback to
> + * safe alignment which is not optimal. Allocating the first block avoids 
> this
> + * fallback.
> + *
> + * fd may be opened with O_DIRECT, but we don't know the buffer alignment or
> + * request alignment, so we use safe values.
> + *
> + * Returns: 0 on success, -errno on failure. Since this is an optimization,
> + * caller may ignore failures.
> + */
> +static int allocate_first_block(int fd, size_t max_size)
> +{
> +size_t write_size = MIN(MAX_BLOCKSIZE, max_size);

Hm, well, there was a reason why I proposed rounding this down to the
next power of two.  If max_size is not a power of two but below
MAX_BLOCKSIZE, write_size will not be a power of two, and thus the write
below may fail even if write_size exceeds the physical block size.

You can see that in the test case you add by using e.g. 768 as the
destination size (provided your test filesystem has a block size of 512).

Now I would like to say that it’s stupid to resize an O_DIRECT file to a
size that is not a multiple of the block size; but I’ve had a bug
assigned to me before because that didn’t work.

But maybe it’s actually better if it doesn’t work.  I don’t know.

> +size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize());
> +void *buf;
> +ssize_t n;
> +
> +buf = qemu_memalign(max_align, write_size);
> +memset(buf, 0, write_size);
> +
> +do {
> +n = pwrite(fd, buf, write_size, 0);
> +} while (n == -1 && errno == EINTR);
> +
> +qemu_vfree(buf);
> +
> +return (n == -1) ? -errno : 0;
> +}
> +
>  static int handle_aiocb_truncate(void *opaque)
>  {
>  RawPosixAIOData *aiocb = opaque;
> @@ -1788,6 +1821,13 @@ static int handle_aiocb_truncate(void *opaque)
>  /* posix_fallocate() doesn't set errno. */
>  error_setg_errno(errp, -result,
>   "Could not preallocate new data");
> +} else if (current_length == 0) {
> +/*
> + * Needed only if posix_fallocate() used fallocate(), but we
> + * don't have a way to detect that.

This sounds a bit weird because fallocate() is what we call