Re: [Qemu-devel] [Qemu-block] [PATCH] block: posix: Always allocate the first block

2019-08-25 Thread Maxim Levitsky
On Sun, 2019-08-25 at 22:51 +0300, Nir Soffer wrote:
> On Sun, Aug 25, 2019 at 10:44 AM Maxim Levitsky  wrote:
> > On Sat, 2019-08-17 at 00:21 +0300, Nir Soffer wrote:
> > > When creating an image with preallocation "off" or "falloc", the first
> > > block of the image is typically not allocated. When using Gluster
> > > storage backed by XFS filesystem, reading this block using direct I/O
> > > succeeds regardless of request length, fooling alignment detection.
> > > 
> > > In this case we fallback to a safe value (4096) instead of the optimal
> > > value (512), which may lead to unneeded data copying when aligning
> > > requests.  Allocating the first block avoids the fallback.
> > > 
> > > When using preallocation=off, we always allocate at least one filesystem
> > > block:
> > > 
> > > $ ./qemu-img create -f raw test.raw 1g
> > > Formatting 'test.raw', fmt=raw size=1073741824
> > > 
> > > $ ls -lhs test.raw
> > > 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> > 
> > Are you sure about this?
> 
> This is the new behaviour with this change...
> 
> > [mlevitsk@maximlenovopc ~/work/test_area/posix-file 0]$ qemu-img create -f 
> > raw test.raw 1g -o preallocation=off
> > Formatting 'test.raw', fmt=raw size=1073741824 preallocation=off
> > [mlevitsk@maximlenovopc ~/work/test_area/posix-file 0]$ls -lhs ./test.raw 
> > 0 -rw-r--r--. 1 mlevitsk mlevitsk 1.0G Aug 25 10:38 ./test.raw
> > 
> > ext4, tested on qemu-4.0.0 and qemu git master.
> 
> And this is the old behavior. I guess the commit message does not make it 
> clear.

Ah, thanks!


> > From what I remember, the only case when posix-raw touches the first block 
> > is to zero it out
> > when running on top of kernel block device, to erase whatever header might 
> > be there, and this
> > is also kind of a backward compat hack which might be one day removed.
> 
> This change is only for file, on block storage we use BLKSSZGET.
>  
> > [...]
> > 
> > Best regards,
> > Maxim Levitsky
> > 
> > 

Best regards,
Maxim Levitsky





Re: [Qemu-devel] [Qemu-block] [PATCH] block: posix: Always allocate the first block

2019-08-25 Thread Nir Soffer
On Sun, Aug 25, 2019 at 10:44 AM Maxim Levitsky  wrote:

> On Sat, 2019-08-17 at 00:21 +0300, Nir Soffer wrote:
> > When creating an image with preallocation "off" or "falloc", the first
> > block of the image is typically not allocated. When using Gluster
> > storage backed by XFS filesystem, reading this block using direct I/O
> > succeeds regardless of request length, fooling alignment detection.
> >
> > In this case we fallback to a safe value (4096) instead of the optimal
> > value (512), which may lead to unneeded data copying when aligning
> > requests.  Allocating the first block avoids the fallback.
> >
> > When using preallocation=off, we always allocate at least one filesystem
> > block:
> >
> > $ ./qemu-img create -f raw test.raw 1g
> > Formatting 'test.raw', fmt=raw size=1073741824
> >
> > $ ls -lhs test.raw
> > 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
>
> Are you sure about this?
>

This is the new behaviour with this change...

[mlevitsk@maximlenovopc ~/work/test_area/posix-file 0]$ qemu-img create -f
> raw test.raw 1g -o preallocation=off
> Formatting 'test.raw', fmt=raw size=1073741824 preallocation=off
> [mlevitsk@maximlenovopc ~/work/test_area/posix-file 0]$ls -lhs ./test.raw
> 0 -rw-r--r--. 1 mlevitsk mlevitsk 1.0G Aug 25 10:38 ./test.raw
>
> ext4, tested on qemu-4.0.0 and qemu git master.
>

And this is the old behavior. I guess the commit message does not make it
clear.

>From what I remember, the only case when posix-raw touches the first block
> is to zero it out
> when running on top of kernel block device, to erase whatever header might
> be there, and this
> is also kind of a backward compat hack which might be one day removed.
>

This change is only for file, on block storage we use BLKSSZGET.


>
> [...]
>
> Best regards,
> Maxim Levitsky
>
>
>


Re: [Qemu-devel] [Qemu-block] [PATCH] block: posix: Always allocate the first block

2019-08-25 Thread Maxim Levitsky
On Sat, 2019-08-17 at 00:21 +0300, Nir Soffer wrote:
> When creating an image with preallocation "off" or "falloc", the first
> block of the image is typically not allocated. When using Gluster
> storage backed by XFS filesystem, reading this block using direct I/O
> succeeds regardless of request length, fooling alignment detection.
> 
> In this case we fallback to a safe value (4096) instead of the optimal
> value (512), which may lead to unneeded data copying when aligning
> requests.  Allocating the first block avoids the fallback.
> 
> When using preallocation=off, we always allocate at least one filesystem
> block:
> 
> $ ./qemu-img create -f raw test.raw 1g
> Formatting 'test.raw', fmt=raw size=1073741824
> 
> $ ls -lhs test.raw
> 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw

Are you sure about this?

[mlevitsk@maximlenovopc ~/work/test_area/posix-file 0]$ qemu-img create -f raw 
test.raw 1g -o preallocation=off
Formatting 'test.raw', fmt=raw size=1073741824 preallocation=off
[mlevitsk@maximlenovopc ~/work/test_area/posix-file 0]$ls -lhs ./test.raw 
0 -rw-r--r--. 1 mlevitsk mlevitsk 1.0G Aug 25 10:38 ./test.raw

ext4, tested on qemu-4.0.0 and qemu git master.


>From what I remember, the only case when posix-raw touches the first block is 
>to zero it out
when running on top of kernel block device, to erase whatever header might be 
there, and this
is also kind of a backward compat hack which might be one day removed.

[...]

Best regards,
Maxim Levitsky




Re: [Qemu-devel] [Qemu-block] [PATCH] block: posix: Always allocate the first block

2019-08-16 Thread John Snow



On 8/16/19 6:45 PM, Nir Soffer wrote:
> On Sat, Aug 17, 2019 at 12:57 AM John Snow  > wrote:
> 
> On 8/16/19 5:21 PM, Nir Soffer wrote:
> > When creating an image with preallocation "off" or "falloc", the first
> > block of the image is typically not allocated. When using Gluster
> > storage backed by XFS filesystem, reading this block using direct I/O
> > succeeds regardless of request length, fooling alignment detection.
> >
> > In this case we fallback to a safe value (4096) instead of the optimal
> > value (512), which may lead to unneeded data copying when aligning
> > requests.  Allocating the first block avoids the fallback.
> >
> 
> Where does this detection/fallback happen? (Can it be improved?)
> 
> 
> In raw_probe_alignment().
> 
> This patch explain the issues:
> https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00568.html
> 
> Here Kevin and me discussed ways to improve it:
> https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00426.html
> 

Thanks for the reading!
That does help explain this patch better.

> > When using preallocation=off, we always allocate at least one
> filesystem
> > block:
> >
> >     $ ./qemu-img create -f raw test.raw 1g
> >     Formatting 'test.raw', fmt=raw size=1073741824
> >
> >     $ ls -lhs test.raw
> >     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> >
> > I did quick performance tests for these flows:
> > - Provisioning a VM with a new raw image.
> > - Copying disks with qemu-img convert to new raw target image
> >
> > I installed Fedora 29 server on raw sparse image, measuring the time
> > from clicking "Begin installation" until the "Reboot" button appears:
> >
> > Before(s)  After(s)     Diff(%)
> > ---
> >      356        389        +8.4
> >
> > I ran this only once, so we cannot tell much from these results.
> >
> 
> That seems like a pretty big difference for just having pre-allocated a
> single block. What was the actual command line / block graph for
> that test?
> 
> 
> Having the first block allocated changes the alignment.
> 
> Before this patch, we detect request_alignment=1, so we fallback to 4096.
> Then we detect buf_align=1, so we fallback to value of request alignment.
> 
> The guest see a disk with:
> logical_block_size = 512
> physical_block_size = 512
> 
> But qemu uses:
> request_alignment = 4096
> buf_align = 4096
> 
> storage uses:
> logical_block_size = 512
> physical_block_size = 512
> 
> If the guest does direct I/O using 512 bytes aligment, qemu has to copy
> the buffer to align them to 4096 bytes.
> 
> After this patch, qemu detects the alignment correctly, so we have:
> 
> guest
> logical_block_size = 512
> physical_block_size = 512
> 
> qemu
> request_alignment = 512
> buf_align = 512
> 
> storage:
> logical_block_size = 512
> physical_block_size = 512
> 
> We expect this to be more efficient because qemu does not have to emulate
> anything.
> 
> Was this over a network that could explain the variance?
> 
> 
> Maybe, this is complete install of Fedora 29 server, I'm not sure if the
> installation 
> access the network.
> 
> > The second test was cloning the installation image with qemu-img
> > convert, doing 10 runs:
> >
> >     for i in $(seq 10); do
> >         rm -f dst.raw
> >         sleep 10
> >         time ./qemu-img convert -f raw -O raw -t none -T none
> src.raw dst.raw
> >     done
> >
> > Here is a table comparing the total time spent:
> >
> > Type    Before(s)   After(s)    Diff(%)
> > ---
> > real      530.028    469.123      -11.4
> > user       17.204     10.768      -37.4
> > sys        17.881      7.011      -60.7
> >
> > Here we see very clear improvement in CPU usage.
> >
> 
> Hard to argue much with that. I feel a little strange trying to force
> the allocation of the first block, but I suppose in practice "almost no
> preallocation" is indistinguishable from "exactly no preallocation" if
> you squint.
> 
> 
> Right.
> 
> The real issue is that filesystems and block devices do not expose the
> alignment
> requirement for direct I/O, so we need to use these hacks and assumptions.
> 
> With local XFS we use xfsctl(XFS_IOC_DIOINFO) to get request_alignment,
> but this does
> not help for XFS filesystem used by Gluster on the server side.
> 
> I hope that Niels is working on adding similar ioctl for Glsuter, os it
> can expose the properties
> of the remote filesystem.
> 
> Nir

That sounds quite a bit less hacky, but I agree we still have to do what
we can in the meantime.

(It looks like you've been hashing this out with Kevin for a while, so
I'm going to sheepishly defer to his judgment on this patch. While I
think it's probably a fine 

Re: [Qemu-devel] [Qemu-block] [PATCH] block: posix: Always allocate the first block

2019-08-16 Thread Nir Soffer
On Sat, Aug 17, 2019 at 12:57 AM John Snow  wrote:

> On 8/16/19 5:21 PM, Nir Soffer wrote:
> > When creating an image with preallocation "off" or "falloc", the first
> > block of the image is typically not allocated. When using Gluster
> > storage backed by XFS filesystem, reading this block using direct I/O
> > succeeds regardless of request length, fooling alignment detection.
> >
> > In this case we fallback to a safe value (4096) instead of the optimal
> > value (512), which may lead to unneeded data copying when aligning
> > requests.  Allocating the first block avoids the fallback.
> >
>
> Where does this detection/fallback happen? (Can it be improved?)
>

In raw_probe_alignment().

This patch explain the issues:
https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00568.html

Here Kevin and me discussed ways to improve it:
https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00426.html

> When using preallocation=off, we always allocate at least one filesystem
> > block:
> >
> > $ ./qemu-img create -f raw test.raw 1g
> > Formatting 'test.raw', fmt=raw size=1073741824
> >
> > $ ls -lhs test.raw
> > 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> >
> > I did quick performance tests for these flows:
> > - Provisioning a VM with a new raw image.
> > - Copying disks with qemu-img convert to new raw target image
> >
> > I installed Fedora 29 server on raw sparse image, measuring the time
> > from clicking "Begin installation" until the "Reboot" button appears:
> >
> > Before(s)  After(s) Diff(%)
> > ---
> >  356389+8.4
> >
> > I ran this only once, so we cannot tell much from these results.
> >
>
> That seems like a pretty big difference for just having pre-allocated a
> single block. What was the actual command line / block graph for that test?
>

Having the first block allocated changes the alignment.

Before this patch, we detect request_alignment=1, so we fallback to 4096.
Then we detect buf_align=1, so we fallback to value of request alignment.

The guest see a disk with:
logical_block_size = 512
physical_block_size = 512

But qemu uses:
request_alignment = 4096
buf_align = 4096

storage uses:
logical_block_size = 512
physical_block_size = 512

If the guest does direct I/O using 512 bytes aligment, qemu has to copy
the buffer to align them to 4096 bytes.

After this patch, qemu detects the alignment correctly, so we have:

guest
logical_block_size = 512
physical_block_size = 512

qemu
request_alignment = 512
buf_align = 512

storage:
logical_block_size = 512
physical_block_size = 512

We expect this to be more efficient because qemu does not have to emulate
anything.

Was this over a network that could explain the variance?
>

Maybe, this is complete install of Fedora 29 server, I'm not sure if the
installation
access the network.

> The second test was cloning the installation image with qemu-img
> > convert, doing 10 runs:
> >
> > for i in $(seq 10); do
> > rm -f dst.raw
> > sleep 10
> > time ./qemu-img convert -f raw -O raw -t none -T none src.raw
> dst.raw
> > done
> >
> > Here is a table comparing the total time spent:
> >
> > TypeBefore(s)   After(s)Diff(%)
> > ---
> > real  530.028469.123  -11.4
> > user   17.204 10.768  -37.4
> > sys17.881  7.011  -60.7
> >
> > Here we see very clear improvement in CPU usage.
> >
>
> Hard to argue much with that. I feel a little strange trying to force
> the allocation of the first block, but I suppose in practice "almost no
> preallocation" is indistinguishable from "exactly no preallocation" if
> you squint.
>

Right.

The real issue is that filesystems and block devices do not expose the
alignment
requirement for direct I/O, so we need to use these hacks and assumptions.

With local XFS we use xfsctl(XFS_IOC_DIOINFO) to get request_alignment, but
this does
not help for XFS filesystem used by Gluster on the server side.

I hope that Niels is working on adding similar ioctl for Glsuter, os it can
expose the properties
of the remote filesystem.

Nir


Re: [Qemu-devel] [Qemu-block] [PATCH] block: posix: Always allocate the first block

2019-08-16 Thread John Snow



On 8/16/19 5:21 PM, Nir Soffer wrote:
> When creating an image with preallocation "off" or "falloc", the first
> block of the image is typically not allocated. When using Gluster
> storage backed by XFS filesystem, reading this block using direct I/O
> succeeds regardless of request length, fooling alignment detection.
> 
> In this case we fallback to a safe value (4096) instead of the optimal
> value (512), which may lead to unneeded data copying when aligning
> requests.  Allocating the first block avoids the fallback.
> 

Where does this detection/fallback happen? (Can it be improved?)

> When using preallocation=off, we always allocate at least one filesystem
> block:
> 
> $ ./qemu-img create -f raw test.raw 1g
> Formatting 'test.raw', fmt=raw size=1073741824
> 
> $ ls -lhs test.raw
> 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> 
> I did quick performance tests for these flows:
> - Provisioning a VM with a new raw image.
> - Copying disks with qemu-img convert to new raw target image
> 
> I installed Fedora 29 server on raw sparse image, measuring the time
> from clicking "Begin installation" until the "Reboot" button appears:
> 
> Before(s)  After(s) Diff(%)
> ---
>  356389+8.4
> 
> I ran this only once, so we cannot tell much from these results.
> 

That seems like a pretty big difference for just having pre-allocated a
single block. What was the actual command line / block graph for that test?

Was this over a network that could explain the variance?

> The second test was cloning the installation image with qemu-img
> convert, doing 10 runs:
> 
> for i in $(seq 10); do
> rm -f dst.raw
> sleep 10
> time ./qemu-img convert -f raw -O raw -t none -T none src.raw dst.raw
> done
> 
> Here is a table comparing the total time spent:
> 
> TypeBefore(s)   After(s)Diff(%)
> ---
> real  530.028469.123  -11.4
> user   17.204 10.768  -37.4
> sys17.881  7.011  -60.7
> 
> Here we see very clear improvement in CPU usage.
> 

Hard to argue much with that. I feel a little strange trying to force
the allocation of the first block, but I suppose in practice "almost no
preallocation" is indistinguishable from "exactly no preallocation" if
you squint.

> Signed-off-by: Nir Soffer 
> ---
>  block/file-posix.c | 25 +
>  tests/qemu-iotests/150.out |  1 +
>  tests/qemu-iotests/160 |  4 
>  tests/qemu-iotests/175 | 19 +--
>  tests/qemu-iotests/175.out |  8 
>  tests/qemu-iotests/221.out | 12 
>  tests/qemu-iotests/253.out | 12 
>  7 files changed, 63 insertions(+), 18 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index b9c33c8f6c..3964dd2021 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void *opaque)
>  return ret;
>  }
>  
> +/*
> + * Help alignment detection by allocating the first block.
> + *
> + * When reading with direct I/O from unallocated area on Gluster backed by 
> XFS,
> + * reading succeeds regardless of request length. In this case we fallback to
> + * safe aligment which is not optimal. Allocating the first block avoids this
> + * fallback.
> + *
> + * Returns: 0 on success, -errno on failure.
> + */
> +static int allocate_first_block(int fd)
> +{
> +ssize_t n;
> +
> +do {
> +n = pwrite(fd, "\0", 1, 0);
> +} while (n == -1 && errno == EINTR);
> +
> +return (n == -1) ? -errno : 0;
> +}
> +
>  static int handle_aiocb_truncate(void *opaque)
>  {
>  RawPosixAIOData *aiocb = opaque;
> @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void *opaque)
>  /* posix_fallocate() doesn't set errno. */
>  error_setg_errno(errp, -result,
>   "Could not preallocate new data");
> +} else if (current_length == 0) {
> +allocate_first_block(fd);
>  }
>  } else {
>  result = 0;
> @@ -1855,6 +1878,8 @@ static int handle_aiocb_truncate(void *opaque)
>  if (ftruncate(fd, offset) != 0) {
>  result = -errno;
>  error_setg_errno(errp, -result, "Could not resize file");
> +} else if (current_length == 0 && offset > current_length) {
> +allocate_first_block(fd);
>  }
>  return result;
>  default:
> diff --git a/tests/qemu-iotests/150.out b/tests/qemu-iotests/150.out
> index 2a54e8dcfa..3cdc7727a5 100644
> --- a/tests/qemu-iotests/150.out
> +++ b/tests/qemu-iotests/150.out
> @@ -3,6 +3,7 @@ QA output created by 150
>  === Mapping sparse conversion ===
>  
>  Offset  Length  File
> +0   0x1000  TEST_DIR/t.IMGFMT
>  
>  === Mapping non-sparse conversion ===
>  
> diff --git