Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-27 Thread Brian Foster
On Wed, Aug 26, 2020 at 08:34:32PM +0200, Alberto Garcia wrote:
> On Tue 25 Aug 2020 09:47:24 PM CEST, Brian Foster  wrote:
> > My fio fallocates the entire file by default with this command. Is that
> > the intent of this particular test? I added --fallocate=none to my test
> > runs to incorporate the allocation cost in the I/Os.
> 
> That wasn't intentional, you're right, it should use --fallocate=none (I
> don't see a big difference in my test anyway).
> 
> >> The Linux version is 4.19.132-1 from Debian.
> >
> > Thanks. I don't have LUKS in the mix on my box, but I was running on a
> > more recent kernel (Fedora 5.7.15-100). I threw v4.19 on the box and
> > saw a bit more of a delta between XFS (~14k iops) and ext4 (~24k). The
> > same test shows ~17k iops for XFS and ~19k iops for ext4 on v5.7. If I
> > increase the size of the LVM volume from 126G to >1TB, ext4 runs at
> > roughly the same rate and XFS closes the gap to around ~19k iops as
> > well. I'm not sure what might have changed since v4.19, but care to
> > see if this is still an issue on a more recent kernel?
> 
> Ok, I gave 5.7.10-1 a try but I still get similar numbers.
> 

Strange.

> Perhaps with a larger filesystem there would be a difference? I don't
> know.
> 

Perhaps. I believe Dave mentioned earlier how log size might affect
things.

I created a 125GB lvm volume and see slight deltas in iops going from
testing directly on the block device, to a fully allocated file on
XFS/ext4 and then to a preallocated file on XFS/ext4. In both cases the
numbers are comparable between XFS and ext4. On XFS, I can reproduce a
serious drop in iops if I reduce the default ~64MB log down to 8MB.
Perhaps you could try increasing your log ('-lsize=...' at mkfs time)
and see if that changes anything?

Beyond that, I'd probably try to normalize and simplify your storage
stack if you wanted to narrow it down further. E.g., clean format the
same bdev for XFS and ext4 and pull out things like LUKS just to rule
out any poor interactions.

Brian

> Berto
> 




Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-26 Thread Alberto Garcia
On Tue 25 Aug 2020 09:47:24 PM CEST, Brian Foster  wrote:
> My fio fallocates the entire file by default with this command. Is that
> the intent of this particular test? I added --fallocate=none to my test
> runs to incorporate the allocation cost in the I/Os.

That wasn't intentional, you're right, it should use --fallocate=none (I
don't see a big difference in my test anyway).

>> The Linux version is 4.19.132-1 from Debian.
>
> Thanks. I don't have LUKS in the mix on my box, but I was running on a
> more recent kernel (Fedora 5.7.15-100). I threw v4.19 on the box and
> saw a bit more of a delta between XFS (~14k iops) and ext4 (~24k). The
> same test shows ~17k iops for XFS and ~19k iops for ext4 on v5.7. If I
> increase the size of the LVM volume from 126G to >1TB, ext4 runs at
> roughly the same rate and XFS closes the gap to around ~19k iops as
> well. I'm not sure what might have changed since v4.19, but care to
> see if this is still an issue on a more recent kernel?

Ok, I gave 5.7.10-1 a try but I still get similar numbers.

Perhaps with a larger filesystem there would be a difference? I don't
know.

Berto



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-25 Thread Brian Foster
On Tue, Aug 25, 2020 at 07:18:19PM +0200, Alberto Garcia wrote:
> On Tue 25 Aug 2020 06:54:15 PM CEST, Brian Foster wrote:
> > If I compare this 5m fio test between XFS and ext4 on a couple of my
> > systems (with either no prealloc or full file prealloc), I end up seeing
> > ext4 run slightly faster on my vm and XFS slightly faster on bare metal.
> > Either way, I don't see that huge disparity where ext4 is 5-6 times
> > faster than XFS. Can you describe the test, filesystem and storage in
> > detail where you observe such a discrepancy?
> 
> Here's the test:
> 
> fio --filename=/path/to/file.raw --direct=1 --randrepeat=1 \
> --eta=always --ioengine=libaio --iodepth=32 --numjobs=1 \
> --name=test --size=25G --io_limit=25G --ramp_time=0 \
> --rw=randwrite --bs=4k --runtime=300 --time_based=1
> 

My fio fallocates the entire file by default with this command. Is that
the intent of this particular test? I added --fallocate=none to my test
runs to incorporate the allocation cost in the I/Os.

> The size of the XFS filesystem is 126 GB and it's almost empty, here's
> the xfs_info output:
> 
> meta-data=/dev/vg/test   isize=512agcount=4, agsize=8248576
> blks
>  =   sectsz=512   attr=2, projid32bit=1
>  =   crc=1finobt=1, sparse=1,
>  rmapbt=0
>  =   reflink=0
> data =   bsize=4096   blocks=32994304, imaxpct=25
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
> log  =internal log   bsize=4096   blocks=16110, version=2
>  =   sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
> 
> The size of the ext4 filesystem is 99GB, of which 49GB are free (that
> is, without the file used in this test). The filesystem uses 4KB
> blocks, a 128M journal and these features:
> 
> Filesystem revision #:1 (dynamic)
> Filesystem features:  has_journal ext_attr resize_inode dir_index
>   filetype needs_recovery extent flex_bg
>   sparse_super large_file huge_file uninit_bg
>   dir_nlink extra_isize
> Filesystem flags: signed_directory_hash
> Default mount options:user_xattr acl
> 
> In both cases I'm using LVM on top of LUKS and the hard drive is a
> Samsung SSD 850 PRO 1TB.
> 
> The Linux version is 4.19.132-1 from Debian.
> 

Thanks. I don't have LUKS in the mix on my box, but I was running on a
more recent kernel (Fedora 5.7.15-100). I threw v4.19 on the box and saw
a bit more of a delta between XFS (~14k iops) and ext4 (~24k). The same
test shows ~17k iops for XFS and ~19k iops for ext4 on v5.7. If I
increase the size of the LVM volume from 126G to >1TB, ext4 runs at
roughly the same rate and XFS closes the gap to around ~19k iops as
well. I'm not sure what might have changed since v4.19, but care to see
if this is still an issue on a more recent kernel?

Brian

> Berto
> 




Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-25 Thread Alberto Garcia
On Tue 25 Aug 2020 06:54:15 PM CEST, Brian Foster wrote:
> If I compare this 5m fio test between XFS and ext4 on a couple of my
> systems (with either no prealloc or full file prealloc), I end up seeing
> ext4 run slightly faster on my vm and XFS slightly faster on bare metal.
> Either way, I don't see that huge disparity where ext4 is 5-6 times
> faster than XFS. Can you describe the test, filesystem and storage in
> detail where you observe such a discrepancy?

Here's the test:

fio --filename=/path/to/file.raw --direct=1 --randrepeat=1 \
--eta=always --ioengine=libaio --iodepth=32 --numjobs=1 \
--name=test --size=25G --io_limit=25G --ramp_time=0 \
--rw=randwrite --bs=4k --runtime=300 --time_based=1

The size of the XFS filesystem is 126 GB and it's almost empty, here's
the xfs_info output:

meta-data=/dev/vg/test   isize=512agcount=4, agsize=8248576
blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=1,
 rmapbt=0
 =   reflink=0
data =   bsize=4096   blocks=32994304, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=16110, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

The size of the ext4 filesystem is 99GB, of which 49GB are free (that
is, without the file used in this test). The filesystem uses 4KB
blocks, a 128M journal and these features:

Filesystem revision #:1 (dynamic)
Filesystem features:  has_journal ext_attr resize_inode dir_index
  filetype needs_recovery extent flex_bg
  sparse_super large_file huge_file uninit_bg
  dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options:user_xattr acl

In both cases I'm using LVM on top of LUKS and the hard drive is a
Samsung SSD 850 PRO 1TB.

The Linux version is 4.19.132-1 from Debian.

Berto



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-25 Thread Brian Foster
On Tue, Aug 25, 2020 at 02:24:58PM +0200, Alberto Garcia wrote:
> On Fri 21 Aug 2020 07:02:32 PM CEST, Brian Foster wrote:
> >> I was running fio with --ramp_time=5 which ignores the first 5 seconds
> >> of data in order to let performance settle, but if I remove that I can
> >> see the effect more clearly. I can observe it with raw files (in 'off'
> >> and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
> >> preallocation=off the performance is stable during the whole test.
> >
> > That's interesting. I ran your fio command (without --ramp_time and
> > with --runtime=5m) against a file on XFS (so no qcow2, no zero_range)
> > once with sparse file with a 64k extent size hint and again with a
> > fully preallocated 25GB file and I saw similar results in terms of the
> > delta.  This was just against an SSD backed vdisk in my local dev VM,
> > but I saw ~5800 iops for the full preallocation test and ~6200 iops
> > with the extent size hint.
> >
> > I do notice an initial iops burst as described for both tests, so I
> > switched to use a 60s ramp time and 60s runtime. With that longer ramp
> > up time, I see ~5000 iops with the 64k extent size hint and ~5500 iops
> > with the full 25GB prealloc. Perhaps the unexpected performance delta
> > with qcow2 is similarly transient towards the start of the test and
> > the runtime is short enough that it skews the final results..?
> 
> I also tried running directly against a file on xfs (no qcow2, no VMs)
> but it doesn't really matter whether I use --ramp_time=5 or 60.
> 
> Here are the results:
> 
> |---+---+---|
> | preallocation |   xfs |  ext4 |
> |---+---+---|
> | off   |  7277 | 43260 |
> | fallocate |  7299 | 42810 |
> | full  | 88404 | 83197 |
> |---+---+---|
> 
> I ran the first case (no preallocation) for 5 minutes and I said there's
> a peak during the first 5 seconds, but then the number remains under 10k
> IOPS for the rest of the 5 minutes.
> 

I don't think we're talking about the same thing. I was referring to the
difference between full file preallocation and the extent size hint in
XFS, and how the latter was faster with the shorter ramp time but that
swapped around when the test ramped up for longer. Here, it looks like
you're comparing XFS to ext4 writing direct to a file..

If I compare this 5m fio test between XFS and ext4 on a couple of my
systems (with either no prealloc or full file prealloc), I end up seeing
ext4 run slightly faster on my vm and XFS slightly faster on bare metal.
Either way, I don't see that huge disparity where ext4 is 5-6 times
faster than XFS. Can you describe the test, filesystem and storage in
detail where you observe such a discrepancy?

Brian




Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-25 Thread Alberto Garcia
On Fri 21 Aug 2020 07:02:32 PM CEST, Brian Foster wrote:
>> I was running fio with --ramp_time=5 which ignores the first 5 seconds
>> of data in order to let performance settle, but if I remove that I can
>> see the effect more clearly. I can observe it with raw files (in 'off'
>> and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
>> preallocation=off the performance is stable during the whole test.
>
> That's interesting. I ran your fio command (without --ramp_time and
> with --runtime=5m) against a file on XFS (so no qcow2, no zero_range)
> once with sparse file with a 64k extent size hint and again with a
> fully preallocated 25GB file and I saw similar results in terms of the
> delta.  This was just against an SSD backed vdisk in my local dev VM,
> but I saw ~5800 iops for the full preallocation test and ~6200 iops
> with the extent size hint.
>
> I do notice an initial iops burst as described for both tests, so I
> switched to use a 60s ramp time and 60s runtime. With that longer ramp
> up time, I see ~5000 iops with the 64k extent size hint and ~5500 iops
> with the full 25GB prealloc. Perhaps the unexpected performance delta
> with qcow2 is similarly transient towards the start of the test and
> the runtime is short enough that it skews the final results..?

I also tried running directly against a file on xfs (no qcow2, no VMs)
but it doesn't really matter whether I use --ramp_time=5 or 60.

Here are the results:

|---+---+---|
| preallocation |   xfs |  ext4 |
|---+---+---|
| off   |  7277 | 43260 |
| fallocate |  7299 | 42810 |
| full  | 88404 | 83197 |
|---+---+---|

I ran the first case (no preallocation) for 5 minutes and I said there's
a peak during the first 5 seconds, but then the number remains under 10k
IOPS for the rest of the 5 minutes.

Berto



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-24 Thread Alberto Garcia
On Sun 23 Aug 2020 11:59:07 PM CEST, Dave Chinner wrote:
>> >> Option 4 is described above as initial file preallocation whereas
>> >> option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
>> >> is reporting that the initial file preallocation mode is slower than
>> >> the per cluster prealloc mode. Berto, am I following that right?
>> 
>> After looking more closely at the data I can see that there is a peak of
>> ~30K IOPS during the first 5 or 6 seconds and then it suddenly drops to
>> ~7K for the rest of the test.
>
> How big is the filesystem, how big is the log? (xfs_info output,
> please!)

The size of the filesystem is 126GB and here's the output of xfs_info:

meta-data=/dev/vg/test   isize=512agcount=4, agsize=8248576 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=1, rmapbt=0
 =   reflink=0
data =   bsize=4096   blocks=32994304, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=16110, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

>> I was running fio with --ramp_time=5 which ignores the first 5 seconds
>> of data in order to let performance settle, but if I remove that I can
>> see the effect more clearly. I can observe it with raw files (in 'off'
>> and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
>> preallocation=off the performance is stable during the whole test.
>
> What does "preallocation=off" mean again? Is that using
> fallocate(ZERO_RANGE) prior to the data write rather than
> preallocating the metadata/entire file?

Exactly, it means that. One fallocate() call before each data write
(unless the area has been allocated by a previous write).

> If so, I would expect the limiting factor is the rate at which IO can
> be issued because of the fallocate() triggered pipeline bubbles. That
> leaves idle device time so you're not pushing the limits of the
> hardware and hence none of the behaviours above will be evident...

The thing is that with raw (i.e. non-qcow2) images the number of IOPS is
similar, but in that case there are no fallocate() calls, only the data
writes.

Berto



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-23 Thread Dave Chinner
On Fri, Aug 21, 2020 at 08:59:44AM -0400, Brian Foster wrote:
> On Fri, Aug 21, 2020 at 01:42:52PM +0200, Alberto Garcia wrote:
> > On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster  
> > wrote:
> > And yes, (4) is a bit slower than (1) in my tests. On ext4 I get 10%
> > more IOPS.
> > 
> > I just ran the tests with aio=native and with a raw image instead of
> > qcow2, here are the results:
> > 
> > qcow2:
> > |--+-+|
> > | preallocation| aio=threads | aio=native |
> > |--+-+|
> > | off  |8139 |   7649 |
> > | off (w/o ZERO_RANGE) |2965 |   2779 |
> > | metadata |7768 |   8265 |
> > | falloc   |7742 |   7956 |
> > | full |   41389 |  56668 |
> > |--+-+|
> > 
> 
> So this seems like Dave's suggestion to use native aio produced more
> predictable results with full file prealloc being a bit faster than per
> cluster prealloc. Not sure why that isn't the case with aio=threads. I

That will the context switch overhead with aio=threads becoming a
performance limiting factor at higher IOPS. The "full" workload
there is probably running at 80-120k context switches/s while the
aio=native if probably under 10k ctxsw/s because it doesn't switch
threads for every IO that has to be submitted/completed.

For all the other results, I'd consider the difference to be noise -
it's just not significant enough to draw any conclusions from at
all.

FWIW, the other thing that aio=native gives us is plugging across
batch IO submission. This allows bio merging before dispatch and
that can greatly increase performance of AIO when the IO being
submitted has some mergable submissions. That's not the case for
pure random IO like this, but there are relatively few pure random
IO workloads out there... :P

> was wondering if perhaps the threading affects something indirectly like
> the qcow2 metadata allocation itself, but I guess that would be
> inconsistent with ext4 showing a notable jump from (1) to (4) (assuming
> the previous ext4 numbers were with aio=threads).

> > raw:
> > |---+-+|
> > | preallocation | aio=threads | aio=native |
> > |---+-+|
> > | off   |7647 |   7928 |
> > | falloc|7662 |   7856 |
> > | full  |   45224 |  58627 |
> > |---+-+|
> > 
> > A qcow2 file with preallocation=metadata is more or less similar to a
> > sparse raw file (and the numbers are indeed similar).
> > 
> > preallocation=off on qcow2 does not have an equivalent on raw files.
> > 
> 
> It sounds like preallocation=off for qcow2 would be roughly equivalent
> to a raw file with a 64k extent size hint (on XFS).

Yes, the effect should be close to identical, the only difference is
that qcow2 adds new clusters to the end of the file (i.e. the file
itself is not sparse), while the extent size hint will just add 64kB
extents into the file around the write offset. That demonstrates the
other behavioural advantage that extent size hints have is they
avoid needing to extend the file, which is yet another way to
serialise concurrent IO and create IO pipeline stalls...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-23 Thread Dave Chinner
On Fri, Aug 21, 2020 at 02:12:32PM +0200, Alberto Garcia wrote:
> On Fri 21 Aug 2020 01:42:52 PM CEST, Alberto Garcia wrote:
> > On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster  
> > wrote:
> >>> > 1) off: for every write request QEMU initializes the cluster (64KB)
> >>> > with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> >>> > 
> >>> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> >>> > of the cluster with zeroes.
> >>> > 
> >>> > 3) metadata: all clusters were allocated when the image was created
> >>> > but they are sparse, QEMU only writes the 4KB of data.
> >>> > 
> >>> > 4) falloc: all clusters were allocated with fallocate() when the image
> >>> > was created, QEMU only writes 4KB of data.
> >>> > 
> >>> > 5) full: all clusters were allocated by writing zeroes to all of them
> >>> > when the image was created, QEMU only writes 4KB of data.
> >>> > 
> >>> > As I said in a previous message I'm not familiar with xfs, but the
> >>> > parts that I don't understand are
> >>> > 
> >>> >- Why is (4) slower than (1)?
> >>> 
> >>> Because fallocate() is a full IO serialisation barrier at the
> >>> filesystem level. If you do:
> >>> 
> >>> fallocate(whole file)
> >>> 
> >>> 
> >>> 
> >>> .
> >>> 
> >>> The IO can run concurrent and does not serialise against anything in
> >>> the filesysetm except unwritten extent conversions at IO completion
> >>> (see answer to next question!)
> >>> 
> >>> However, if you just use (4) you get:
> >>> 
> >>> falloc(64k)
> >>>   
> >>>   
> >>> <4k io>
> >>>   
> >>> falloc(64k)
> >>>   
> >>>   
> >>>   <4k IO completes, converts 4k to written>
> >>>   
> >>> <4k io>
> >>> falloc(64k)
> >>>   
> >>>   
> >>>   <4k IO completes, converts 4k to written>
> >>>   
> >>> <4k io>
> >>>   
> >>> 
> >>
> >> Option 4 is described above as initial file preallocation whereas
> >> option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
> >> is reporting that the initial file preallocation mode is slower than
> >> the per cluster prealloc mode. Berto, am I following that right?
> 
> After looking more closely at the data I can see that there is a peak of
> ~30K IOPS during the first 5 or 6 seconds and then it suddenly drops to
> ~7K for the rest of the test.

How big is the filesystem, how big is the log? (xfs_info output,
please!)

In general, there are three typical causes of this. The first is
typical of the initial burst of allocations running on an empty
journal, then allocation transactions getting throttling back to the
speed at which metadata can be flushed once the journal fills up. If
you have a small filesystem and a default sized log, this is quite
likely to happen.

The second is that have large logs and you are running on hardware
where device cache flushes and FUA writes hammer overall device
performance. Hence when the CIL initially fills up and starts
flushing (journal writes are pre-flush + FUA so do both) device
performance goes way down because now it has to write it's cached
data to physical media rather than just cache it in volatile device
RAM. IOWs, journal writes end up forcing all volatile data to stable
media and so that can slow the device down. ALso, cache flushes
might not be queued commands, hence journal writes will also create IO
pipeline stalls...

The third is the hardware capability.  Consumer hardware is designed
to have extremely fast bursty behaviour, but then steady state
performance is much lower (think "SLC" burst caches in TLC SSDs). I
have isome consumer SSDs here that can sustain 400MB/s random 4kB
write for about 10-15s, then they drop to about 50MB/s once the
burst buffer is full. OTOH, I have enterprise SSDs that will sustain
a _much_ higher rate of random 4kB writes indefinitely than the
consumer SSDs burst at.  However, most consumer workloads don't move
this sort of data around, so this sort of design tradeoff is fine
for that market (Benchmarketing 101 stuff :).

IOWs, this behaviour could be filesystem config, it could be cache
flush behaviour, it could simply be storage device design
capability. Or it could be a combination of all three things.
Watching a set of fast sampling metrics that tell you what the
device and filesytem are doing in real time (e.g. I use PCP for this
and visualise ithe behaviour in real time via pmchart) gives a lot
of insight into exactly what is changing during transient workload
changes liek starting a benchmark...

> I was running fio with --ramp_time=5 which ignores the first 5 seconds
> of data in order to let performance settle, but if I remove that I can
> see the effect more clearly. I can observe it with raw files (in 'off'
> and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
> preallocation=off the performance is stable during the whole test.

What does "preallocation=off" mean again? Is that using
fallocate(ZERO_RANGE) prior to the data write rather than

Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-21 Thread Brian Foster
On Fri, Aug 21, 2020 at 02:12:32PM +0200, Alberto Garcia wrote:
> On Fri 21 Aug 2020 01:42:52 PM CEST, Alberto Garcia wrote:
> > On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster  
> > wrote:
> >>> > 1) off: for every write request QEMU initializes the cluster (64KB)
> >>> > with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> >>> > 
> >>> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> >>> > of the cluster with zeroes.
> >>> > 
> >>> > 3) metadata: all clusters were allocated when the image was created
> >>> > but they are sparse, QEMU only writes the 4KB of data.
> >>> > 
> >>> > 4) falloc: all clusters were allocated with fallocate() when the image
> >>> > was created, QEMU only writes 4KB of data.
> >>> > 
> >>> > 5) full: all clusters were allocated by writing zeroes to all of them
> >>> > when the image was created, QEMU only writes 4KB of data.
> >>> > 
> >>> > As I said in a previous message I'm not familiar with xfs, but the
> >>> > parts that I don't understand are
> >>> > 
> >>> >- Why is (4) slower than (1)?
> >>> 
> >>> Because fallocate() is a full IO serialisation barrier at the
> >>> filesystem level. If you do:
> >>> 
> >>> fallocate(whole file)
> >>> 
> >>> 
> >>> 
> >>> .
> >>> 
> >>> The IO can run concurrent and does not serialise against anything in
> >>> the filesysetm except unwritten extent conversions at IO completion
> >>> (see answer to next question!)
> >>> 
> >>> However, if you just use (4) you get:
> >>> 
> >>> falloc(64k)
> >>>   
> >>>   
> >>> <4k io>
> >>>   
> >>> falloc(64k)
> >>>   
> >>>   
> >>>   <4k IO completes, converts 4k to written>
> >>>   
> >>> <4k io>
> >>> falloc(64k)
> >>>   
> >>>   
> >>>   <4k IO completes, converts 4k to written>
> >>>   
> >>> <4k io>
> >>>   
> >>> 
> >>
> >> Option 4 is described above as initial file preallocation whereas
> >> option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
> >> is reporting that the initial file preallocation mode is slower than
> >> the per cluster prealloc mode. Berto, am I following that right?
> 
> After looking more closely at the data I can see that there is a peak of
> ~30K IOPS during the first 5 or 6 seconds and then it suddenly drops to
> ~7K for the rest of the test.
> 
> I was running fio with --ramp_time=5 which ignores the first 5 seconds
> of data in order to let performance settle, but if I remove that I can
> see the effect more clearly. I can observe it with raw files (in 'off'
> and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
> preallocation=off the performance is stable during the whole test.
> 

That's interesting. I ran your fio command (without --ramp_time and with
--runtime=5m) against a file on XFS (so no qcow2, no zero_range) once
with sparse file with a 64k extent size hint and again with a fully
preallocated 25GB file and I saw similar results in terms of the delta.
This was just against an SSD backed vdisk in my local dev VM, but I saw
~5800 iops for the full preallocation test and ~6200 iops with the
extent size hint.

I do notice an initial iops burst as described for both tests, so I
switched to use a 60s ramp time and 60s runtime. With that longer ramp
up time, I see ~5000 iops with the 64k extent size hint and ~5500 iops
with the full 25GB prealloc. Perhaps the unexpected performance delta
with qcow2 is similarly transient towards the start of the test and the
runtime is short enough that it skews the final results..?

Brian

> Berto
> 




Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-21 Thread Alberto Garcia
On Thu 20 Aug 2020 11:58:11 PM CEST, Dave Chinner wrote:
>> The virtual drive (/dev/vdb) is a freshly created qcow2 file stored on
>> the host (on an xfs or ext4 filesystem as the table above shows), and
>> it is attached to QEMU using a virtio-blk-pci device:
>> 
>>-drive if=virtio,file=image.qcow2,cache=none,l2-cache-size=200M
>
> You're not using AIO on this image file, so it can't do
> concurrent IO? what happens when you add "aio=native" to this?

I sent the results on a reply to Brian.

>> cache=none means that the image is opened with O_DIRECT and
>> l2-cache-size is large enough so QEMU is able to cache all the
>> relevant qcow2 metadata in memory.
>
> What happens when you just use a sparse file (i.e. a raw image) with
> aio=native instead of using qcow2? XFS, ext4, btrfs, etc all support
> sparse files so using qcow2 to provide sparse image file support is
> largely an unnecessary layer of indirection and overhead...
>
> And with XFS, you don't need qcow2 for snapshots either because you
> can use reflink copies to take an atomic copy-on-write snapshot of the
> raw image file... (assuming you made the xfs filesystem with reflink
> support (which is the TOT default now)).

To be clear, I'm not trying to advocate for or against qcow2 on xfs, we
were just analyzing different allocation strategies for qcow2 and we
came across these results which we don't quite understand.

>> 1) off: for every write request QEMU initializes the cluster (64KB)
>> with fallocate(ZERO_RANGE) and then writes the 4KB of data.
>> 
>> 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
>> of the cluster with zeroes.
>> 
>> 3) metadata: all clusters were allocated when the image was created
>> but they are sparse, QEMU only writes the 4KB of data.
>> 
>> 4) falloc: all clusters were allocated with fallocate() when the image
>> was created, QEMU only writes 4KB of data.
>> 
>> 5) full: all clusters were allocated by writing zeroes to all of them
>> when the image was created, QEMU only writes 4KB of data.
>> 
>> As I said in a previous message I'm not familiar with xfs, but the
>> parts that I don't understand are
>> 
>>- Why is (4) slower than (1)?
>
> Because fallocate() is a full IO serialisation barrier at the
> filesystem level. If you do:
>
> fallocate(whole file)
> 
> 
> 
> .
>
> The IO can run concurrent and does not serialise against anything in
> the filesysetm except unwritten extent conversions at IO completion
> (see answer to next question!)
>
> However, if you just use (4) you get:
>
> falloc(64k)
>   
>   
> <4k io>
>   
> falloc(64k)
>   
>   
>   <4k IO completes, converts 4k to written>
>   
> <4k io>

I think Brian pointed it out already, but scenario (4) is rather
falloc(25GB), then QEMU is launched and the actual 4k IO requests start
to happen.

So I would expect that after falloc(25GB) all clusters are initialized
and the end result would be closer to a full preallocation (i.e. writing
25GB worth of zeroes to disk).

> IOWs, typical "write once" benchmark testing indicates the *worst*
> performance you are going to see. As the guest filesytsem ages and
> initialises more of the underlying image file, it will get faster, not
> slower.

Yes, that's clear, once everything is allocation then it is fast (and
really much faster in the case of xfs vs ext4), what we try to optimize
in qcow2 is precisely the allocation of new clusters.

Berto



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-21 Thread Alberto Garcia
On Fri 21 Aug 2020 02:59:44 PM CEST, Brian Foster wrote:
>> > Option 4 is described above as initial file preallocation whereas
>> > option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
>> > is reporting that the initial file preallocation mode is slower than
>> > the per cluster prealloc mode. Berto, am I following that right?
>> 
>> Option (1) means that no qcow2 cluster is allocated at the beginning of
>> the test so, apart from updating the relevant qcow2 metadata, each write
>> request clears the cluster first (with fallocate(ZERO_RANGE)) then
>> writes the requested 4KB of data. Further writes to the same cluster
>> don't need changes on the qcow2 metadata so they go directly to the area
>> that was cleared with fallocate().
>> 
>> Option (4) means that all clusters are allocated when the image is
>> created and they are initialized with fallocate() (actually with
>> posix_fallocate() now that I read the code, I suppose it's the same for
>> xfs?). Only after that the test starts. All write requests are simply
>> forwarded to the disk, there is no need to touch any qcow2 metadata nor
>> do anything else.
>> 
>
> Ok, I think that's consistent with what I described above (sorry, I find
> the preallocation mode names rather confusing so I was trying to avoid
> using them). Have you confirmed that posix_fallocate() in this case
> translates directly to fallocate()? I suppose that's most likely the
> case, otherwise you'd see numbers more like with preallocation=full
> (file preallocated via writing zeroes).

Yes, it seems to be:

   
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/posix_fallocate.c;h=7238b000383af2f3878a9daf8528819645b6aa31;hb=HEAD

And that's also what the posix_fallocate() manual page says.

>> And yes, (4) is a bit slower than (1) in my tests. On ext4 I get 10%
>> more IOPS.
>> 
>> I just ran the tests with aio=native and with a raw image instead of
>> qcow2, here are the results:
>> 
>> qcow2:
>> |--+-+|
>> | preallocation| aio=threads | aio=native |
>> |--+-+|
>> | off  |8139 |   7649 |
>> | off (w/o ZERO_RANGE) |2965 |   2779 |
>> | metadata |7768 |   8265 |
>> | falloc   |7742 |   7956 |
>> | full |   41389 |  56668 |
>> |--+-+|
>> 
>
> So this seems like Dave's suggestion to use native aio produced more
> predictable results with full file prealloc being a bit faster than per
> cluster prealloc. Not sure why that isn't the case with aio=threads. I
> was wondering if perhaps the threading affects something indirectly like
> the qcow2 metadata allocation itself, but I guess that would be
> inconsistent with ext4 showing a notable jump from (1) to (4) (assuming
> the previous ext4 numbers were with aio=threads).

Yes, I took the ext4 numbers with aio=threads

>> raw:
>> |---+-+|
>> | preallocation | aio=threads | aio=native |
>> |---+-+|
>> | off   |7647 |   7928 |
>> | falloc|7662 |   7856 |
>> | full  |   45224 |  58627 |
>> |---+-+|
>> 
>> A qcow2 file with preallocation=metadata is more or less similar to a
>> sparse raw file (and the numbers are indeed similar).
>> 
>> preallocation=off on qcow2 does not have an equivalent on raw files.
>
> It sounds like preallocation=off for qcow2 would be roughly equivalent
> to a raw file with a 64k extent size hint (on XFS).

There's the overhead of handling the qcow2 metadata but QEMU keeps a
memory cache so it should not be too big.

Berto



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-21 Thread Brian Foster
On Fri, Aug 21, 2020 at 01:42:52PM +0200, Alberto Garcia wrote:
> On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster  wrote:
> >> > 1) off: for every write request QEMU initializes the cluster (64KB)
> >> > with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> >> > 
> >> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> >> > of the cluster with zeroes.
> >> > 
> >> > 3) metadata: all clusters were allocated when the image was created
> >> > but they are sparse, QEMU only writes the 4KB of data.
> >> > 
> >> > 4) falloc: all clusters were allocated with fallocate() when the image
> >> > was created, QEMU only writes 4KB of data.
> >> > 
> >> > 5) full: all clusters were allocated by writing zeroes to all of them
> >> > when the image was created, QEMU only writes 4KB of data.
> >> > 
> >> > As I said in a previous message I'm not familiar with xfs, but the
> >> > parts that I don't understand are
> >> > 
> >> >- Why is (4) slower than (1)?
> >> 
> >> Because fallocate() is a full IO serialisation barrier at the
> >> filesystem level. If you do:
> >> 
> >> fallocate(whole file)
> >> 
> >> 
> >> 
> >> .
> >> 
> >> The IO can run concurrent and does not serialise against anything in
> >> the filesysetm except unwritten extent conversions at IO completion
> >> (see answer to next question!)
> >> 
> >> However, if you just use (4) you get:
> >> 
> >> falloc(64k)
> >>   
> >>   
> >> <4k io>
> >>   
> >> falloc(64k)
> >>   
> >>   
> >>   <4k IO completes, converts 4k to written>
> >>   
> >> <4k io>
> >> falloc(64k)
> >>   
> >>   
> >>   <4k IO completes, converts 4k to written>
> >>   
> >> <4k io>
> >>   
> >> 
> >
> > Option 4 is described above as initial file preallocation whereas
> > option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
> > is reporting that the initial file preallocation mode is slower than
> > the per cluster prealloc mode. Berto, am I following that right?
> 
> Option (1) means that no qcow2 cluster is allocated at the beginning of
> the test so, apart from updating the relevant qcow2 metadata, each write
> request clears the cluster first (with fallocate(ZERO_RANGE)) then
> writes the requested 4KB of data. Further writes to the same cluster
> don't need changes on the qcow2 metadata so they go directly to the area
> that was cleared with fallocate().
> 
> Option (4) means that all clusters are allocated when the image is
> created and they are initialized with fallocate() (actually with
> posix_fallocate() now that I read the code, I suppose it's the same for
> xfs?). Only after that the test starts. All write requests are simply
> forwarded to the disk, there is no need to touch any qcow2 metadata nor
> do anything else.
> 

Ok, I think that's consistent with what I described above (sorry, I find
the preallocation mode names rather confusing so I was trying to avoid
using them). Have you confirmed that posix_fallocate() in this case
translates directly to fallocate()? I suppose that's most likely the
case, otherwise you'd see numbers more like with preallocation=full
(file preallocated via writing zeroes).

> And yes, (4) is a bit slower than (1) in my tests. On ext4 I get 10%
> more IOPS.
> 
> I just ran the tests with aio=native and with a raw image instead of
> qcow2, here are the results:
> 
> qcow2:
> |--+-+|
> | preallocation| aio=threads | aio=native |
> |--+-+|
> | off  |8139 |   7649 |
> | off (w/o ZERO_RANGE) |2965 |   2779 |
> | metadata |7768 |   8265 |
> | falloc   |7742 |   7956 |
> | full |   41389 |  56668 |
> |--+-+|
> 

So this seems like Dave's suggestion to use native aio produced more
predictable results with full file prealloc being a bit faster than per
cluster prealloc. Not sure why that isn't the case with aio=threads. I
was wondering if perhaps the threading affects something indirectly like
the qcow2 metadata allocation itself, but I guess that would be
inconsistent with ext4 showing a notable jump from (1) to (4) (assuming
the previous ext4 numbers were with aio=threads).

> raw:
> |---+-+|
> | preallocation | aio=threads | aio=native |
> |---+-+|
> | off   |7647 |   7928 |
> | falloc|7662 |   7856 |
> | full  |   45224 |  58627 |
> |---+-+|
> 
> A qcow2 file with preallocation=metadata is more or less similar to a
> sparse raw file (and the numbers are indeed similar).
> 
> preallocation=off on qcow2 does not have an equivalent on raw files.
> 

It sounds like preallocation=off for qcow2 would be roughly equivalent
to a raw file with a 

Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-21 Thread Alberto Garcia
On Fri 21 Aug 2020 01:42:52 PM CEST, Alberto Garcia wrote:
> On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster  wrote:
>>> > 1) off: for every write request QEMU initializes the cluster (64KB)
>>> > with fallocate(ZERO_RANGE) and then writes the 4KB of data.
>>> > 
>>> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
>>> > of the cluster with zeroes.
>>> > 
>>> > 3) metadata: all clusters were allocated when the image was created
>>> > but they are sparse, QEMU only writes the 4KB of data.
>>> > 
>>> > 4) falloc: all clusters were allocated with fallocate() when the image
>>> > was created, QEMU only writes 4KB of data.
>>> > 
>>> > 5) full: all clusters were allocated by writing zeroes to all of them
>>> > when the image was created, QEMU only writes 4KB of data.
>>> > 
>>> > As I said in a previous message I'm not familiar with xfs, but the
>>> > parts that I don't understand are
>>> > 
>>> >- Why is (4) slower than (1)?
>>> 
>>> Because fallocate() is a full IO serialisation barrier at the
>>> filesystem level. If you do:
>>> 
>>> fallocate(whole file)
>>> 
>>> 
>>> 
>>> .
>>> 
>>> The IO can run concurrent and does not serialise against anything in
>>> the filesysetm except unwritten extent conversions at IO completion
>>> (see answer to next question!)
>>> 
>>> However, if you just use (4) you get:
>>> 
>>> falloc(64k)
>>>   
>>>   
>>> <4k io>
>>>   
>>> falloc(64k)
>>>   
>>>   
>>>   <4k IO completes, converts 4k to written>
>>>   
>>> <4k io>
>>> falloc(64k)
>>>   
>>>   
>>>   <4k IO completes, converts 4k to written>
>>>   
>>> <4k io>
>>>   
>>> 
>>
>> Option 4 is described above as initial file preallocation whereas
>> option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
>> is reporting that the initial file preallocation mode is slower than
>> the per cluster prealloc mode. Berto, am I following that right?

After looking more closely at the data I can see that there is a peak of
~30K IOPS during the first 5 or 6 seconds and then it suddenly drops to
~7K for the rest of the test.

I was running fio with --ramp_time=5 which ignores the first 5 seconds
of data in order to let performance settle, but if I remove that I can
see the effect more clearly. I can observe it with raw files (in 'off'
and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
preallocation=off the performance is stable during the whole test.

Berto



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-21 Thread Alberto Garcia
On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster  wrote:
>> > 1) off: for every write request QEMU initializes the cluster (64KB)
>> > with fallocate(ZERO_RANGE) and then writes the 4KB of data.
>> > 
>> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
>> > of the cluster with zeroes.
>> > 
>> > 3) metadata: all clusters were allocated when the image was created
>> > but they are sparse, QEMU only writes the 4KB of data.
>> > 
>> > 4) falloc: all clusters were allocated with fallocate() when the image
>> > was created, QEMU only writes 4KB of data.
>> > 
>> > 5) full: all clusters were allocated by writing zeroes to all of them
>> > when the image was created, QEMU only writes 4KB of data.
>> > 
>> > As I said in a previous message I'm not familiar with xfs, but the
>> > parts that I don't understand are
>> > 
>> >- Why is (4) slower than (1)?
>> 
>> Because fallocate() is a full IO serialisation barrier at the
>> filesystem level. If you do:
>> 
>> fallocate(whole file)
>> 
>> 
>> 
>> .
>> 
>> The IO can run concurrent and does not serialise against anything in
>> the filesysetm except unwritten extent conversions at IO completion
>> (see answer to next question!)
>> 
>> However, if you just use (4) you get:
>> 
>> falloc(64k)
>>   
>>   
>> <4k io>
>>   
>> falloc(64k)
>>   
>>   
>>   <4k IO completes, converts 4k to written>
>>   
>> <4k io>
>> falloc(64k)
>>   
>>   
>>   <4k IO completes, converts 4k to written>
>>   
>> <4k io>
>>   
>> 
>
> Option 4 is described above as initial file preallocation whereas
> option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
> is reporting that the initial file preallocation mode is slower than
> the per cluster prealloc mode. Berto, am I following that right?

Option (1) means that no qcow2 cluster is allocated at the beginning of
the test so, apart from updating the relevant qcow2 metadata, each write
request clears the cluster first (with fallocate(ZERO_RANGE)) then
writes the requested 4KB of data. Further writes to the same cluster
don't need changes on the qcow2 metadata so they go directly to the area
that was cleared with fallocate().

Option (4) means that all clusters are allocated when the image is
created and they are initialized with fallocate() (actually with
posix_fallocate() now that I read the code, I suppose it's the same for
xfs?). Only after that the test starts. All write requests are simply
forwarded to the disk, there is no need to touch any qcow2 metadata nor
do anything else.

And yes, (4) is a bit slower than (1) in my tests. On ext4 I get 10%
more IOPS.

I just ran the tests with aio=native and with a raw image instead of
qcow2, here are the results:

qcow2:
|--+-+|
| preallocation| aio=threads | aio=native |
|--+-+|
| off  |8139 |   7649 |
| off (w/o ZERO_RANGE) |2965 |   2779 |
| metadata |7768 |   8265 |
| falloc   |7742 |   7956 |
| full |   41389 |  56668 |
|--+-+|

raw:
|---+-+|
| preallocation | aio=threads | aio=native |
|---+-+|
| off   |7647 |   7928 |
| falloc|7662 |   7856 |
| full  |   45224 |  58627 |
|---+-+|

A qcow2 file with preallocation=metadata is more or less similar to a
sparse raw file (and the numbers are indeed similar).

preallocation=off on qcow2 does not have an equivalent on raw files.

Berto



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-21 Thread Brian Foster
On Fri, Aug 21, 2020 at 07:58:11AM +1000, Dave Chinner wrote:
> On Thu, Aug 20, 2020 at 10:03:10PM +0200, Alberto Garcia wrote:
> > Cc: linux-xfs
> > 
> > On Wed 19 Aug 2020 07:53:00 PM CEST, Brian Foster wrote:
> > > In any event, if you're seeing unclear or unexpected performance
> > > deltas between certain XFS configurations or other fs', I think the
> > > best thing to do is post a more complete description of the workload,
> > > filesystem/storage setup, and test results to the linux-xfs mailing
> > > list (feel free to cc me as well). As it is, aside from the questions
> > > above, it's not really clear to me what the storage stack looks like
> > > for this test, if/how qcow2 is involved, what the various
> > > 'preallocation=' modes actually mean, etc.
> > 
> > (see [1] for a bit of context)
> > 
> > I repeated the tests with a larger (125GB) filesystem. Things are a bit
> > faster but not radically different, here are the new numbers:
> > 
> > |--+---+---|
> > | preallocation mode   |   xfs |  ext4 |
> > |--+---+---|
> > | off  |  8139 | 11688 |
> > | off (w/o ZERO_RANGE) |  2965 |  2780 |
> > | metadata |  7768 |  9132 |
> > | falloc   |  7742 | 13108 |
> > | full | 41389 | 16351 |
> > |--+---+---|
> > 
> > The numbers are I/O operations per second as reported by fio, running
> > inside a VM.
> > 
> > The VM is running Debian 9.7 with Linux 4.9.130 and the fio version is
> > 2.16-1. I'm using QEMU 5.1.0.
> > 
> > fio is sending random 4KB write requests to a 25GB virtual drive, this
> > is the full command line:
> > 
> > fio --filename=/dev/vdb --direct=1 --randrepeat=1 --eta=always
> > --ioengine=libaio --iodepth=32 --numjobs=1 --name=test --size=25G
> > --io_limit=25G --ramp_time=5 --rw=randwrite --bs=4k --runtime=60
> >   
> > The virtual drive (/dev/vdb) is a freshly created qcow2 file stored on
> > the host (on an xfs or ext4 filesystem as the table above shows), and
> > it is attached to QEMU using a virtio-blk-pci device:
> > 
> >-drive if=virtio,file=image.qcow2,cache=none,l2-cache-size=200M
> 
> You're not using AIO on this image file, so it can't do
> concurrent IO? what happens when you add "aio=native" to this?
> 
> > cache=none means that the image is opened with O_DIRECT and
> > l2-cache-size is large enough so QEMU is able to cache all the
> > relevant qcow2 metadata in memory.
> 
> What happens when you just use a sparse file (i.e. a raw image) with
> aio=native instead of using qcow2? XFS, ext4, btrfs, etc all support
> sparse files so using qcow2 to provide sparse image file support is
> largely an unnecessary layer of indirection and overhead...
> 
> And with XFS, you don't need qcow2 for snapshots either because you
> can use reflink copies to take an atomic copy-on-write snapshot of
> the raw image file... (assuming you made the xfs filesystem with
> reflink support (which is the TOT default now)).
> 
> I've been using raw sprase files on XFS for all my VMs for over a
> decade now, and using reflink to create COW copies of golden
> image files iwhen deploying new VMs for a couple of years now...
> 
> > The host is running Linux 4.19.132 and has an SSD drive.
> > 
> > About the preallocation modes: a qcow2 file is divided into clusters
> > of the same size (64KB in this case). That is the minimum unit of
> > allocation, so when writing 4KB to an unallocated cluster QEMU needs
> > to fill the other 60KB with zeroes. So here's what happens with the
> > different modes:
> 
> Which is something that sparse files on filesystems do not need to
> do. If, on XFS, you really want 64kB allocation clusters, use an
> extent size hint of 64kB. Though for image files, I highly recommend
> using 1MB or larger extent size hints.
> 
> 
> > 1) off: for every write request QEMU initializes the cluster (64KB)
> > with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> > 
> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> > of the cluster with zeroes.
> > 
> > 3) metadata: all clusters were allocated when the image was created
> > but they are sparse, QEMU only writes the 4KB of data.
> > 
> > 4) falloc: all clusters were allocated with fallocate() when the image
> > was created, QEMU only writes 4KB of data.
> > 
> > 5) full: all clusters were allocated by writing zeroes to all of them
> > when the image was created, QEMU only writes 4KB of data.
> > 
> > As I said in a previous message I'm not familiar with xfs, but the
> > parts that I don't understand are
> > 
> >- Why is (4) slower than (1)?
> 
> Because fallocate() is a full IO serialisation barrier at the
> filesystem level. If you do:
> 
> fallocate(whole file)
> 
> 
> 
> .
> 
> The IO can run concurrent and does not serialise against anything in
> the filesysetm except unwritten extent conversions at IO 

Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-20 Thread Dave Chinner
On Thu, Aug 20, 2020 at 10:03:10PM +0200, Alberto Garcia wrote:
> Cc: linux-xfs
> 
> On Wed 19 Aug 2020 07:53:00 PM CEST, Brian Foster wrote:
> > In any event, if you're seeing unclear or unexpected performance
> > deltas between certain XFS configurations or other fs', I think the
> > best thing to do is post a more complete description of the workload,
> > filesystem/storage setup, and test results to the linux-xfs mailing
> > list (feel free to cc me as well). As it is, aside from the questions
> > above, it's not really clear to me what the storage stack looks like
> > for this test, if/how qcow2 is involved, what the various
> > 'preallocation=' modes actually mean, etc.
> 
> (see [1] for a bit of context)
> 
> I repeated the tests with a larger (125GB) filesystem. Things are a bit
> faster but not radically different, here are the new numbers:
> 
> |--+---+---|
> | preallocation mode   |   xfs |  ext4 |
> |--+---+---|
> | off  |  8139 | 11688 |
> | off (w/o ZERO_RANGE) |  2965 |  2780 |
> | metadata |  7768 |  9132 |
> | falloc   |  7742 | 13108 |
> | full | 41389 | 16351 |
> |--+---+---|
> 
> The numbers are I/O operations per second as reported by fio, running
> inside a VM.
> 
> The VM is running Debian 9.7 with Linux 4.9.130 and the fio version is
> 2.16-1. I'm using QEMU 5.1.0.
> 
> fio is sending random 4KB write requests to a 25GB virtual drive, this
> is the full command line:
> 
> fio --filename=/dev/vdb --direct=1 --randrepeat=1 --eta=always
> --ioengine=libaio --iodepth=32 --numjobs=1 --name=test --size=25G
> --io_limit=25G --ramp_time=5 --rw=randwrite --bs=4k --runtime=60
>   
> The virtual drive (/dev/vdb) is a freshly created qcow2 file stored on
> the host (on an xfs or ext4 filesystem as the table above shows), and
> it is attached to QEMU using a virtio-blk-pci device:
> 
>-drive if=virtio,file=image.qcow2,cache=none,l2-cache-size=200M

You're not using AIO on this image file, so it can't do
concurrent IO? what happens when you add "aio=native" to this?

> cache=none means that the image is opened with O_DIRECT and
> l2-cache-size is large enough so QEMU is able to cache all the
> relevant qcow2 metadata in memory.

What happens when you just use a sparse file (i.e. a raw image) with
aio=native instead of using qcow2? XFS, ext4, btrfs, etc all support
sparse files so using qcow2 to provide sparse image file support is
largely an unnecessary layer of indirection and overhead...

And with XFS, you don't need qcow2 for snapshots either because you
can use reflink copies to take an atomic copy-on-write snapshot of
the raw image file... (assuming you made the xfs filesystem with
reflink support (which is the TOT default now)).

I've been using raw sprase files on XFS for all my VMs for over a
decade now, and using reflink to create COW copies of golden
image files iwhen deploying new VMs for a couple of years now...

> The host is running Linux 4.19.132 and has an SSD drive.
> 
> About the preallocation modes: a qcow2 file is divided into clusters
> of the same size (64KB in this case). That is the minimum unit of
> allocation, so when writing 4KB to an unallocated cluster QEMU needs
> to fill the other 60KB with zeroes. So here's what happens with the
> different modes:

Which is something that sparse files on filesystems do not need to
do. If, on XFS, you really want 64kB allocation clusters, use an
extent size hint of 64kB. Though for image files, I highly recommend
using 1MB or larger extent size hints.


> 1) off: for every write request QEMU initializes the cluster (64KB)
> with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> 
> 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> of the cluster with zeroes.
> 
> 3) metadata: all clusters were allocated when the image was created
> but they are sparse, QEMU only writes the 4KB of data.
> 
> 4) falloc: all clusters were allocated with fallocate() when the image
> was created, QEMU only writes 4KB of data.
> 
> 5) full: all clusters were allocated by writing zeroes to all of them
> when the image was created, QEMU only writes 4KB of data.
> 
> As I said in a previous message I'm not familiar with xfs, but the
> parts that I don't understand are
> 
>- Why is (4) slower than (1)?

Because fallocate() is a full IO serialisation barrier at the
filesystem level. If you do:

fallocate(whole file)



.

The IO can run concurrent and does not serialise against anything in
the filesysetm except unwritten extent conversions at IO completion
(see answer to next question!)

However, if you just use (4) you get:

falloc(64k)
  
  
<4k io>
  
falloc(64k)
  
  
  <4k IO completes, converts 4k to written>
  
<4k io>
falloc(64k)
  
  
  <4k IO completes, converts 4k to written>
  
<4k io>
  

until all the 

Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-20 Thread Alberto Garcia
Cc: linux-xfs

On Wed 19 Aug 2020 07:53:00 PM CEST, Brian Foster wrote:
> In any event, if you're seeing unclear or unexpected performance
> deltas between certain XFS configurations or other fs', I think the
> best thing to do is post a more complete description of the workload,
> filesystem/storage setup, and test results to the linux-xfs mailing
> list (feel free to cc me as well). As it is, aside from the questions
> above, it's not really clear to me what the storage stack looks like
> for this test, if/how qcow2 is involved, what the various
> 'preallocation=' modes actually mean, etc.

(see [1] for a bit of context)

I repeated the tests with a larger (125GB) filesystem. Things are a bit
faster but not radically different, here are the new numbers:

|--+---+---|
| preallocation mode   |   xfs |  ext4 |
|--+---+---|
| off  |  8139 | 11688 |
| off (w/o ZERO_RANGE) |  2965 |  2780 |
| metadata |  7768 |  9132 |
| falloc   |  7742 | 13108 |
| full | 41389 | 16351 |
|--+---+---|

The numbers are I/O operations per second as reported by fio, running
inside a VM.

The VM is running Debian 9.7 with Linux 4.9.130 and the fio version is
2.16-1. I'm using QEMU 5.1.0.

fio is sending random 4KB write requests to a 25GB virtual drive, this
is the full command line:

fio --filename=/dev/vdb --direct=1 --randrepeat=1 --eta=always
--ioengine=libaio --iodepth=32 --numjobs=1 --name=test --size=25G
--io_limit=25G --ramp_time=5 --rw=randwrite --bs=4k --runtime=60
  
The virtual drive (/dev/vdb) is a freshly created qcow2 file stored on
the host (on an xfs or ext4 filesystem as the table above shows), and
it is attached to QEMU using a virtio-blk-pci device:

   -drive if=virtio,file=image.qcow2,cache=none,l2-cache-size=200M

cache=none means that the image is opened with O_DIRECT and
l2-cache-size is large enough so QEMU is able to cache all the
relevant qcow2 metadata in memory.

The host is running Linux 4.19.132 and has an SSD drive.

About the preallocation modes: a qcow2 file is divided into clusters
of the same size (64KB in this case). That is the minimum unit of
allocation, so when writing 4KB to an unallocated cluster QEMU needs
to fill the other 60KB with zeroes. So here's what happens with the
different modes:

1) off: for every write request QEMU initializes the cluster (64KB)
with fallocate(ZERO_RANGE) and then writes the 4KB of data.

2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
of the cluster with zeroes.

3) metadata: all clusters were allocated when the image was created
but they are sparse, QEMU only writes the 4KB of data.

4) falloc: all clusters were allocated with fallocate() when the image
was created, QEMU only writes 4KB of data.

5) full: all clusters were allocated by writing zeroes to all of them
when the image was created, QEMU only writes 4KB of data.

As I said in a previous message I'm not familiar with xfs, but the
parts that I don't understand are

   - Why is (4) slower than (1)?
   - Why is (5) so much faster than everything else?

I hope I didn't forget anything, tell me if you have questions.

Berto

[1] https://lists.gnu.org/archive/html/qemu-block/2020-08/msg00481.html



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-19 Thread Brian Foster
On Wed, Aug 19, 2020 at 05:07:11PM +0200, Kevin Wolf wrote:
> Am 19.08.2020 um 16:25 hat Alberto Garcia geschrieben:
> > On Mon 17 Aug 2020 05:53:07 PM CEST, Kevin Wolf wrote:
> > >> > Or are you saying that ZERO_RANGE + pwrite on a sparse file (=
> > >> > cluster allocation) is faster for you than just the pwrite alone (=
> > >> > writing to already allocated cluster)?
> > >> 
> > >> Yes, 20% faster in my tests (4KB random writes), but in the latter
> > >> case the cluster is already allocated only at the qcow2 level, not on
> > >> the filesystem. preallocation=falloc is faster than
> > >> preallocation=metadata (preallocation=off sits in the middle).
> > >
> > > Hm, this feels wrong. Doing more operations should never be faster
> > > than doing less operations.
> > >
> > > Maybe the difference is in allocating 64k at once instead of doing a
> > > separate allocation for every 4k block? But with the extent size hint
> > > patches to file-posix, we should allocate 1 MB at once by default now
> > > (if your test image was newly created). Can you check whether this is
> > > in effect for your image file?
> > 
> > I checked with xfs on my computer. I'm not very familiar with that
> > filesystem so I was using the default options and I didn't tune
> > anything.
> > 
> > What I got with my tests (using fio):
> > 
> > - Using extent_size_hint didn't make any difference in my test case (I
> >   do see a clear difference however with the test case described in
> >   commit ffa244c84a).
> 
> Hm, interesting. What is your exact fio configuration? Specifically,
> which iodepth are you using? I guess with a low iodepth (and O_DIRECT),
> the effect of draining the queue might not be as visible.
> 
> > - preallocation=off is still faster than preallocation=metadata.
> 
> Brian, can you help us here with some input?
> 
> Essentially what we're having here is a sparse image file on XFS that is
> opened with O_DIRECT (presumably - Berto, is this right?), and Berto is
> seeing cases where a random write benchmark is faster if we're doing the
> 64k ZERO_RANGE + 4k pwrite when touching a 64k cluster for the first
> time compared to always just doing the 4k pwrite. This is with a 1 MB
> extent size hint.
> 

Which is with the 1MB extent size hint? Both, or just the non-ZERO_RANGE
test? A quick test on a vm shows that a 1MB extent size hint widens a
smaller zero range request to the hint. Just based on that, I guess I
wouldn't expect much difference between the tests in the former case
(extra syscall overhead perhaps?) since they'd both be doing 1MB extent
allocs and 4k dio writes. If the hint is only active in the latter case,
then I suppose you'd be comparing 64k unwritten allocs + 4k writes vs.
1MB unwritten allocs + 4k writes.

I also see that Berto noted in a followup email that the XFS filesystem
is close to full, which can have a significant effect on block
allocation performance. I'd strongly recommend not testing for
performance under low free space conditions.


> From the discussions we had the other day [1][2] I took away that your
> suggestion is that we should not try to optimise things with
> fallocate(), but just write the areas we really want to write and let
> the filesystem deal with the sparse parts. Especially with the extent
> size hint that we're now setting, I'm surprised to hear that doing a
> ZERO_RANGE first still seems to improve the performance.
> 
> Do you have any idea why this is happening and what we should be doing
> with this?
> 

Note that I'm just returning from several weeks of leave so my memory is
a bit fuzzy, but I thought the previous issues were around performance
associated with fragmentation caused by doing such small allocations
over time, not necessarily finding the most performant configuration
according to a particular I/O benchmark.

In any event, if you're seeing unclear or unexpected performance deltas
between certain XFS configurations or other fs', I think the best thing
to do is post a more complete description of the workload,
filesystem/storage setup, and test results to the linux-xfs mailing list
(feel free to cc me as well). As it is, aside from the questions above,
it's not really clear to me what the storage stack looks like for this
test, if/how qcow2 is involved, what the various 'preallocation=' modes
actually mean, etc.

Brian

> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1850660
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1666864
> 
> >   If I disable handle_alloc_space() (so there is no ZERO_RANGE used)
> >   then it is much slower.
> 
> This makes some sense because then we're falling back to writing
> explicit zero buffers (unless you disabled that, too).
> 
> > - With preallocation=falloc I get the same results as with
> >   preallocation=metadata.
> 
> Interesting, this means that the fallocate() call costs you basically no
> time. I would have expected preallocation=falloc to be a little faster.
> 
> > - preallocation=full is the fastest by far.

Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-19 Thread Alberto Garcia
On Wed 19 Aug 2020 05:37:12 PM CEST, Alberto Garcia wrote:
> I ran the test again on a newly created filesystem just to make sure,
> here are the full results (numbers are IOPS):
>
> |--+---+---|
> | preallocation|  ext4 |   xfs |
> |--+---+---|
> | off  | 11688 |  6981 |
> | off (w/o ZERO_RANGE) |  2780 |  3196 |
> | metadata |  9132 |  5764 |
> | falloc   | 13108 |  5727 |
> | full | 16351 | 40759 |
> |--+---+---|

Oh, and this is probably relevant, but the ext4 fs has much more free
space than the xfs one (which is almost full with the fully allocated
image). I'll try to run the tests again tomorrow with a larger
filesystem.

Berto



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-19 Thread Alberto Garcia
On Wed 19 Aug 2020 05:07:11 PM CEST, Kevin Wolf wrote:
>> I checked with xfs on my computer. I'm not very familiar with that
>> filesystem so I was using the default options and I didn't tune
>> anything.
>> 
>> What I got with my tests (using fio):
>> 
>> - Using extent_size_hint didn't make any difference in my test case (I
>>   do see a clear difference however with the test case described in
>>   commit ffa244c84a).
>
> Hm, interesting. What is your exact fio configuration? Specifically,
> which iodepth are you using? I guess with a low iodepth (and O_DIRECT),
> the effect of draining the queue might not be as visible.

fio --filename=/dev/vdb --direct=1 --randrepeat=1 --eta=always
--ioengine=libaio --iodepth=32 --numjobs=1 --name=test --size=25G
--io_limit=25G --ramp_time=5 --rw=randwrite --bs=4k --runtime=60

>> - preallocation=off is still faster than preallocation=metadata.
>
> Brian, can you help us here with some input?
>
> Essentially what we're having here is a sparse image file on XFS that
> is opened with O_DIRECT (presumably - Berto, is this right?), and
> Berto is seeing cases where a random write benchmark is faster if
> we're doing the 64k ZERO_RANGE + 4k pwrite when touching a 64k cluster
> for the first time compared to always just doing the 4k pwrite. This
> is with a 1 MB extent size hint.

A couple of notes:

- Yes, it's O_DIRECT (the image is opened with cache=none and fio uses
  --direct=1).

- The extent size hint is the default one, I didn't change or set
  anything for this test (or should I have?).

> From the discussions we had the other day [1][2] I took away that your
> suggestion is that we should not try to optimise things with
> fallocate(), but just write the areas we really want to write and let
> the filesystem deal with the sparse parts. Especially with the extent
> size hint that we're now setting, I'm surprised to hear that doing a
> ZERO_RANGE first still seems to improve the performance.
>
> Do you have any idea why this is happening and what we should be doing
> with this?
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1850660
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1666864
>
>>   If I disable handle_alloc_space() (so there is no ZERO_RANGE used)
>>   then it is much slower.
>
> This makes some sense because then we're falling back to writing
> explicit zero buffers (unless you disabled that, too).

Exactly, this happens on both ext4 and xfs.

>> - With preallocation=falloc I get the same results as with
>>   preallocation=metadata.
>
> Interesting, this means that the fallocate() call costs you basically
> no time. I would have expected preallocation=falloc to be a little
> faster.

I would expect preallocation=falloc to be at least as fast as
preallocation=off (and it is, on ext4). However on xfs it seems to be
slower (?). It doesn't make sense to me.

>> - preallocation=full is the fastest by far.
>
> I guess this saves the conversion of unwritten extents to fully
> allocated ones?

However it is *much* *much* faster. I assume I must be missing something
on how the filesystem works.

I ran the test again on a newly created filesystem just to make sure,
here are the full results (numbers are IOPS):

|--+---+---|
| preallocation|  ext4 |   xfs |
|--+---+---|
| off  | 11688 |  6981 |
| off (w/o ZERO_RANGE) |  2780 |  3196 |
| metadata |  9132 |  5764 |
| falloc   | 13108 |  5727 |
| full | 16351 | 40759 |
|--+---+---|

Berto



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-19 Thread Kevin Wolf
Am 19.08.2020 um 16:25 hat Alberto Garcia geschrieben:
> On Mon 17 Aug 2020 05:53:07 PM CEST, Kevin Wolf wrote:
> >> > Or are you saying that ZERO_RANGE + pwrite on a sparse file (=
> >> > cluster allocation) is faster for you than just the pwrite alone (=
> >> > writing to already allocated cluster)?
> >> 
> >> Yes, 20% faster in my tests (4KB random writes), but in the latter
> >> case the cluster is already allocated only at the qcow2 level, not on
> >> the filesystem. preallocation=falloc is faster than
> >> preallocation=metadata (preallocation=off sits in the middle).
> >
> > Hm, this feels wrong. Doing more operations should never be faster
> > than doing less operations.
> >
> > Maybe the difference is in allocating 64k at once instead of doing a
> > separate allocation for every 4k block? But with the extent size hint
> > patches to file-posix, we should allocate 1 MB at once by default now
> > (if your test image was newly created). Can you check whether this is
> > in effect for your image file?
> 
> I checked with xfs on my computer. I'm not very familiar with that
> filesystem so I was using the default options and I didn't tune
> anything.
> 
> What I got with my tests (using fio):
> 
> - Using extent_size_hint didn't make any difference in my test case (I
>   do see a clear difference however with the test case described in
>   commit ffa244c84a).

Hm, interesting. What is your exact fio configuration? Specifically,
which iodepth are you using? I guess with a low iodepth (and O_DIRECT),
the effect of draining the queue might not be as visible.

> - preallocation=off is still faster than preallocation=metadata.

Brian, can you help us here with some input?

Essentially what we're having here is a sparse image file on XFS that is
opened with O_DIRECT (presumably - Berto, is this right?), and Berto is
seeing cases where a random write benchmark is faster if we're doing the
64k ZERO_RANGE + 4k pwrite when touching a 64k cluster for the first
time compared to always just doing the 4k pwrite. This is with a 1 MB
extent size hint.

>From the discussions we had the other day [1][2] I took away that your
suggestion is that we should not try to optimise things with
fallocate(), but just write the areas we really want to write and let
the filesystem deal with the sparse parts. Especially with the extent
size hint that we're now setting, I'm surprised to hear that doing a
ZERO_RANGE first still seems to improve the performance.

Do you have any idea why this is happening and what we should be doing
with this?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1850660
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1666864

>   If I disable handle_alloc_space() (so there is no ZERO_RANGE used)
>   then it is much slower.

This makes some sense because then we're falling back to writing
explicit zero buffers (unless you disabled that, too).

> - With preallocation=falloc I get the same results as with
>   preallocation=metadata.

Interesting, this means that the fallocate() call costs you basically no
time. I would have expected preallocation=falloc to be a little faster.

> - preallocation=full is the fastest by far.

I guess this saves the conversion of unwritten extents to fully
allocated ones?

As the extent size hint doesn't seem to influence your test case anyway,
can I assume that ext4 behaves similar to XFS in all four cases?

Kevin




Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-19 Thread Alberto Garcia
On Mon 17 Aug 2020 05:53:07 PM CEST, Kevin Wolf wrote:
>> > Or are you saying that ZERO_RANGE + pwrite on a sparse file (=
>> > cluster allocation) is faster for you than just the pwrite alone (=
>> > writing to already allocated cluster)?
>> 
>> Yes, 20% faster in my tests (4KB random writes), but in the latter
>> case the cluster is already allocated only at the qcow2 level, not on
>> the filesystem. preallocation=falloc is faster than
>> preallocation=metadata (preallocation=off sits in the middle).
>
> Hm, this feels wrong. Doing more operations should never be faster
> than doing less operations.
>
> Maybe the difference is in allocating 64k at once instead of doing a
> separate allocation for every 4k block? But with the extent size hint
> patches to file-posix, we should allocate 1 MB at once by default now
> (if your test image was newly created). Can you check whether this is
> in effect for your image file?

I checked with xfs on my computer. I'm not very familiar with that
filesystem so I was using the default options and I didn't tune
anything.

What I got with my tests (using fio):

- Using extent_size_hint didn't make any difference in my test case (I
  do see a clear difference however with the test case described in
  commit ffa244c84a).

- preallocation=off is still faster than preallocation=metadata. If I
  disable handle_alloc_space() (so there is no ZERO_RANGE used) then it
  is much slower.

- With preallocation=falloc I get the same results as with
  preallocation=metadata.

- preallocation=full is the fastest by far.

Berto



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-18 Thread Kevin Wolf
Am 17.08.2020 um 20:18 hat Alberto Garcia geschrieben:
> On Mon 17 Aug 2020 05:53:07 PM CEST, Kevin Wolf wrote:
> > Maybe the difference is in allocating 64k at once instead of doing a
> > separate allocation for every 4k block? But with the extent size hint
> > patches to file-posix, we should allocate 1 MB at once by default now
> > (if your test image was newly created). Can you check whether this is
> > in effect for your image file?
> 
> Ehmm... is that hint supported in ext4 or only in xfs?

Hm, I had understood that ext4 supports this, but looking at the kernel
code, it doesn't look like it. :-(

Kevin




Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-17 Thread Alberto Garcia
On Mon 17 Aug 2020 05:53:07 PM CEST, Kevin Wolf wrote:
> Maybe the difference is in allocating 64k at once instead of doing a
> separate allocation for every 4k block? But with the extent size hint
> patches to file-posix, we should allocate 1 MB at once by default now
> (if your test image was newly created). Can you check whether this is
> in effect for your image file?

Ehmm... is that hint supported in ext4 or only in xfs?

Berto



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-17 Thread Kevin Wolf
Am 17.08.2020 um 17:31 hat Alberto Garcia geschrieben:
> On Mon 17 Aug 2020 12:10:19 PM CEST, Kevin Wolf wrote:
> >> Since commit c8bb23cbdbe / QEMU 4.1.0 (and if the storage backend
> >> allows it) writing to an image created with preallocation=metadata
> >> can be slower (20% in my tests) than writing to an image with no
> >> preallocation at all.
> >
> > A while ago we had a case where commit c8bb23cbdbe was actually
> > reported as a major performance regression, so it's a big "it
> > depends".
> >
> > XFS people told me that they consider this code a bad idea. Just
> > because it's a specialised "write zeroes" operation, it's not
> > necessarily fast on filesystems. In particular, on XFS, ZERO_RANGE
> > causes a queue drain with O_DIRECT (probably hurts cases with high
> > queue depths) and additionally even a page cache flush without
> > O_DIRECT.
> >
> > So in a way this whole thing is a two-edged sword.
> 
> I see... on ext4 the improvements are clearly visible. Are we not
> detecting this for xfs? We do have an s->is_xfs flag.

My understanding is that XFS and ext4 behave very similar in this
respect. It's not a clear loss on XFS either, some cases are improved.
But cases that get a performance regression exist, too. It's a question
of the workload, the file system state (e.g. fragmentation of the image
file) and the storage.

So I don't think checking for a specific filesystem is going to improve
things.

> >> a) shall we include a warning in the documentation ("note that this
> >> preallocation mode can result in worse performance")?
> >
> > To be honest, I don't really understand this case yet. With metadata
> > preallocation, the clusters are already marked as allocated, so why
> > would handle_alloc_space() even be called? We're not allocating new
> > clusters after all?
> 
> It's not called, what happens is what you say below:
> 
> > Or are you saying that ZERO_RANGE + pwrite on a sparse file (= cluster
> > allocation) is faster for you than just the pwrite alone (= writing to
> > already allocated cluster)?
> 
> Yes, 20% faster in my tests (4KB random writes), but in the latter case
> the cluster is already allocated only at the qcow2 level, not on the
> filesystem. preallocation=falloc is faster than preallocation=metadata
> (preallocation=off sits in the middle).

Hm, this feels wrong. Doing more operations should never be faster than
doing less operations.

Maybe the difference is in allocating 64k at once instead of doing a
separate allocation for every 4k block? But with the extent size hint
patches to file-posix, we should allocate 1 MB at once by default now
(if your test image was newly created). Can you check whether this is in
effect for your image file?

Kevin




Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-17 Thread Alberto Garcia
On Mon 17 Aug 2020 12:10:19 PM CEST, Kevin Wolf wrote:
>> Since commit c8bb23cbdbe / QEMU 4.1.0 (and if the storage backend
>> allows it) writing to an image created with preallocation=metadata
>> can be slower (20% in my tests) than writing to an image with no
>> preallocation at all.
>
> A while ago we had a case where commit c8bb23cbdbe was actually
> reported as a major performance regression, so it's a big "it
> depends".
>
> XFS people told me that they consider this code a bad idea. Just
> because it's a specialised "write zeroes" operation, it's not
> necessarily fast on filesystems. In particular, on XFS, ZERO_RANGE
> causes a queue drain with O_DIRECT (probably hurts cases with high
> queue depths) and additionally even a page cache flush without
> O_DIRECT.
>
> So in a way this whole thing is a two-edged sword.

I see... on ext4 the improvements are clearly visible. Are we not
detecting this for xfs? We do have an s->is_xfs flag.

>> a) shall we include a warning in the documentation ("note that this
>> preallocation mode can result in worse performance")?
>
> To be honest, I don't really understand this case yet. With metadata
> preallocation, the clusters are already marked as allocated, so why
> would handle_alloc_space() even be called? We're not allocating new
> clusters after all?

It's not called, what happens is what you say below:

> Or are you saying that ZERO_RANGE + pwrite on a sparse file (= cluster
> allocation) is faster for you than just the pwrite alone (= writing to
> already allocated cluster)?

Yes, 20% faster in my tests (4KB random writes), but in the latter case
the cluster is already allocated only at the qcow2 level, not on the
filesystem. preallocation=falloc is faster than preallocation=metadata
(preallocation=off sits in the middle).

>> b) why don't we also initialize preallocated clusters with
>>QCOW_OFLAG_ZERO? (at least when there's no subclusters involved,
>>i.e. no backing file). This would make reading from them (and
>>writing to them, after this patch) faster.
>
> Because the idea with metadata preallocation is that you don't have to
> perform any COW and update any metdata because everything is already
> allocated. If you set the zero flag, you get cluster allocations with
> COW again, defeating the whole purpose of the preallocation.

Fair enough.

Berto



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-17 Thread Alberto Garcia
On Mon 17 Aug 2020 05:53:07 PM CEST, Kevin Wolf wrote:
>> > Or are you saying that ZERO_RANGE + pwrite on a sparse file (=
>> > cluster allocation) is faster for you than just the pwrite alone (=
>> > writing to already allocated cluster)?
>> 
>> Yes, 20% faster in my tests (4KB random writes), but in the latter
>> case the cluster is already allocated only at the qcow2 level, not on
>> the filesystem. preallocation=falloc is faster than
>> preallocation=metadata (preallocation=off sits in the middle).
>
> Hm, this feels wrong. Doing more operations should never be faster
> than doing less operations.
>
> Maybe the difference is in allocating 64k at once instead of doing a
> separate allocation for every 4k block?

That's what I imagine, yes. I'll have a look at your patches and tell
you.

Berto



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-17 Thread Kevin Wolf
Am 14.08.2020 um 16:57 hat Alberto Garcia geschrieben:
> Hi,
> 
> the patch is self-explanatory, but I'm using the cover letter to raise
> a couple of related questions.
> 
> Since commit c8bb23cbdbe / QEMU 4.1.0 (and if the storage backend
> allows it) writing to an image created with preallocation=metadata can
> be slower (20% in my tests) than writing to an image with no
> preallocation at all.

A while ago we had a case where commit c8bb23cbdbe was actually reported
as a major performance regression, so it's a big "it depends".

XFS people told me that they consider this code a bad idea. Just because
it's a specialised "write zeroes" operation, it's not necessarily fast
on filesystems. In particular, on XFS, ZERO_RANGE causes a queue drain
with O_DIRECT (probably hurts cases with high queue depths) and
additionally even a page cache flush without O_DIRECT.

So in a way this whole thing is a two-edged sword.

> So:
> 
> a) shall we include a warning in the documentation ("note that this
>preallocation mode can result in worse performance")?

To be honest, I don't really understand this case yet. With metadata
preallocation, the clusters are already marked as allocated, so why
would handle_alloc_space() even be called? We're not allocating new
clusters after all?

Or are you saying that ZERO_RANGE + pwrite on a sparse file (= cluster
allocation) is faster for you than just the pwrite alone (= writing to
already allocated cluster)?

> b) why don't we also initialize preallocated clusters with
>QCOW_OFLAG_ZERO? (at least when there's no subclusters involved,
>i.e. no backing file). This would make reading from them (and
>writing to them, after this patch) faster.

Because the idea with metadata preallocation is that you don't have to
perform any COW and update any metdata because everything is already
allocated. If you set the zero flag, you get cluster allocations with
COW again, defeating the whole purpose of the preallocation.

Kevin




Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-14 Thread Vladimir Sementsov-Ogievskiy

Hi!

14.08.2020 17:57, Alberto Garcia wrote:

Hi,

the patch is self-explanatory, but I'm using the cover letter to raise
a couple of related questions.

Since commit c8bb23cbdbe / QEMU 4.1.0 (and if the storage backend
allows it) writing to an image created with preallocation=metadata can
be slower (20% in my tests) than writing to an image with no
preallocation at all.

So:

a) shall we include a warning in the documentation ("note that this
preallocation mode can result in worse performance")?


I think, the best thing to do is to make it work fast in all cases if possible 
(I assume, that would be, with your patch + positive answer to [b]? Or not?) :)

Andrey recently added a benchmark, with some cases, where c8bb23cbdbe bring 
benefits:
[PATCH v6] scripts/simplebench: compare write request performance
<1594741846-475697-1-git-send-email-andrey.shinkev...@virtuozzo.com>
queued in Eduardo's python-next: 
https://github.com/ehabkost/qemu/commit/9519f87d900b0ef30075c749fa097bd93471553f

So, as a first step, could you post your tests, so we can add it into this 
benchmark? Or post a patch to simplebench on top of Eduardo's python-next.



b) why don't we also initialize preallocated clusters with
QCOW_OFLAG_ZERO? (at least when there's no subclusters involved,
i.e. no backing file). This would make reading from them (and
writing to them, after this patch) faster.


Probably, they are not guaranteed to be zero on all filesystems? But I think at 
least in some cases (99% :) we can mark them as ZERO.. Honestly, I may be not 
aware of actual reasons.



Berto

Alberto Garcia (1):
   qcow2: Skip copy-on-write when allocating a zero cluster

  include/block/block.h |  2 +-
  block/commit.c|  2 +-
  block/io.c| 20 +---
  block/mirror.c|  3 ++-
  block/qcow2.c | 26 --
  block/replication.c   |  2 +-
  block/stream.c|  2 +-
  qemu-img.c|  2 +-
  8 files changed, 40 insertions(+), 19 deletions(-)




--
Best regards,
Vladimir



[PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-14 Thread Alberto Garcia
Hi,

the patch is self-explanatory, but I'm using the cover letter to raise
a couple of related questions.

Since commit c8bb23cbdbe / QEMU 4.1.0 (and if the storage backend
allows it) writing to an image created with preallocation=metadata can
be slower (20% in my tests) than writing to an image with no
preallocation at all.

So:

a) shall we include a warning in the documentation ("note that this
   preallocation mode can result in worse performance")?

b) why don't we also initialize preallocated clusters with
   QCOW_OFLAG_ZERO? (at least when there's no subclusters involved,
   i.e. no backing file). This would make reading from them (and
   writing to them, after this patch) faster.

Berto

Alberto Garcia (1):
  qcow2: Skip copy-on-write when allocating a zero cluster

 include/block/block.h |  2 +-
 block/commit.c|  2 +-
 block/io.c| 20 +---
 block/mirror.c|  3 ++-
 block/qcow2.c | 26 --
 block/replication.c   |  2 +-
 block/stream.c|  2 +-
 qemu-img.c|  2 +-
 8 files changed, 40 insertions(+), 19 deletions(-)

-- 
2.20.1