Re: [Qemu-devel] [PATCH v3 0/4] qcow2: async handling of fragmented io

2019-08-16 Thread Vladimir Sementsov-Ogievskiy
15.08.2019 18:39, Vladimir Sementsov-Ogievskiy wrote:
> 15.08.2019 17:09, Max Reitz wrote:
>> On 15.08.19 14:10, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> Here is an asynchronous scheme for handling fragmented qcow2
>>> reads and writes. Both qcow2 read and write functions loops through
>>> sequential portions of data. The series aim it to parallelize these
>>> loops iterations.
>>> It improves performance for fragmented qcow2 images, I've tested it
>>> as described below.
>>
>> Looks good to me, but I can’t take it yet because I need to wait for
>> Stefan’s branch to be merged, of course.
>>
>> Speaking of which, why didn’t you add any tests for the *_part()
>> methods?  I find it a bit unsettling that nothing would have caught the
>> bug you had in v2 in patch 3.
>>
> 
> Hmm, any test with write to fragmented area should have caught it.. Ok,
> I'll think of something.
> 
> 

And now I see that it's not trivial to make such a test:

1. qcow2 write is broken when we give nonzero qiov_offset to it, but only
qcow2_write calls bdrv_co_pwritev_part, so we need to have a test where qcow2
is a file child for qcow2

2. Then, the bug leads to the beginning of the qiov will be written to all 
parts.
But our testing tool qemu-io has only "write -P" command with buffer filled with
the same one byte, so we can't catch it


-- 
Best regards,
Vladimir


Re: [Qemu-devel] [PATCH v3 0/4] qcow2: async handling of fragmented io

2019-08-15 Thread Vladimir Sementsov-Ogievskiy
15.08.2019 17:09, Max Reitz wrote:
> On 15.08.19 14:10, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all!
>>
>> Here is an asynchronous scheme for handling fragmented qcow2
>> reads and writes. Both qcow2 read and write functions loops through
>> sequential portions of data. The series aim it to parallelize these
>> loops iterations.
>> It improves performance for fragmented qcow2 images, I've tested it
>> as described below.
> 
> Looks good to me, but I can’t take it yet because I need to wait for
> Stefan’s branch to be merged, of course.
> 
> Speaking of which, why didn’t you add any tests for the *_part()
> methods?  I find it a bit unsettling that nothing would have caught the
> bug you had in v2 in patch 3.
> 

Hmm, any test with write to fragmented area should have caught it.. Ok,
I'll think of something.


-- 
Best regards,
Vladimir


Re: [Qemu-devel] [PATCH v3 0/4] qcow2: async handling of fragmented io

2019-08-15 Thread Vladimir Sementsov-Ogievskiy
15.08.2019 16:21, Max Reitz wrote:
> On 15.08.19 14:10, Vladimir Sementsov-Ogievskiy wrote:
>> 01: - use coroutine_fn where appropriate !!!
> 
> :-)
> 


Ahahaha, I'll explain:

When comparing v2 vs v3 and writing this difference script I noticed
that I added coroutine_fn marks only to .h and not to .c. So I added
some of "!!!" to not foreget to fix it.


-- 
Best regards,
Vladimir


Re: [Qemu-devel] [PATCH v3 0/4] qcow2: async handling of fragmented io

2019-08-15 Thread Max Reitz
On 15.08.19 14:10, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> Here is an asynchronous scheme for handling fragmented qcow2
> reads and writes. Both qcow2 read and write functions loops through
> sequential portions of data. The series aim it to parallelize these
> loops iterations.
> It improves performance for fragmented qcow2 images, I've tested it
> as described below.

Looks good to me, but I can’t take it yet because I need to wait for
Stefan’s branch to be merged, of course.

Speaking of which, why didn’t you add any tests for the *_part()
methods?  I find it a bit unsettling that nothing would have caught the
bug you had in v2 in patch 3.

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH v3 0/4] qcow2: async handling of fragmented io

2019-08-15 Thread Max Reitz
On 15.08.19 14:10, Vladimir Sementsov-Ogievskiy wrote:
> 01: - use coroutine_fn where appropriate !!!

:-)



signature.asc
Description: OpenPGP digital signature


[Qemu-devel] [PATCH v3 0/4] qcow2: async handling of fragmented io

2019-08-15 Thread Vladimir Sementsov-Ogievskiy
Hi all!

Here is an asynchronous scheme for handling fragmented qcow2
reads and writes. Both qcow2 read and write functions loops through
sequential portions of data. The series aim it to parallelize these
loops iterations.
It improves performance for fragmented qcow2 images, I've tested it
as described below.

v3 (by Max's comments) [perf results not updated]:

01: - use coroutine_fn where appropriate !!!
- add aio_task_pool_free
- add some comments
- move header to include/block
- s/wait_done/waiting
02: - Rewrite note about decryption in guest buffers [thx to Eric]
- separate g_assert_not_reached for QCOW2_CLUSTER_ZERO_*
- drop return after g_assert_not_reached
03: - drop bytes_done and correctly use qiov_offset
- fix comment
04: - move QCOW2_MAX_WORKERS to block/qcow2.h
- initialize ret in qcow2_co_preadv_part
Based-on: https://github.com/stefanha/qemu/commits/block


v2: changed a lot, as
 1. a lot of preparations around locks, hd_qiovs, threads for encryption
are done
 2. I decided to create separate file with async request handling API, to
reuse it for backup, stream and copy-on-read to improve their performance
too. Mirror and qemu-img convert has their own async request handling,
may be we'll be able finally merge all these similar code into one
feature.
Note that not all API calls used in qcow2, some will be needed on
following steps for parallelizing other io loops.

About testing:

I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
t-seq.qcow2 - sequentially written qcow2 image
t-reverse.qcow2 - filled by writing 64k portions from end to the start
t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
(see source code of image generation in the end for details)

and I've done several runs like the following (sequential io by 1mb chunks):

out=/tmp/block; echo > $out; cat /tmp/files | while read file; do for wr in 
{"","-w"}; do echo "$file" $wr; ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m 
-t none $wr "$file" | grep 'Run completed in' | awk '{print $4}' >> $out; done; 
done


short info about parameters:
  -w - do writes (otherwise do reads)
  -c - count of blocks
  -s - block size
  -t none - disable cache
  -n - native aio
  -d 1 - don't use parallel requests provided by qemu-img bench itself

results:

+---+-+-+
| file  | master  | async   |
+---+-+-+
| /ssd/t-part-rand.qcow2| 14.671  | 9.193   |
+---+-+-+
| /ssd/t-part-rand.qcow2 -w | 11.434  | 8.621   |
+---+-+-+
| /ssd/t-rand.qcow2 | 20.421  | 10.05   |
+---+-+-+
| /ssd/t-rand.qcow2 -w  | 11.097  | 8.915   |
+---+-+-+
| /ssd/t-reverse.qcow2  | 17.515  | 9.407   |
+---+-+-+
| /ssd/t-reverse.qcow2 -w   | 11.255  | 8.649   |
+---+-+-+
| /ssd/t-seq.qcow2  | 9.081   | 9.072   |
+---+-+-+
| /ssd/t-seq.qcow2 -w   | 8.761   | 8.747   |
+---+-+-+
| /tmp/t-part-rand.qcow2| 41.179  | 41.37   |
+---+-+-+
| /tmp/t-part-rand.qcow2 -w | 54.097  | 55.323  |
+---+-+-+
| /tmp/t-rand.qcow2 | 711.899 | 514.339 |
+---+-+-+
| /tmp/t-rand.qcow2 -w  | 546.259 | 642.114 |
+---+-+-+
| /tmp/t-reverse.qcow2  | 86.065  | 96.522  |
+---+-+-+
| /tmp/t-reverse.qcow2 -w   | 46.557  | 48.499  |
+---+-+-+
| /tmp/t-seq.qcow2  | 33.804  | 33.862  |
+---+-+-+
| /tmp/t-seq.qcow2 -w   | 34.299  | 34.233  |
+---+-+-+


Performance gain is obvious, especially for read and especially for ssd.
For hdd there is a degradation for reverse case, but this is the most
impossible case and seems not critical.

How images are generated:

 gen-writes ==
#!/usr/bin/env python
import random
import sys

size = 4 * 1024 * 1024 * 1024
block = 64 * 1024
block2 = 1024 * 1024

arg = sys.argv[1]

if arg in ('rand', 'reverse', 'seq'):
writes = list(range(0, size, block))

if arg == 'rand':
random.shuffle(writes)
elif arg == 'reverse':
writes.reverse()
elif arg == 'part-rand':
writes = []