Re: [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO

2018-03-07 Thread Peter Lieven
Am 07.03.2018 um 10:47 schrieb Stefan Hajnoczi:
> On Wed, Mar 7, 2018 at 7:55 AM, Peter Lieven  wrote:
>> Am 06.03.2018 um 17:35 schrieb Peter Lieven:
>>> Am 06.03.2018 um 17:07 schrieb Stefan Hajnoczi:
 On Mon, Mar 05, 2018 at 02:52:16PM +, Dr. David Alan Gilbert wrote:
> * Peter Lieven (p...@kamp.de) wrote:
>> Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
>>> On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
 I stumbled across the MAX_INFLIGHT_IO field that was introduced in 
 2015 and was curious what was the reason
 to choose 512MB as readahead? The question is that I found that the 
 source VM gets very unresponsive I/O wise
 while the initial 512MB are read and furthermore seems to stay 
 unreasponsive if we choose a high migration speed
 and have a fast storage on the destination VM.

 In our environment I modified this value to 16MB which seems to work 
 much smoother. I wonder if we should make
 this a user configurable value or define a different rate limit for 
 the block transfer in bulk stage at least?
>>> I don't know if benchmarks were run when choosing the value.  From the
>>> commit description it sounds like the main purpose was to limit the
>>> amount of memory that can be consumed.
>>>
>>> 16 MB also fulfills that criteria :), but why is the source VM more
>>> responsive with a lower value?
>>>
>>> Perhaps the issue is queue depth on the storage device - the block
>>> migration code enqueues up to 512 MB worth of reads, and guest I/O has
>>> to wait?
>> That is my guess. Especially if the destination storage is faster we 
>> basically alsways have
>> 512 I/Os in flight on the source storage.
>>
>> Does anyone mind if the reduce that value to 16MB or do we need a better 
>> mechanism?
> We've got migration-parameters these days; you could connect it to one
> of those fairly easily I think.
> Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
> already there.
> Then you can set it to whatever you like.
 It would be nice to solve the performance problem without adding a
 tuneable.

 On the other hand, QEMU has no idea what the queue depth of the device
 is.  Therefore it cannot prioritize guest I/O over block migration I/O.

 512 parallel requests is much too high.  Most parallel I/O benchmarking
 is done at 32-64 queue depth.

 I think that 16 parallel requests is a reasonable maximum number for a
 background job.

 We need to be clear though that the purpose of this change is unrelated
 to the original 512 MB memory footprint goal.  It just happens to touch
 the same constant but the goal is now to submit at most 16 I/O requests
 in parallel to avoid monopolizing the I/O device.
>>> I think we should really look at this. The variables that control if we 
>>> stay in the while loop or not are incremented and decremented
>>> at the following places:
>>>
>>> mig_save_device_dirty:
>>> mig_save_device_bulk:
>>> block_mig_state.submitted++;
>>>
>>> blk_mig_read_cb:
>>> block_mig_state.submitted--;
>>> block_mig_state.read_done++;
>>>
>>> flush_blks:
>>> block_mig_state.read_done--;
>>>
>>> The condition of the while loop is:
>>> (block_mig_state.submitted +
>>> block_mig_state.read_done) * BLOCK_SIZE <
>>>qemu_file_get_rate_limit(f) &&
>>>(block_mig_state.submitted +
>>> block_mig_state.read_done) <
>>>MAX_INFLIGHT_IO)
>>>
>>> At first I wonder if we ever reach the rate-limit because we put the read 
>>> buffers onto f AFTER we exit the while loop?
>>>
>>> And even if we reach the limit we constantly maintain 512 I/Os in parallel 
>>> because we immediately decrement read_done
>>> when we put the buffers to f in flush_blks. In the next iteration of the 
>>> while loop we then read again until we have 512 in-flight I/Os.
>>>
>>> And shouldn't we have a time limit to limit the time we stay in the while 
>>> loop? I think we artificially delay sending data to f?
>> Thinking about it for a while I would propose the following:
>>
>> a) rename MAX_INFLIGHT_IO to MAX_IO_BUFFERS
>> b) add MAX_PARALLEL_IO with a value of 16
>> c) compare qemu_file_get_rate_limit only with block_mig_state.read_done
>>
>> This would yield in the following condition for the while loop:
>>
>> (block_mig_state.read_done * BLOCK_SIZE < qemu_file_get_rate_limit(f) &&
>>  (block_mig_state.submitted + block_mig_state.read_done) < MAX_IO_BUFFERS &&
>>  block_mig_state.submitted < MAX_PARALLEL_IO)
>>
>> Sounds that like a plan?
> That sounds good to me.

I will prepare patches for this.

Peter





Re: [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO

2018-03-07 Thread Stefan Hajnoczi
On Wed, Mar 7, 2018 at 7:55 AM, Peter Lieven  wrote:
> Am 06.03.2018 um 17:35 schrieb Peter Lieven:
>> Am 06.03.2018 um 17:07 schrieb Stefan Hajnoczi:
>>> On Mon, Mar 05, 2018 at 02:52:16PM +, Dr. David Alan Gilbert wrote:
 * Peter Lieven (p...@kamp.de) wrote:
> Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
>> On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
>>> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 
>>> and was curious what was the reason
>>> to choose 512MB as readahead? The question is that I found that the 
>>> source VM gets very unresponsive I/O wise
>>> while the initial 512MB are read and furthermore seems to stay 
>>> unreasponsive if we choose a high migration speed
>>> and have a fast storage on the destination VM.
>>>
>>> In our environment I modified this value to 16MB which seems to work 
>>> much smoother. I wonder if we should make
>>> this a user configurable value or define a different rate limit for the 
>>> block transfer in bulk stage at least?
>> I don't know if benchmarks were run when choosing the value.  From the
>> commit description it sounds like the main purpose was to limit the
>> amount of memory that can be consumed.
>>
>> 16 MB also fulfills that criteria :), but why is the source VM more
>> responsive with a lower value?
>>
>> Perhaps the issue is queue depth on the storage device - the block
>> migration code enqueues up to 512 MB worth of reads, and guest I/O has
>> to wait?
> That is my guess. Especially if the destination storage is faster we 
> basically alsways have
> 512 I/Os in flight on the source storage.
>
> Does anyone mind if the reduce that value to 16MB or do we need a better 
> mechanism?
 We've got migration-parameters these days; you could connect it to one
 of those fairly easily I think.
 Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
 already there.
 Then you can set it to whatever you like.
>>> It would be nice to solve the performance problem without adding a
>>> tuneable.
>>>
>>> On the other hand, QEMU has no idea what the queue depth of the device
>>> is.  Therefore it cannot prioritize guest I/O over block migration I/O.
>>>
>>> 512 parallel requests is much too high.  Most parallel I/O benchmarking
>>> is done at 32-64 queue depth.
>>>
>>> I think that 16 parallel requests is a reasonable maximum number for a
>>> background job.
>>>
>>> We need to be clear though that the purpose of this change is unrelated
>>> to the original 512 MB memory footprint goal.  It just happens to touch
>>> the same constant but the goal is now to submit at most 16 I/O requests
>>> in parallel to avoid monopolizing the I/O device.
>> I think we should really look at this. The variables that control if we stay 
>> in the while loop or not are incremented and decremented
>> at the following places:
>>
>> mig_save_device_dirty:
>> mig_save_device_bulk:
>> block_mig_state.submitted++;
>>
>> blk_mig_read_cb:
>> block_mig_state.submitted--;
>> block_mig_state.read_done++;
>>
>> flush_blks:
>> block_mig_state.read_done--;
>>
>> The condition of the while loop is:
>> (block_mig_state.submitted +
>> block_mig_state.read_done) * BLOCK_SIZE <
>>qemu_file_get_rate_limit(f) &&
>>(block_mig_state.submitted +
>> block_mig_state.read_done) <
>>MAX_INFLIGHT_IO)
>>
>> At first I wonder if we ever reach the rate-limit because we put the read 
>> buffers onto f AFTER we exit the while loop?
>>
>> And even if we reach the limit we constantly maintain 512 I/Os in parallel 
>> because we immediately decrement read_done
>> when we put the buffers to f in flush_blks. In the next iteration of the 
>> while loop we then read again until we have 512 in-flight I/Os.
>>
>> And shouldn't we have a time limit to limit the time we stay in the while 
>> loop? I think we artificially delay sending data to f?
>
> Thinking about it for a while I would propose the following:
>
> a) rename MAX_INFLIGHT_IO to MAX_IO_BUFFERS
> b) add MAX_PARALLEL_IO with a value of 16
> c) compare qemu_file_get_rate_limit only with block_mig_state.read_done
>
> This would yield in the following condition for the while loop:
>
> (block_mig_state.read_done * BLOCK_SIZE < qemu_file_get_rate_limit(f) &&
>  (block_mig_state.submitted + block_mig_state.read_done) < MAX_IO_BUFFERS &&
>  block_mig_state.submitted < MAX_PARALLEL_IO)
>
> Sounds that like a plan?

That sounds good to me.

Stefan



Re: [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO

2018-03-06 Thread Peter Lieven
Am 06.03.2018 um 17:35 schrieb Peter Lieven:
> Am 06.03.2018 um 17:07 schrieb Stefan Hajnoczi:
>> On Mon, Mar 05, 2018 at 02:52:16PM +, Dr. David Alan Gilbert wrote:
>>> * Peter Lieven (p...@kamp.de) wrote:
 Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
> On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
>> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 
>> and was curious what was the reason
>> to choose 512MB as readahead? The question is that I found that the 
>> source VM gets very unresponsive I/O wise
>> while the initial 512MB are read and furthermore seems to stay 
>> unreasponsive if we choose a high migration speed
>> and have a fast storage on the destination VM.
>>
>> In our environment I modified this value to 16MB which seems to work 
>> much smoother. I wonder if we should make
>> this a user configurable value or define a different rate limit for the 
>> block transfer in bulk stage at least?
> I don't know if benchmarks were run when choosing the value.  From the
> commit description it sounds like the main purpose was to limit the
> amount of memory that can be consumed.
>
> 16 MB also fulfills that criteria :), but why is the source VM more
> responsive with a lower value?
>
> Perhaps the issue is queue depth on the storage device - the block
> migration code enqueues up to 512 MB worth of reads, and guest I/O has
> to wait?
 That is my guess. Especially if the destination storage is faster we 
 basically alsways have
 512 I/Os in flight on the source storage.

 Does anyone mind if the reduce that value to 16MB or do we need a better 
 mechanism?
>>> We've got migration-parameters these days; you could connect it to one
>>> of those fairly easily I think.
>>> Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
>>> already there.
>>> Then you can set it to whatever you like.
>> It would be nice to solve the performance problem without adding a
>> tuneable.
>>
>> On the other hand, QEMU has no idea what the queue depth of the device
>> is.  Therefore it cannot prioritize guest I/O over block migration I/O.
>>
>> 512 parallel requests is much too high.  Most parallel I/O benchmarking
>> is done at 32-64 queue depth.
>>
>> I think that 16 parallel requests is a reasonable maximum number for a
>> background job.
>>
>> We need to be clear though that the purpose of this change is unrelated
>> to the original 512 MB memory footprint goal.  It just happens to touch
>> the same constant but the goal is now to submit at most 16 I/O requests
>> in parallel to avoid monopolizing the I/O device.
> I think we should really look at this. The variables that control if we stay 
> in the while loop or not are incremented and decremented
> at the following places:
>
> mig_save_device_dirty:
> mig_save_device_bulk:
> block_mig_state.submitted++;
>
> blk_mig_read_cb:
> block_mig_state.submitted--;
> block_mig_state.read_done++;
>
> flush_blks:
> block_mig_state.read_done--;
>
> The condition of the while loop is:
> (block_mig_state.submitted +
> block_mig_state.read_done) * BLOCK_SIZE <
>qemu_file_get_rate_limit(f) &&
>(block_mig_state.submitted +
> block_mig_state.read_done) <
>MAX_INFLIGHT_IO)
>
> At first I wonder if we ever reach the rate-limit because we put the read 
> buffers onto f AFTER we exit the while loop?
>
> And even if we reach the limit we constantly maintain 512 I/Os in parallel 
> because we immediately decrement read_done
> when we put the buffers to f in flush_blks. In the next iteration of the 
> while loop we then read again until we have 512 in-flight I/Os.
>
> And shouldn't we have a time limit to limit the time we stay in the while 
> loop? I think we artificially delay sending data to f?

Thinking about it for a while I would propose the following:

a) rename MAX_INFLIGHT_IO to MAX_IO_BUFFERS
b) add MAX_PARALLEL_IO with a value of 16
c) compare qemu_file_get_rate_limit only with block_mig_state.read_done

This would yield in the following condition for the while loop:

(block_mig_state.read_done * BLOCK_SIZE < qemu_file_get_rate_limit(f) &&
 (block_mig_state.submitted + block_mig_state.read_done) < MAX_IO_BUFFERS &&
 block_mig_state.submitted < MAX_PARALLEL_IO)

Sounds that like a plan?

Peter





Re: [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO

2018-03-06 Thread Peter Lieven
Am 06.03.2018 um 17:07 schrieb Stefan Hajnoczi:
> On Mon, Mar 05, 2018 at 02:52:16PM +, Dr. David Alan Gilbert wrote:
>> * Peter Lieven (p...@kamp.de) wrote:
>>> Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
 On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 
> and was curious what was the reason
> to choose 512MB as readahead? The question is that I found that the 
> source VM gets very unresponsive I/O wise
> while the initial 512MB are read and furthermore seems to stay 
> unreasponsive if we choose a high migration speed
> and have a fast storage on the destination VM.
>
> In our environment I modified this value to 16MB which seems to work much 
> smoother. I wonder if we should make
> this a user configurable value or define a different rate limit for the 
> block transfer in bulk stage at least?
 I don't know if benchmarks were run when choosing the value.  From the
 commit description it sounds like the main purpose was to limit the
 amount of memory that can be consumed.

 16 MB also fulfills that criteria :), but why is the source VM more
 responsive with a lower value?

 Perhaps the issue is queue depth on the storage device - the block
 migration code enqueues up to 512 MB worth of reads, and guest I/O has
 to wait?
>>> That is my guess. Especially if the destination storage is faster we 
>>> basically alsways have
>>> 512 I/Os in flight on the source storage.
>>>
>>> Does anyone mind if the reduce that value to 16MB or do we need a better 
>>> mechanism?
>> We've got migration-parameters these days; you could connect it to one
>> of those fairly easily I think.
>> Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
>> already there.
>> Then you can set it to whatever you like.
> It would be nice to solve the performance problem without adding a
> tuneable.
>
> On the other hand, QEMU has no idea what the queue depth of the device
> is.  Therefore it cannot prioritize guest I/O over block migration I/O.
>
> 512 parallel requests is much too high.  Most parallel I/O benchmarking
> is done at 32-64 queue depth.
>
> I think that 16 parallel requests is a reasonable maximum number for a
> background job.
>
> We need to be clear though that the purpose of this change is unrelated
> to the original 512 MB memory footprint goal.  It just happens to touch
> the same constant but the goal is now to submit at most 16 I/O requests
> in parallel to avoid monopolizing the I/O device.

I think we should really look at this. The variables that control if we stay in 
the while loop or not are incremented and decremented
at the following places:

mig_save_device_dirty:
mig_save_device_bulk:
block_mig_state.submitted++;

blk_mig_read_cb:
block_mig_state.submitted--;
block_mig_state.read_done++;

flush_blks:
block_mig_state.read_done--;

The condition of the while loop is:
(block_mig_state.submitted +
block_mig_state.read_done) * BLOCK_SIZE <
   qemu_file_get_rate_limit(f) &&
   (block_mig_state.submitted +
block_mig_state.read_done) <
   MAX_INFLIGHT_IO)

At first I wonder if we ever reach the rate-limit because we put the read 
buffers onto f AFTER we exit the while loop?

And even if we reach the limit we constantly maintain 512 I/Os in parallel 
because we immediately decrement read_done
when we put the buffers to f in flush_blks. In the next iteration of the while 
loop we then read again until we have 512 in-flight I/Os.

And shouldn't we have a time limit to limit the time we stay in the while loop? 
I think we artificially delay sending data to f?

Peter





Re: [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO

2018-03-06 Thread Stefan Hajnoczi
On Mon, Mar 05, 2018 at 02:52:16PM +, Dr. David Alan Gilbert wrote:
> * Peter Lieven (p...@kamp.de) wrote:
> > Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
> > > On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
> > >> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 
> > >> and was curious what was the reason
> > >> to choose 512MB as readahead? The question is that I found that the 
> > >> source VM gets very unresponsive I/O wise
> > >> while the initial 512MB are read and furthermore seems to stay 
> > >> unreasponsive if we choose a high migration speed
> > >> and have a fast storage on the destination VM.
> > >>
> > >> In our environment I modified this value to 16MB which seems to work 
> > >> much smoother. I wonder if we should make
> > >> this a user configurable value or define a different rate limit for the 
> > >> block transfer in bulk stage at least?
> > > I don't know if benchmarks were run when choosing the value.  From the
> > > commit description it sounds like the main purpose was to limit the
> > > amount of memory that can be consumed.
> > >
> > > 16 MB also fulfills that criteria :), but why is the source VM more
> > > responsive with a lower value?
> > >
> > > Perhaps the issue is queue depth on the storage device - the block
> > > migration code enqueues up to 512 MB worth of reads, and guest I/O has
> > > to wait?
> > 
> > That is my guess. Especially if the destination storage is faster we 
> > basically alsways have
> > 512 I/Os in flight on the source storage.
> > 
> > Does anyone mind if the reduce that value to 16MB or do we need a better 
> > mechanism?
> 
> We've got migration-parameters these days; you could connect it to one
> of those fairly easily I think.
> Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
> already there.
> Then you can set it to whatever you like.

It would be nice to solve the performance problem without adding a
tuneable.

On the other hand, QEMU has no idea what the queue depth of the device
is.  Therefore it cannot prioritize guest I/O over block migration I/O.

512 parallel requests is much too high.  Most parallel I/O benchmarking
is done at 32-64 queue depth.

I think that 16 parallel requests is a reasonable maximum number for a
background job.

We need to be clear though that the purpose of this change is unrelated
to the original 512 MB memory footprint goal.  It just happens to touch
the same constant but the goal is now to submit at most 16 I/O requests
in parallel to avoid monopolizing the I/O device.

Stefan


signature.asc
Description: PGP signature