Re: [Qemu-devel] Fwd: Re: Tunneled Migration with Non-Shared Storage

2014-11-20 Thread Gary R Hook

On 11/20/14 3:54 AM, Dr. David Alan Gilbert wrote:

* Gary R Hook (grhookatw...@gmail.com) wrote:

Ugh, I wish I could teach Thunderbird to understand how to reply to a
newsgroup.

Apologies to Paolo for the direct note.

On 11/19/14 4:19 AM, Paolo Bonzini wrote:



On 19/11/2014 10:35, Dr. David Alan Gilbert wrote:

* Paolo Bonzini (pbonz...@redhat.com) wrote:



On 18/11/2014 21:28, Dr. David Alan Gilbert wrote:

This seems odd, since as far as I know the tunneling code is quite separate
to the migration code; I thought the only thing that the migration
code sees different is the file descriptors it gets past.
(Having said that, again I don't know storage stuff, so if this
is a storage special there may be something there...)


Tunnelled migration uses the old block-migration.c code.  Non-tunnelled
migration uses the NBD server and block/mirror.c.


OK, that explains that.  Is that because the tunneling code can't
deal with tunneling the NBD server connection?


The main problem with
the old code is that uses a possibly unbounded amount of memory in
mig_save_device_dirty and can have huge jitter if any serious workload
is running in the guest.


So that's sending dirty blocks iteratively? Not that I can see
when the allocations get freed; but is the amount allocated there
related to total disk size (as Gary suggested) or to the amount
of dirty blocks?


It should be related to the maximum rate limit (which can be set to
arbitrarily high values, however).


This makes no sense. The code in block_save_iterate() specifically
attempts to control the rate of transfer. But when
qemu_file_get_rate_limit() returns a number like 922337203685372723
(0xCCB) I'm under the impression that no bandwidth
constraints are being imposed at this layer. Why, then, would that
transfer be occurring at 20MB/sec (simple, under-utilized 1 gigE
connection) with no clear bottleneck in CPU or network? What other
relation might exist?


Disk IO on the disk that you're trying to transfer?


Well, non-tunneled runs fast enough (120 MB/s) to saturate the network 
pipe, so it's evident to me that the blocks can come screaming from the 
disk plenty fast. And there's no CPU bottleneck; the VM is really not 
doing much of anything at all. So I'll say no. I shall continue my 
investigation.



The reads are started, then the ones that are ready are sent and the
blocks are freed in flush_blks.  The jitter happens when the guest reads
a lot but only writes a few blocks.  In that case, the bdrv_drain_all in
mig_save_device_dirty can be called relatively often and it can be
expensive because it also waits for all guest-initiated reads to complete.


Pardon my ignorance, but this does not match my observations. What I am
seeing is the process size of the source qemu grow steadily until the
COR completes; during this time the backing file on the destination
system does not change/grow at all, which implies that no blocks are
being transferred. (I have tested this with a 25GB VM disk, and larger;
no network activity occurs during this period.) Once the COR is done and
the in-memory copy ready (marked by a "Completed 100%" message from
blk_mig_save_builked_block()) the transfer begins. At an abysmally slow
rate, I'll add, per the above. Another problem to be investigated.


Odd thought; can you try dropping your migration bandwidth limit
(migrate_set_speed) - try something low, like 10M - does the behaviour
stay the same, or does it start transmitting disk data before it's read
the lot?


Interesting idea. I shall attempt that.


The bulk phase is similar, just with different functions (the reads are
done in mig_save_device_bulk).  With a high rate limit, the total
allocated memory can reach a few gigabytes indeed.


Much, much more than that. It's definitely dependent upon the disk file
size. Tiny VM disks are a nit; big VM disks are a problem.


Well, if as you say it's not starting transmitting for some reason until
it's read the lot then that would make sense.


Right. I'm just saying that I don't think this works the way people 
thinks it works.



Depending on the scenario, a possible disadvantage of NBD migration is
that it can only throttle each disk separately, while the old code will
apply a single limit to all migrations.


How about no throttling at all? And just to be very clear, the goal is
fast (NBD-based) migrations of VMs using non-shared storage over an
encrypted channel. Safest, worst-case scenario. Aside from gaining an
understanding of this code.


There are vague plans to add TLS support for encrypting these streams
internally to qemu; but they're just thoughts at the moment.


:-(

--
Gary R Hook
Senior Kernel Engineer
NIMBOXX, Inc



Re: [Qemu-devel] Fwd: Re: Tunneled Migration with Non-Shared Storage

2014-11-20 Thread Dr. David Alan Gilbert
* Gary R Hook (grhookatw...@gmail.com) wrote:
> Ugh, I wish I could teach Thunderbird to understand how to reply to a
> newsgroup.
> 
> Apologies to Paolo for the direct note.
> 
> On 11/19/14 4:19 AM, Paolo Bonzini wrote:
> >
> >
> >On 19/11/2014 10:35, Dr. David Alan Gilbert wrote:
> >>* Paolo Bonzini (pbonz...@redhat.com) wrote:
> >>>
> >>>
> >>>On 18/11/2014 21:28, Dr. David Alan Gilbert wrote:
> This seems odd, since as far as I know the tunneling code is quite 
> separate
> to the migration code; I thought the only thing that the migration
> code sees different is the file descriptors it gets past.
> (Having said that, again I don't know storage stuff, so if this
> is a storage special there may be something there...)
> >>>
> >>>Tunnelled migration uses the old block-migration.c code.  Non-tunnelled
> >>>migration uses the NBD server and block/mirror.c.
> >>
> >>OK, that explains that.  Is that because the tunneling code can't
> >>deal with tunneling the NBD server connection?
> >>
> >>>The main problem with
> >>>the old code is that uses a possibly unbounded amount of memory in
> >>>mig_save_device_dirty and can have huge jitter if any serious workload
> >>>is running in the guest.
> >>
> >>So that's sending dirty blocks iteratively? Not that I can see
> >>when the allocations get freed; but is the amount allocated there
> >>related to total disk size (as Gary suggested) or to the amount
> >>of dirty blocks?
> >
> >It should be related to the maximum rate limit (which can be set to
> >arbitrarily high values, however).
> 
> This makes no sense. The code in block_save_iterate() specifically
> attempts to control the rate of transfer. But when
> qemu_file_get_rate_limit() returns a number like 922337203685372723
> (0xCCB) I'm under the impression that no bandwidth
> constraints are being imposed at this layer. Why, then, would that
> transfer be occurring at 20MB/sec (simple, under-utilized 1 gigE
> connection) with no clear bottleneck in CPU or network? What other
> relation might exist?

Disk IO on the disk that you're trying to transfer?

> >The reads are started, then the ones that are ready are sent and the
> >blocks are freed in flush_blks.  The jitter happens when the guest reads
> >a lot but only writes a few blocks.  In that case, the bdrv_drain_all in
> >mig_save_device_dirty can be called relatively often and it can be
> >expensive because it also waits for all guest-initiated reads to complete.
> 
> Pardon my ignorance, but this does not match my observations. What I am
> seeing is the process size of the source qemu grow steadily until the
> COR completes; during this time the backing file on the destination
> system does not change/grow at all, which implies that no blocks are
> being transferred. (I have tested this with a 25GB VM disk, and larger;
> no network activity occurs during this period.) Once the COR is done and
> the in-memory copy ready (marked by a "Completed 100%" message from
> blk_mig_save_builked_block()) the transfer begins. At an abysmally slow
> rate, I'll add, per the above. Another problem to be investigated.

Odd thought; can you try dropping your migration bandwidth limit
(migrate_set_speed) - try something low, like 10M - does the behaviour
stay the same, or does it start transmitting disk data before it's read
the lot?

> >The bulk phase is similar, just with different functions (the reads are
> >done in mig_save_device_bulk).  With a high rate limit, the total
> >allocated memory can reach a few gigabytes indeed.
> 
> Much, much more than that. It's definitely dependent upon the disk file
> size. Tiny VM disks are a nit; big VM disks are a problem.

Well, if as you say it's not starting transmitting for some reason until
it's read the lot then that would make sense.

> >Depending on the scenario, a possible disadvantage of NBD migration is
> >that it can only throttle each disk separately, while the old code will
> >apply a single limit to all migrations.
> 
> How about no throttling at all? And just to be very clear, the goal is
> fast (NBD-based) migrations of VMs using non-shared storage over an
> encrypted channel. Safest, worst-case scenario. Aside from gaining an
> understanding of this code.

There are vague plans to add TLS support for encrypting these streams
internally to qemu; but they're just thoughts at the moment.

> Thank you for your attention.

Dave

> 
> -- 
> Gary R Hook
> Senior Kernel Engineer
> NIMBOXX, Inc
> 
> 
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



[Qemu-devel] Fwd: Re: Tunneled Migration with Non-Shared Storage

2014-11-19 Thread Gary R Hook
Ugh, I wish I could teach Thunderbird to understand how to reply to a 
newsgroup.


Apologies to Paolo for the direct note.

On 11/19/14 4:19 AM, Paolo Bonzini wrote:



On 19/11/2014 10:35, Dr. David Alan Gilbert wrote:

* Paolo Bonzini (pbonz...@redhat.com) wrote:



On 18/11/2014 21:28, Dr. David Alan Gilbert wrote:

This seems odd, since as far as I know the tunneling code is quite separate
to the migration code; I thought the only thing that the migration
code sees different is the file descriptors it gets past.
(Having said that, again I don't know storage stuff, so if this
is a storage special there may be something there...)


Tunnelled migration uses the old block-migration.c code.  Non-tunnelled
migration uses the NBD server and block/mirror.c.


OK, that explains that.  Is that because the tunneling code can't
deal with tunneling the NBD server connection?


The main problem with
the old code is that uses a possibly unbounded amount of memory in
mig_save_device_dirty and can have huge jitter if any serious workload
is running in the guest.


So that's sending dirty blocks iteratively? Not that I can see
when the allocations get freed; but is the amount allocated there
related to total disk size (as Gary suggested) or to the amount
of dirty blocks?


It should be related to the maximum rate limit (which can be set to
arbitrarily high values, however).


This makes no sense. The code in block_save_iterate() specifically
attempts to control the rate of transfer. But when
qemu_file_get_rate_limit() returns a number like 922337203685372723
(0xCCB) I'm under the impression that no bandwidth
constraints are being imposed at this layer. Why, then, would that
transfer be occurring at 20MB/sec (simple, under-utilized 1 gigE
connection) with no clear bottleneck in CPU or network? What other
relation might exist?



The reads are started, then the ones that are ready are sent and the
blocks are freed in flush_blks.  The jitter happens when the guest reads
a lot but only writes a few blocks.  In that case, the bdrv_drain_all in
mig_save_device_dirty can be called relatively often and it can be
expensive because it also waits for all guest-initiated reads to complete.


Pardon my ignorance, but this does not match my observations. What I am
seeing is the process size of the source qemu grow steadily until the
COR completes; during this time the backing file on the destination
system does not change/grow at all, which implies that no blocks are
being transferred. (I have tested this with a 25GB VM disk, and larger;
no network activity occurs during this period.) Once the COR is done and
the in-memory copy ready (marked by a "Completed 100%" message from
blk_mig_save_builked_block()) the transfer begins. At an abysmally slow
rate, I'll add, per the above. Another problem to be investigated.



The bulk phase is similar, just with different functions (the reads are
done in mig_save_device_bulk).  With a high rate limit, the total
allocated memory can reach a few gigabytes indeed.


Much, much more than that. It's definitely dependent upon the disk file
size. Tiny VM disks are a nit; big VM disks are a problem.


Depending on the scenario, a possible disadvantage of NBD migration is
that it can only throttle each disk separately, while the old code will
apply a single limit to all migrations.


How about no throttling at all? And just to be very clear, the goal is
fast (NBD-based) migrations of VMs using non-shared storage over an
encrypted channel. Safest, worst-case scenario. Aside from gaining an
understanding of this code.

Thank you for your attention.

--
Gary R Hook
Senior Kernel Engineer
NIMBOXX, Inc