Re: [Qemu-devel] live migration vs device assignment (motivation)

2016-01-03 Thread Lan Tianyu
On 2015年12月30日 00:46, Michael S. Tsirkin wrote:
> Interesting. So you sare saying merely ifdown/ifup is 100ms?
> This does not sound reasonable.
> Is there a chance you are e.g. getting IP from dhcp?
> 
> If so that is wrong - clearly should reconfigure the old IP
> back without playing with dhcp. For testing, just set up
> a static IP.

MAC and IP are migrated with VM to target machine and not need to
reconfigure IP after migration.

>From my test result, ixgbevf_down() consumes 35ms and ixgbevf_up()
consumes 55ms during migration.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-29 Thread Alexander Duyck
On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin  wrote:
> On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
>>
>>
>> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
>> >As long as you keep up this vague talk about performance during
>> >migration, without even bothering with any measurements, this patchset
>> >will keep going nowhere.
>> >
>>
>> I measured network service downtime for "keep device alive"(RFC patch V1
>> presented) and "put down and up network interface"(RFC patch V2 presented)
>> during migration with some optimizations.
>>
>> The former is around 140ms and the later is around 240ms.
>>
>> My patchset relies on the maibox irq which doesn't work in the suspend state
>> and so can't get downtime for suspend/resume cases. Will try to get the
>> result later.
>
>
> Interesting. So you sare saying merely ifdown/ifup is 100ms?
> This does not sound reasonable.
> Is there a chance you are e.g. getting IP from dhcp?


Actually it wouldn't surprise me if that is due to a reset logic in
the driver.  For starters there is a 10 msec delay in the call
ixgbevf_reset_hw_vf which I believe is present to allow the PF time to
clear registers after the VF has requested a reset.  There is also a
10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues
were disabled.  That is in addition to the fact that the function that
disables the queues does so serially and polls each queue until the
hardware acknowledges that the queues are actually disabled.  The
driver also does the serial enable with poll logic on re-enabling the
queues which likely doesn't help things.

Really this driver is probably in need of a refactor to clean the
cruft out of the reset and initialization logic.  I suspect we have
far more delays than we really need and that is the source of much of
the slow down.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-29 Thread Michael S. Tsirkin
On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
> >As long as you keep up this vague talk about performance during
> >migration, without even bothering with any measurements, this patchset
> >will keep going nowhere.
> >
> 
> I measured network service downtime for "keep device alive"(RFC patch V1
> presented) and "put down and up network interface"(RFC patch V2 presented)
> during migration with some optimizations.
> 
> The former is around 140ms and the later is around 240ms.
> 
> My patchset relies on the maibox irq which doesn't work in the suspend state
> and so can't get downtime for suspend/resume cases. Will try to get the
> result later.


Interesting. So you sare saying merely ifdown/ifup is 100ms?
This does not sound reasonable.
Is there a chance you are e.g. getting IP from dhcp?

If so that is wrong - clearly should reconfigure the old IP
back without playing with dhcp. For testing, just set up
a static IP.

> >
> >
> >
> >There's Alex's patch that tracks memory changes during migration.  It
> >needs some simple enhancements to be useful in production (e.g. add a
> >host/guest handshake to both enable tracking in guest and to detect the
> >support in host), then it can allow starting migration with an assigned
> >device, by invoking hot-unplug after most of memory have been migrated.
> >
> >Please implement this in qemu and measure the speed.
> 
> Sure. Will do that.
> 
> >I will not be surprised if destroying/creating netdev in linux
> >turns out to take too long, but before anyone bothered
> >checking, it does not make sense to discuss further enhancements.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-29 Thread Alexander Duyck
On Tue, Dec 29, 2015 at 9:15 AM, Michael S. Tsirkin  wrote:
> On Tue, Dec 29, 2015 at 09:04:51AM -0800, Alexander Duyck wrote:
>> On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin  wrote:
>> > On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
>> >>
>> >>
>> >> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
>> >> >As long as you keep up this vague talk about performance during
>> >> >migration, without even bothering with any measurements, this patchset
>> >> >will keep going nowhere.
>> >> >
>> >>
>> >> I measured network service downtime for "keep device alive"(RFC patch V1
>> >> presented) and "put down and up network interface"(RFC patch V2 presented)
>> >> during migration with some optimizations.
>> >>
>> >> The former is around 140ms and the later is around 240ms.
>> >>
>> >> My patchset relies on the maibox irq which doesn't work in the suspend 
>> >> state
>> >> and so can't get downtime for suspend/resume cases. Will try to get the
>> >> result later.
>> >
>> >
>> > Interesting. So you sare saying merely ifdown/ifup is 100ms?
>> > This does not sound reasonable.
>> > Is there a chance you are e.g. getting IP from dhcp?
>>
>>
>> Actually it wouldn't surprise me if that is due to a reset logic in
>> the driver.  For starters there is a 10 msec delay in the call
>> ixgbevf_reset_hw_vf which I believe is present to allow the PF time to
>> clear registers after the VF has requested a reset.  There is also a
>> 10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues
>> were disabled.  That is in addition to the fact that the function that
>> disables the queues does so serially and polls each queue until the
>> hardware acknowledges that the queues are actually disabled.  The
>> driver also does the serial enable with poll logic on re-enabling the
>> queues which likely doesn't help things.
>>
>> Really this driver is probably in need of a refactor to clean the
>> cruft out of the reset and initialization logic.  I suspect we have
>> far more delays than we really need and that is the source of much of
>> the slow down.
>
> For ifdown, why is there any need to reset the device at all?
> Is it so buffers can be reclaimed?
>

I believe it is mostly historical.  All the Intel drivers are derived
from e1000.  The e1000 has a 10ms sleep to allow outstanding PCI
transactions to complete before resetting and it looks like they ended
up inheriting that in the ixgbevf driver.  I suppose it does allow for
the buffers to be reclaimed which is something we may need, though the
VF driver should have already verified that it disabled the queues
when it was doing the polling on the bits being cleared in the
individual queue control registers.  Likely the 10ms sleep is
redundant as a result.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-29 Thread Michael S. Tsirkin
On Tue, Dec 29, 2015 at 09:04:51AM -0800, Alexander Duyck wrote:
> On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin  wrote:
> > On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
> >>
> >>
> >> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
> >> >As long as you keep up this vague talk about performance during
> >> >migration, without even bothering with any measurements, this patchset
> >> >will keep going nowhere.
> >> >
> >>
> >> I measured network service downtime for "keep device alive"(RFC patch V1
> >> presented) and "put down and up network interface"(RFC patch V2 presented)
> >> during migration with some optimizations.
> >>
> >> The former is around 140ms and the later is around 240ms.
> >>
> >> My patchset relies on the maibox irq which doesn't work in the suspend 
> >> state
> >> and so can't get downtime for suspend/resume cases. Will try to get the
> >> result later.
> >
> >
> > Interesting. So you sare saying merely ifdown/ifup is 100ms?
> > This does not sound reasonable.
> > Is there a chance you are e.g. getting IP from dhcp?
> 
> 
> Actually it wouldn't surprise me if that is due to a reset logic in
> the driver.  For starters there is a 10 msec delay in the call
> ixgbevf_reset_hw_vf which I believe is present to allow the PF time to
> clear registers after the VF has requested a reset.  There is also a
> 10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues
> were disabled.  That is in addition to the fact that the function that
> disables the queues does so serially and polls each queue until the
> hardware acknowledges that the queues are actually disabled.  The
> driver also does the serial enable with poll logic on re-enabling the
> queues which likely doesn't help things.
> 
> Really this driver is probably in need of a refactor to clean the
> cruft out of the reset and initialization logic.  I suspect we have
> far more delays than we really need and that is the source of much of
> the slow down.
> 
> - Alex

For ifdown, why is there any need to reset the device at all?
Is it so buffers can be reclaimed?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-28 Thread Michael S. Tsirkin
On Mon, Dec 28, 2015 at 03:20:10AM +, Dong, Eddie wrote:
> > >
> > > Even if the device driver doesn't support migration, you still want to
> > > migrate VM? That maybe risk and we should add the "bad path" for the
> > > driver at least.
> > 
> > At a minimum we should have support for hot-plug if we are expecting to
> > support migration.  You would simply have to hot-plug the device before you
> > start migration and then return it after.  That is how the current bonding
> > approach for this works if I am not mistaken.
> 
> Hotplug is good to eliminate the device spefic state clone, but
> bonding approach is very network specific, it doesn’t work for other
> devices such as FPGA device, QaT devices & GPU devices, which we plan
> to support gradually :)

Alexander didn't say do bonding. He just said bonding uses hot-unplug.

Gradual and generic is the correct approach. So focus on splitting the
work into manageable pieces which are also useful by themselves, and
generally reusable by different devices.

So live the pausing alone for a moment.

Start from Alexander's patchset for tracking dirty memory, add a way to
control and detect it from userspace (and maybe from host), and a way to
start migration while device is attached, removing it at the last
possible moment.

That will be a nice first step.


> > 
> > The advantage we are looking to gain is to avoid removing/disabling the
> > device for as long as possible.  Ideally we want to keep the device active
> > through the warm-up period, but if the guest doesn't do that we should still
> > be able to fall back on the older approaches if needed.
> > 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-28 Thread Pavel Fedin
 Hello!

> A dedicated IRQ per device for something that is a system wide event
> sounds like a waste.  I don't understand why a spec change is strictly
> required, we only need to support this with the specific virtual bridge
> used by QEMU, so I think that a vendor specific capability will do.
> Once this works well in the field, a PCI spec ECN might make sense
> to standardise the capability.

 Keeping track of your discussion for some time, decided to jump in...
 So far, we want to have some kind of mailbox to notify the quest about 
migration. So what about some dedicated "pci device" for
this purpose? Some kind of "migration controller". This is:
a) perhaps easier to implement than capability, we don't need to push anything 
to PCI spec.
b) could easily make friendship with Windows, because this means that no bus 
code has to be touched at all. It would rely only on
drivers' ability to communicate with each other (i guess it should be possible 
in Windows, isn't it?)
c) does not need to steal resources (BARs, IRQs, etc) from the actual devices.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-28 Thread Michael S. Tsirkin
On Sun, Dec 27, 2015 at 01:45:15PM -0800, Alexander Duyck wrote:
> On Sun, Dec 27, 2015 at 1:21 AM, Michael S. Tsirkin  wrote:
> > On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote:
> >> The PCI hot-plug specification calls out that the OS can optionally
> >> implement a "pause" mechanism which is meant to be used for high
> >> availability type environments.  What I am proposing is basically
> >> extending the standard SHPC capable PCI bridge so that we can support
> >> the DMA page dirtying for everything hosted on it, add a vendor
> >> specific block to the config space so that the guest can notify the
> >> host that it will do page dirtying, and add a mechanism to indicate
> >> that all hot-plug events during the warm-up phase of the migration are
> >> pause events instead of full removals.
> >
> > Two comments:
> >
> > 1. A vendor specific capability will always be problematic.
> > Better to register a capability id with pci sig.
> >
> > 2. There are actually several capabilities:
> >
> > A. support for memory dirtying
> > if not supported, we must stop device before migration
> >
> > This is supported by core guest OS code,
> > using patches similar to posted by you.
> >
> >
> > B. support for device replacement
> > This is a faster form of hotplug, where device is removed and
> > later another device using same driver is inserted in the same slot.
> >
> > This is a possible optimization, but I am convinced
> > (A) should be implemented independently of (B).
> >
> 
> My thought on this was that we don't need much to really implement
> either feature.  Really only a bit or two for either one.  I had
> thought about extending the PCI Advanced Features, but for now it
> might make more sense to just implement it as a vendor capability for
> the QEMU based bridges instead of trying to make this a true PCI
> capability since I am not sure if this in any way would apply to
> physical hardware.  The fact is the PCI Advanced Features capability
> is essentially just a vendor specific capability with a different ID

Interesting. I see it more as a backport of pci express
features to pci.

> so if we were to use 2 bits that are currently reserved in the
> capability we could later merge the functionality without much
> overhead.

Don't do this. You must not touch reserved bits.

> I fully agree that the two implementations should be separate but
> nothing says we have to implement them completely different.  If we
> are just using 3 bits for capability, status, and control of each
> feature there is no reason for them to need to be stored in separate
> locations.

True.

> >> I've been poking around in the kernel and QEMU code and the part I
> >> have been trying to sort out is how to get QEMU based pci-bridge to
> >> use the SHPC driver because from what I can tell the driver never
> >> actually gets loaded on the device as it is left in the control of
> >> ACPI hot-plug.
> >
> > There are ways, but you can just use pci express, it's easier.
> 
> That's true.  I should probably just give up on trying to do an
> implementation that works with the i440fx implementation.  I could
> probably move over to the q35 and once that is done then we could look
> at something like the PCI Advanced Features solution for something
> like the PCI-bridge drivers.
> 
> - Alex

Once we have a decent idea of what's required, I can write
an ECN for pci code and id assignment specification.
That's cleaner than vendor specific stuff that's tied to
a specific device/vendor ID.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-28 Thread Michael S. Tsirkin
On Mon, Dec 28, 2015 at 11:52:43AM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > A dedicated IRQ per device for something that is a system wide event
> > sounds like a waste.  I don't understand why a spec change is strictly
> > required, we only need to support this with the specific virtual bridge
> > used by QEMU, so I think that a vendor specific capability will do.
> > Once this works well in the field, a PCI spec ECN might make sense
> > to standardise the capability.
> 
>  Keeping track of your discussion for some time, decided to jump in...
>  So far, we want to have some kind of mailbox to notify the quest about 
> migration. So what about some dedicated "pci device" for
> this purpose? Some kind of "migration controller". This is:
> a) perhaps easier to implement than capability, we don't need to push 
> anything to PCI spec.
> b) could easily make friendship with Windows, because this means that no bus 
> code has to be touched at all. It would rely only on
> drivers' ability to communicate with each other (i guess it should be 
> possible in Windows, isn't it?)
> c) does not need to steal resources (BARs, IRQs, etc) from the actual devices.
> 
> Kind regards,
> Pavel Fedin
> Expert Engineer
> Samsung Electronics Research center Russia
> 

Sure, or we can use an ACPI device.  It doesn't really matter what we do
for the mailbox. Whoever writes this first will get to select a
mechanism.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-28 Thread Lan, Tianyu



On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:

As long as you keep up this vague talk about performance during
migration, without even bothering with any measurements, this patchset
will keep going nowhere.



I measured network service downtime for "keep device alive"(RFC patch V1 
presented) and "put down and up network interface"(RFC patch V2 
presented) during migration with some optimizations.


The former is around 140ms and the later is around 240ms.

My patchset relies on the maibox irq which doesn't work in the suspend 
state and so can't get downtime for suspend/resume cases. Will try to 
get the result later.






There's Alex's patch that tracks memory changes during migration.  It
needs some simple enhancements to be useful in production (e.g. add a
host/guest handshake to both enable tracking in guest and to detect the
support in host), then it can allow starting migration with an assigned
device, by invoking hot-unplug after most of memory have been migrated.

Please implement this in qemu and measure the speed.


Sure. Will do that.


I will not be surprised if destroying/creating netdev in linux
turns out to take too long, but before anyone bothered
checking, it does not make sense to discuss further enhancements.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-27 Thread Michael S. Tsirkin
On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote:
> The PCI hot-plug specification calls out that the OS can optionally
> implement a "pause" mechanism which is meant to be used for high
> availability type environments.  What I am proposing is basically
> extending the standard SHPC capable PCI bridge so that we can support
> the DMA page dirtying for everything hosted on it, add a vendor
> specific block to the config space so that the guest can notify the
> host that it will do page dirtying, and add a mechanism to indicate
> that all hot-plug events during the warm-up phase of the migration are
> pause events instead of full removals.

Two comments:

1. A vendor specific capability will always be problematic.
Better to register a capability id with pci sig.

2. There are actually several capabilities:

A. support for memory dirtying
if not supported, we must stop device before migration

This is supported by core guest OS code,
using patches similar to posted by you.


B. support for device replacement
This is a faster form of hotplug, where device is removed and
later another device using same driver is inserted in the same slot.

This is a possible optimization, but I am convinced
(A) should be implemented independently of (B).




> I've been poking around in the kernel and QEMU code and the part I
> have been trying to sort out is how to get QEMU based pci-bridge to
> use the SHPC driver because from what I can tell the driver never
> actually gets loaded on the device as it is left in the control of
> ACPI hot-plug.
> 
> - Alex

There are ways, but you can just use pci express, it's easier.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-27 Thread Alexander Duyck
On Sun, Dec 27, 2015 at 1:21 AM, Michael S. Tsirkin  wrote:
> On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote:
>> The PCI hot-plug specification calls out that the OS can optionally
>> implement a "pause" mechanism which is meant to be used for high
>> availability type environments.  What I am proposing is basically
>> extending the standard SHPC capable PCI bridge so that we can support
>> the DMA page dirtying for everything hosted on it, add a vendor
>> specific block to the config space so that the guest can notify the
>> host that it will do page dirtying, and add a mechanism to indicate
>> that all hot-plug events during the warm-up phase of the migration are
>> pause events instead of full removals.
>
> Two comments:
>
> 1. A vendor specific capability will always be problematic.
> Better to register a capability id with pci sig.
>
> 2. There are actually several capabilities:
>
> A. support for memory dirtying
> if not supported, we must stop device before migration
>
> This is supported by core guest OS code,
> using patches similar to posted by you.
>
>
> B. support for device replacement
> This is a faster form of hotplug, where device is removed and
> later another device using same driver is inserted in the same slot.
>
> This is a possible optimization, but I am convinced
> (A) should be implemented independently of (B).
>

My thought on this was that we don't need much to really implement
either feature.  Really only a bit or two for either one.  I had
thought about extending the PCI Advanced Features, but for now it
might make more sense to just implement it as a vendor capability for
the QEMU based bridges instead of trying to make this a true PCI
capability since I am not sure if this in any way would apply to
physical hardware.  The fact is the PCI Advanced Features capability
is essentially just a vendor specific capability with a different ID
so if we were to use 2 bits that are currently reserved in the
capability we could later merge the functionality without much
overhead.

I fully agree that the two implementations should be separate but
nothing says we have to implement them completely different.  If we
are just using 3 bits for capability, status, and control of each
feature there is no reason for them to need to be stored in separate
locations.

>> I've been poking around in the kernel and QEMU code and the part I
>> have been trying to sort out is how to get QEMU based pci-bridge to
>> use the SHPC driver because from what I can tell the driver never
>> actually gets loaded on the device as it is left in the control of
>> ACPI hot-plug.
>
> There are ways, but you can just use pci express, it's easier.

That's true.  I should probably just give up on trying to do an
implementation that works with the i440fx implementation.  I could
probably move over to the q35 and once that is done then we could look
at something like the PCI Advanced Features solution for something
like the PCI-bridge drivers.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-27 Thread Alexander Duyck
On Sun, Dec 27, 2015 at 7:20 PM, Dong, Eddie  wrote:
>> >
>> > Even if the device driver doesn't support migration, you still want to
>> > migrate VM? That maybe risk and we should add the "bad path" for the
>> > driver at least.
>>
>> At a minimum we should have support for hot-plug if we are expecting to
>> support migration.  You would simply have to hot-plug the device before you
>> start migration and then return it after.  That is how the current bonding
>> approach for this works if I am not mistaken.
>
> Hotplug is good to eliminate the device spefic state clone, but bonding 
> approach is very network specific, it doesn’t work for other devices such as 
> FPGA device, QaT devices & GPU devices, which we plan to support gradually :)

Hotplug would be usable for that assuming the guest supports the
optional "pause" implementation as called out in the PCI hotplug spec.
With that the device can maintain state for some period of time after
the hotplug remove event has occurred.

The problem is that you have to get the device to quiesce at some
point as you cannot complete the migration with the device still
active.  The way you were doing it was using the per-device
configuration space mechanism.  That doesn't scale when you have to
implement it for each and every driver for each and every OS you have
to support.  Using the "pause" implementation for hot-plug would have
a much greater likelihood of scaling as you could either take the fast
path approach of "pausing" the device to resume it when migration has
completed, or you could just remove the device and restart the driver
on the other side if the pause support is not yet implemented.  You
would lose the state under such a migration but it is much more
practical than having to implement a per device solution.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-27 Thread Dong, Eddie
> >
> > Even if the device driver doesn't support migration, you still want to
> > migrate VM? That maybe risk and we should add the "bad path" for the
> > driver at least.
> 
> At a minimum we should have support for hot-plug if we are expecting to
> support migration.  You would simply have to hot-plug the device before you
> start migration and then return it after.  That is how the current bonding
> approach for this works if I am not mistaken.

Hotplug is good to eliminate the device spefic state clone, but bonding 
approach is very network specific, it doesn’t work for other devices such as 
FPGA device, QaT devices & GPU devices, which we plan to support gradually :)

> 
> The advantage we are looking to gain is to avoid removing/disabling the
> device for as long as possible.  Ideally we want to keep the device active
> through the warm-up period, but if the guest doesn't do that we should still
> be able to fall back on the older approaches if needed.
> 
N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-25 Thread Michael S. Tsirkin
On Fri, Dec 25, 2015 at 03:03:47PM +0800, Lan Tianyu wrote:
> Merry Christmas.
> Sorry for later response due to personal affair.
> 
> On 2015年12月14日 03:30, Alexander Duyck wrote:
> >> > These sounds we need to add a faked bridge for migration and adding a
> >> > driver in the guest for it. It also needs to extend PCI bus/hotplug
> >> > driver to do pause/resume other devices, right?
> >> >
> >> > My concern is still that whether we can change PCI bus/hotplug like that
> >> > without spec change.
> >> >
> >> > IRQ should be general for any devices and we may extend it for
> >> > migration. Device driver also can make decision to support migration
> >> > or not.
> > The device should have no say in the matter.  Either we are going to
> > migrate or we will not.  This is why I have suggested my approach as
> > it allows for the least amount of driver intrusion while providing the
> > maximum number of ways to still perform migration even if the device
> > doesn't support it.
> 
> Even if the device driver doesn't support migration, you still want to
> migrate VM? That maybe risk and we should add the "bad path" for the
> driver at least.
> 
> > 
> > The solution I have proposed is simple:
> > 
> > 1.  Extend swiotlb to allow for a page dirtying functionality.
> > 
> >  This part is pretty straight forward.  I'll submit a few patches
> > later today as RFC that can provided the minimal functionality needed
> > for this.
> 
> Very appreciate to do that.
> 
> > 
> > 2.  Provide a vendor specific configuration space option on the QEMU
> > implementation of a PCI bridge to act as a bridge between direct
> > assigned devices and the host bridge.
> > 
> >  My thought was to add some vendor specific block that includes a
> > capabilities, status, and control register so you could go through and
> > synchronize things like the DMA page dirtying feature.  The bridge
> > itself could manage the migration capable bit inside QEMU for all
> > devices assigned to it.  So if you added a VF to the bridge it would
> > flag that you can support migration in QEMU, while the bridge would
> > indicate you cannot until the DMA page dirtying control bit is set by
> > the guest.
> > 
> >  We could also go through and optimize the DMA page dirtying after
> > this is added so that we can narrow down the scope of use, and as a
> > result improve the performance for other devices that don't need to
> > support migration.  It would then be a matter of adding an interrupt
> > in the device to handle an event such as the DMA page dirtying status
> > bit being set in the config space status register, while the bit is
> > not set in the control register.  If it doesn't get set then we would
> > have to evict the devices before the warm-up phase of the migration,
> > otherwise we can defer it until the end of the warm-up phase.
> > 
> > 3.  Extend existing shpc driver to support the optional "pause"
> > functionality as called out in section 4.1.2 of the Revision 1.1 PCI
> > hot-plug specification.
> 
> Since your solution has added a faked PCI bridge. Why not notify the
> bridge directly during migration via irq and call device driver's
> callback in the new bridge driver?
> 
> Otherwise, the new bridge driver also can check whether the device
> driver provides migration callback or not and call them to improve the
> passthough device's performance during migration.

As long as you keep up this vague talk about performance during
migration, without even bothering with any measurements, this patchset
will keep going nowhere.




There's Alex's patch that tracks memory changes during migration.  It
needs some simple enhancements to be useful in production (e.g. add a
host/guest handshake to both enable tracking in guest and to detect the
support in host), then it can allow starting migration with an assigned
device, by invoking hot-unplug after most of memory have been migrated.

Please implement this in qemu and measure the speed.
I will not be surprised if destroying/creating netdev in linux
turns out to take too long, but before anyone bothered
checking, it does not make sense to discuss further enhancements.



> > 
> >  Note I call out "extend" here instead of saying to add this.
> > Basically what we should do is provide a means of quiescing the device
> > without unloading the driver.  This is called out as something the OS
> > vendor can optionally implement in the PCI hot-plug specification.  On
> > OSes that wouldn't support this it would just be treated as a standard
> > hot-plug event.   We could add a capability, status, and control bit
> > in the vendor specific configuration block for this as well and if we
> > set the status bit would indicate the host wants to pause instead of
> > remove and the control bit would indicate the guest supports "pause"
> > in the OS.  We then could optionally disable guest migration while the
> > VF is present and pause is not supported.
> > 
> >  To support this we would need 

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-25 Thread Alexander Duyck
On Thu, Dec 24, 2015 at 11:03 PM, Lan Tianyu  wrote:
> Merry Christmas.
> Sorry for later response due to personal affair.
>
> On 2015年12月14日 03:30, Alexander Duyck wrote:
>>> > These sounds we need to add a faked bridge for migration and adding a
>>> > driver in the guest for it. It also needs to extend PCI bus/hotplug
>>> > driver to do pause/resume other devices, right?
>>> >
>>> > My concern is still that whether we can change PCI bus/hotplug like that
>>> > without spec change.
>>> >
>>> > IRQ should be general for any devices and we may extend it for
>>> > migration. Device driver also can make decision to support migration
>>> > or not.
>> The device should have no say in the matter.  Either we are going to
>> migrate or we will not.  This is why I have suggested my approach as
>> it allows for the least amount of driver intrusion while providing the
>> maximum number of ways to still perform migration even if the device
>> doesn't support it.
>
> Even if the device driver doesn't support migration, you still want to
> migrate VM? That maybe risk and we should add the "bad path" for the
> driver at least.

At a minimum we should have support for hot-plug if we are expecting
to support migration.  You would simply have to hot-plug the device
before you start migration and then return it after.  That is how the
current bonding approach for this works if I am not mistaken.

The advantage we are looking to gain is to avoid removing/disabling
the device for as long as possible.  Ideally we want to keep the
device active through the warm-up period, but if the guest doesn't do
that we should still be able to fall back on the older approaches if
needed.

>>
>> The solution I have proposed is simple:
>>
>> 1.  Extend swiotlb to allow for a page dirtying functionality.
>>
>>  This part is pretty straight forward.  I'll submit a few patches
>> later today as RFC that can provided the minimal functionality needed
>> for this.
>
> Very appreciate to do that.
>
>>
>> 2.  Provide a vendor specific configuration space option on the QEMU
>> implementation of a PCI bridge to act as a bridge between direct
>> assigned devices and the host bridge.
>>
>>  My thought was to add some vendor specific block that includes a
>> capabilities, status, and control register so you could go through and
>> synchronize things like the DMA page dirtying feature.  The bridge
>> itself could manage the migration capable bit inside QEMU for all
>> devices assigned to it.  So if you added a VF to the bridge it would
>> flag that you can support migration in QEMU, while the bridge would
>> indicate you cannot until the DMA page dirtying control bit is set by
>> the guest.
>>
>>  We could also go through and optimize the DMA page dirtying after
>> this is added so that we can narrow down the scope of use, and as a
>> result improve the performance for other devices that don't need to
>> support migration.  It would then be a matter of adding an interrupt
>> in the device to handle an event such as the DMA page dirtying status
>> bit being set in the config space status register, while the bit is
>> not set in the control register.  If it doesn't get set then we would
>> have to evict the devices before the warm-up phase of the migration,
>> otherwise we can defer it until the end of the warm-up phase.
>>
>> 3.  Extend existing shpc driver to support the optional "pause"
>> functionality as called out in section 4.1.2 of the Revision 1.1 PCI
>> hot-plug specification.
>
> Since your solution has added a faked PCI bridge. Why not notify the
> bridge directly during migration via irq and call device driver's
> callback in the new bridge driver?
>
> Otherwise, the new bridge driver also can check whether the device
> driver provides migration callback or not and call them to improve the
> passthough device's performance during migration.

This is basically what I had in mind.  Though I would take things one
step further.  You don't need to add any new call-backs if you make
use of the existing suspend/resume logic.  For a VF this does exactly
what you would need since the VFs don't support wake on LAN so it will
simply clear the bus master enable and put the netdev in a suspended
state until the resume can be called.

The PCI hot-plug specification calls out that the OS can optionally
implement a "pause" mechanism which is meant to be used for high
availability type environments.  What I am proposing is basically
extending the standard SHPC capable PCI bridge so that we can support
the DMA page dirtying for everything hosted on it, add a vendor
specific block to the config space so that the guest can notify the
host that it will do page dirtying, and add a mechanism to indicate
that all hot-plug events during the warm-up phase of the migration are
pause events instead of full removals.

I've been poking around in the kernel and QEMU code and the part I
have been trying to sort out is how to get QEMU based 

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-24 Thread Lan Tianyu
Merry Christmas.
Sorry for later response due to personal affair.

On 2015年12月14日 03:30, Alexander Duyck wrote:
>> > These sounds we need to add a faked bridge for migration and adding a
>> > driver in the guest for it. It also needs to extend PCI bus/hotplug
>> > driver to do pause/resume other devices, right?
>> >
>> > My concern is still that whether we can change PCI bus/hotplug like that
>> > without spec change.
>> >
>> > IRQ should be general for any devices and we may extend it for
>> > migration. Device driver also can make decision to support migration
>> > or not.
> The device should have no say in the matter.  Either we are going to
> migrate or we will not.  This is why I have suggested my approach as
> it allows for the least amount of driver intrusion while providing the
> maximum number of ways to still perform migration even if the device
> doesn't support it.

Even if the device driver doesn't support migration, you still want to
migrate VM? That maybe risk and we should add the "bad path" for the
driver at least.

> 
> The solution I have proposed is simple:
> 
> 1.  Extend swiotlb to allow for a page dirtying functionality.
> 
>  This part is pretty straight forward.  I'll submit a few patches
> later today as RFC that can provided the minimal functionality needed
> for this.

Very appreciate to do that.

> 
> 2.  Provide a vendor specific configuration space option on the QEMU
> implementation of a PCI bridge to act as a bridge between direct
> assigned devices and the host bridge.
> 
>  My thought was to add some vendor specific block that includes a
> capabilities, status, and control register so you could go through and
> synchronize things like the DMA page dirtying feature.  The bridge
> itself could manage the migration capable bit inside QEMU for all
> devices assigned to it.  So if you added a VF to the bridge it would
> flag that you can support migration in QEMU, while the bridge would
> indicate you cannot until the DMA page dirtying control bit is set by
> the guest.
> 
>  We could also go through and optimize the DMA page dirtying after
> this is added so that we can narrow down the scope of use, and as a
> result improve the performance for other devices that don't need to
> support migration.  It would then be a matter of adding an interrupt
> in the device to handle an event such as the DMA page dirtying status
> bit being set in the config space status register, while the bit is
> not set in the control register.  If it doesn't get set then we would
> have to evict the devices before the warm-up phase of the migration,
> otherwise we can defer it until the end of the warm-up phase.
> 
> 3.  Extend existing shpc driver to support the optional "pause"
> functionality as called out in section 4.1.2 of the Revision 1.1 PCI
> hot-plug specification.

Since your solution has added a faked PCI bridge. Why not notify the
bridge directly during migration via irq and call device driver's
callback in the new bridge driver?

Otherwise, the new bridge driver also can check whether the device
driver provides migration callback or not and call them to improve the
passthough device's performance during migration.

> 
>  Note I call out "extend" here instead of saying to add this.
> Basically what we should do is provide a means of quiescing the device
> without unloading the driver.  This is called out as something the OS
> vendor can optionally implement in the PCI hot-plug specification.  On
> OSes that wouldn't support this it would just be treated as a standard
> hot-plug event.   We could add a capability, status, and control bit
> in the vendor specific configuration block for this as well and if we
> set the status bit would indicate the host wants to pause instead of
> remove and the control bit would indicate the guest supports "pause"
> in the OS.  We then could optionally disable guest migration while the
> VF is present and pause is not supported.
> 
>  To support this we would need to add a timer and if a new device
> is not inserted in some period of time (60 seconds for example), or if
> a different device is inserted,
> we need to unload the original driver
> from the device.  In addition we would need to verify if drivers can
> call the remove function after having called suspend without resume.
> If not, we could look at adding a recovery function to remove the
> driver from the device in the case of a suspend with either a failed
> resume or no resume call.  Once again it would probably be useful to
> have for those cases where power management suspend/resume runs into
> an issue like somebody causing a surprise removal while a device was
> suspended.


-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-14 Thread Michael S. Tsirkin
On Sun, Dec 13, 2015 at 11:47:44PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/11/2015 1:16 AM, Alexander Duyck wrote:
> >On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu  wrote:
> >>
> >>
> >>On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> 
> Ideally, it is able to leave guest driver unmodified but it requires the
> >hypervisor or qemu to aware the device which means we may need a driver
> >in
> >hypervisor or qemu to handle the device on behalf of guest driver.
> >>>
> >>>Can you answer the question of when do you use your code -
> >>> at the start of migration or
> >>> just before the end?
> >>
> >>
> >>Just before stopping VCPU in this version and inject VF mailbox irq to
> >>notify the driver if the irq handler is installed.
> >>Qemu side also will check this via the faked PCI migration capability
> >>and driver will set the status during device open() or resume() callback.
> >
> >The VF mailbox interrupt is a very bad idea.  Really the device should
> >be in a reset state on the other side of a migration.  It doesn't make
> >sense to have the interrupt firing if the device is not configured.
> >This is one of the things that is preventing you from being able to
> >migrate the device while the interface is administratively down or the
> >VF driver is not loaded.
> 
> From my opinion, if VF driver is not loaded and hardware doesn't start
> to work, the device state doesn't need to be migrated.
> 
> We may add a flag for driver to check whether migration happened during it's
> down and reinitialize the hardware and clear the flag when system try to put
> it up.
> 
> We may add migration core in the Linux kernel and provide some helps
> functions to facilitate to add migration support for drivers.
> Migration core is in charge to sync status with Qemu.
> 
> Example.
> migration_register()
> Driver provides
> - Callbacks to be called before and after migration or for bad path
> - Its irq which it prefers to deal with migration event.
> 
> migration_event_check()
> Driver calls it in the irq handler. Migration core code will check
> migration status and call its callbacks when migration happens.
> 
> 
> >
> >My thought on all this is that it might make sense to move this
> >functionality into a PCI-to-PCI bridge device and make it a
> >requirement that all direct-assigned devices have to exist behind that
> >device in order to support migration.  That way you would be working
> >with a directly emulated device that would likely already be
> >supporting hot-plug anyway.  Then it would just be a matter of coming
> >up with a few Qemu specific extensions that you would need to add to
> >the device itself.  The same approach would likely be portable enough
> >that you could achieve it with PCIe as well via the same configuration
> >space being present on the upstream side of a PCIe port or maybe a
> >PCIe switch of some sort.
> >
> >It would then be possible to signal via your vendor-specific PCI
> >capability on that device that all devices behind this bridge require
> >DMA page dirtying, you could use the configuration in addition to the
> >interrupt already provided for hot-plug to signal things like when you
> >are starting migration, and possibly even just extend the shpc
> >functionality so that if this capability is present you have the
> >option to pause/resume instead of remove/probe the device in the case
> >of certain hot-plug events.  The fact is there may be some use for a
> >pause/resume type approach for PCIe hot-plug in the near future
> >anyway.  From the sounds of it Apple has required it for all
> >Thunderbolt device drivers so that they can halt the device in order
> >to shuffle resources around, perhaps we should look at something
> >similar for Linux.
> >
> >The other advantage behind grouping functions on one bridge is things
> >like reset domains.  The PCI error handling logic will want to be able
> >to reset any devices that experienced an error in the event of
> >something such as a surprise removal.  By grouping all of the devices
> >you could disable/reset/enable them as one logical group in the event
> >of something such as the "bad path" approach Michael has mentioned.
> >
> 
> These sounds we need to add a faked bridge for migration and adding a
> driver in the guest for it. It also needs to extend PCI bus/hotplug
> driver to do pause/resume other devices, right?
> 
> My concern is still that whether we can change PCI bus/hotplug like that
> without spec change.
> 
> IRQ should be general for any devices and we may extend it for
> migration. Device driver also can make decision to support migration
> or not.

A dedicated IRQ per device for something that is a system wide event
sounds like a waste.  I don't understand why a spec change is strictly
required, we only need to support this with the specific virtual bridge
used by QEMU, so I think that a vendor specific capability will do.
Once this works well in the field, a PCI spec 

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-14 Thread Michael S. Tsirkin
On Fri, Dec 11, 2015 at 03:32:04PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/11/2015 12:11 AM, Michael S. Tsirkin wrote:
> >On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
> >>
> >>
> >>On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> Ideally, it is able to leave guest driver unmodified but it requires the
> >hypervisor or qemu to aware the device which means we may need a driver 
> >in
> >hypervisor or qemu to handle the device on behalf of guest driver.
> >>>Can you answer the question of when do you use your code -
> >>>at the start of migration or
> >>>just before the end?
> >>
> >>Just before stopping VCPU in this version and inject VF mailbox irq to
> >>notify the driver if the irq handler is installed.
> >>Qemu side also will check this via the faked PCI migration capability
> >>and driver will set the status during device open() or resume() callback.
> >
> >Right, this is the "good path" optimization. Whether this buys anything
> >as compared to just sending reset to the device when VCPU is stopped
> >needs to be measured. In any case, we probably do need a way to
> >interrupt driver on destination to make it reconfigure the device -
> >otherwise it might take seconds for it to notice.  And a way to make
> >sure driver can handle this surprise reset so we can block migration if
> >it can't.
> >
> 
> Yes, we need such a way to notify driver about migration status and do
> reset or restore operation on the destination machine. My original
> design is to take advantage of device's irq to do that. Driver can tell
> Qemu that which irq it prefers to handle such task and whether the irq
> is enabled or bound with handler. We may discuss the detail in the other
> thread.
> 
> >>>
> >>>It would be great if we could avoid changing the guest; but at least 
> >>>your guest
> >>>driver changes don't actually seem to be that hardware specific; could 
> >>>your
> >>>changes actually be moved to generic PCI level so they could be made
> >>>to work for lots of drivers?
> >
> >It is impossible to use one common solution for all devices unless the 
> >PCIE
> >spec documents it clearly and i think one day it will be there. But 
> >before
> >that, we need some workarounds on guest driver to make it work even it 
> >looks
> >ugly.
> >>
> >>Yes, so far there is not hardware migration support
> >
> >VT-D supports setting dirty bit in the PTE in hardware.
> 
> Actually, this doesn't support in the current hardware.
> VTD spec documents the dirty bit for first level translation which
> requires devices to support DMA request with PASID(process
> address space identifier). Most device don't support the feature.

True, I missed this.  It's generally unfortunate that first level
translation only applies to requests with PASID.  All other features
limited to requests with PASID like nested translation would be very
useful for all requests, not just requests with PASID.


> >
> >>and it's hard to modify
> >>bus level code.
> >
> >Why is it hard?
> 
> As Yang said, the concern is that PCI Spec doesn't document about how to do
> migration.

We can submit a PCI spec ECN documenting a new capability.

I think for existing devices which lack it, adding this capability to
the bridge to which the device is attached is preferable to trying to
add it to the device itself.

> >
> >>It also will block implementation on the Windows.
> >
> >Implementation of what?  We are discussing motivation here, not
> >implementation.  E.g. windows drivers typically support surprise
> >removal, should you use that, you get some working code for free.  Just
> >stop worrying about it.  Make it work, worry about closed source
> >software later.
> >
> >>>Dave
> >>>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-13 Thread Lan, Tianyu



On 12/11/2015 1:16 AM, Alexander Duyck wrote:

On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu  wrote:



On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:


Ideally, it is able to leave guest driver unmodified but it requires the

hypervisor or qemu to aware the device which means we may need a driver
in
hypervisor or qemu to handle the device on behalf of guest driver.


Can you answer the question of when do you use your code -
 at the start of migration or
 just before the end?



Just before stopping VCPU in this version and inject VF mailbox irq to
notify the driver if the irq handler is installed.
Qemu side also will check this via the faked PCI migration capability
and driver will set the status during device open() or resume() callback.


The VF mailbox interrupt is a very bad idea.  Really the device should
be in a reset state on the other side of a migration.  It doesn't make
sense to have the interrupt firing if the device is not configured.
This is one of the things that is preventing you from being able to
migrate the device while the interface is administratively down or the
VF driver is not loaded.


From my opinion, if VF driver is not loaded and hardware doesn't start
to work, the device state doesn't need to be migrated.

We may add a flag for driver to check whether migration happened during 
it's down and reinitialize the hardware and clear the flag when system 
try to put it up.


We may add migration core in the Linux kernel and provide some helps 
functions to facilitate to add migration support for drivers.

Migration core is in charge to sync status with Qemu.

Example.
migration_register()
Driver provides
- Callbacks to be called before and after migration or for bad path
- Its irq which it prefers to deal with migration event.

migration_event_check()
Driver calls it in the irq handler. Migration core code will check
migration status and call its callbacks when migration happens.




My thought on all this is that it might make sense to move this
functionality into a PCI-to-PCI bridge device and make it a
requirement that all direct-assigned devices have to exist behind that
device in order to support migration.  That way you would be working
with a directly emulated device that would likely already be
supporting hot-plug anyway.  Then it would just be a matter of coming
up with a few Qemu specific extensions that you would need to add to
the device itself.  The same approach would likely be portable enough
that you could achieve it with PCIe as well via the same configuration
space being present on the upstream side of a PCIe port or maybe a
PCIe switch of some sort.

It would then be possible to signal via your vendor-specific PCI
capability on that device that all devices behind this bridge require
DMA page dirtying, you could use the configuration in addition to the
interrupt already provided for hot-plug to signal things like when you
are starting migration, and possibly even just extend the shpc
functionality so that if this capability is present you have the
option to pause/resume instead of remove/probe the device in the case
of certain hot-plug events.  The fact is there may be some use for a
pause/resume type approach for PCIe hot-plug in the near future
anyway.  From the sounds of it Apple has required it for all
Thunderbolt device drivers so that they can halt the device in order
to shuffle resources around, perhaps we should look at something
similar for Linux.

The other advantage behind grouping functions on one bridge is things
like reset domains.  The PCI error handling logic will want to be able
to reset any devices that experienced an error in the event of
something such as a surprise removal.  By grouping all of the devices
you could disable/reset/enable them as one logical group in the event
of something such as the "bad path" approach Michael has mentioned.



These sounds we need to add a faked bridge for migration and adding a
driver in the guest for it. It also needs to extend PCI bus/hotplug
driver to do pause/resume other devices, right?

My concern is still that whether we can change PCI bus/hotplug like that
without spec change.

IRQ should be general for any devices and we may extend it for
migration. Device driver also can make decision to support migration
or not.






It would be great if we could avoid changing the guest; but at least
your guest
driver changes don't actually seem to be that hardware specific;
could your
changes actually be moved to generic PCI level so they could be made
to work for lots of drivers?




It is impossible to use one common solution for all devices unless the
PCIE
spec documents it clearly and i think one day it will be there. But
before
that, we need some workarounds on guest driver to make it work even it
looks
ugly.



Yes, so far there is not hardware migration support and it's hard to modify
bus level code. It also will block implementation on the Windows.


Please don't 

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-13 Thread Alexander Duyck
On Sun, Dec 13, 2015 at 7:47 AM, Lan, Tianyu  wrote:
>
>
> On 12/11/2015 1:16 AM, Alexander Duyck wrote:
>>
>> On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu  wrote:
>>>
>>>
>>>
>>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>
>
> Ideally, it is able to leave guest driver unmodified but it requires
> the
>>
>> hypervisor or qemu to aware the device which means we may need a
>> driver
>> in
>> hypervisor or qemu to handle the device on behalf of guest driver.


 Can you answer the question of when do you use your code -
  at the start of migration or
  just before the end?
>>>
>>>
>>>
>>> Just before stopping VCPU in this version and inject VF mailbox irq to
>>> notify the driver if the irq handler is installed.
>>> Qemu side also will check this via the faked PCI migration capability
>>> and driver will set the status during device open() or resume() callback.
>>
>>
>> The VF mailbox interrupt is a very bad idea.  Really the device should
>> be in a reset state on the other side of a migration.  It doesn't make
>> sense to have the interrupt firing if the device is not configured.
>> This is one of the things that is preventing you from being able to
>> migrate the device while the interface is administratively down or the
>> VF driver is not loaded.
>
>
> From my opinion, if VF driver is not loaded and hardware doesn't start
> to work, the device state doesn't need to be migrated.
>
> We may add a flag for driver to check whether migration happened during it's
> down and reinitialize the hardware and clear the flag when system try to put
> it up.
>
> We may add migration core in the Linux kernel and provide some helps
> functions to facilitate to add migration support for drivers.
> Migration core is in charge to sync status with Qemu.
>
> Example.
> migration_register()
> Driver provides
> - Callbacks to be called before and after migration or for bad path
> - Its irq which it prefers to deal with migration event.

You would be better off just using function pointers in the pci_driver
struct and let the PCI driver registration take care of all that.

> migration_event_check()
> Driver calls it in the irq handler. Migration core code will check
> migration status and call its callbacks when migration happens.

No, this is still a bad idea.  You haven't addressed what you do when
the device has had interrupts disabled such as being in the down
state.

This is the biggest issue I see with your whole patch set.  It
requires the driver containing certain changes and being in a certain
state.  You cannot put those expectations on the guest.  You really
need to try and move as much of this out to existing functionality as
possible.

>>
>> My thought on all this is that it might make sense to move this
>> functionality into a PCI-to-PCI bridge device and make it a
>> requirement that all direct-assigned devices have to exist behind that
>> device in order to support migration.  That way you would be working
>> with a directly emulated device that would likely already be
>> supporting hot-plug anyway.  Then it would just be a matter of coming
>> up with a few Qemu specific extensions that you would need to add to
>> the device itself.  The same approach would likely be portable enough
>> that you could achieve it with PCIe as well via the same configuration
>> space being present on the upstream side of a PCIe port or maybe a
>> PCIe switch of some sort.
>>
>> It would then be possible to signal via your vendor-specific PCI
>> capability on that device that all devices behind this bridge require
>> DMA page dirtying, you could use the configuration in addition to the
>> interrupt already provided for hot-plug to signal things like when you
>> are starting migration, and possibly even just extend the shpc
>> functionality so that if this capability is present you have the
>> option to pause/resume instead of remove/probe the device in the case
>> of certain hot-plug events.  The fact is there may be some use for a
>> pause/resume type approach for PCIe hot-plug in the near future
>> anyway.  From the sounds of it Apple has required it for all
>> Thunderbolt device drivers so that they can halt the device in order
>> to shuffle resources around, perhaps we should look at something
>> similar for Linux.
>>
>> The other advantage behind grouping functions on one bridge is things
>> like reset domains.  The PCI error handling logic will want to be able
>> to reset any devices that experienced an error in the event of
>> something such as a surprise removal.  By grouping all of the devices
>> you could disable/reset/enable them as one logical group in the event
>> of something such as the "bad path" approach Michael has mentioned.
>>
>
> These sounds we need to add a faked bridge for migration and adding a
> driver in the guest for it. It also needs to extend PCI bus/hotplug
> driver to do pause/resume 

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Alexander Duyck
On Thu, Dec 10, 2015 at 8:11 AM, Michael S. Tsirkin  wrote:
> On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
>>
>>
>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>> >>Ideally, it is able to leave guest driver unmodified but it requires the
>> >>>hypervisor or qemu to aware the device which means we may need a driver in
>> >>>hypervisor or qemu to handle the device on behalf of guest driver.
>> >Can you answer the question of when do you use your code -
>> >at the start of migration or
>> >just before the end?
>>
>> Just before stopping VCPU in this version and inject VF mailbox irq to
>> notify the driver if the irq handler is installed.
>> Qemu side also will check this via the faked PCI migration capability
>> and driver will set the status during device open() or resume() callback.
>
> Right, this is the "good path" optimization. Whether this buys anything
> as compared to just sending reset to the device when VCPU is stopped
> needs to be measured. In any case, we probably do need a way to
> interrupt driver on destination to make it reconfigure the device -
> otherwise it might take seconds for it to notice.  And a way to make
> sure driver can handle this surprise reset so we can block migration if
> it can't.

The question is how do we handle the "bad path"?  From what I can tell
it seems like we would have to have the dirty page tracking for DMA
handled in the host in order to support that.  Otherwise we risk
corrupting the memory in the guest as there are going to be a few
stale pages that end up being in the guest.

The easiest way to probably flag a "bad path" migration would be to
emulate a Manually-operated Retention Latch being opened and closed on
the device.  It may even allow us to work with the desire to support a
means for doing a pause/resume as that would be a hot-plug event where
the latch was never actually opened.  Basically if the retention latch
is released and then re-closed it can be assumed that the device has
lost power and as a result been reset.  As such a normal hot-plug
controller would have to reconfigure the device in such an event.  The
key bit being that with the power being cycled on the port the
assumption is that the device has lost any existing state, and we
should emulate that as well by clearing any state Qemu might be
carrying such as the shadow of the MSI-X table.  In addition we could
also signal if the host supports the dirty page tracking via the IOMMU
so if needed the guest could trigger some sort of memory exception
handling due to the risk of memory corruption.

I would argue that we don't necessarily have to provide a means to
guarantee the driver can support a surprise removal/reset.  Worst case
scenario is that it would be equivalent to somebody pulling the plug
on an externally connected PCIe cage in a physical host.  I know the
Intel Ethernet drivers have already had to add support for surprise
removal due to the fact that such a scenario can occur on Thunderbolt
enabled platforms.  Since it is acceptable for physical hosts to have
such an event occur I think we could support the same type of failure
for direct assigned devices in guests.  That would be the one spot
where I would say it is up to the drivers to figure out how they are
going to deal with it since this is something that can occur for any
given driver on any given OS assuming it can be plugged into an
externally removable cage.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Dr. David Alan Gilbert
* Yang Zhang (yang.zhang...@gmail.com) wrote:
> On 2015/12/10 18:18, Dr. David Alan Gilbert wrote:
> >* Lan, Tianyu (tianyu@intel.com) wrote:
> >>On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >>>I thought about what this is doing at the high level, and I do have some
> >>>value in what you are trying to do, but I also think we need to clarify
> >>>the motivation a bit more.  What you are saying is not really what the
> >>>patches are doing.
> >>>
> >>>And with that clearer understanding of the motivation in mind (assuming
> >>>it actually captures a real need), I would also like to suggest some
> >>>changes.
> >>
> >>Motivation:
> >>Most current solutions for migration with passthough device are based on
> >>the PCI hotplug but it has side affect and can't work for all device.
> >>
> >>For NIC device:
> >>PCI hotplug solution can work around Network device migration
> >>via switching VF and PF.
> >>
> >>But switching network interface will introduce service down time.
> >>
> >>I tested the service down time via putting VF and PV interface
> >>into a bonded interface and ping the bonded interface during plug
> >>and unplug VF.
> >>1) About 100ms when add VF
> >>2) About 30ms when del VF
> >>
> >>It also requires guest to do switch configuration. These are hard to
> >>manage and deploy from our customers. To maintain PV performance during
> >>migration, host side also needs to assign a VF to PV device. This
> >>affects scalability.
> >>
> >>These factors block SRIOV NIC passthough usage in the cloud service and
> >>OPNFV which require network high performance and stability a lot.
> >
> >Right, that I'll agree it's hard to do migration of a VM which uses
> >an SRIOV device; and while I think it should be possible to bond a virtio 
> >device
> >to a VF for networking and then hotplug the SR-IOV device I agree it's hard 
> >to manage.
> >
> >>For other kind of devices, it's hard to work.
> >>We are also adding migration support for QAT(QuickAssist Technology) device.
> >>
> >>QAT device user case introduction.
> >>Server, networking, big data, and storage applications use QuickAssist
> >>Technology to offload servers from handling compute-intensive operations,
> >>such as:
> >>1) Symmetric cryptography functions including cipher operations and
> >>authentication operations
> >>2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> >>cryptography
> >>3) Compression and decompression functions including DEFLATE and LZS
> >>
> >>PCI hotplug will not work for such devices during migration and these
> >>operations will fail when unplug device.
> >
> >I don't understand that QAT argument; if the device is purely an offload
> >engine for performance, then why can't you fall back to doing the
> >same operations in the VM or in QEMU if the card is unavailable?
> >The tricky bit is dealing with outstanding operations.
> >
> >>So we are trying implementing a new solution which really migrates
> >>device state to target machine and won't affect user during migration
> >>with low service down time.
> >
> >Right, that's a good aim - the only question is how to do it.
> >
> >It looks like this is always going to need some device-specific code;
> >the question I see is whether that's in:
> > 1) qemu
> > 2) the host kernel
> > 3) the guest kernel driver
> >
> >The objections to this series seem to be that it needs changes to (3);
> >I can see the worry that the guest kernel driver might not get a chance
> >to run during the right time in migration and it's painful having to
> >change every guest driver (although your change is small).
> >
> >My question is what stage of the migration process do you expect to tell
> >the guest kernel driver to do this?
> >
> > If you do it at the start of the migration, and quiesce the device,
> > the migration might take a long time (say 30 minutes) - are you
> > intending the device to be quiesced for this long? And where are
> > you going to send the traffic?
> > If you are, then do you need to do it via this PCI trick, or could
> > you just do it via something higher level to quiesce the device.
> >
> > Or are you intending to do it just near the end of the migration?
> > But then how do we know how long it will take the guest driver to
> > respond?
> 
> Ideally, it is able to leave guest driver unmodified but it requires the
> hypervisor or qemu to aware the device which means we may need a driver in
> hypervisor or qemu to handle the device on behalf of guest driver.

Can you answer the question of when do you use your code -
   at the start of migration or
   just before the end?

> >It would be great if we could avoid changing the guest; but at least your 
> >guest
> >driver changes don't actually seem to be that hardware specific; could your
> >changes actually be moved to generic PCI level so they could be made
> >to work for lots of drivers?
> 
> It is impossible to use one common solution for all devices 

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Dr. David Alan Gilbert
* Lan, Tianyu (tianyu@intel.com) wrote:
> On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >I thought about what this is doing at the high level, and I do have some
> >value in what you are trying to do, but I also think we need to clarify
> >the motivation a bit more.  What you are saying is not really what the
> >patches are doing.
> >
> >And with that clearer understanding of the motivation in mind (assuming
> >it actually captures a real need), I would also like to suggest some
> >changes.
> 
> Motivation:
> Most current solutions for migration with passthough device are based on
> the PCI hotplug but it has side affect and can't work for all device.
> 
> For NIC device:
> PCI hotplug solution can work around Network device migration
> via switching VF and PF.
> 
> But switching network interface will introduce service down time.
> 
> I tested the service down time via putting VF and PV interface
> into a bonded interface and ping the bonded interface during plug
> and unplug VF.
> 1) About 100ms when add VF
> 2) About 30ms when del VF
> 
> It also requires guest to do switch configuration. These are hard to
> manage and deploy from our customers. To maintain PV performance during
> migration, host side also needs to assign a VF to PV device. This
> affects scalability.
> 
> These factors block SRIOV NIC passthough usage in the cloud service and
> OPNFV which require network high performance and stability a lot.

Right, that I'll agree it's hard to do migration of a VM which uses
an SRIOV device; and while I think it should be possible to bond a virtio device
to a VF for networking and then hotplug the SR-IOV device I agree it's hard to 
manage.

> For other kind of devices, it's hard to work.
> We are also adding migration support for QAT(QuickAssist Technology) device.
> 
> QAT device user case introduction.
> Server, networking, big data, and storage applications use QuickAssist
> Technology to offload servers from handling compute-intensive operations,
> such as:
> 1) Symmetric cryptography functions including cipher operations and
> authentication operations
> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> cryptography
> 3) Compression and decompression functions including DEFLATE and LZS
> 
> PCI hotplug will not work for such devices during migration and these
> operations will fail when unplug device.

I don't understand that QAT argument; if the device is purely an offload
engine for performance, then why can't you fall back to doing the
same operations in the VM or in QEMU if the card is unavailable?
The tricky bit is dealing with outstanding operations.

> So we are trying implementing a new solution which really migrates
> device state to target machine and won't affect user during migration
> with low service down time.

Right, that's a good aim - the only question is how to do it.

It looks like this is always going to need some device-specific code;
the question I see is whether that's in:
1) qemu
2) the host kernel
3) the guest kernel driver

The objections to this series seem to be that it needs changes to (3);
I can see the worry that the guest kernel driver might not get a chance
to run during the right time in migration and it's painful having to
change every guest driver (although your change is small).

My question is what stage of the migration process do you expect to tell
the guest kernel driver to do this?

If you do it at the start of the migration, and quiesce the device,
the migration might take a long time (say 30 minutes) - are you
intending the device to be quiesced for this long? And where are
you going to send the traffic?
If you are, then do you need to do it via this PCI trick, or could
you just do it via something higher level to quiesce the device.

Or are you intending to do it just near the end of the migration?
But then how do we know how long it will take the guest driver to
respond?

It would be great if we could avoid changing the guest; but at least your guest
driver changes don't actually seem to be that hardware specific; could your
changes actually be moved to generic PCI level so they could be made
to work for lots of drivers?

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Yang Zhang

On 2015/12/10 18:18, Dr. David Alan Gilbert wrote:

* Lan, Tianyu (tianyu@intel.com) wrote:

On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:

I thought about what this is doing at the high level, and I do have some
value in what you are trying to do, but I also think we need to clarify
the motivation a bit more.  What you are saying is not really what the
patches are doing.

And with that clearer understanding of the motivation in mind (assuming
it actually captures a real need), I would also like to suggest some
changes.


Motivation:
Most current solutions for migration with passthough device are based on
the PCI hotplug but it has side affect and can't work for all device.

For NIC device:
PCI hotplug solution can work around Network device migration
via switching VF and PF.

But switching network interface will introduce service down time.

I tested the service down time via putting VF and PV interface
into a bonded interface and ping the bonded interface during plug
and unplug VF.
1) About 100ms when add VF
2) About 30ms when del VF

It also requires guest to do switch configuration. These are hard to
manage and deploy from our customers. To maintain PV performance during
migration, host side also needs to assign a VF to PV device. This
affects scalability.

These factors block SRIOV NIC passthough usage in the cloud service and
OPNFV which require network high performance and stability a lot.


Right, that I'll agree it's hard to do migration of a VM which uses
an SRIOV device; and while I think it should be possible to bond a virtio device
to a VF for networking and then hotplug the SR-IOV device I agree it's hard to 
manage.


For other kind of devices, it's hard to work.
We are also adding migration support for QAT(QuickAssist Technology) device.

QAT device user case introduction.
Server, networking, big data, and storage applications use QuickAssist
Technology to offload servers from handling compute-intensive operations,
such as:
1) Symmetric cryptography functions including cipher operations and
authentication operations
2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
cryptography
3) Compression and decompression functions including DEFLATE and LZS

PCI hotplug will not work for such devices during migration and these
operations will fail when unplug device.


I don't understand that QAT argument; if the device is purely an offload
engine for performance, then why can't you fall back to doing the
same operations in the VM or in QEMU if the card is unavailable?
The tricky bit is dealing with outstanding operations.


So we are trying implementing a new solution which really migrates
device state to target machine and won't affect user during migration
with low service down time.


Right, that's a good aim - the only question is how to do it.

It looks like this is always going to need some device-specific code;
the question I see is whether that's in:
 1) qemu
 2) the host kernel
 3) the guest kernel driver

The objections to this series seem to be that it needs changes to (3);
I can see the worry that the guest kernel driver might not get a chance
to run during the right time in migration and it's painful having to
change every guest driver (although your change is small).

My question is what stage of the migration process do you expect to tell
the guest kernel driver to do this?

 If you do it at the start of the migration, and quiesce the device,
 the migration might take a long time (say 30 minutes) - are you
 intending the device to be quiesced for this long? And where are
 you going to send the traffic?
 If you are, then do you need to do it via this PCI trick, or could
 you just do it via something higher level to quiesce the device.

 Or are you intending to do it just near the end of the migration?
 But then how do we know how long it will take the guest driver to
 respond?


Ideally, it is able to leave guest driver unmodified but it requires the 
hypervisor or qemu to aware the device which means we may need a driver 
in hypervisor or qemu to handle the device on behalf of guest driver.




It would be great if we could avoid changing the guest; but at least your guest
driver changes don't actually seem to be that hardware specific; could your
changes actually be moved to generic PCI level so they could be made
to work for lots of drivers?


It is impossible to use one common solution for all devices unless the 
PCIE spec documents it clearly and i think one day it will be there. But 
before that, we need some workarounds on guest driver to make it work 
even it looks ugly.



--
best regards
yang
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Alexander Duyck
On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu  wrote:
>
>
> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>>>
>>> Ideally, it is able to leave guest driver unmodified but it requires the
>>> >hypervisor or qemu to aware the device which means we may need a driver
>>> > in
>>> >hypervisor or qemu to handle the device on behalf of guest driver.
>>
>> Can you answer the question of when do you use your code -
>> at the start of migration or
>> just before the end?
>
>
> Just before stopping VCPU in this version and inject VF mailbox irq to
> notify the driver if the irq handler is installed.
> Qemu side also will check this via the faked PCI migration capability
> and driver will set the status during device open() or resume() callback.

The VF mailbox interrupt is a very bad idea.  Really the device should
be in a reset state on the other side of a migration.  It doesn't make
sense to have the interrupt firing if the device is not configured.
This is one of the things that is preventing you from being able to
migrate the device while the interface is administratively down or the
VF driver is not loaded.

My thought on all this is that it might make sense to move this
functionality into a PCI-to-PCI bridge device and make it a
requirement that all direct-assigned devices have to exist behind that
device in order to support migration.  That way you would be working
with a directly emulated device that would likely already be
supporting hot-plug anyway.  Then it would just be a matter of coming
up with a few Qemu specific extensions that you would need to add to
the device itself.  The same approach would likely be portable enough
that you could achieve it with PCIe as well via the same configuration
space being present on the upstream side of a PCIe port or maybe a
PCIe switch of some sort.

It would then be possible to signal via your vendor-specific PCI
capability on that device that all devices behind this bridge require
DMA page dirtying, you could use the configuration in addition to the
interrupt already provided for hot-plug to signal things like when you
are starting migration, and possibly even just extend the shpc
functionality so that if this capability is present you have the
option to pause/resume instead of remove/probe the device in the case
of certain hot-plug events.  The fact is there may be some use for a
pause/resume type approach for PCIe hot-plug in the near future
anyway.  From the sounds of it Apple has required it for all
Thunderbolt device drivers so that they can halt the device in order
to shuffle resources around, perhaps we should look at something
similar for Linux.

The other advantage behind grouping functions on one bridge is things
like reset domains.  The PCI error handling logic will want to be able
to reset any devices that experienced an error in the event of
something such as a surprise removal.  By grouping all of the devices
you could disable/reset/enable them as one logical group in the event
of something such as the "bad path" approach Michael has mentioned.

>>
 > >It would be great if we could avoid changing the guest; but at least
 > > your guest
 > >driver changes don't actually seem to be that hardware specific;
 > > could your
 > >changes actually be moved to generic PCI level so they could be made
 > >to work for lots of drivers?
>>>
>>> >
>>> >It is impossible to use one common solution for all devices unless the
>>> > PCIE
>>> >spec documents it clearly and i think one day it will be there. But
>>> > before
>>> >that, we need some workarounds on guest driver to make it work even it
>>> > looks
>>> >ugly.
>
>
> Yes, so far there is not hardware migration support and it's hard to modify
> bus level code. It also will block implementation on the Windows.

Please don't assume things.  Unless you have hard data from Microsoft
that says they want it this way lets just try to figure out what works
best for us for now and then we can start worrying about third party
implementations after we have figured out a solution that actually
works.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Yang Zhang

On 2015/12/10 19:41, Dr. David Alan Gilbert wrote:

* Yang Zhang (yang.zhang...@gmail.com) wrote:

On 2015/12/10 18:18, Dr. David Alan Gilbert wrote:

* Lan, Tianyu (tianyu@intel.com) wrote:

On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:

I thought about what this is doing at the high level, and I do have some
value in what you are trying to do, but I also think we need to clarify
the motivation a bit more.  What you are saying is not really what the
patches are doing.

And with that clearer understanding of the motivation in mind (assuming
it actually captures a real need), I would also like to suggest some
changes.


Motivation:
Most current solutions for migration with passthough device are based on
the PCI hotplug but it has side affect and can't work for all device.

For NIC device:
PCI hotplug solution can work around Network device migration
via switching VF and PF.

But switching network interface will introduce service down time.

I tested the service down time via putting VF and PV interface
into a bonded interface and ping the bonded interface during plug
and unplug VF.
1) About 100ms when add VF
2) About 30ms when del VF

It also requires guest to do switch configuration. These are hard to
manage and deploy from our customers. To maintain PV performance during
migration, host side also needs to assign a VF to PV device. This
affects scalability.

These factors block SRIOV NIC passthough usage in the cloud service and
OPNFV which require network high performance and stability a lot.


Right, that I'll agree it's hard to do migration of a VM which uses
an SRIOV device; and while I think it should be possible to bond a virtio device
to a VF for networking and then hotplug the SR-IOV device I agree it's hard to 
manage.


For other kind of devices, it's hard to work.
We are also adding migration support for QAT(QuickAssist Technology) device.

QAT device user case introduction.
Server, networking, big data, and storage applications use QuickAssist
Technology to offload servers from handling compute-intensive operations,
such as:
1) Symmetric cryptography functions including cipher operations and
authentication operations
2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
cryptography
3) Compression and decompression functions including DEFLATE and LZS

PCI hotplug will not work for such devices during migration and these
operations will fail when unplug device.


I don't understand that QAT argument; if the device is purely an offload
engine for performance, then why can't you fall back to doing the
same operations in the VM or in QEMU if the card is unavailable?
The tricky bit is dealing with outstanding operations.


So we are trying implementing a new solution which really migrates
device state to target machine and won't affect user during migration
with low service down time.


Right, that's a good aim - the only question is how to do it.

It looks like this is always going to need some device-specific code;
the question I see is whether that's in:
 1) qemu
 2) the host kernel
 3) the guest kernel driver

The objections to this series seem to be that it needs changes to (3);
I can see the worry that the guest kernel driver might not get a chance
to run during the right time in migration and it's painful having to
change every guest driver (although your change is small).

My question is what stage of the migration process do you expect to tell
the guest kernel driver to do this?

 If you do it at the start of the migration, and quiesce the device,
 the migration might take a long time (say 30 minutes) - are you
 intending the device to be quiesced for this long? And where are
 you going to send the traffic?
 If you are, then do you need to do it via this PCI trick, or could
 you just do it via something higher level to quiesce the device.

 Or are you intending to do it just near the end of the migration?
 But then how do we know how long it will take the guest driver to
 respond?


Ideally, it is able to leave guest driver unmodified but it requires the
hypervisor or qemu to aware the device which means we may need a driver in
hypervisor or qemu to handle the device on behalf of guest driver.


Can you answer the question of when do you use your code -
at the start of migration or
just before the end?


Tianyu can answer this question. In my initial design, i prefer to put 
more modifications in hypervisor and Qemu, and the only involvement from 
guest driver is how to restore the state after migration. But I don't 
know the later implementation since i have left Intel.





It would be great if we could avoid changing the guest; but at least your guest
driver changes don't actually seem to be that hardware specific; could your
changes actually be moved to generic PCI level so they could be made
to work for lots of drivers?


It is impossible to use one common solution for all devices unless the PCIE
spec 

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Lan, Tianyu



On 12/11/2015 12:11 AM, Michael S. Tsirkin wrote:

On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:



On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:

Ideally, it is able to leave guest driver unmodified but it requires the

hypervisor or qemu to aware the device which means we may need a driver in
hypervisor or qemu to handle the device on behalf of guest driver.

Can you answer the question of when do you use your code -
at the start of migration or
just before the end?


Just before stopping VCPU in this version and inject VF mailbox irq to
notify the driver if the irq handler is installed.
Qemu side also will check this via the faked PCI migration capability
and driver will set the status during device open() or resume() callback.


Right, this is the "good path" optimization. Whether this buys anything
as compared to just sending reset to the device when VCPU is stopped
needs to be measured. In any case, we probably do need a way to
interrupt driver on destination to make it reconfigure the device -
otherwise it might take seconds for it to notice.  And a way to make
sure driver can handle this surprise reset so we can block migration if
it can't.



Yes, we need such a way to notify driver about migration status and do
reset or restore operation on the destination machine. My original
design is to take advantage of device's irq to do that. Driver can tell
Qemu that which irq it prefers to handle such task and whether the irq
is enabled or bound with handler. We may discuss the detail in the other
thread.




It would be great if we could avoid changing the guest; but at least your guest
driver changes don't actually seem to be that hardware specific; could your
changes actually be moved to generic PCI level so they could be made
to work for lots of drivers?


It is impossible to use one common solution for all devices unless the PCIE
spec documents it clearly and i think one day it will be there. But before
that, we need some workarounds on guest driver to make it work even it looks
ugly.


Yes, so far there is not hardware migration support


VT-D supports setting dirty bit in the PTE in hardware.


Actually, this doesn't support in the current hardware.
VTD spec documents the dirty bit for first level translation which
requires devices to support DMA request with PASID(process
address space identifier). Most device don't support the feature.




and it's hard to modify
bus level code.


Why is it hard?


As Yang said, the concern is that PCI Spec doesn't document about how to 
do migration.





It also will block implementation on the Windows.


Implementation of what?  We are discussing motivation here, not
implementation.  E.g. windows drivers typically support surprise
removal, should you use that, you get some working code for free.  Just
stop worrying about it.  Make it work, worry about closed source
software later.


Dave


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Lan, Tianyu



On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:

Ideally, it is able to leave guest driver unmodified but it requires the
>hypervisor or qemu to aware the device which means we may need a driver in
>hypervisor or qemu to handle the device on behalf of guest driver.

Can you answer the question of when do you use your code -
at the start of migration or
just before the end?


Just before stopping VCPU in this version and inject VF mailbox irq to
notify the driver if the irq handler is installed.
Qemu side also will check this via the faked PCI migration capability
and driver will set the status during device open() or resume() callback.




> >It would be great if we could avoid changing the guest; but at least your 
guest
> >driver changes don't actually seem to be that hardware specific; could your
> >changes actually be moved to generic PCI level so they could be made
> >to work for lots of drivers?

>
>It is impossible to use one common solution for all devices unless the PCIE
>spec documents it clearly and i think one day it will be there. But before
>that, we need some workarounds on guest driver to make it work even it looks
>ugly.


Yes, so far there is not hardware migration support and it's hard to 
modify bus level code. It also will block implementation on the Windows.



Dave


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Michael S. Tsirkin
On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> >>Ideally, it is able to leave guest driver unmodified but it requires the
> >>>hypervisor or qemu to aware the device which means we may need a driver in
> >>>hypervisor or qemu to handle the device on behalf of guest driver.
> >Can you answer the question of when do you use your code -
> >at the start of migration or
> >just before the end?
> 
> Just before stopping VCPU in this version and inject VF mailbox irq to
> notify the driver if the irq handler is installed.
> Qemu side also will check this via the faked PCI migration capability
> and driver will set the status during device open() or resume() callback.

Right, this is the "good path" optimization. Whether this buys anything
as compared to just sending reset to the device when VCPU is stopped
needs to be measured. In any case, we probably do need a way to
interrupt driver on destination to make it reconfigure the device -
otherwise it might take seconds for it to notice.  And a way to make
sure driver can handle this surprise reset so we can block migration if
it can't.

> >
>  >It would be great if we could avoid changing the guest; but at least 
>  >your guest
>  >driver changes don't actually seem to be that hardware specific; could 
>  >your
>  >changes actually be moved to generic PCI level so they could be made
>  >to work for lots of drivers?
> >>>
> >>>It is impossible to use one common solution for all devices unless the PCIE
> >>>spec documents it clearly and i think one day it will be there. But before
> >>>that, we need some workarounds on guest driver to make it work even it 
> >>>looks
> >>>ugly.
> 
> Yes, so far there is not hardware migration support

VT-D supports setting dirty bit in the PTE in hardware.

> and it's hard to modify
> bus level code.

Why is it hard?

> It also will block implementation on the Windows.

Implementation of what?  We are discussing motivation here, not
implementation.  E.g. windows drivers typically support surprise
removal, should you use that, you get some working code for free.  Just
stop worrying about it.  Make it work, worry about closed source
software later.

> >Dave
> >
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Dr. David Alan Gilbert
* Lan, Tianyu (tianyu@intel.com) wrote:
> 
> 
> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> >>Ideally, it is able to leave guest driver unmodified but it requires the
> >>>hypervisor or qemu to aware the device which means we may need a driver in
> >>>hypervisor or qemu to handle the device on behalf of guest driver.
> >Can you answer the question of when do you use your code -
> >at the start of migration or
> >just before the end?
> 
> Just before stopping VCPU in this version and inject VF mailbox irq to
> notify the driver if the irq handler is installed.
> Qemu side also will check this via the faked PCI migration capability
> and driver will set the status during device open() or resume() callback.

OK, hmm - I can see that would work in some cases; but:
   a) It wouldn't work if the guest was paused, the management can pause it 
before
 starting migration or during migration - so you might need to hook the 
pause
 as well;  so that's a bit complicated.

   b) How long does qemu wait for the guest to respond, and what does it do if
  the guest doesn't respond ?  How do we recover?

   c) How much work does the guest need to do at this point?

   d) It would be great if we could find a more generic way of telling the guest
  it's about to migrate rather than via the PCI registers of one device; 
imagine
  what happens if you have a few different devices using SR-IOV, we'd have 
to tell
  them all with separate interrupts.   Perhaps we could use a virtio 
channel or
  an ACPI event or something?

>  >It would be great if we could avoid changing the guest; but at least 
>  >your guest
>  >driver changes don't actually seem to be that hardware specific; could 
>  >your
>  >changes actually be moved to generic PCI level so they could be made
>  >to work for lots of drivers?
> >>>
> >>>It is impossible to use one common solution for all devices unless the PCIE
> >>>spec documents it clearly and i think one day it will be there. But before
> >>>that, we need some workarounds on guest driver to make it work even it 
> >>>looks
> >>>ugly.
> 
> Yes, so far there is not hardware migration support and it's hard to modify
> bus level code. It also will block implementation on the Windows.

Well, there was agraf's trick, although that's a lot more complicated at the 
qemu
level, but it should work with no guest modifications.  Michael's point about
dirty page tracking is neat, I think that simplifies it a bit if it can track 
dirty
pages.

Dave

> >Dave
> >
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html