Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2016-01-05 Thread Alexander Duyck
On Tue, Jan 5, 2016 at 1:40 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
>> >> The two mechanisms referenced above would likely require coordination with
>> >> QEMU and as such are open to discussion.  I haven't attempted to address
>> >> them as I am not sure there is a consensus as of yet.  My personal
>> >> preference would be to add a vendor-specific configuration block to the
>> >> emulated pci-bridge interfaces created by QEMU that would allow us to
>> >> essentially extend shpc to support guest live migration with pass-through
>> >> devices.
>> >
>> > shpc?
>>
>> That is kind of what I was thinking.  We basically need some mechanism
>> to allow for the host to ask the device to quiesce.  It has been
>> proposed to possibly even look at something like an ACPI interface
>> since I know ACPI is used by QEMU to manage hot-plug in the standard
>> case.
>>
>> - Alex
>
>
> Start by using hot-unplug for this!
>
> Really use your patch guest side, and write host side
> to allow starting migration with the device, but
> defer completing it.

Yeah, I'm fully on board with this idea, though I'm not really working
on this right now since last I knew the folks on this thread from
Intel were working on it.  My patches were mostly meant to be a nudge
in this direction so that we could get away from the driver specific
code.

> So
>
> 1.- host tells guest to start tracking memory writes
> 2.- guest acks
> 3.- migration starts
> 4.- most memory is migrated
> 5.- host tells guest to eject device
> 6.- guest acks
> 7.- stop vm and migrate rest of state
>

Sounds about right.  The only way this differs from what I see as the
final solution for this is that instead of fully ejecting the device
in step 5 the driver would instead pause the device and give the host
something like 10 seconds to stop the VM and resume with the same
device connected if it is available.  We would probably also need to
look at a solution that would force the device to be ejected or abort
prior to starting the migration if it doesn't give us the ack in step
2.

> It will already be a win since hot unplug after migration starts and
> most memory has been migrated is better than hot unplug before migration
> starts.

Right.  Generally the longer the VF can be maintained as a part of the
guest the longer the network performance is improved versus using a
purely virtual interface.

> Then measure downtime and profile. Then we can look at ways
> to quiesce device faster which really means step 5 is replaced
> with "host tells guest to quiesce device and dirty (or just unmap!)
> all memory mapped for write by device".

Step 5 will be the spot where we really need to start modifying
drivers.  Specifically we probably need to go through and clean-up
things so that we can reduce as many of the delays in the driver
suspend/resume path as possible.  I suspect there is quite a bit that
can be done there that would probably also improve boot and shutdown
times since those are also impacted by the devices.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2016-01-04 Thread Alexander Duyck
On Mon, Jan 4, 2016 at 12:41 PM, Konrad Rzeszutek Wilk
<konrad.w...@oracle.com> wrote:
> On Sun, Dec 13, 2015 at 01:28:09PM -0800, Alexander Duyck wrote:
>> This patch set is meant to be the guest side code for a proof of concept
>> involving leaving pass-through devices in the guest during the warm-up
>> phase of guest live migration.  In order to accomplish this I have added a
>
> What does that mean? 'warm-up-phase'?

It is the first phase in a pre-copy migration.
https://en.wikipedia.org/wiki/Live_migration

Basically in this phase all the memory is marked as dirty and then
copied.  Any memory that changes gets marked as dirty as well.
Currently DMA circumvents this as the user space dirty page tracking
isn't able to track DMA.

>> new function called dma_mark_dirty that will mark the pages associated with
>> the DMA transaction as dirty in the case of either an unmap or a
>> sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or
>> DMA_BIDIRECTIONAL.  The pass-through device must still be removed before
>> the stop-and-copy phase, however allowing the device to be present should
>> significantly improve the performance of the guest during the warm-up
>> period.
>
> .. if the warm-up phase is short I presume? If the warm-up phase takes
> a long time (busy guest that is of 1TB size) it wouldn't help much as the
> tracking of these DMA's may be quite long?
>
>>
>> This current implementation is very preliminary and there are number of
>> items still missing.  Specifically in order to make this a more complete
>> solution we need to support:
>> 1.  Notifying hypervisor that drivers are dirtying DMA pages received
>
> .. And somehow giving the hypervisor the GPFN so it can retain the PFN in
> the VT-d as long as possible.

Yes, what has happened is that the host went through and marked all
memory as read-only.  So trying to do any operation that requires
write access triggers a page fault which is then used by the host to
track pages that were dirtied.

>> 2.  Bypassing page dirtying when it is not needed.
>
> How would this work with with device doing DMA operations _after_ the 
> migration?
> That is the driver submits and DMA READ.. migrates away, device is unplugged,
> VT-d context is torn down - device does the DMA READ gets an VT-d error...
>
> and what then? How should the device on the other host replay the DMA READ?

The device has to quiesce before the migration can occur.  We cannot
have any DMA mappings still open when we reach the stop-and-copy phase
of the migration.  The solution I have proposed here works for
streaming mappings but doesn't solve the case for things like
dma_alloc_coherent where a bidirectional mapping is maintained between
the CPU and the device.

>>
>> The two mechanisms referenced above would likely require coordination with
>> QEMU and as such are open to discussion.  I haven't attempted to address
>> them as I am not sure there is a consensus as of yet.  My personal
>> preference would be to add a vendor-specific configuration block to the
>> emulated pci-bridge interfaces created by QEMU that would allow us to
>> essentially extend shpc to support guest live migration with pass-through
>> devices.
>
> shpc?

That is kind of what I was thinking.  We basically need some mechanism
to allow for the host to ask the device to quiesce.  It has been
proposed to possibly even look at something like an ACPI interface
since I know ACPI is used by QEMU to manage hot-plug in the standard
case.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-29 Thread Alexander Duyck
On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin  wrote:
> On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
>>
>>
>> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
>> >As long as you keep up this vague talk about performance during
>> >migration, without even bothering with any measurements, this patchset
>> >will keep going nowhere.
>> >
>>
>> I measured network service downtime for "keep device alive"(RFC patch V1
>> presented) and "put down and up network interface"(RFC patch V2 presented)
>> during migration with some optimizations.
>>
>> The former is around 140ms and the later is around 240ms.
>>
>> My patchset relies on the maibox irq which doesn't work in the suspend state
>> and so can't get downtime for suspend/resume cases. Will try to get the
>> result later.
>
>
> Interesting. So you sare saying merely ifdown/ifup is 100ms?
> This does not sound reasonable.
> Is there a chance you are e.g. getting IP from dhcp?


Actually it wouldn't surprise me if that is due to a reset logic in
the driver.  For starters there is a 10 msec delay in the call
ixgbevf_reset_hw_vf which I believe is present to allow the PF time to
clear registers after the VF has requested a reset.  There is also a
10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues
were disabled.  That is in addition to the fact that the function that
disables the queues does so serially and polls each queue until the
hardware acknowledges that the queues are actually disabled.  The
driver also does the serial enable with poll logic on re-enabling the
queues which likely doesn't help things.

Really this driver is probably in need of a refactor to clean the
cruft out of the reset and initialization logic.  I suspect we have
far more delays than we really need and that is the source of much of
the slow down.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-29 Thread Alexander Duyck
On Tue, Dec 29, 2015 at 9:15 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Tue, Dec 29, 2015 at 09:04:51AM -0800, Alexander Duyck wrote:
>> On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
>> >>
>> >>
>> >> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
>> >> >As long as you keep up this vague talk about performance during
>> >> >migration, without even bothering with any measurements, this patchset
>> >> >will keep going nowhere.
>> >> >
>> >>
>> >> I measured network service downtime for "keep device alive"(RFC patch V1
>> >> presented) and "put down and up network interface"(RFC patch V2 presented)
>> >> during migration with some optimizations.
>> >>
>> >> The former is around 140ms and the later is around 240ms.
>> >>
>> >> My patchset relies on the maibox irq which doesn't work in the suspend 
>> >> state
>> >> and so can't get downtime for suspend/resume cases. Will try to get the
>> >> result later.
>> >
>> >
>> > Interesting. So you sare saying merely ifdown/ifup is 100ms?
>> > This does not sound reasonable.
>> > Is there a chance you are e.g. getting IP from dhcp?
>>
>>
>> Actually it wouldn't surprise me if that is due to a reset logic in
>> the driver.  For starters there is a 10 msec delay in the call
>> ixgbevf_reset_hw_vf which I believe is present to allow the PF time to
>> clear registers after the VF has requested a reset.  There is also a
>> 10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues
>> were disabled.  That is in addition to the fact that the function that
>> disables the queues does so serially and polls each queue until the
>> hardware acknowledges that the queues are actually disabled.  The
>> driver also does the serial enable with poll logic on re-enabling the
>> queues which likely doesn't help things.
>>
>> Really this driver is probably in need of a refactor to clean the
>> cruft out of the reset and initialization logic.  I suspect we have
>> far more delays than we really need and that is the source of much of
>> the slow down.
>
> For ifdown, why is there any need to reset the device at all?
> Is it so buffers can be reclaimed?
>

I believe it is mostly historical.  All the Intel drivers are derived
from e1000.  The e1000 has a 10ms sleep to allow outstanding PCI
transactions to complete before resetting and it looks like they ended
up inheriting that in the ixgbevf driver.  I suppose it does allow for
the buffers to be reclaimed which is something we may need, though the
VF driver should have already verified that it disabled the queues
when it was doing the polling on the bits being cleared in the
individual queue control registers.  Likely the 10ms sleep is
redundant as a result.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-27 Thread Alexander Duyck
On Sun, Dec 27, 2015 at 1:21 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote:
>> The PCI hot-plug specification calls out that the OS can optionally
>> implement a "pause" mechanism which is meant to be used for high
>> availability type environments.  What I am proposing is basically
>> extending the standard SHPC capable PCI bridge so that we can support
>> the DMA page dirtying for everything hosted on it, add a vendor
>> specific block to the config space so that the guest can notify the
>> host that it will do page dirtying, and add a mechanism to indicate
>> that all hot-plug events during the warm-up phase of the migration are
>> pause events instead of full removals.
>
> Two comments:
>
> 1. A vendor specific capability will always be problematic.
> Better to register a capability id with pci sig.
>
> 2. There are actually several capabilities:
>
> A. support for memory dirtying
> if not supported, we must stop device before migration
>
> This is supported by core guest OS code,
> using patches similar to posted by you.
>
>
> B. support for device replacement
> This is a faster form of hotplug, where device is removed and
> later another device using same driver is inserted in the same slot.
>
> This is a possible optimization, but I am convinced
> (A) should be implemented independently of (B).
>

My thought on this was that we don't need much to really implement
either feature.  Really only a bit or two for either one.  I had
thought about extending the PCI Advanced Features, but for now it
might make more sense to just implement it as a vendor capability for
the QEMU based bridges instead of trying to make this a true PCI
capability since I am not sure if this in any way would apply to
physical hardware.  The fact is the PCI Advanced Features capability
is essentially just a vendor specific capability with a different ID
so if we were to use 2 bits that are currently reserved in the
capability we could later merge the functionality without much
overhead.

I fully agree that the two implementations should be separate but
nothing says we have to implement them completely different.  If we
are just using 3 bits for capability, status, and control of each
feature there is no reason for them to need to be stored in separate
locations.

>> I've been poking around in the kernel and QEMU code and the part I
>> have been trying to sort out is how to get QEMU based pci-bridge to
>> use the SHPC driver because from what I can tell the driver never
>> actually gets loaded on the device as it is left in the control of
>> ACPI hot-plug.
>
> There are ways, but you can just use pci express, it's easier.

That's true.  I should probably just give up on trying to do an
implementation that works with the i440fx implementation.  I could
probably move over to the q35 and once that is done then we could look
at something like the PCI Advanced Features solution for something
like the PCI-bridge drivers.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-27 Thread Alexander Duyck
On Sun, Dec 27, 2015 at 7:20 PM, Dong, Eddie  wrote:
>> >
>> > Even if the device driver doesn't support migration, you still want to
>> > migrate VM? That maybe risk and we should add the "bad path" for the
>> > driver at least.
>>
>> At a minimum we should have support for hot-plug if we are expecting to
>> support migration.  You would simply have to hot-plug the device before you
>> start migration and then return it after.  That is how the current bonding
>> approach for this works if I am not mistaken.
>
> Hotplug is good to eliminate the device spefic state clone, but bonding 
> approach is very network specific, it doesn’t work for other devices such as 
> FPGA device, QaT devices & GPU devices, which we plan to support gradually :)

Hotplug would be usable for that assuming the guest supports the
optional "pause" implementation as called out in the PCI hotplug spec.
With that the device can maintain state for some period of time after
the hotplug remove event has occurred.

The problem is that you have to get the device to quiesce at some
point as you cannot complete the migration with the device still
active.  The way you were doing it was using the per-device
configuration space mechanism.  That doesn't scale when you have to
implement it for each and every driver for each and every OS you have
to support.  Using the "pause" implementation for hot-plug would have
a much greater likelihood of scaling as you could either take the fast
path approach of "pausing" the device to resume it when migration has
completed, or you could just remove the device and restart the driver
on the other side if the pause support is not yet implemented.  You
would lose the state under such a migration but it is much more
practical than having to implement a per device solution.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-25 Thread Alexander Duyck
On Thu, Dec 24, 2015 at 11:03 PM, Lan Tianyu <tianyu@intel.com> wrote:
> Merry Christmas.
> Sorry for later response due to personal affair.
>
> On 2015年12月14日 03:30, Alexander Duyck wrote:
>>> > These sounds we need to add a faked bridge for migration and adding a
>>> > driver in the guest for it. It also needs to extend PCI bus/hotplug
>>> > driver to do pause/resume other devices, right?
>>> >
>>> > My concern is still that whether we can change PCI bus/hotplug like that
>>> > without spec change.
>>> >
>>> > IRQ should be general for any devices and we may extend it for
>>> > migration. Device driver also can make decision to support migration
>>> > or not.
>> The device should have no say in the matter.  Either we are going to
>> migrate or we will not.  This is why I have suggested my approach as
>> it allows for the least amount of driver intrusion while providing the
>> maximum number of ways to still perform migration even if the device
>> doesn't support it.
>
> Even if the device driver doesn't support migration, you still want to
> migrate VM? That maybe risk and we should add the "bad path" for the
> driver at least.

At a minimum we should have support for hot-plug if we are expecting
to support migration.  You would simply have to hot-plug the device
before you start migration and then return it after.  That is how the
current bonding approach for this works if I am not mistaken.

The advantage we are looking to gain is to avoid removing/disabling
the device for as long as possible.  Ideally we want to keep the
device active through the warm-up period, but if the guest doesn't do
that we should still be able to fall back on the older approaches if
needed.

>>
>> The solution I have proposed is simple:
>>
>> 1.  Extend swiotlb to allow for a page dirtying functionality.
>>
>>  This part is pretty straight forward.  I'll submit a few patches
>> later today as RFC that can provided the minimal functionality needed
>> for this.
>
> Very appreciate to do that.
>
>>
>> 2.  Provide a vendor specific configuration space option on the QEMU
>> implementation of a PCI bridge to act as a bridge between direct
>> assigned devices and the host bridge.
>>
>>  My thought was to add some vendor specific block that includes a
>> capabilities, status, and control register so you could go through and
>> synchronize things like the DMA page dirtying feature.  The bridge
>> itself could manage the migration capable bit inside QEMU for all
>> devices assigned to it.  So if you added a VF to the bridge it would
>> flag that you can support migration in QEMU, while the bridge would
>> indicate you cannot until the DMA page dirtying control bit is set by
>> the guest.
>>
>>  We could also go through and optimize the DMA page dirtying after
>> this is added so that we can narrow down the scope of use, and as a
>> result improve the performance for other devices that don't need to
>> support migration.  It would then be a matter of adding an interrupt
>> in the device to handle an event such as the DMA page dirtying status
>> bit being set in the config space status register, while the bit is
>> not set in the control register.  If it doesn't get set then we would
>> have to evict the devices before the warm-up phase of the migration,
>> otherwise we can defer it until the end of the warm-up phase.
>>
>> 3.  Extend existing shpc driver to support the optional "pause"
>> functionality as called out in section 4.1.2 of the Revision 1.1 PCI
>> hot-plug specification.
>
> Since your solution has added a faked PCI bridge. Why not notify the
> bridge directly during migration via irq and call device driver's
> callback in the new bridge driver?
>
> Otherwise, the new bridge driver also can check whether the device
> driver provides migration callback or not and call them to improve the
> passthough device's performance during migration.

This is basically what I had in mind.  Though I would take things one
step further.  You don't need to add any new call-backs if you make
use of the existing suspend/resume logic.  For a VF this does exactly
what you would need since the VFs don't support wake on LAN so it will
simply clear the bus master enable and put the netdev in a suspended
state until the resume can be called.

The PCI hot-plug specification calls out that the OS can optionally
implement a "pause" mechanism which is meant to be used for high
availability type environments.  What I am proposing is basically
extending the standard SHPC capable PCI bridge so th

Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest

2015-12-14 Thread Alexander Duyck
On Mon, Dec 14, 2015 at 6:00 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Sun, Dec 13, 2015 at 01:28:31PM -0800, Alexander Duyck wrote:
>> This patch is meant to provide the guest with a way of flagging DMA pages
>> as being dirty to the host when using a direct-assign device within a
>> guest.  The advantage to this approach is that it is fairly simple, however
>> It currently has a singificant impact on device performance in all the
>> scenerios where it won't be needed.
>>
>> As such this is really meant only as a proof of concept and to get the ball
>> rolling in terms of figuring out how best to approach the issue of dirty
>> page tracking for a guest that is using a direct assigned device.  In
>> addition with just this patch it should be possible to modify current
>> migration approaches so that instead of having to hot-remove the device
>> before starting the migration this can instead be delayed until the period
>> before the final stop and copy.
>>
>> Signed-off-by: Alexander Duyck <adu...@mirantis.com>
>> ---
>>  arch/arm/include/asm/dma-mapping.h   |3 ++-
>>  arch/arm64/include/asm/dma-mapping.h |5 ++---
>>  arch/ia64/include/asm/dma.h  |1 +
>>  arch/mips/include/asm/dma-mapping.h  |1 +
>>  arch/powerpc/include/asm/swiotlb.h   |1 +
>>  arch/tile/include/asm/dma-mapping.h  |1 +
>>  arch/unicore32/include/asm/dma-mapping.h |1 +
>>  arch/x86/Kconfig |   11 +++
>>  arch/x86/include/asm/swiotlb.h   |   26 ++
>>  drivers/xen/swiotlb-xen.c|6 ++
>>  lib/swiotlb.c|6 ++
>>  11 files changed, 58 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/dma-mapping.h 
>> b/arch/arm/include/asm/dma-mapping.h
>> index ccb3aa64640d..1962d7b471c7 100644
>> --- a/arch/arm/include/asm/dma-mapping.h
>> +++ b/arch/arm/include/asm/dma-mapping.h
>> @@ -167,7 +167,8 @@ static inline bool dma_capable(struct device *dev, 
>> dma_addr_t addr, size_t size)
>>   return 1;
>>  }
>>
>> -static inline void dma_mark_clean(void *addr, size_t size) { }
>> +static inline void dma_mark_clean(void *addr, size_t size) {}
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  extern int arm_dma_set_mask(struct device *dev, u64 dma_mask);
>>
>> diff --git a/arch/arm64/include/asm/dma-mapping.h 
>> b/arch/arm64/include/asm/dma-mapping.h
>> index 61e08f360e31..8d24fe11c8a3 100644
>> --- a/arch/arm64/include/asm/dma-mapping.h
>> +++ b/arch/arm64/include/asm/dma-mapping.h
>> @@ -84,9 +84,8 @@ static inline bool dma_capable(struct device *dev, 
>> dma_addr_t addr, size_t size)
>>   return addr + size - 1 <= *dev->dma_mask;
>>  }
>>
>> -static inline void dma_mark_clean(void *addr, size_t size)
>> -{
>> -}
>> +static inline void dma_mark_clean(void *addr, size_t size) {}
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  #endif   /* __KERNEL__ */
>>  #endif   /* __ASM_DMA_MAPPING_H */
>> diff --git a/arch/ia64/include/asm/dma.h b/arch/ia64/include/asm/dma.h
>> index 4d97f60f1ef5..d92ebeb2758e 100644
>> --- a/arch/ia64/include/asm/dma.h
>> +++ b/arch/ia64/include/asm/dma.h
>> @@ -20,5 +20,6 @@ extern unsigned long MAX_DMA_ADDRESS;
>>  #define free_dma(x)
>>
>>  void dma_mark_clean(void *addr, size_t size);
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  #endif /* _ASM_IA64_DMA_H */
>> diff --git a/arch/mips/include/asm/dma-mapping.h 
>> b/arch/mips/include/asm/dma-mapping.h
>> index e604f760c4a0..567f6e03e337 100644
>> --- a/arch/mips/include/asm/dma-mapping.h
>> +++ b/arch/mips/include/asm/dma-mapping.h
>> @@ -28,6 +28,7 @@ static inline bool dma_capable(struct device *dev, 
>> dma_addr_t addr, size_t size)
>>  }
>>
>>  static inline void dma_mark_clean(void *addr, size_t size) {}
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  #include 
>>
>> diff --git a/arch/powerpc/include/asm/swiotlb.h 
>> b/arch/powerpc/include/asm/swiotlb.h
>> index de99d6e29430..b694e8399e28 100644
>> --- a/arch/powerpc/include/asm/swiotlb.h
>> +++ b/arch/powerpc/include/asm/swiotlb.h
>> @@ -16,6 +16,7 @@
>>  extern struct dma_map_ops swiotlb_dma_ops;
>>
>>  static inline void dma_mark_clean(void *addr, size_t size) {}
>> +static inline void dma_mark_dirty(void

Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest

2015-12-14 Thread Alexander Duyck
On Mon, Dec 14, 2015 at 9:20 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote:
>> > This way distro can use a guest agent to disable
>> > dirtying until before migration starts.
>>
>> Right.  For a v2 version I would definitely want to have some way to
>> limit the scope of this.  My main reason for putting this out here is
>> to start altering the course of discussions since it seems like were
>> weren't getting anywhere with the ixgbevf migration changes that were
>> being proposed.
>
> Absolutely, thanks for working on this.
>
>> >> + unsigned long pg_addr, start;
>> >> +
>> >> + start = (unsigned long)addr;
>> >> + pg_addr = PAGE_ALIGN(start + size);
>> >> + start &= ~(sizeof(atomic_t) - 1);
>> >> +
>> >> + /* trigger a write fault on each page, excluding first page */
>> >> + while ((pg_addr -= PAGE_SIZE) > start)
>> >> + atomic_add(0, (atomic_t *)pg_addr);
>> >> +
>> >> + /* trigger a write fault on first word of DMA */
>> >> + atomic_add(0, (atomic_t *)start);
>> >
>> > start might not be aligned correctly for a cast to atomic_t.
>> > It's harmless to do this for any memory, so I think you should
>> > just do this for 1st byte of all pages including the first one.
>>
>> You may not have noticed it but I actually aligned start in the line
>> after pg_addr.
>
> Yes you did. alignof would make it a bit more noticeable.
>
>>  However instead of aligning to the start of the next
>> atomic_t I just masked off the lower bits so that we start at the
>> DWORD that contains the first byte of the starting address.  The
>> assumption here is that I cannot trigger any sort of fault since if I
>> have access to a given byte within a DWORD I will have access to the
>> entire DWORD.
>
> I'm curious where does this come from.  Isn't it true that access is
> controlled at page granularity normally, so you can touch beginning of
> page just as well?

Yeah, I am pretty sure it probably is page granularity.  However my
thought was to try and stick to the start of the DMA as the last
access.  That way we don't pull in any more cache lines than we need
to in order to dirty the pages.  Usually the start of the DMA region
will contain some sort of headers or something that needs to be
accessed with the highest priority so I wanted to make certain that we
were forcing usable data into the L1 cache rather than just the first
cache line of the page where the DMA started.  If however the start of
a DMA was the start of the page there is nothing there to prevent
that.

>>  I coded this up so that the spots where we touch the
>> memory should match up with addresses provided by the hardware to
>> perform the DMA over the PCI bus.
>
> Yes but there's no requirement to do it like this from
> virt POV. You just need to touch each page.

I know, but at the same time if we match up with the DMA then it is
more likely that we avoid grabbing unneeded cache lines.  In the case
of most drivers the data for headers and start is at the start of the
DMA.  So if we dirty the cache line associated with the start of the
DMA it will be pulled into the L1 cache and there is a greater chance
that it may already be prefetched as well.

>> Also I intentionally ran from highest address to lowest since that way
>> we don't risk pushing the first cache line of the DMA buffer out of
>> the L1 cache due to the PAGE_SIZE stride.
>
> Interesting. How does order of access help with this?

If you use a PAGE_SIZE stride you will start evicting things from L1
cache after something like 8 accesses on an x86 processor as most of
the recent ones have a 32K 8 way associative L1 cache.  So if I go
from back to front then I evict the stuff that would likely be in the
data portion of a buffer instead of headers which are usually located
at the front.

> By the way, if you are into these micro-optimizations you might want to
> limit prefetch, to this end you want to access the last line of the
> page.  And it's probably worth benchmarking a bit and not doing it all just
> based on theory, keep code simple in v1 otherwise.

My main goal for now is functional code over high performance code.
That is why I have kept this code fairly simple.  I might have done
some optimization but it was as much about the optimization as keeping
the code simple.  For example by using the start of the page instead
of the end I could easily do the comparison against start and avoid
doing more than one write per page.

The issue for me doing performance testing is that I don't have
anyth

Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest

2015-12-14 Thread Alexander Duyck
On Mon, Dec 14, 2015 at 12:52 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Mon, Dec 14, 2015 at 09:59:13AM -0800, Alexander Duyck wrote:
>> On Mon, Dec 14, 2015 at 9:20 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote:
>> >> > This way distro can use a guest agent to disable
>> >> > dirtying until before migration starts.
>> >>
>> >> Right.  For a v2 version I would definitely want to have some way to
>> >> limit the scope of this.  My main reason for putting this out here is
>> >> to start altering the course of discussions since it seems like were
>> >> weren't getting anywhere with the ixgbevf migration changes that were
>> >> being proposed.
>> >
>> > Absolutely, thanks for working on this.
>> >
>> >> >> + unsigned long pg_addr, start;
>> >> >> +
>> >> >> + start = (unsigned long)addr;
>> >> >> + pg_addr = PAGE_ALIGN(start + size);
>> >> >> + start &= ~(sizeof(atomic_t) - 1);
>> >> >> +
>> >> >> + /* trigger a write fault on each page, excluding first page */
>> >> >> + while ((pg_addr -= PAGE_SIZE) > start)
>> >> >> + atomic_add(0, (atomic_t *)pg_addr);
>> >> >> +
>> >> >> + /* trigger a write fault on first word of DMA */
>> >> >> + atomic_add(0, (atomic_t *)start);
>
> Actually, I have second thoughts about using atomic_add here,
> especially for _sync.
>
> Many architectures do
>
> #define ATOMIC_OP_RETURN(op, c_op)  \
> static inline int atomic_##op##_return(int i, atomic_t *v)  \
> {   \
> unsigned long flags;\
> int ret;\
> \
> raw_local_irq_save(flags);  \
> ret = (v->counter = v->counter c_op i); \
> raw_local_irq_restore(flags);   \
> \
> return ret; \
> }
>
> and this is not safe if device is still doing DMA to/from
> this memory.
>
> Generally, atomic_t is there for SMP effects, not for sync
> with devices.
>
> This is why I said you should do
> cmpxchg(pg_addr, 0xdead, 0xdead);
>
> Yes, we probably never actually want to run m68k within a VM,
> but let's not misuse interfaces like this.

Right now this implementation is for x86 only.  Any other architecture
currently reports dma_mark_dirty as an empty inline function.  The
reason why I chose the atomic_add for x86 is simply because it is
guaranteed dirty the cache line with relatively few instructions and
operands as all I have to have is the pointer and 0.

For the m68k we could implement it as a cmpxchg instead.  The general
thought here is that each architecture is probably going to have to do
it a little bit differently.

>> >> >
>> >> > start might not be aligned correctly for a cast to atomic_t.
>> >> > It's harmless to do this for any memory, so I think you should
>> >> > just do this for 1st byte of all pages including the first one.
>> >>
>> >> You may not have noticed it but I actually aligned start in the line
>> >> after pg_addr.
>> >
>> > Yes you did. alignof would make it a bit more noticeable.
>> >
>> >>  However instead of aligning to the start of the next
>> >> atomic_t I just masked off the lower bits so that we start at the
>> >> DWORD that contains the first byte of the starting address.  The
>> >> assumption here is that I cannot trigger any sort of fault since if I
>> >> have access to a given byte within a DWORD I will have access to the
>> >> entire DWORD.
>> >
>> > I'm curious where does this come from.  Isn't it true that access is
>> > controlled at page granularity normally, so you can touch beginning of
>> > page just as well?
>>
>> Yeah, I am pretty sure it probably is page granularity.  However my
>> thought was to try and stick to the start of the DMA as the last
>> access.  That way we don't pull in any more cache lines

Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2015-12-13 Thread Alexander Duyck
On Sun, Dec 13, 2015 at 6:27 PM, Yang Zhang <yang.zhang...@gmail.com> wrote:
> On 2015/12/14 5:28, Alexander Duyck wrote:
>>
>> This patch set is meant to be the guest side code for a proof of concept
>> involving leaving pass-through devices in the guest during the warm-up
>> phase of guest live migration.  In order to accomplish this I have added a
>> new function called dma_mark_dirty that will mark the pages associated
>> with
>> the DMA transaction as dirty in the case of either an unmap or a
>> sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or
>> DMA_BIDIRECTIONAL.  The pass-through device must still be removed before
>> the stop-and-copy phase, however allowing the device to be present should
>> significantly improve the performance of the guest during the warm-up
>> period.
>>
>> This current implementation is very preliminary and there are number of
>> items still missing.  Specifically in order to make this a more complete
>> solution we need to support:
>> 1.  Notifying hypervisor that drivers are dirtying DMA pages received
>> 2.  Bypassing page dirtying when it is not needed.
>>
>
> Shouldn't current log dirty mechanism already cover them?

The guest has no way of currently knowing that the hypervisor is doing
dirty page logging, and the log dirty mechanism currently has no way
of tracking device DMA accesses.  This change is meant to bridge the
two so that the guest device driver will force the SWIOTLB DMA API to
mark pages written to by the device as dirty.

>> The two mechanisms referenced above would likely require coordination with
>> QEMU and as such are open to discussion.  I haven't attempted to address
>> them as I am not sure there is a consensus as of yet.  My personal
>> preference would be to add a vendor-specific configuration block to the
>> emulated pci-bridge interfaces created by QEMU that would allow us to
>> essentially extend shpc to support guest live migration with pass-through
>> devices.
>>
>> The functionality in this patch set is currently disabled by default.  To
>> enable it you can select "SWIOTLB page dirtying" from the "Processor type
>> and features" menu.
>
>
> Only SWIOTLB is supported?

Yes.  For right now this only supports SWIOTLB.  The assumption here
is that SWIOTLB is in use for most cases where an IOMMU is not
present.  If an IOMMU is present in a virtualized guest then most
likely the IOMMU might be able to provide a separate mechanism for
dirty page tracking.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-13 Thread Alexander Duyck
On Sun, Dec 13, 2015 at 7:47 AM, Lan, Tianyu <tianyu@intel.com> wrote:
>
>
> On 12/11/2015 1:16 AM, Alexander Duyck wrote:
>>
>> On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu <tianyu@intel.com> wrote:
>>>
>>>
>>>
>>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>>>>>
>>>>>
>>>>> Ideally, it is able to leave guest driver unmodified but it requires
>>>>> the
>>>>>>
>>>>>> hypervisor or qemu to aware the device which means we may need a
>>>>>> driver
>>>>>> in
>>>>>> hypervisor or qemu to handle the device on behalf of guest driver.
>>>>
>>>>
>>>> Can you answer the question of when do you use your code -
>>>>  at the start of migration or
>>>>  just before the end?
>>>
>>>
>>>
>>> Just before stopping VCPU in this version and inject VF mailbox irq to
>>> notify the driver if the irq handler is installed.
>>> Qemu side also will check this via the faked PCI migration capability
>>> and driver will set the status during device open() or resume() callback.
>>
>>
>> The VF mailbox interrupt is a very bad idea.  Really the device should
>> be in a reset state on the other side of a migration.  It doesn't make
>> sense to have the interrupt firing if the device is not configured.
>> This is one of the things that is preventing you from being able to
>> migrate the device while the interface is administratively down or the
>> VF driver is not loaded.
>
>
> From my opinion, if VF driver is not loaded and hardware doesn't start
> to work, the device state doesn't need to be migrated.
>
> We may add a flag for driver to check whether migration happened during it's
> down and reinitialize the hardware and clear the flag when system try to put
> it up.
>
> We may add migration core in the Linux kernel and provide some helps
> functions to facilitate to add migration support for drivers.
> Migration core is in charge to sync status with Qemu.
>
> Example.
> migration_register()
> Driver provides
> - Callbacks to be called before and after migration or for bad path
> - Its irq which it prefers to deal with migration event.

You would be better off just using function pointers in the pci_driver
struct and let the PCI driver registration take care of all that.

> migration_event_check()
> Driver calls it in the irq handler. Migration core code will check
> migration status and call its callbacks when migration happens.

No, this is still a bad idea.  You haven't addressed what you do when
the device has had interrupts disabled such as being in the down
state.

This is the biggest issue I see with your whole patch set.  It
requires the driver containing certain changes and being in a certain
state.  You cannot put those expectations on the guest.  You really
need to try and move as much of this out to existing functionality as
possible.

>>
>> My thought on all this is that it might make sense to move this
>> functionality into a PCI-to-PCI bridge device and make it a
>> requirement that all direct-assigned devices have to exist behind that
>> device in order to support migration.  That way you would be working
>> with a directly emulated device that would likely already be
>> supporting hot-plug anyway.  Then it would just be a matter of coming
>> up with a few Qemu specific extensions that you would need to add to
>> the device itself.  The same approach would likely be portable enough
>> that you could achieve it with PCIe as well via the same configuration
>> space being present on the upstream side of a PCIe port or maybe a
>> PCIe switch of some sort.
>>
>> It would then be possible to signal via your vendor-specific PCI
>> capability on that device that all devices behind this bridge require
>> DMA page dirtying, you could use the configuration in addition to the
>> interrupt already provided for hot-plug to signal things like when you
>> are starting migration, and possibly even just extend the shpc
>> functionality so that if this capability is present you have the
>> option to pause/resume instead of remove/probe the device in the case
>> of certain hot-plug events.  The fact is there may be some use for a
>> pause/resume type approach for PCIe hot-plug in the near future
>> anyway.  From the sounds of it Apple has required it for all
>> Thunderbolt device drivers so that they can halt the device in order
>> to shuffle resources around, perhaps we should look at something
>> similar for Linux

[RFC PATCH 2/3] xen/swiotlb: Fold static unmap and sync calls into calling functions

2015-12-13 Thread Alexander Duyck
This change essentially does two things.  First it folds the swiotlb_unmap
and swiotlb_sync calls into their callers.  The goal behind this is three
fold.  First this helps to reduce execution time and improves performance
since we aren't having to call into as many functions.  Second it allows us
to split up some of the sync functionality as there is the dma_mark_clean
portion of the sync call that is only really needed for dma_sync_for_cpu
since we don't actually want to mark the page as clean if we are syncing
for the device.

The second change is to move dma_mark_clean inside the if statement instead
of using a return in the case of sync and unmap.  By doing this we make it
so that we can also add a dma_mark_dirty function later.

Signed-off-by: Alexander Duyck <adu...@mirantis.com>
---
 drivers/xen/swiotlb-xen.c |   90 ++---
 1 file changed, 36 insertions(+), 54 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 7399782c0998..2154c70e47da 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -432,9 +432,9 @@ EXPORT_SYMBOL_GPL(xen_swiotlb_map_page);
  * After this call, reads by the cpu to the buffer are guaranteed to see
  * whatever the device wrote there.
  */
-static void xen_unmap_single(struct device *hwdev, dma_addr_t dev_addr,
-size_t size, enum dma_data_direction dir,
-struct dma_attrs *attrs)
+void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
+   size_t size, enum dma_data_direction dir,
+   struct dma_attrs *attrs)
 {
phys_addr_t paddr = xen_bus_to_phys(dev_addr);
 
@@ -448,23 +448,14 @@ static void xen_unmap_single(struct device *hwdev, 
dma_addr_t dev_addr,
return;
}
 
-   if (dir != DMA_FROM_DEVICE)
-   return;
-
/*
 * phys_to_virt doesn't work with hihgmem page but we could
 * call dma_mark_clean() with hihgmem page here. However, we
 * are fine since dma_mark_clean() is null on POWERPC. We can
 * make dma_mark_clean() take a physical address if necessary.
 */
-   dma_mark_clean(phys_to_virt(paddr), size);
-}
-
-void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
-   size_t size, enum dma_data_direction dir,
-   struct dma_attrs *attrs)
-{
-   xen_unmap_single(hwdev, dev_addr, size, dir, attrs);
+   if (dir == DMA_FROM_DEVICE)
+   dma_mark_clean(phys_to_virt(paddr), size);
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_unmap_page);
 
@@ -478,36 +469,22 @@ EXPORT_SYMBOL_GPL(xen_swiotlb_unmap_page);
  * address back to the card, you must first perform a
  * xen_swiotlb_dma_sync_for_device, and then the device again owns the buffer
  */
-static void
-xen_swiotlb_sync_single(struct device *hwdev, dma_addr_t dev_addr,
-   size_t size, enum dma_data_direction dir,
-   enum dma_sync_target target)
+void
+xen_swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
+   size_t size, enum dma_data_direction dir)
 {
phys_addr_t paddr = xen_bus_to_phys(dev_addr);
 
BUG_ON(dir == DMA_NONE);
 
-   if (target == SYNC_FOR_CPU)
-   xen_dma_sync_single_for_cpu(hwdev, dev_addr, size, dir);
+   xen_dma_sync_single_for_cpu(hwdev, dev_addr, size, dir);
 
/* NOTE: We use dev_addr here, not paddr! */
if (is_xen_swiotlb_buffer(dev_addr))
-   swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target);
-
-   if (target == SYNC_FOR_DEVICE)
-   xen_dma_sync_single_for_device(hwdev, dev_addr, size, dir);
-
-   if (dir != DMA_FROM_DEVICE)
-   return;
+   swiotlb_tbl_sync_single(hwdev, paddr, size, dir, SYNC_FOR_CPU);
 
-   dma_mark_clean(phys_to_virt(paddr), size);
-}
-
-void
-xen_swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
-   size_t size, enum dma_data_direction dir)
-{
-   xen_swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_CPU);
+   if (dir == DMA_FROM_DEVICE)
+   dma_mark_clean(phys_to_virt(paddr), size);
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_sync_single_for_cpu);
 
@@ -515,7 +492,16 @@ void
 xen_swiotlb_sync_single_for_device(struct device *hwdev, dma_addr_t dev_addr,
   size_t size, enum dma_data_direction dir)
 {
-   xen_swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_DEVICE);
+   phys_addr_t paddr = xen_bus_to_phys(dev_addr);
+
+   BUG_ON(dir == DMA_NONE);
+
+   /* NOTE: We use dev_addr here, not paddr! */
+   if (is_xen_swiotlb_buffer(dev_addr))
+   swiotlb_tbl_sync_single(hwdev, paddr, size, dir,
+   SYNC_FOR_

[RFC PATCH 1/3] swiotlb: Fold static unmap and sync calls into calling functions

2015-12-13 Thread Alexander Duyck
This change essentially does two things.  First it folds the swiotlb_unmap
and swiotlb_sync calls into their callers.  The goal behind this is three
fold.  First this helps to reduce execution time and improves performance
since we aren't having to call into as many functions.  Second it allows us
to split up some of the sync functionality as there is the dma_mark_clean
portion of the sync call that is only really needed for dma_sync_for_cpu
since we don't actually want to mark the page as clean if we are syncing
for the device.

The second change is to move dma_mark_clean inside the if statement instead
of using a return in the case of sync and unmap.  By doing this we make it
so that we can also add a dma_mark_dirty function later.

Signed-off-by: Alexander Duyck <adu...@mirantis.com>
---
 lib/swiotlb.c |   81 +++--
 1 file changed, 33 insertions(+), 48 deletions(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 76f29ecba8f4..384ac06217b2 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -781,8 +781,9 @@ EXPORT_SYMBOL_GPL(swiotlb_map_page);
  * After this call, reads by the cpu to the buffer are guaranteed to see
  * whatever the device wrote there.
  */
-static void unmap_single(struct device *hwdev, dma_addr_t dev_addr,
-size_t size, enum dma_data_direction dir)
+void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
+   size_t size, enum dma_data_direction dir,
+   struct dma_attrs *attrs)
 {
phys_addr_t paddr = dma_to_phys(hwdev, dev_addr);
 
@@ -793,23 +794,14 @@ static void unmap_single(struct device *hwdev, dma_addr_t 
dev_addr,
return;
}
 
-   if (dir != DMA_FROM_DEVICE)
-   return;
-
/*
 * phys_to_virt doesn't work with hihgmem page but we could
 * call dma_mark_clean() with hihgmem page here. However, we
 * are fine since dma_mark_clean() is null on POWERPC. We can
 * make dma_mark_clean() take a physical address if necessary.
 */
-   dma_mark_clean(phys_to_virt(paddr), size);
-}
-
-void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
-   size_t size, enum dma_data_direction dir,
-   struct dma_attrs *attrs)
-{
-   unmap_single(hwdev, dev_addr, size, dir);
+   if (dir == DMA_FROM_DEVICE)
+   dma_mark_clean(phys_to_virt(paddr), size);
 }
 EXPORT_SYMBOL_GPL(swiotlb_unmap_page);
 
@@ -823,31 +815,21 @@ EXPORT_SYMBOL_GPL(swiotlb_unmap_page);
  * address back to the card, you must first perform a
  * swiotlb_dma_sync_for_device, and then the device again owns the buffer
  */
-static void
-swiotlb_sync_single(struct device *hwdev, dma_addr_t dev_addr,
-   size_t size, enum dma_data_direction dir,
-   enum dma_sync_target target)
+void
+swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
+   size_t size, enum dma_data_direction dir)
 {
phys_addr_t paddr = dma_to_phys(hwdev, dev_addr);
 
BUG_ON(dir == DMA_NONE);
 
if (is_swiotlb_buffer(paddr)) {
-   swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target);
+   swiotlb_tbl_sync_single(hwdev, paddr, size, dir, SYNC_FOR_CPU);
return;
}
 
-   if (dir != DMA_FROM_DEVICE)
-   return;
-
-   dma_mark_clean(phys_to_virt(paddr), size);
-}
-
-void
-swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
-   size_t size, enum dma_data_direction dir)
-{
-   swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_CPU);
+   if (dir == DMA_FROM_DEVICE)
+   dma_mark_clean(phys_to_virt(paddr), size);
 }
 EXPORT_SYMBOL(swiotlb_sync_single_for_cpu);
 
@@ -855,7 +837,14 @@ void
 swiotlb_sync_single_for_device(struct device *hwdev, dma_addr_t dev_addr,
   size_t size, enum dma_data_direction dir)
 {
-   swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_DEVICE);
+   phys_addr_t paddr = dma_to_phys(hwdev, dev_addr);
+
+   BUG_ON(dir == DMA_NONE);
+
+   if (!is_swiotlb_buffer(paddr))
+   return;
+
+   swiotlb_tbl_sync_single(hwdev, paddr, size, dir, SYNC_FOR_DEVICE);
 }
 EXPORT_SYMBOL(swiotlb_sync_single_for_device);
 
@@ -929,10 +918,9 @@ swiotlb_unmap_sg_attrs(struct device *hwdev, struct 
scatterlist *sgl,
struct scatterlist *sg;
int i;
 
-   BUG_ON(dir == DMA_NONE);
-
for_each_sg(sgl, sg, nelems, i)
-   unmap_single(hwdev, sg->dma_address, sg_dma_len(sg), dir);
+   swiotlb_unmap_page(hwdev, sg->dma_address, sg_dma_len(sg),
+  dir, attrs);
 
 }
 EXPORT_SYMBOL(swiotlb_unmap_sg_attrs);
@@ -952,32 +940,29 @@ EXPORT_SYMBOL(swiotlb_unmap_sg);
  * The same as swiotlb_sync_single_* but for a sc

[RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2015-12-13 Thread Alexander Duyck
This patch set is meant to be the guest side code for a proof of concept
involving leaving pass-through devices in the guest during the warm-up
phase of guest live migration.  In order to accomplish this I have added a
new function called dma_mark_dirty that will mark the pages associated with
the DMA transaction as dirty in the case of either an unmap or a
sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or
DMA_BIDIRECTIONAL.  The pass-through device must still be removed before
the stop-and-copy phase, however allowing the device to be present should
significantly improve the performance of the guest during the warm-up
period.

This current implementation is very preliminary and there are number of
items still missing.  Specifically in order to make this a more complete 
solution we need to support:
1.  Notifying hypervisor that drivers are dirtying DMA pages received
2.  Bypassing page dirtying when it is not needed.

The two mechanisms referenced above would likely require coordination with
QEMU and as such are open to discussion.  I haven't attempted to address
them as I am not sure there is a consensus as of yet.  My personal
preference would be to add a vendor-specific configuration block to the
emulated pci-bridge interfaces created by QEMU that would allow us to
essentially extend shpc to support guest live migration with pass-through
devices.

The functionality in this patch set is currently disabled by default.  To
enable it you can select "SWIOTLB page dirtying" from the "Processor type
and features" menu.

---

Alexander Duyck (3):
  swiotlb: Fold static unmap and sync calls into calling functions
  xen/swiotlb: Fold static unmap and sync calls into calling functions
  x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest


 arch/arm/include/asm/dma-mapping.h   |3 +
 arch/arm64/include/asm/dma-mapping.h |5 +-
 arch/ia64/include/asm/dma.h  |1 
 arch/mips/include/asm/dma-mapping.h  |1 
 arch/powerpc/include/asm/swiotlb.h   |1 
 arch/tile/include/asm/dma-mapping.h  |1 
 arch/unicore32/include/asm/dma-mapping.h |1 
 arch/x86/Kconfig |   11 
 arch/x86/include/asm/swiotlb.h   |   26 
 drivers/xen/swiotlb-xen.c|   92 +-
 lib/swiotlb.c|   83 ---
 11 files changed, 123 insertions(+), 102 deletions(-)

--
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest

2015-12-13 Thread Alexander Duyck
This patch is meant to provide the guest with a way of flagging DMA pages
as being dirty to the host when using a direct-assign device within a
guest.  The advantage to this approach is that it is fairly simple, however
It currently has a singificant impact on device performance in all the
scenerios where it won't be needed.

As such this is really meant only as a proof of concept and to get the ball
rolling in terms of figuring out how best to approach the issue of dirty
page tracking for a guest that is using a direct assigned device.  In
addition with just this patch it should be possible to modify current
migration approaches so that instead of having to hot-remove the device
before starting the migration this can instead be delayed until the period
before the final stop and copy.

Signed-off-by: Alexander Duyck <adu...@mirantis.com>
---
 arch/arm/include/asm/dma-mapping.h   |3 ++-
 arch/arm64/include/asm/dma-mapping.h |5 ++---
 arch/ia64/include/asm/dma.h  |1 +
 arch/mips/include/asm/dma-mapping.h  |1 +
 arch/powerpc/include/asm/swiotlb.h   |1 +
 arch/tile/include/asm/dma-mapping.h  |1 +
 arch/unicore32/include/asm/dma-mapping.h |1 +
 arch/x86/Kconfig |   11 +++
 arch/x86/include/asm/swiotlb.h   |   26 ++
 drivers/xen/swiotlb-xen.c|6 ++
 lib/swiotlb.c|6 ++
 11 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/arch/arm/include/asm/dma-mapping.h 
b/arch/arm/include/asm/dma-mapping.h
index ccb3aa64640d..1962d7b471c7 100644
--- a/arch/arm/include/asm/dma-mapping.h
+++ b/arch/arm/include/asm/dma-mapping.h
@@ -167,7 +167,8 @@ static inline bool dma_capable(struct device *dev, 
dma_addr_t addr, size_t size)
return 1;
 }
 
-static inline void dma_mark_clean(void *addr, size_t size) { }
+static inline void dma_mark_clean(void *addr, size_t size) {}
+static inline void dma_mark_dirty(void *addr, size_t size) {}
 
 extern int arm_dma_set_mask(struct device *dev, u64 dma_mask);
 
diff --git a/arch/arm64/include/asm/dma-mapping.h 
b/arch/arm64/include/asm/dma-mapping.h
index 61e08f360e31..8d24fe11c8a3 100644
--- a/arch/arm64/include/asm/dma-mapping.h
+++ b/arch/arm64/include/asm/dma-mapping.h
@@ -84,9 +84,8 @@ static inline bool dma_capable(struct device *dev, dma_addr_t 
addr, size_t size)
return addr + size - 1 <= *dev->dma_mask;
 }
 
-static inline void dma_mark_clean(void *addr, size_t size)
-{
-}
+static inline void dma_mark_clean(void *addr, size_t size) {}
+static inline void dma_mark_dirty(void *addr, size_t size) {}
 
 #endif /* __KERNEL__ */
 #endif /* __ASM_DMA_MAPPING_H */
diff --git a/arch/ia64/include/asm/dma.h b/arch/ia64/include/asm/dma.h
index 4d97f60f1ef5..d92ebeb2758e 100644
--- a/arch/ia64/include/asm/dma.h
+++ b/arch/ia64/include/asm/dma.h
@@ -20,5 +20,6 @@ extern unsigned long MAX_DMA_ADDRESS;
 #define free_dma(x)
 
 void dma_mark_clean(void *addr, size_t size);
+static inline void dma_mark_dirty(void *addr, size_t size) {}
 
 #endif /* _ASM_IA64_DMA_H */
diff --git a/arch/mips/include/asm/dma-mapping.h 
b/arch/mips/include/asm/dma-mapping.h
index e604f760c4a0..567f6e03e337 100644
--- a/arch/mips/include/asm/dma-mapping.h
+++ b/arch/mips/include/asm/dma-mapping.h
@@ -28,6 +28,7 @@ static inline bool dma_capable(struct device *dev, dma_addr_t 
addr, size_t size)
 }
 
 static inline void dma_mark_clean(void *addr, size_t size) {}
+static inline void dma_mark_dirty(void *addr, size_t size) {}
 
 #include 
 
diff --git a/arch/powerpc/include/asm/swiotlb.h 
b/arch/powerpc/include/asm/swiotlb.h
index de99d6e29430..b694e8399e28 100644
--- a/arch/powerpc/include/asm/swiotlb.h
+++ b/arch/powerpc/include/asm/swiotlb.h
@@ -16,6 +16,7 @@
 extern struct dma_map_ops swiotlb_dma_ops;
 
 static inline void dma_mark_clean(void *addr, size_t size) {}
+static inline void dma_mark_dirty(void *addr, size_t size) {}
 
 extern unsigned int ppc_swiotlb_enable;
 int __init swiotlb_setup_bus_notifier(void);
diff --git a/arch/tile/include/asm/dma-mapping.h 
b/arch/tile/include/asm/dma-mapping.h
index 96ac6cce4a32..79953f09e938 100644
--- a/arch/tile/include/asm/dma-mapping.h
+++ b/arch/tile/include/asm/dma-mapping.h
@@ -58,6 +58,7 @@ static inline phys_addr_t dma_to_phys(struct device *dev, 
dma_addr_t daddr)
 }
 
 static inline void dma_mark_clean(void *addr, size_t size) {}
+static inline void dma_mark_dirty(void *addr, size_t size) {}
 
 static inline void set_dma_ops(struct device *dev, struct dma_map_ops *ops)
 {
diff --git a/arch/unicore32/include/asm/dma-mapping.h 
b/arch/unicore32/include/asm/dma-mapping.h
index 8140e053ccd3..b9d357ab122d 100644
--- a/arch/unicore32/include/asm/dma-mapping.h
+++ b/arch/unicore32/include/asm/dma-mapping.h
@@ -49,6 +49,7 @@ static inline phys_addr_t dma_to_phys(struct device *dev, 
dma_addr_t daddr)
 }
 
 static inline void dma_mark_clean(void *addr,

Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2015-12-13 Thread Alexander Duyck
On Sun, Dec 13, 2015 at 9:22 PM, Yang Zhang <yang.zhang...@gmail.com> wrote:
> On 2015/12/14 12:54, Alexander Duyck wrote:
>>
>> On Sun, Dec 13, 2015 at 6:27 PM, Yang Zhang <yang.zhang...@gmail.com>
>> wrote:
>>>
>>> On 2015/12/14 5:28, Alexander Duyck wrote:
>>>>
>>>>
>>>> This patch set is meant to be the guest side code for a proof of concept
>>>> involving leaving pass-through devices in the guest during the warm-up
>>>> phase of guest live migration.  In order to accomplish this I have added
>>>> a
>>>> new function called dma_mark_dirty that will mark the pages associated
>>>> with
>>>> the DMA transaction as dirty in the case of either an unmap or a
>>>> sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or
>>>> DMA_BIDIRECTIONAL.  The pass-through device must still be removed before
>>>> the stop-and-copy phase, however allowing the device to be present
>>>> should
>>>> significantly improve the performance of the guest during the warm-up
>>>> period.
>>>>
>>>> This current implementation is very preliminary and there are number of
>>>> items still missing.  Specifically in order to make this a more complete
>>>> solution we need to support:
>>>> 1.  Notifying hypervisor that drivers are dirtying DMA pages received
>>>> 2.  Bypassing page dirtying when it is not needed.
>>>>
>>>
>>> Shouldn't current log dirty mechanism already cover them?
>>
>>
>> The guest has no way of currently knowing that the hypervisor is doing
>> dirty page logging, and the log dirty mechanism currently has no way
>> of tracking device DMA accesses.  This change is meant to bridge the
>> two so that the guest device driver will force the SWIOTLB DMA API to
>> mark pages written to by the device as dirty.
>
>
> OK. This is what we called "dummy write mechanism". Actually, this is just a
> workaround before iommu dirty bit ready. Eventually, we need to change to
> use the hardware dirty bit. Besides, we may still lost the data if dma
> happens during/just before stop and copy phase.

Right, this is a "dummy write mechanism" in order to allow for entry
tracking.  This only works completely if we force the hardware to
quiesce via a hot-plug event before we reach the stop-and-copy phase
of the migration.

The IOMMU dirty bit approach is likely going to have a significant
number of challenges involved.  Looking over the driver and the data
sheet it looks like the current implementation is using a form of huge
pages in the IOMMU, as such we will need to tear that down and replace
it with 4K pages if we don't want to dirty large regions with each DMA
transaction, and I'm not sure that is something we can change while
DMA is active to the affected regions.  In addition the data sheet
references the fact that the page table entries are stored in a
translation cache and in order to sync things up you have to
invalidate the entries.  I'm not sure what the total overhead would be
for invalidating something like a half million 4K pages to migrate a
guest with just 2G of RAM, but I would think that might be a bit
expensive given the fact that IOMMU accesses aren't known for being
incredibly fast when invalidating DMA on the host.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Alexander Duyck
On Thu, Dec 10, 2015 at 8:11 AM, Michael S. Tsirkin  wrote:
> On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
>>
>>
>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>> >>Ideally, it is able to leave guest driver unmodified but it requires the
>> >>>hypervisor or qemu to aware the device which means we may need a driver in
>> >>>hypervisor or qemu to handle the device on behalf of guest driver.
>> >Can you answer the question of when do you use your code -
>> >at the start of migration or
>> >just before the end?
>>
>> Just before stopping VCPU in this version and inject VF mailbox irq to
>> notify the driver if the irq handler is installed.
>> Qemu side also will check this via the faked PCI migration capability
>> and driver will set the status during device open() or resume() callback.
>
> Right, this is the "good path" optimization. Whether this buys anything
> as compared to just sending reset to the device when VCPU is stopped
> needs to be measured. In any case, we probably do need a way to
> interrupt driver on destination to make it reconfigure the device -
> otherwise it might take seconds for it to notice.  And a way to make
> sure driver can handle this surprise reset so we can block migration if
> it can't.

The question is how do we handle the "bad path"?  From what I can tell
it seems like we would have to have the dirty page tracking for DMA
handled in the host in order to support that.  Otherwise we risk
corrupting the memory in the guest as there are going to be a few
stale pages that end up being in the guest.

The easiest way to probably flag a "bad path" migration would be to
emulate a Manually-operated Retention Latch being opened and closed on
the device.  It may even allow us to work with the desire to support a
means for doing a pause/resume as that would be a hot-plug event where
the latch was never actually opened.  Basically if the retention latch
is released and then re-closed it can be assumed that the device has
lost power and as a result been reset.  As such a normal hot-plug
controller would have to reconfigure the device in such an event.  The
key bit being that with the power being cycled on the port the
assumption is that the device has lost any existing state, and we
should emulate that as well by clearing any state Qemu might be
carrying such as the shadow of the MSI-X table.  In addition we could
also signal if the host supports the dirty page tracking via the IOMMU
so if needed the guest could trigger some sort of memory exception
handling due to the risk of memory corruption.

I would argue that we don't necessarily have to provide a means to
guarantee the driver can support a surprise removal/reset.  Worst case
scenario is that it would be equivalent to somebody pulling the plug
on an externally connected PCIe cage in a physical host.  I know the
Intel Ethernet drivers have already had to add support for surprise
removal due to the fact that such a scenario can occur on Thunderbolt
enabled platforms.  Since it is acceptable for physical hosts to have
such an event occur I think we could support the same type of failure
for direct assigned devices in guests.  That would be the one spot
where I would say it is up to the drivers to figure out how they are
going to deal with it since this is something that can occur for any
given driver on any given OS assuming it can be plugged into an
externally removable cage.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Alexander Duyck
On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu  wrote:
>
>
> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>>>
>>> Ideally, it is able to leave guest driver unmodified but it requires the
>>> >hypervisor or qemu to aware the device which means we may need a driver
>>> > in
>>> >hypervisor or qemu to handle the device on behalf of guest driver.
>>
>> Can you answer the question of when do you use your code -
>> at the start of migration or
>> just before the end?
>
>
> Just before stopping VCPU in this version and inject VF mailbox irq to
> notify the driver if the irq handler is installed.
> Qemu side also will check this via the faked PCI migration capability
> and driver will set the status during device open() or resume() callback.

The VF mailbox interrupt is a very bad idea.  Really the device should
be in a reset state on the other side of a migration.  It doesn't make
sense to have the interrupt firing if the device is not configured.
This is one of the things that is preventing you from being able to
migrate the device while the interface is administratively down or the
VF driver is not loaded.

My thought on all this is that it might make sense to move this
functionality into a PCI-to-PCI bridge device and make it a
requirement that all direct-assigned devices have to exist behind that
device in order to support migration.  That way you would be working
with a directly emulated device that would likely already be
supporting hot-plug anyway.  Then it would just be a matter of coming
up with a few Qemu specific extensions that you would need to add to
the device itself.  The same approach would likely be portable enough
that you could achieve it with PCIe as well via the same configuration
space being present on the upstream side of a PCIe port or maybe a
PCIe switch of some sort.

It would then be possible to signal via your vendor-specific PCI
capability on that device that all devices behind this bridge require
DMA page dirtying, you could use the configuration in addition to the
interrupt already provided for hot-plug to signal things like when you
are starting migration, and possibly even just extend the shpc
functionality so that if this capability is present you have the
option to pause/resume instead of remove/probe the device in the case
of certain hot-plug events.  The fact is there may be some use for a
pause/resume type approach for PCIe hot-plug in the near future
anyway.  From the sounds of it Apple has required it for all
Thunderbolt device drivers so that they can halt the device in order
to shuffle resources around, perhaps we should look at something
similar for Linux.

The other advantage behind grouping functions on one bridge is things
like reset domains.  The PCI error handling logic will want to be able
to reset any devices that experienced an error in the event of
something such as a surprise removal.  By grouping all of the devices
you could disable/reset/enable them as one logical group in the event
of something such as the "bad path" approach Michael has mentioned.

>>
 > >It would be great if we could avoid changing the guest; but at least
 > > your guest
 > >driver changes don't actually seem to be that hardware specific;
 > > could your
 > >changes actually be moved to generic PCI level so they could be made
 > >to work for lots of drivers?
>>>
>>> >
>>> >It is impossible to use one common solution for all devices unless the
>>> > PCIE
>>> >spec documents it clearly and i think one day it will be there. But
>>> > before
>>> >that, we need some workarounds on guest driver to make it work even it
>>> > looks
>>> >ugly.
>
>
> Yes, so far there is not hardware migration support and it's hard to modify
> bus level code. It also will block implementation on the Windows.

Please don't assume things.  Unless you have hard data from Microsoft
that says they want it this way lets just try to figure out what works
best for us for now and then we can start worrying about third party
implementations after we have figured out a solution that actually
works.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Alexander Duyck
On Wed, Dec 9, 2015 at 1:28 AM, Lan, Tianyu <tianyu@intel.com> wrote:
>
>
> On 12/8/2015 1:12 AM, Alexander Duyck wrote:
>>
>> On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu <tianyu@intel.com> wrote:
>>>
>>> On 12/5/2015 1:07 AM, Alexander Duyck wrote:
>>>>>
>>>>>
>>>>>
>>>>> We still need to support Windows guest for migration and this is why
>>>>> our
>>>>> patches keep all changes in the driver since it's impossible to change
>>>>> Windows kernel.
>>>>
>>>>
>>>>
>>>> That is a poor argument.  I highly doubt Microsoft is interested in
>>>> having to modify all of the drivers that will support direct assignment
>>>> in order to support migration.  They would likely request something
>>>> similar to what I have in that they will want a way to do DMA tracking
>>>> with minimal modification required to the drivers.
>>>
>>>
>>>
>>> This totally depends on the NIC or other devices' vendors and they
>>> should make decision to support migration or not. If yes, they would
>>> modify driver.
>>
>>
>> Having to modify every driver that wants to support live migration is
>> a bit much.  In addition I don't see this being limited only to NIC
>> devices.  You can direct assign a number of different devices, your
>> solution cannot be specific to NICs.
>
>
> We are also adding such migration support for QAT device and so our
> solution will not just be limit to NIC. Now just is the beginning.

Agreed, but still QAT is networking related.  My advice would be to
look at something else that works from within a different subsystem
such as storage.  All I am saying is that your solution is very
networking centric.

> We can't limit user to only use Linux guest. So the migration feature
> should work for both Windows and Linux guest.

Right now what your solution is doing is to limit things so that only
the Intel NICs can support this since it will require driver
modification across the board.  Instead what I have proposed should
make it so that once you have done the work there should be very
little work that has to be done on your port to support any device.

>>
>>> If just target to call suspend/resume during migration, the feature will
>>> be meaningless. Most cases don't want to affect user during migration
>>> a lot and so the service down time is vital. Our target is to apply
>>> SRIOV NIC passthough to cloud service and NFV(network functions
>>> virtualization) projects which are sensitive to network performance
>>> and stability. From my opinion, We should give a change for device
>>> driver to implement itself migration job. Call suspend and resume
>>> callback in the driver if it doesn't care the performance during
>>> migration.
>>
>>
>> The suspend/resume callback should be efficient in terms of time.
>> After all we don't want the system to stall for a long period of time
>> when it should be either running or asleep.  Having it burn cycles in
>> a power state limbo doesn't do anyone any good.  If nothing else maybe
>> it will help to push the vendors to speed up those functions which
>> then benefit migration and the system sleep states.
>
>
> If we can benefit both migration and suspend, that would be wonderful.
> But migration and system pm is still different. Just for example,
> driver doesn't need to put device into deep D-status during migration
> and host can do this after migration while it's essential for
> system sleep. PCI configure space and interrupt config is emulated by
> Qemu and Qemu can migrate these configures to new machine. Driver
> doesn't need to deal with such thing. So I think migration still needs a
> different callback or different code path than device suspend/resume.

SR-IOV devices are considered to be in D3 as soon as you clear the bus
master enable bit.  They don't actually have a PCIe power management
block in their configuration space.  The advantage of the
suspend/resume approach is that the D0->D3->D0 series of transitions
should trigger a PCIe reset on the device.  As such the resume call is
capable of fully reinitializing a device.

As far as migrating the interrupts themselves moving live interrupts
is problematic.  You are more likely to throw them out of sync since
the state of the device will not match the state of what you migrated
for things like the pending bit array so if there is a device that
actually depending on those bits you might run into issues.

> Another concern is that we have to rework PM core ore PCI bus driver
&

Re: live migration vs device assignment (motivation)

2015-12-09 Thread Alexander Duyck
On Wed, Dec 9, 2015 at 8:26 AM, Lan, Tianyu  wrote:

> For other kind of devices, it's hard to work.
> We are also adding migration support for QAT(QuickAssist Technology) device.
>
> QAT device user case introduction.
> Server, networking, big data, and storage applications use QuickAssist
> Technology to offload servers from handling compute-intensive operations,
> such as:
> 1) Symmetric cryptography functions including cipher operations and
> authentication operations
> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> cryptography
> 3) Compression and decompression functions including DEFLATE and LZS
>
> PCI hotplug will not work for such devices during migration and these
> operations will fail when unplug device.

I assume the problem is that with a PCI hotplug event you are losing
the state information for the device, do I have that right?

Looking over the QAT drivers it doesn't seem like any of them support
the suspend/resume PM calls.  I would imagine it makes it difficult
for a system with a QAT card in it to be able to drop the system to a
low power state.  You might want to try enabling suspend/resume
support for the devices on bare metal before you attempt to take on
migration as it would provide you with a good testing framework to see
what needs to be saved/restored within the device and in what order
before you attempt to do the same while migrating from one system to
another.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-07 Thread Alexander Duyck
On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu <tianyu@intel.com> wrote:
> On 12/5/2015 1:07 AM, Alexander Duyck wrote:
>>>
>>>
>>> We still need to support Windows guest for migration and this is why our
>>> patches keep all changes in the driver since it's impossible to change
>>> Windows kernel.
>>
>>
>> That is a poor argument.  I highly doubt Microsoft is interested in
>> having to modify all of the drivers that will support direct assignment
>> in order to support migration.  They would likely request something
>> similar to what I have in that they will want a way to do DMA tracking
>> with minimal modification required to the drivers.
>
>
> This totally depends on the NIC or other devices' vendors and they
> should make decision to support migration or not. If yes, they would
> modify driver.

Having to modify every driver that wants to support live migration is
a bit much.  In addition I don't see this being limited only to NIC
devices.  You can direct assign a number of different devices, your
solution cannot be specific to NICs.

> If just target to call suspend/resume during migration, the feature will
> be meaningless. Most cases don't want to affect user during migration
> a lot and so the service down time is vital. Our target is to apply
> SRIOV NIC passthough to cloud service and NFV(network functions
> virtualization) projects which are sensitive to network performance
> and stability. From my opinion, We should give a change for device
> driver to implement itself migration job. Call suspend and resume
> callback in the driver if it doesn't care the performance during migration.

The suspend/resume callback should be efficient in terms of time.
After all we don't want the system to stall for a long period of time
when it should be either running or asleep.  Having it burn cycles in
a power state limbo doesn't do anyone any good.  If nothing else maybe
it will help to push the vendors to speed up those functions which
then benefit migration and the system sleep states.

Also you keep assuming you can keep the device running while you do
the migration and you can't.  You are going to corrupt the memory if
you do, and you have yet to provide any means to explain how you are
going to solve that.


>
>>
>>> Following is my idea to do DMA tracking.
>>>
>>> Inject event to VF driver after memory iterate stage
>>> and before stop VCPU and then VF driver marks dirty all
>>> using DMA memory. The new allocated pages also need to
>>> be marked dirty before stopping VCPU. All dirty memory
>>> in this time slot will be migrated until stop-and-copy
>>> stage. We also need to make sure to disable VF via clearing the
>>> bus master enable bit for VF before migrating these memory.
>>
>>
>> The ordering of your explanation here doesn't quite work.  What needs to
>> happen is that you have to disable DMA and then mark the pages as dirty.
>>   What the disabling of the BME does is signal to the hypervisor that
>> the device is now stopped.  The ixgbevf_suspend call already supported
>> by the driver is almost exactly what is needed to take care of something
>> like this.
>
>
> This is why I hope to reserve a piece of space in the dma page to do dummy
> write. This can help to mark page dirty while not require to stop DMA and
> not race with DMA data.

You can't and it will still race.  What concerns me is that your
patches and the document you referenced earlier show a considerable
lack of understanding about how DMA and device drivers work.  There is
a reason why device drivers have so many memory barriers and the like
in them.  The fact is when you have CPU and a device both accessing
memory things have to be done in a very specific order and you cannot
violate that.

If you have a contiguous block of memory you expect the device to
write into you cannot just poke a hole in it.  Such a situation is not
supported by any hardware that I am aware of.

As far as writing to dirty the pages it only works so long as you halt
the DMA and then mark the pages dirty.  It has to be in that order.
Any other order will result in data corruption and I am sure the NFV
customers definitely don't want that.

> If can't do that, we have to stop DMA in a short time to mark all dma
> pages dirty and then reenable it. I am not sure how much we can get by
> this way to track all DMA memory with device running during migration. I
> need to do some tests and compare results with stop DMA diretly at last
> stage during migration.

We have to halt the DMA before we can complete the migration.  So
please feel free to test this.

In addition I still feel you would be better off taking this in
smaller steps.  I still say your fir

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-07 Thread Alexander Duyck
On Mon, Dec 7, 2015 at 9:39 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Mon, Dec 07, 2015 at 09:12:08AM -0800, Alexander Duyck wrote:
>> On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu <tianyu@intel.com> wrote:
>> > On 12/5/2015 1:07 AM, Alexander Duyck wrote:

>> > If can't do that, we have to stop DMA in a short time to mark all dma
>> > pages dirty and then reenable it. I am not sure how much we can get by
>> > this way to track all DMA memory with device running during migration. I
>> > need to do some tests and compare results with stop DMA diretly at last
>> > stage during migration.
>>
>> We have to halt the DMA before we can complete the migration.  So
>> please feel free to test this.
>>
>> In addition I still feel you would be better off taking this in
>> smaller steps.  I still say your first step would be to come up with a
>> generic solution for the dirty page tracking like the dma_mark_clean()
>> approach I had mentioned earlier.  If I get time I might try to take
>> care of it myself later this week since you don't seem to agree with
>> that approach.
>
> Or even try to look at the dirty bit in the VT-D PTEs
> on the host. See the mail I have just sent.
> Might be slower, or might be faster, but is completely
> transparent.

I just saw it and I am looking over the VTd spec now.  It looks like
there might be some performance impacts if software is changing the
PTEs since then the VTd harwdare cannot cache them.  I still have to
do some more reading though so I can fully understand the impacts.

>> >>
>> >> The question is how we would go about triggering it.  I really don't
>> >> think the PCI configuration space approach is the right idea.
>> >>  I wonder
>> >> if we couldn't get away with some sort of ACPI event instead.  We
>> >> already require ACPI support in order to shut down the system
>> >> gracefully, I wonder if we couldn't get away with something similar in
>> >> order to suspend/resume the direct assigned devices gracefully.
>> >>
>> >
>> > I don't think there is such events in the current spec.
>> > Otherwise, There are two kinds of suspend/resume callbacks.
>> > 1) System suspend/resume called during S2RAM and S2DISK.
>> > 2) Runtime suspend/resume called by pm core when device is idle.
>> > If you want to do what you mentioned, you have to change PM core and
>> > ACPI spec.
>>
>> The thought I had was to somehow try to move the direct assigned
>> devices into their own power domain and then simulate a AC power event
>> where that domain is switched off.  However I don't know if there are
>> ACPI events to support that since the power domain code currently only
>> appears to be in use for runtime power management.
>>
>> That had also given me the thought to look at something like runtime
>> power management for the VFs.  We would need to do a runtime
>> suspend/resume.  The only problem is I don't know if there is any way
>> to get the VFs to do a quick wakeup.  It might be worthwhile looking
>> at trying to check with the ACPI experts out there to see if there is
>> anything we can do as bypassing having to use the configuration space
>> mechanism to signal this would definitely be worth it.
>
> I don't much like this idea because it relies on the
> device being exactly the same across source/destination.
> After all, this is always true for suspend/resume.
> Most users do not have control over this, and you would
> often get sightly different versions of firmware,
> etc without noticing.

The original code was operating on that assumption as well.  That is
kind of why I suggested suspend/resume rather than reinventing the
wheel.

> I think we should first see how far along we can get
> by doing a full device reset, and only carrying over
> high level state such as IP, MAC, ARP cache etc.

One advantage of the suspend/resume approach is that it is compatible
with a full reset.  The suspend/resume approach assumes the device
goes through a D0->D3->D0 reset as a part of transitioning between the
system states.

I do admit though that the PCI spec says you aren't supposed to be
hot-swapping devices while the system is in a sleep state so odds are
you would encounter issues if the device changed in any significant
way.

>> >>> The dma page allocated by VF driver also needs to reserve space
>> >>> to do dummy write.
>> >>
>> >>
>> >> No, this will not work.  If for example you have a VF driver allocating
>> >> memory for a 9K receive how will that work?  It i

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-04 Thread Alexander Duyck

On 12/04/2015 08:32 AM, Lan, Tianyu wrote:

Hi Michael & Alexander:
Thanks a lot for your comments and suggestions.

We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.


That is a poor argument.  I highly doubt Microsoft is interested in 
having to modify all of the drivers that will support direct assignment 
in order to support migration.  They would likely request something 
similar to what I have in that they will want a way to do DMA tracking 
with minimal modification required to the drivers.



Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.


The ordering of your explanation here doesn't quite work.  What needs to 
happen is that you have to disable DMA and then mark the pages as dirty. 
 What the disabling of the BME does is signal to the hypervisor that 
the device is now stopped.  The ixgbevf_suspend call already supported 
by the driver is almost exactly what is needed to take care of something 
like this.


The question is how we would go about triggering it.  I really don't 
think the PCI configuration space approach is the right idea.  I wonder 
if we couldn't get away with some sort of ACPI event instead.  We 
already require ACPI support in order to shut down the system 
gracefully, I wonder if we couldn't get away with something similar in 
order to suspend/resume the direct assigned devices gracefully.



The dma page allocated by VF driver also needs to reserve space
to do dummy write.


No, this will not work.  If for example you have a VF driver allocating 
memory for a 9K receive how will that work?  It isn't as if you can poke 
a hole in the contiguous memory.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-12-03 Thread Alexander Duyck
On Wed, Dec 2, 2015 at 6:08 AM, Lan, Tianyu  wrote:
> On 12/1/2015 11:02 PM, Michael S. Tsirkin wrote:
>>>
>>> But
>>> it requires guest OS to do specific configurations inside and rely on
>>> bonding driver which blocks it work on Windows.
>>>  From performance side,
>>> putting VF and virtio NIC under bonded interface will affect their
>>> performance even when not do migration. These factors block to use VF
>>> NIC passthough in some user cases(Especially in the cloud) which require
>>> migration.
>>
>>
>> That's really up to guest. You don't need to do bonding,
>> you can just move the IP and mac from userspace, that's
>> possible on most OS-es.
>>
>> Or write something in guest kernel that is more lightweight if you are
>> so inclined. What we are discussing here is the host-guest interface,
>> not the in-guest interface.
>>
>>> Current solution we proposed changes NIC driver and Qemu. Guest Os
>>> doesn't need to do special thing for migration.
>>> It's easy to deploy
>>
>>
>>
>> Except of course these patches don't even work properly yet.
>>
>> And when they do, even minor changes in host side NIC hardware across
>> migration will break guests in hard to predict ways.
>
>
> Switching between PV and VF NIC will introduce network stop and the
> latency of hotplug VF is measurable. For some user cases(cloud service
> and OPNFV) which are sensitive to network stabilization and performance,
> these are not friend and blocks SRIOV NIC usage in these case. We hope
> to find a better way to make SRIOV NIC work in these cases and this is
> worth to do since SRIOV NIC provides better network performance compared
> with PV NIC. Current patches have some issues. I think we can find
> solution for them andimprove them step by step.

I still believe the concepts being put into use here are deeply
flawed.  You are assuming you can somehow complete the migration while
the device is active and I seriously doubt that is the case.  You are
going to cause data corruption or worse cause a kernel panic when you
end up corrupting the guest memory.

You have to halt the device at some point in order to complete the
migration.  Now I fully agree it is best to do this for as small a
window as possible.  I really think that your best approach would be
embrace and extend the current solution that is making use of bonding.
The first step being to make it so that you don't have to hot-plug the
VF until just before you halt the guest instead of before you start he
migration.  Just doing that would yield a significant gain in terms of
performance during the migration.  In addition something like that
should be able to be done without having to be overly invasive into
the drivers.  A few tweaks to the DMA API and you could probably have
that resolved.

As far as avoiding the hot-plug itself that would be better handled as
a separate follow-up, and really belongs more to the PCI layer than
the NIC device drivers.  The device drivers should already have code
for handling a suspend/resume due to a power cycle event.  If you
could make use of that then it is just a matter of implementing
something in the hot-plug or PCIe drivers that would allow QEMU to
signal when the device needs to go into D3 and when it can resume
normal operation at D0.  You could probably use the PCI Bus Master
Enable bit as the test on if the device is ready for migration or not.
If the bit is set you cannot migrate the VM, and if it is cleared than
you are ready to migrate.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-01 Thread Alexander Duyck
On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Tue, Dec 01, 2015 at 11:04:31PM +0800, Lan, Tianyu wrote:
>>
>>
>> On 12/1/2015 12:07 AM, Alexander Duyck wrote:
>> >They can only be corrected if the underlying assumptions are correct
>> >and they aren't.  Your solution would have never worked correctly.
>> >The problem is you assume you can keep the device running when you are
>> >migrating and you simply cannot.  At some point you will always have
>> >to stop the device in order to complete the migration, and you cannot
>> >stop it before you have stopped your page tracking mechanism.  So
>> >unless the platform has an IOMMU that is somehow taking part in the
>> >dirty page tracking you will not be able to stop the guest and then
>> >the device, it will have to be the device and then the guest.
>> >
>> >>>Doing suspend and resume() may help to do migration easily but some
>> >>>devices requires low service down time. Especially network and I got
>> >>>that some cloud company promised less than 500ms network service downtime.
>> >Honestly focusing on the downtime is getting the cart ahead of the
>> >horse.  First you need to be able to do this without corrupting system
>> >memory and regardless of the state of the device.  You haven't even
>> >gotten to that state yet.  Last I knew the device had to be up in
>> >order for your migration to even work.
>>
>> I think the issue is that the content of rx package delivered to stack maybe
>> changed during migration because the piece of memory won't be migrated to
>> new machine. This may confuse applications or stack. Current dummy write
>> solution can ensure the content of package won't change after doing dummy
>> write while the content maybe not received data if migration happens before
>> that point. We can recheck the content via checksum or crc in the protocol
>> after dummy write to ensure the content is what VF received. I think stack
>> has already done such checks and the package will be abandoned if failed to
>> pass through the check.
>
>
> Most people nowdays rely on hardware checksums so I don't think this can
> fly.

Correct.  The checksum/crc approach will not work since it is possible
for a checksum to even be mangled in the case of some features such as
LRO or GRO.

>> Another way is to tell all memory driver are using to Qemu and let Qemu to
>> migrate these memory after stopping VCPU and the device. This seems safe but
>> implementation maybe complex.
>
> Not really 100% safe.  See below.
>
> I think hiding these details behind dma_* API does have
> some appeal. In any case, it gives us a good
> terminology as it covers what most drivers do.

That was kind of my thought.  If we were to build our own
dma_mark_clean() type function that will mark the DMA region dirty on
sync or unmap then that is half the battle right there as we would be
able to at least keep the regions consistent after they have left the
driver.

> There are several components to this:
> - dma_map_* needs to prevent page from
>   being migrated while device is running.
>   For example, expose some kind of bitmap from guest
>   to host, set bit there while page is mapped.
>   What happens if we stop the guest and some
>   bits are still set? See dma_alloc_coherent below
>   for some ideas.

Yeah, I could see something like this working.  Maybe we could do
something like what was done for the NX bit and make use of the upper
order bits beyond the limits of the memory range to mark pages as
non-migratable?

I'm curious.  What we have with a DMA mapped region is essentially
shared memory between the guest and the device.  How would we resolve
something like this with IVSHMEM, or are we blocked there as well in
terms of migration?

> - dma_unmap_* needs to mark page as dirty
>   This can be done by writing into a page.
>
> - dma_sync_* needs to mark page as dirty
>   This is trickier as we can not change the data.
>   One solution is using atomics.
>   For example:
> int x = ACCESS_ONCE(*p);
> cmpxchg(p, x, x);
>   Seems to do a write without changing page
>   contents.

Like I said we can probably kill 2 birds with one stone by just
implementing our own dma_mark_clean() for x86 virtualized
environments.

I'd say we could take your solution one step further and just use 0
instead of bothering to read the value.  After all it won't write the
area if the value at the offset is not 0.  The only downside is that
this is a locked operation so we will take a pretty serious
performance penalty when this is active.  As such my preference would

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-01 Thread Alexander Duyck
On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote:
>> On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin <m...@redhat.com> wrote:

>> > There are several components to this:
>> > - dma_map_* needs to prevent page from
>> >   being migrated while device is running.
>> >   For example, expose some kind of bitmap from guest
>> >   to host, set bit there while page is mapped.
>> >   What happens if we stop the guest and some
>> >   bits are still set? See dma_alloc_coherent below
>> >   for some ideas.
>>
>> Yeah, I could see something like this working.  Maybe we could do
>> something like what was done for the NX bit and make use of the upper
>> order bits beyond the limits of the memory range to mark pages as
>> non-migratable?
>>
>> I'm curious.  What we have with a DMA mapped region is essentially
>> shared memory between the guest and the device.  How would we resolve
>> something like this with IVSHMEM, or are we blocked there as well in
>> terms of migration?
>
> I have some ideas. Will post later.

I look forward to it.

>> > - dma_unmap_* needs to mark page as dirty
>> >   This can be done by writing into a page.
>> >
>> > - dma_sync_* needs to mark page as dirty
>> >   This is trickier as we can not change the data.
>> >   One solution is using atomics.
>> >   For example:
>> > int x = ACCESS_ONCE(*p);
>> > cmpxchg(p, x, x);
>> >   Seems to do a write without changing page
>> >   contents.
>>
>> Like I said we can probably kill 2 birds with one stone by just
>> implementing our own dma_mark_clean() for x86 virtualized
>> environments.
>>
>> I'd say we could take your solution one step further and just use 0
>> instead of bothering to read the value.  After all it won't write the
>> area if the value at the offset is not 0.
>
> Really almost any atomic that has no side effect will do.
> atomic or with 0
> atomic and with 
>
> It's just that cmpxchg already happens to have a portable
> wrapper.

I was originally thinking maybe an atomic_add with 0 would be the way
to go.  Either way though we still are using a locked prefix and
having to dirty a cache line per page which is going to come at some
cost.

>> > - dma_alloc_coherent memory (e.g. device rings)
>> >   must be migrated after device stopped modifying it.
>> >   Just stopping the VCPU is not enough:
>> >   you must make sure device is not changing it.
>> >
>> >   Or maybe the device has some kind of ring flush operation,
>> >   if there was a reasonably portable way to do this
>> >   (e.g. a flush capability could maybe be added to SRIOV)
>> >   then hypervisor could do this.
>>
>> This is where things start to get messy. I was suggesting the
>> suspend/resume to resolve this bit, but it might be possible to also
>> deal with this via something like this via clearing the bus master
>> enable bit for the VF.  If I am not mistaken that should disable MSI-X
>> interrupts and halt any DMA.  That should work as long as you have
>> some mechanism that is tracking the pages in use for DMA.
>
> A bigger issue is recovering afterwards.

Agreed.

>> >   In case you need to resume on source, you
>> >   really need to follow the same path
>> >   as on destination, preferably detecting
>> >   device reset and restoring the device
>> >   state.
>>
>> The problem with detecting the reset is that you would likely have to
>> be polling to do something like that.
>
> We could some event to guest to notify it about this event
> through a new or existing channel.
>
> Or we could make it possible for userspace to trigger this,
> then notify guest through the guest agent.

The first thing that comes to mind would be to use something like PCIe
Advanced Error Reporting, however I don't know if we can put a
requirement on the system supporting the q35 machine type or not in
order to support migration.

>>  I believe the fm10k driver
>> already has code like that in place where it will detect a reset as a
>> part of its watchdog, however the response time is something like 2
>> seconds for that.  That was one of the reasons I preferred something
>> like hot-plug as that should be functioning as soon as the guest is up
>> and it is a mechanism that operates outside of the VF drivers.
>
> That's pretty minor.
> A bigger issue is making sure guest does not crash
> when device is suddenly reset under it's legs.

I know the ixgbevf driver should already have logic to address some of
that.  If you look through the code there should be logic there for a
surprise removal support in ixgbevf.  The only issue is that unlike
fm10k it will not restore itself after a resume or slot_reset call.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-30 Thread Alexander Duyck
On Sun, Nov 29, 2015 at 10:53 PM, Lan, Tianyu <tianyu@intel.com> wrote:
> On 11/26/2015 11:56 AM, Alexander Duyck wrote:
>>
>> > I am not saying you cannot modify the drivers, however what you are
>> doing is far too invasive.  Do you seriously plan on modifying all of
>> the PCI device drivers out there in order to allow any device that
>> might be direct assigned to a port to support migration?  I certainly
>> hope not.  That is why I have said that this solution will not scale.
>
>
> Current drivers are not migration friendly. If the driver wants to
> support migration, it's necessary to be changed.

Modifying all of the drivers directly will not solve the issue though.
This is why I have suggested looking at possibly implementing
something like dma_mark_clean() which is used for ia64 architectures
to mark pages that were DMAed in as clean.  In your case though you
would want to mark such pages as dirty so that the page migration will
notice them and move them over.

> RFC PATCH V1 presented our ideas about how to deal with MMIO, ring and
> DMA tracking during migration. These are common for most drivers and
> they maybe problematic in the previous version but can be corrected later.

They can only be corrected if the underlying assumptions are correct
and they aren't.  Your solution would have never worked correctly.
The problem is you assume you can keep the device running when you are
migrating and you simply cannot.  At some point you will always have
to stop the device in order to complete the migration, and you cannot
stop it before you have stopped your page tracking mechanism.  So
unless the platform has an IOMMU that is somehow taking part in the
dirty page tracking you will not be able to stop the guest and then
the device, it will have to be the device and then the guest.

> Doing suspend and resume() may help to do migration easily but some
> devices requires low service down time. Especially network and I got
> that some cloud company promised less than 500ms network service downtime.

Honestly focusing on the downtime is getting the cart ahead of the
horse.  First you need to be able to do this without corrupting system
memory and regardless of the state of the device.  You haven't even
gotten to that state yet.  Last I knew the device had to be up in
order for your migration to even work.

Many devices are very state driven.  As such you cannot just freeze
them and restore them like you would regular device memory.  That is
where something like suspend/resume comes in because it already takes
care of getting the device ready for halt, and then resume.  Keep in
mind that those functions were meant to function on a device doing
something like a suspend to RAM or disk.  This is not too far of from
what a migration is doing since you need to halt the guest before you
move it.

As such the first step is to make it so that we can do the current
bonding approach with one change.  Specifically we want to leave the
device in the guest until the last portion of the migration instead of
having to remove it first.  To that end I would suggest focusing on
solving the DMA problem via something like a dma_mark_clean() type
solution as that would be one issue resolved and we all would see an
immediate gain instead of just those users of the ixgbevf driver.

> So I think performance effect also should be taken into account when we
> design the framework.

What you are proposing I would call premature optimization.  You need
to actually solve the problem before you can start optimizing things
and I don't see anything actually solved yet since your solution is
too unstable.

>>
>> What I am counter proposing seems like a very simple proposition.  It
>> can be implemented in two steps.
>>
>> 1.  Look at modifying dma_mark_clean().  It is a function called in
>> the sync and unmap paths of the lib/swiotlb.c.  If you could somehow
>> modify it to take care of marking the pages you unmap for Rx as being
>> dirty it will get you a good way towards your goal as it will allow
>> you to continue to do DMA while you are migrating the VM.
>>
>> 2.  Look at making use of the existing PCI suspend/resume calls that
>> are there to support PCI power management.  They have everything
>> needed to allow you to pause and resume DMA for the device before and
>> after the migration while retaining the driver state.  If you can
>> implement something that allows you to trigger these calls from the
>> PCI subsystem such as hot-plug then you would have a generic solution
>> that can be easily reproduced for multiple drivers beyond those
>> supported by ixgbevf.
>
>
> Glanced at PCI hotplug code. The hotplug events are triggered by PCI hotplug
> controller and these event are defined in the controller spec.
> It's ha

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-25 Thread Alexander Duyck
On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu <tianyu@intel.com> wrote:
> On 2015年11月25日 13:30, Alexander Duyck wrote:
>> No, what I am getting at is that you can't go around and modify the
>> configuration space for every possible device out there.  This
>> solution won't scale.
>
>
> PCI config space regs are emulation by Qemu and so We can find the free
> PCI config space regs for the faked PCI capability. Its position can be
> not permanent.

Yes, but do you really want to edit every driver on every OS that you
plan to support this on.  What about things like direct assignment of
regular Ethernet ports?  What you really need is a solution that will
work generically on any existing piece of hardware out there.

>>  If you instead moved the logic for notifying
>> the device into a separate mechanism such as making it a part of the
>> hot-plug logic then you only have to write the code once per OS in
>> order to get the hot-plug capability to pause/resume the device.  What
>> I am talking about is not full hot-plug, but rather to extend the
>> existing hot-plug in Qemu and the Linux kernel to support a
>> "pause/resume" functionality.  The PCI hot-plug specification calls
>> out the option of implementing something like this, but we don't
>> currently have support for it.
>>
>
> Could you elaborate the part of PCI hot-plug specification you mentioned?
>
> My concern is whether it needs to change PCI spec or not.


In the PCI Hot-Plug Specification 1.1, in section 4.1.2 it states:
 In addition to quiescing add-in card activity, an operating-system
vendor may optionally implement a less drastic “pause” capability, in
anticipation of the same or a similar add-in card being reinserted.

The idea I had was basically if we were to implement something like
that in Linux then we could pause/resume the device instead of
outright removing it.  The pause functionality could make use of the
suspend/resume functionality most drivers already have for PCI power
management.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-25 Thread Alexander Duyck
On Wed, Nov 25, 2015 at 8:02 AM, Lan, Tianyu  wrote:
> On 11/25/2015 8:28 PM, Michael S. Tsirkin wrote:
>>
>> Frankly, I don't really see what this short term hack buys us,
>> and if it goes in, we'll have to maintain it forever.
>>
>
> The framework of how to notify VF about migration status won't be
> changed regardless of stopping VF or not before doing migration.
> We hope to reach agreement on this first. Tracking dirty memory still
> need to more discussions and we will continue working on it. Stop VF may
> help to work around the issue and make tracking easier.

The problem is you still have to stop the device at some point for the
same reason why you have to halt the VM.  You seem to think you can
get by without doing that but you can't.  All you do is open the
system up to multiple races if you leave the device running.  The goal
should be to avoid stopping the device until the last possible moment,
however it will still have to be stopped eventually.  It isn't as if
you can migrate memory and leave the device doing DMA and expect to
get a clean state.

I agree with Michael.  The focus needs to be on first addressing dirty
page tracking.  Once you have that you could use a variation on the
bonding solution where you postpone the hot-plug event until near the
end of the migration just before you halt the guest instead of having
to do it before you start the migration.  Then after that we could
look at optimizing things further by introducing a variation that you
could further improve on things by introducing a variation of hot-plug
that would pause the device as I suggested instead of removing it.  At
that point you should be able to have almost all of the key issues
addresses so that you could drop the bond interface entirely.

>> Also, assuming you just want to do ifdown/ifup for some reason, it's
>> easy enough to do using a guest agent, in a completely generic way.
>>
>
> Just ifdown/ifup is not enough for migration. It needs to restore some PCI
> settings before doing ifup on the target machine

That is why I have been suggesting making use of suspend/resume logic
that is already in place for PCI power management.  In the case of a
suspend/resume we already have to deal with the fact that the device
will go through a D0->D3->D0 reset so we have to restore all of the
existing state.  It would take a significant load off of Qemu since
the guest would be restoring its own state instead of making Qemu have
to do all of the device migration work.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-25 Thread Alexander Duyck
On Wed, Nov 25, 2015 at 8:39 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Wed, Nov 25, 2015 at 08:24:38AM -0800, Alexander Duyck wrote:
>> >> Also, assuming you just want to do ifdown/ifup for some reason, it's
>> >> easy enough to do using a guest agent, in a completely generic way.
>> >>
>> >
>> > Just ifdown/ifup is not enough for migration. It needs to restore some PCI
>> > settings before doing ifup on the target machine
>>
>> That is why I have been suggesting making use of suspend/resume logic
>> that is already in place for PCI power management.  In the case of a
>> suspend/resume we already have to deal with the fact that the device
>> will go through a D0->D3->D0 reset so we have to restore all of the
>> existing state.  It would take a significant load off of Qemu since
>> the guest would be restoring its own state instead of making Qemu have
>> to do all of the device migration work.
>
> That can work, though again, the issue is you need guest
> cooperation to migrate.

Right now the problem is you need to have guest cooperation anyway as
you need to have some way of tracking the dirty pages.  If the IOMMU
on the host were to provide some sort of dirty page tracking then we
could exclude the guest from the equation, but until then we need the
guest to notify us of what pages it is letting the device dirty.  I'm
still of the opinion that the best way to go there is to just modify
the DMA API that is used in the guest so that it supports some sort of
page flag modification or something along those lines so we can track
all of the pages that might be written to by the device.

> If you reset device on destination instead of restoring state,
> then that issue goes away, but maybe the downtime
> will be increased.

Yes, the downtime will be increased, but it shouldn't be by much.
Depending on the setup a VF with a single queue can have about 3MB of
data outstanding when you move the driver over.  After that it is just
a matter of bringing the interface back up which should take only a
few hundred milliseconds assuming the PF is fairly responsive.

> Will it really? I think it's worth it to start with the
> simplest solution (reset on destination) and see
> what the effect is, then add optimizations.

Agreed.  My thought would be to start with something like
dma_mark_clean() that could be used to take care of marking the pages
for migration when they are unmapped or synced.

> One thing that I've been thinking about for a while, is saving (some)
> state speculatively.  For example, notify guest a bit before migration
> is done, so it can save device state. If guest responds quickly, you
> have state that can be restored.  If it doesn't, still migrate, and it
> will have to reset on destination.

I'm not sure how much more device state we really need to save.  The
driver in the guest has to have enough state to recover in the event
of a device failure resulting in a slot reset.  To top it off the
driver is able to reconfigure things probably as quick as we could if
we were restoring the state.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-25 Thread Alexander Duyck
On Wed, Nov 25, 2015 at 7:15 PM, Dong, Eddie <eddie.d...@intel.com> wrote:
>> On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu <tianyu@intel.com> wrote:
>> > On 2015年11月25日 13:30, Alexander Duyck wrote:
>> >> No, what I am getting at is that you can't go around and modify the
>> >> configuration space for every possible device out there.  This
>> >> solution won't scale.
>> >
>> >
>> > PCI config space regs are emulation by Qemu and so We can find the
>> > free PCI config space regs for the faked PCI capability. Its position
>> > can be not permanent.
>>
>> Yes, but do you really want to edit every driver on every OS that you plan to
>> support this on.  What about things like direct assignment of regular 
>> Ethernet
>> ports?  What you really need is a solution that will work generically on any
>> existing piece of hardware out there.
>
> The fundamental assumption of this patch series is to modify the driver in 
> guest to self-emulate or track the device state, so that the migration may be 
> possible.
> I don't think we can modify OS, without modifying the drivers, even using the 
> PCIe hotplug mechanism.
> In the meantime, modifying Windows OS is a big challenge given that only 
> Microsoft can do. While, modifying driver is relatively simple and manageable 
> to device vendors, if the device vendor want to support state-clone based 
> migration.

The problem is the code you are presenting, even as a proof of concept
is seriously flawed.  It does a poor job of exposing how any of this
can be duplicated for any other VF other than the one you are working
on.

I am not saying you cannot modify the drivers, however what you are
doing is far too invasive.  Do you seriously plan on modifying all of
the PCI device drivers out there in order to allow any device that
might be direct assigned to a port to support migration?  I certainly
hope not.  That is why I have said that this solution will not scale.

What I am counter proposing seems like a very simple proposition.  It
can be implemented in two steps.

1.  Look at modifying dma_mark_clean().  It is a function called in
the sync and unmap paths of the lib/swiotlb.c.  If you could somehow
modify it to take care of marking the pages you unmap for Rx as being
dirty it will get you a good way towards your goal as it will allow
you to continue to do DMA while you are migrating the VM.

2.  Look at making use of the existing PCI suspend/resume calls that
are there to support PCI power management.  They have everything
needed to allow you to pause and resume DMA for the device before and
after the migration while retaining the driver state.  If you can
implement something that allows you to trigger these calls from the
PCI subsystem such as hot-plug then you would have a generic solution
that can be easily reproduced for multiple drivers beyond those
supported by ixgbevf.

Thanks.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Alexander Duyck
On Tue, Nov 24, 2015 at 7:18 PM, Lan Tianyu <tianyu@intel.com> wrote:
> On 2015年11月24日 22:20, Alexander Duyck wrote:
>> I'm still not a fan of this approach.  I really feel like this is
>> something that should be resolved by extending the existing PCI hot-plug
>> rather than trying to instrument this per driver.  Then you will get the
>> goodness for multiple drivers and multiple OSes instead of just one.  An
>> added advantage to dealing with this in the PCI hot-plug environment
>> would be that you could then still do a hot-plug even if the guest
>> didn't load a driver for the VF since you would be working with the PCI
>> slot instead of the device itself.
>>
>> - Alex
>
> Hi Alex:
> What's you mentioned seems the bonding driver solution.
> Paper "Live Migration with Pass-through Device for Linux VM" describes
> it. It does VF hotplug during migration. In order to maintain Network
> connection when VF is out, it takes advantage of Linux bonding driver to
> switch between VF NIC and emulated NIC. But the side affects, that
> requires VM to do additional configure and the performance during
> switching two NIC is not good.

No, what I am getting at is that you can't go around and modify the
configuration space for every possible device out there.  This
solution won't scale.  If you instead moved the logic for notifying
the device into a separate mechanism such as making it a part of the
hot-plug logic then you only have to write the code once per OS in
order to get the hot-plug capability to pause/resume the device.  What
I am talking about is not full hot-plug, but rather to extend the
existing hot-plug in Qemu and the Linux kernel to support a
"pause/resume" functionality.  The PCI hot-plug specification calls
out the option of implementing something like this, but we don't
currently have support for it.

I just feel doing it through PCI hot-plug messages will scale much
better as you could likely make use of the power management
suspend/resume calls to take care of most of the needed implementation
details.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-24 Thread Alexander Duyck
On Tue, Nov 24, 2015 at 1:20 PM, Michael S. Tsirkin  wrote:
> On Tue, Nov 24, 2015 at 09:38:18PM +0800, Lan Tianyu wrote:
>> This patch is to add migration support for ixgbevf driver. Using
>> faked PCI migration capability table communicates with Qemu to
>> share migration status and mailbox irq vector index.
>>
>> Qemu will notify VF via sending MSIX msg to trigger mailbox
>> vector during migration and store migration status in the
>> PCI_VF_MIGRATION_VMM_STATUS regs in the new capability table.
>> The mailbox irq will be triggered just befoe stop-and-copy stage
>> and after migration on the target machine.
>>
>> VF driver will put down net when detect migration and tell
>> Qemu it's ready for migration via writing PCI_VF_MIGRATION_VF_STATUS
>> reg. After migration, put up net again.
>>
>> Qemu will in charge of migrating PCI config space regs and MSIX config.
>>
>> The patch is to dedicate on the normal case that net traffic works
>> when mailbox irq is enabled. For other cases(such as the driver
>> isn't loaded, adapter is suspended or closed), mailbox irq won't be
>> triggered and VF driver will disable it via PCI_VF_MIGRATION_CAP
>> reg. These case will be resolved later.
>>
>> Signed-off-by: Lan Tianyu 
>
> I have to say, I was much more interested in the idea
> of tracking dirty memory. I have some thoughts about
> that one - did you give up on it then?

The tracking of dirty pages still needs to be addressed unless the
interface is being downed before migration even starts which based on
other comments I am assuming is not the case.

I still feel that having a means of marking a page as being dirty when
it is unmapped would be the best way to go.  That way you only have to
update the DMA API instead of messing with each and every driver
trying to add code to force the page to be dirtied.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Alexander Duyck

On 11/24/2015 05:38 AM, Lan Tianyu wrote:

This patchset is to propose a solution of adding live migration
support for SRIOV NIC.

During migration, Qemu needs to let VF driver in the VM to know
migration start and end. Qemu adds faked PCI migration capability
to help to sync status between two sides during migration.

Qemu triggers VF's mailbox irq via sending MSIX msg when migration
status is changed. VF driver tells Qemu its mailbox vector index
via the new PCI capability. In some cases(NIC is suspended or closed),
VF mailbox irq is freed and VF driver can disable irq injecting via
new capability.

VF driver will put down nic before migration and put up again on
the target machine.

Lan Tianyu (3):
   VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO
   PCI: Add macros for faked PCI migration capability
   Ixgbevf: Add migration support for ixgbevf driver

  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |   5 ++
  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 ++
  drivers/vfio/pci/vfio_pci.c   |  21 +
  drivers/vfio/pci/vfio_pci_config.c|  38 ++--
  drivers/vfio/pci/vfio_pci_private.h   |   5 ++
  include/uapi/linux/pci_regs.h |  18 +++-
  include/uapi/linux/vfio.h |  12 +++
  7 files changed, 194 insertions(+), 7 deletions(-)


I'm still not a fan of this approach.  I really feel like this is 
something that should be resolved by extending the existing PCI hot-plug 
rather than trying to instrument this per driver.  Then you will get the 
goodness for multiple drivers and multiple OSes instead of just one.  An 
added advantage to dealing with this in the PCI hot-plug environment 
would be that you could then still do a hot-plug even if the guest 
didn't load a driver for the VF since you would be working with the PCI 
slot instead of the device itself.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-30 Thread Alexander Duyck

On 10/29/2015 07:41 PM, Lan Tianyu wrote:

On 2015年10月30日 00:17, Alexander Duyck wrote:

On 10/29/2015 01:33 AM, Lan Tianyu wrote:

On 2015年10月29日 14:58, Alexander Duyck wrote:

Your code was having to do a bunch of shuffling in order to get things
set up so that you could bring the interface back up.  I would argue
that it may actually be faster at least on the bring-up to just drop the
old rings and start over since it greatly reduced the complexity and the
amount of device related data that has to be moved.

If give up the old ring after migration and keep DMA running before
stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and
just make sure that all Rx buffers delivered to stack has been migrated.

1) Dummy write Rx buffer before checking Rx descriptor to ensure packet
migrated first.

Don't dummy write the Rx descriptor.  You should only really need to
dummy write the Rx buffer and you would do so after checking the
descriptor, not before.  Otherwise you risk corrupting the Rx buffer
because it is possible for you to read the Rx buffer, DMA occurs, and
then you write back the Rx buffer and now you have corrupted the memory.


2) Make a copy of Rx descriptor and then use the copied data to check
buffer status. Not use the original descriptor because it won't be
migrated and migration may happen between two access of the Rx
descriptor.

Do not just blindly copy the Rx descriptor ring.  That is a recipe for
disaster.  The problem is DMA has to happen in a very specific order for
things to function correctly.  The Rx buffer has to be written and then
the Rx descriptor.  The problem is you will end up getting a read-ahead
on the Rx descriptor ring regardless of which order you dirty things in.


Sorry, I didn't say clearly.
I meant to copy one Rx descriptor when receive rx irq and handle Rx ring.


No, I understood what you are saying.  My explanation was that it will 
not work.



Current code in the ixgbevf_clean_rx_irq() checks status of the Rx
descriptor whether its Rx buffer has been populated data and then read
the packet length from Rx descriptor to handle the Rx buffer.


That part you have correct.  However there are very explicit rules about 
the ordering of the reads.



My idea is to do the following three steps when receive Rx buffer in the
ixgbevf_clean_rx_irq().

(1) dummy write the Rx buffer first,


You cannot dummy write the Rx buffer without first being given ownership 
of it.  In the driver this is handled in two phases. First we have to 
read the DD bit to see if it is set.  If it is we can take ownership of 
the buffer.  Second we have to either do a dma_sync_range_for_cpu or 
dma_unmap_page call so that we can guarantee the data has been moved to 
the buffer by the DMA API and that it knows it should no longer be 
accessing it.



(2) make a copy of its Rx descriptor


This is not advisable.  Unless you can guarantee you are going to only 
read the descriptor after the DD bit is set you cannot guarantee that 
you won't race with device DMA.  The problem is you could have the 
migration occur right in the middle of (2).  If that occurs then you 
will have valid status bits, but the rest of the descriptor would be 
invalid data.



(3) Check the buffer status and get length from the copy.


I believe this is the assumption that is leading you down the wrong 
path.  You would have to read the status before you could do the copy.  
You cannot do it after.



Migration may happen every time.
Happen between (1) and (2). If the Rx buffer has been populated data, VF
driver will not know that on the new machine because the Rx descriptor
isn't migrated. But it's still safe.


The part I think you are not getting is that DMA can occur between (1) 
and (2).  So if for example you were doing your dummy write while DMA 
was occurring you pull in your value, DMA occurs, you write your value 
and now you have corrupted an Rx frame by writing stale data back into it.



Happen between (2) and (3). The copy will be migrated to new machine
and Rx buffer is migrated firstly. If there is data in the Rx buffer,
VF driver still can handle the buffer without migrating Rx descriptor.

The next buffers will be ignored since we don't migrate Rx descriptor
for them. Their status will be not completed on the new machine.


You have kind of lost me on this part.  Why do you believe there 
statuses will not be completed?  How are you going to prevent the Rx 
descriptor ring from being migrated as it will be a dirty page by the 
virtue of the fact that it is a bidirectional DMA mapping where the Rx 
path provides new buffers and writes those addresses in while the device 
is writing back the status bits and length back.  This is kind of what I 
was getting at.  The Rx descriptor ring will show up as one of the 
dirtiest spots on the driver since it is constantly being overwritten by 
the CPU in ixgbevf_alloc_rx_buffers.


Anyway we are kind of getting side tracked and I really think

Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-29 Thread Alexander Duyck

On 10/29/2015 01:33 AM, Lan Tianyu wrote:

On 2015年10月29日 14:58, Alexander Duyck wrote:

Your code was having to do a bunch of shuffling in order to get things
set up so that you could bring the interface back up.  I would argue
that it may actually be faster at least on the bring-up to just drop the
old rings and start over since it greatly reduced the complexity and the
amount of device related data that has to be moved.

If give up the old ring after migration and keep DMA running before
stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and
just make sure that all Rx buffers delivered to stack has been migrated.

1) Dummy write Rx buffer before checking Rx descriptor to ensure packet
migrated first.


Don't dummy write the Rx descriptor.  You should only really need to 
dummy write the Rx buffer and you would do so after checking the 
descriptor, not before.  Otherwise you risk corrupting the Rx buffer 
because it is possible for you to read the Rx buffer, DMA occurs, and 
then you write back the Rx buffer and now you have corrupted the memory.



2) Make a copy of Rx descriptor and then use the copied data to check
buffer status. Not use the original descriptor because it won't be
migrated and migration may happen between two access of the Rx descriptor.


Do not just blindly copy the Rx descriptor ring.  That is a recipe for 
disaster.  The problem is DMA has to happen in a very specific order for 
things to function correctly.  The Rx buffer has to be written and then 
the Rx descriptor.  The problem is you will end up getting a read-ahead 
on the Rx descriptor ring regardless of which order you dirty things in.


The descriptor is only 16 bytes, you can fit 256 of them in a single 
page.  There is a good chance you probably wouldn't be able to migrate 
if you were under heavy network stress, however you could still have 
several buffers written in the time it takes for you to halt the VM and 
migrate the remaining pages.  Those buffers wouldn't be marked as dirty 
but odds are the page the descriptors are in would be.  As such you will 
end up with the descriptors but not the buffers.


The only way you could possibly migrate the descriptors rings cleanly 
would be to have enough knowledge about the layout of things to force 
the descriptor rings to be migrated first followed by all of the 
currently mapped Rx buffers.  In addition you would need to have some 
means of tracking all of the Rx buffers such as an emulated IOMMU as you 
would need to migrate all of them, not just part.  By doing it this way 
you would get the Rx descriptor rings in the earliest state possible and 
would be essentially emulating the Rx buffer writes occurring before the 
Rx descriptor writes.  You would likely have several Rx buffer writes 
that would be discarded in the process as there would be no descriptor 
for them but at least the state of the system would be consistent.


- Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-29 Thread Alexander Duyck

On 10/28/2015 11:12 PM, Lan Tianyu wrote:

On 2015年10月26日 23:03, Alexander Duyck wrote:

No.  I think you are missing the fact that there are 256 descriptors per
page.  As such if you dirty just 1 you will be pulling in 255 more, of
which you may or may not have pulled in the receive buffer for.

So for example if you have the descriptor ring size set to 256 then that
means you are going to get whatever the descriptor ring has since you
will be marking the entire ring dirty with every packet processed,
however you cannot guarantee that you are going to get all of the
receive buffers unless you go through and flush the entire ring prior to
migrating.


Yes, that will be a problem. How about adding tag for each Rx buffer and
check the tag when deliver the Rx buffer to stack? If tag has been
overwritten, this means the packet data has been migrated.


Then you have to come up with a pattern that you can guarantee is the 
tag and not part of the packet data.  That isn't going to be something 
that is easy to do.  It would also have a serious performance impact on 
the VF.



This is why I have said you will need to do something to force the rings
to be flushed such as initiating a PM suspend prior to migrating.  You
need to do something to stop the DMA and flush the remaining Rx buffers
if you want to have any hope of being able to migrate the Rx in a
consistent state.  Beyond that the only other thing you have to worry
about are the Rx buffers that have already been handed off to the
stack.  However those should be handled if you do a suspend and somehow
flag pages as dirty when they are unmapped from the DMA.

- Alex

This will be simple and maybe our first version to enable migration. But
we still hope to find a way not to disable DMA before stopping VCPU to
decrease service down time.


You have to stop the Rx DMA at some point anyway.  It is the only means 
to guarantee that the device stops updating buffers and descriptors so 
that you will have a consistent state.


Your code was having to do a bunch of shuffling in order to get things 
set up so that you could bring the interface back up.  I would argue 
that it may actually be faster at least on the bring-up to just drop the 
old rings and start over since it greatly reduced the complexity and the 
amount of device related data that has to be moved.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-26 Thread Alexander Duyck

On 10/25/2015 10:36 PM, Lan Tianyu wrote:

On 2015年10月24日 02:36, Alexander Duyck wrote:

I was thinking about it and I am pretty sure the dummy write approach is
problematic at best.  Specifically the issue is that while you are
performing a dummy write you risk pulling in descriptors for data that
hasn't been dummy written to yet.  So when you resume and restore your
descriptors you will have once that may contain Rx descriptors
indicating they contain data when after the migration they don't.

How about changing sequence? dummy writing Rx packet data fist and then
its desc. This can ensure that RX data is migrated before its desc and
prevent such case.


No.  I think you are missing the fact that there are 256 descriptors per 
page.  As such if you dirty just 1 you will be pulling in 255 more, of 
which you may or may not have pulled in the receive buffer for.


So for example if you have the descriptor ring size set to 256 then that 
means you are going to get whatever the descriptor ring has since you 
will be marking the entire ring dirty with every packet processed, 
however you cannot guarantee that you are going to get all of the 
receive buffers unless you go through and flush the entire ring prior to 
migrating.


This is why I have said you will need to do something to force the rings 
to be flushed such as initiating a PM suspend prior to migrating.  You 
need to do something to stop the DMA and flush the remaining Rx buffers 
if you want to have any hope of being able to migrate the Rx in a 
consistent state.  Beyond that the only other thing you have to worry 
about are the Rx buffers that have already been handed off to the 
stack.  However those should be handled if you do a suspend and somehow 
flag pages as dirty when they are unmapped from the DMA.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"

2015-10-25 Thread Alexander Duyck

On 10/24/2015 08:43 AM, Lan, Tianyu wrote:


On 10/22/2015 4:52 AM, Alexander Duyck wrote:

Also have you even considered the MSI-X configuration on the VF?  I
haven't seen anything anywhere that would have migrated the VF's MSI-X
configuration from BAR 3 on one system to the new system.


MSI-X migration is done by Hypervisor(Qemu).
Following link is my Qemu patch to do that.
http://marc.info/?l=kvm=144544706530484=2


I really don't like the idea of trying to migrate the MSI-X across from 
host to host while it is still active.  I really think Qemu shouldn't be 
moving this kind of data over in a migration.


I think that having the VF do a suspend/resume is the best way to go.  
Then it simplifies things as all you have to deal with is the dirty page 
tracking for the Rx DMA and you should be able to do this without making 
things too difficult.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-23 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

This patchset is to propose a new solution to add live migration support for 
82599
SRIOV network card.

Im our solution, we prefer to put all device specific operation into VF and
PF driver and make code in the Qemu more general.


VF status migration
=
VF status can be divided into 4 parts
1) PCI configure regs
2) MSIX configure
3) VF status in the PF driver
4) VF MMIO regs

The first three status are all handled by Qemu.
The PCI configure space regs and MSIX configure are originally
stored in Qemu. To save and restore "VF status in the PF driver"
by Qemu during migration, adds new sysfs node "state_in_pf" under
VF sysfs directory.

For VF MMIO regs, we introduce self emulation layer in the VF
driver to record MMIO reg values during reading or writing MMIO
and put these data in the guest memory. It will be migrated with
guest memory to new machine.


VF function restoration

Restoring VF function operation are done in the VF and PF driver.

In order to let VF driver to know migration status, Qemu fakes VF
PCI configure regs to indicate migration status and add new sysfs
node "notify_vf" to trigger VF mailbox irq in order to notify VF
about migration status change.

Transmit/Receive descriptor head regs are read-only and can't
be restored via writing back recording reg value directly and they
are set to 0 during VF reset. To reuse original tx/rx rings, shift
desc ring in order to move the desc pointed by original head reg to
first entry of the ring and then enable tx/rx rings. VF restarts to
receive and transmit from original head desc.


Tracking DMA accessed memory
=
Migration relies on tracking dirty page to migrate memory.
Hardware can't automatically mark a page as dirty after DMA
memory access. VF descriptor rings and data buffers are modified
by hardware when receive and transmit data. To track such dirty memory
manually, do dummy writes(read a byte and write it back) when receive
and transmit data.


I was thinking about it and I am pretty sure the dummy write approach is 
problematic at best.  Specifically the issue is that while you are 
performing a dummy write you risk pulling in descriptors for data that 
hasn't been dummy written to yet.  So when you resume and restore your 
descriptors you will have once that may contain Rx descriptors 
indicating they contain data when after the migration they don't.


I really think the best approach to take would be to look at 
implementing an emulated IOMMU so that you could track DMA mapped pages 
and avoid migrating the ones marked as DMA_FROM_DEVICE until they are 
unmapped.  The advantage to this is that in the case of the ixgbevf 
driver it now reuses the same pages for Rx DMA.  As a result it will be 
rewriting the same pages often and if you are marking those pages as 
dirty and transitioning them it is possible for a flow of small packets 
to really make a mess of things since you would be rewriting the same 
pages in a loop while the device is processing packets.


Beyond that I would say you could suspend/resume the device in order to 
get it to stop and flush the descriptor rings and any outstanding 
packets.  The code for suspend would unmap the DMA memory which would 
then be the trigger to flush it across in the migration, and the resume 
code would take care of any state restoration needed beyond any values 
that can be configured with the ip link command.


If you wanted to do a proof of concept of this you could probably do so 
with very little overhead.  Basically you would need the "page_addr" 
portion of patch 12 to emulate a slightly migration aware DMA API, and 
then beyond that you would need something like patch 9 but instead of 
adding new functions and API you would be switching things on and off 
via the ixgbevf_suspend/resume calls.


- Alex








--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-23 Thread Alexander Duyck

On 10/23/2015 12:05 PM, Alex Williamson wrote:

On Fri, 2015-10-23 at 11:36 -0700, Alexander Duyck wrote:

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

This patchset is to propose a new solution to add live migration support for 
82599
SRIOV network card.

Im our solution, we prefer to put all device specific operation into VF and
PF driver and make code in the Qemu more general.


VF status migration
=
VF status can be divided into 4 parts
1) PCI configure regs
2) MSIX configure
3) VF status in the PF driver
4) VF MMIO regs

The first three status are all handled by Qemu.
The PCI configure space regs and MSIX configure are originally
stored in Qemu. To save and restore "VF status in the PF driver"
by Qemu during migration, adds new sysfs node "state_in_pf" under
VF sysfs directory.

For VF MMIO regs, we introduce self emulation layer in the VF
driver to record MMIO reg values during reading or writing MMIO
and put these data in the guest memory. It will be migrated with
guest memory to new machine.


VF function restoration

Restoring VF function operation are done in the VF and PF driver.

In order to let VF driver to know migration status, Qemu fakes VF
PCI configure regs to indicate migration status and add new sysfs
node "notify_vf" to trigger VF mailbox irq in order to notify VF
about migration status change.

Transmit/Receive descriptor head regs are read-only and can't
be restored via writing back recording reg value directly and they
are set to 0 during VF reset. To reuse original tx/rx rings, shift
desc ring in order to move the desc pointed by original head reg to
first entry of the ring and then enable tx/rx rings. VF restarts to
receive and transmit from original head desc.


Tracking DMA accessed memory
=
Migration relies on tracking dirty page to migrate memory.
Hardware can't automatically mark a page as dirty after DMA
memory access. VF descriptor rings and data buffers are modified
by hardware when receive and transmit data. To track such dirty memory
manually, do dummy writes(read a byte and write it back) when receive
and transmit data.


I was thinking about it and I am pretty sure the dummy write approach is
problematic at best.  Specifically the issue is that while you are
performing a dummy write you risk pulling in descriptors for data that
hasn't been dummy written to yet.  So when you resume and restore your
descriptors you will have once that may contain Rx descriptors
indicating they contain data when after the migration they don't.

I really think the best approach to take would be to look at
implementing an emulated IOMMU so that you could track DMA mapped pages
and avoid migrating the ones marked as DMA_FROM_DEVICE until they are
unmapped.  The advantage to this is that in the case of the ixgbevf
driver it now reuses the same pages for Rx DMA.  As a result it will be
rewriting the same pages often and if you are marking those pages as
dirty and transitioning them it is possible for a flow of small packets
to really make a mess of things since you would be rewriting the same
pages in a loop while the device is processing packets.


I'd be concerned that an emulated IOMMU on the DMA path would reduce
throughput to the point where we shouldn't even bother with assigning
the device in the first place and should be using virtio-net instead.
POWER systems have a guest visible IOMMU and it's been challenging for
them to get to 10Gbps, requiring real-mode tricks.  virtio-net may add
some latency, but it's not that hard to get it to 10Gbps and it already
supports migration.  An emulated IOMMU in the guest is really only good
for relatively static mappings, the latency for anything else is likely
too high.  Maybe there are shadow page table tricks that could help, but
it's imposing overhead the whole time the guest is running, not only on
migration.  Thanks,



The big overhead I have seen with IOMMU implementations is the fact that 
they almost always have some sort of locked table or tree that prevents 
multiple CPUs from accessing resources in any kind of timely fashion. 
As a result things like Tx is usually slowed down for network workloads 
when multiple CPUs are enabled.


I admit doing a guest visible IOMMU would probably add some overhead, 
but this current patch set as implemented already has some of the hints 
of that as the descriptor rings are locked which means we cannot unmap 
in the Tx clean-up while we are mapping on another Tx queue for instance.


One approach for this would be to implement or extend a lightweight DMA 
API such as swiotlb or nommu.  The code would need to have a bit in 
there so it can take care of marking the pages as dirty on sync_for_cpu 
and unmap calls when set for BIDIRECTIONAL or FROM_DEVICE.  Then if we 
cou

Re: [Qemu-devel] [RFC Patch 06/12] IXGBEVF: Add self emulation layer

2015-10-22 Thread Alexander Duyck

On 10/22/2015 05:50 AM, Michael S. Tsirkin wrote:

On Wed, Oct 21, 2015 at 01:58:19PM -0700, Alexander Duyck wrote:

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

In order to restore VF function after migration, add self emulation layer
to record regs' values during accessing regs.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
  drivers/net/ethernet/intel/ixgbevf/Makefile|  3 ++-
  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |  2 +-
  .../net/ethernet/intel/ixgbevf/self-emulation.c| 26 ++
  drivers/net/ethernet/intel/ixgbevf/vf.h|  5 -
  4 files changed, 33 insertions(+), 3 deletions(-)
  create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c

diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile 
b/drivers/net/ethernet/intel/ixgbevf/Makefile
index 4ce4c97..841c884 100644
--- a/drivers/net/ethernet/intel/ixgbevf/Makefile
+++ b/drivers/net/ethernet/intel/ixgbevf/Makefile
@@ -31,7 +31,8 @@
  obj-$(CONFIG_IXGBEVF) += ixgbevf.o
-ixgbevf-objs := vf.o \
+ixgbevf-objs := self-emulation.o \
+   vf.o \
  mbx.o \
  ethtool.o \
  ixgbevf_main.o
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index a16d267..4446916 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg)
if (IXGBE_REMOVED(reg_addr))
return IXGBE_FAILED_READ_REG;
-   value = readl(reg_addr + reg);
+   value = ixgbe_self_emul_readl(reg_addr, reg);
if (unlikely(value == IXGBE_FAILED_READ_REG))
ixgbevf_check_remove(hw, reg);
return value;
diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c 
b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
new file mode 100644
index 000..d74b2da
--- /dev/null
+++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
@@ -0,0 +1,26 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vf.h"
+#include "ixgbevf.h"
+
+static u32 hw_regs[0x4000];
+
+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr)
+{
+   u32 tmp;
+
+   tmp = readl(base + addr);
+   hw_regs[(unsigned long)addr] = tmp;
+
+   return tmp;
+}
+
+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr)
+{
+   hw_regs[(unsigned long)addr] = val;
+   writel(val, (volatile void __iomem *)(base + addr));
+}

So I see what you are doing, however I don't think this adds much value.
Many of the key registers for the device are not simple Read/Write
registers.  Most of them are things like write 1 to clear or some other sort
of value where writing doesn't set the bit but has some other side effect.
Just take a look through the Datasheet at registers such as the VFCTRL,
VFMAILBOX, or most of the interrupt registers.  The fact is simply storing
the values off doesn't give you any real idea of what the state of things
are.

It doesn't, but I guess the point is to isolate the migration-related logic
in the recovery code.

An alternative would be to have some smart logic all over the place to
only store what's required - that would be much more intrusive.


After reviewing all of the patches yesterday I would say that almost all 
the values being stored aren't needed.  They can be restored from the 
settings of the driver itself anyway.  Copying the values out don't make 
much sense here since there are already enough caches for almost all of 
this data.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

Add "virtfn_index" member in the struct pci_device to record VF sequence
of PF. This will be used in the VF sysfs node handle.

Signed-off-by: Lan Tianyu 
---
  drivers/pci/iov.c   | 1 +
  include/linux/pci.h | 1 +
  2 files changed, 2 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ee0ebff..065b6bb 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -136,6 +136,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int 
reset)
virtfn->physfn = pci_dev_get(dev);
virtfn->is_virtfn = 1;
virtfn->multifunction = 0;
+   virtfn->virtfn_index = id;
  
  	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {

res = >resource[i + PCI_IOV_RESOURCES];
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 353db8d..85c5531 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -356,6 +356,7 @@ struct pci_dev {
unsigned intio_window_1k:1; /* Intel P2P bridge 1K I/O windows */
unsigned intirq_managed:1;
pci_dev_flags_t dev_flags;
+   unsigned intvirtfn_index;
atomic_tenable_cnt; /* pci_enable_device has been called */
  
  	u32		saved_config_space[16]; /* config space saved at suspend time */




Can't you just calculate the VF index based on the VF BDF number 
combined with the information in the PF BDF number and VF 
offset/stride?  Seems kind of pointless to add a variable that is only 
used by one driver and is in a slowpath when you can just calculate it 
pretty quickly.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

This patch is to add sysfs interface state_in_pf under sysfs directory
of VF PCI device for Qemu to get and put VF status in the PF driver during
migration.

Signed-off-by: Lan Tianyu 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 156 -
  1 file changed, 155 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index ab2a2e2..89671eb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -124,6 +124,157 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter 
*adapter)
return -ENOMEM;
  }
  
+#define IXGBE_PCI_VFCOMMAND   0x4

+#define IXGBE_PCI_VFMSIXMC0x72
+#define IXGBE_SRIOV_VF_OFFSET 0x180
+#define IXGBE_SRIOV_VF_STRIDE 0x2
+
+#define to_adapter(dev) ((struct ixgbe_adapter 
*)(pci_get_drvdata(to_pci_dev(dev)->physfn)))
+
+struct state_in_pf {
+   u16 command;
+   u16 msix_message_control;
+   struct vf_data_storage vf_data;
+};
+
+static struct pci_dev *ixgbe_get_virtfn_dev(struct pci_dev *pdev, int vfn)
+{
+   u16 rid = pdev->devfn + IXGBE_SRIOV_VF_OFFSET + IXGBE_SRIOV_VF_STRIDE * 
vfn;
+   return pci_get_bus_and_slot(pdev->bus->number + (rid >> 8), rid & 0xff);
+}
+
+static ssize_t ixgbe_show_state_in_pf(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   struct ixgbe_adapter *adapter = to_adapter(dev);
+   struct pci_dev *pdev = adapter->pdev, *vdev;
+   struct pci_dev *vf_pdev = to_pci_dev(dev);
+   struct ixgbe_hw *hw = >hw;
+   struct state_in_pf *state = (struct state_in_pf *)buf;
+   int vfn = vf_pdev->virtfn_index;
+   u32 reg, reg_offset, vf_shift;
+
+   /* Clear VF mac and disable VF */
+   ixgbe_del_mac_filter(adapter, adapter->vfinfo[vfn].vf_mac_addresses, 
vfn);
+
+   /* Record PCI configurations */
+   vdev = ixgbe_get_virtfn_dev(pdev, vfn);
+   if (vdev) {
+   pci_read_config_word(vdev, IXGBE_PCI_VFCOMMAND, 
>command);
+   pci_read_config_word(vdev, IXGBE_PCI_VFMSIXMC, 
>msix_message_control);
+   }
+   else
+   printk(KERN_WARNING "Unable to find VF device.\n");
+


Formatting for the if/else is incorrect.  The else condition should be 
in brackets as well.



+   /* Record states hold by PF */
+   memcpy(>vf_data, >vfinfo[vfn], sizeof(struct 
vf_data_storage));
+
+   vf_shift = vfn % 32;
+   reg_offset = vfn / 32;
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
+   reg &= ~(1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
+   reg &= ~(1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
+   reg &= ~(1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);
+
+   return sizeof(struct state_in_pf);
+}
+


This is a read.  Why does it need to switch off the VF?  Also why turn 
of the anti-spoof, it doesn't make much sense.



+static ssize_t ixgbe_store_state_in_pf(struct device *dev,
+  struct device_attribute *attr,
+  const char *buf, size_t count)
+{
+   struct ixgbe_adapter *adapter = to_adapter(dev);
+   struct pci_dev *pdev = adapter->pdev, *vdev;
+   struct pci_dev *vf_pdev = to_pci_dev(dev);
+   struct state_in_pf *state = (struct state_in_pf *)buf;
+   int vfn = vf_pdev->virtfn_index;
+
+   /* Check struct size */
+   if (count != sizeof(struct state_in_pf)) {
+   printk(KERN_ERR "State in PF size does not fit.\n");
+   goto out;
+   }
+
+   /* Restore PCI configurations */
+   vdev = ixgbe_get_virtfn_dev(pdev, vfn);
+   if (vdev) {
+   pci_write_config_word(vdev, IXGBE_PCI_VFCOMMAND, 
state->command);
+   pci_write_config_word(vdev, IXGBE_PCI_VFMSIXMC, 
state->msix_message_control);
+   }
+
+   /* Restore states hold by PF */
+   memcpy(>vfinfo[vfn], >vf_data, sizeof(struct 
vf_data_storage));
+
+  out:
+   return count;
+}


Just doing a memcpy to move the vfinfo over adds no value.  The fact is 
there are a number of filters that have to be configured in hardware 
after, and it isn't as simple as just migrating the values stored.  As I 
mentioned in the case of the 82598 there is also jumbo frames to take 
into account.  If the first PF didn't have it enabled, but the second 
one does that implies the state of the VF needs to change to account for 
that.


I really think you would be better off only migrating the data related 
to what can be configured using the ip link command and leaving other 
values such as clear_to_send at the reset value of 0. Then 

Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

This patch is to add new sysfs interface of "notify_vf" under sysfs
directory of VF PCI device for Qemu to notify VF when migration status
is changed.

Signed-off-by: Lan Tianyu 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 30 ++
  drivers/net/ethernet/intel/ixgbe/ixgbe_type.h  |  4 
  2 files changed, 34 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index e247d67..5cc7817 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -217,10 +217,37 @@ static ssize_t ixgbe_store_state_in_pf(struct device *dev,
return count;
  }
  
+static ssize_t ixgbe_store_notify_vf(struct device *dev,

+  struct device_attribute *attr,
+  const char *buf, size_t count)
+{
+   struct ixgbe_adapter *adapter = to_adapter(dev);
+   struct ixgbe_hw *hw = >hw;
+   struct pci_dev *vf_pdev = to_pci_dev(dev);
+   int vfn = vf_pdev->virtfn_index;
+   u32 ivar;
+
+   /* Enable VF mailbox irq first */
+   IXGBE_WRITE_REG(hw, IXGBE_PVTEIMS(vfn), 0x4);
+   IXGBE_WRITE_REG(hw, IXGBE_PVTEIAM(vfn), 0x4);
+   IXGBE_WRITE_REG(hw, IXGBE_PVTEIAC(vfn), 0x4);
+
+   ivar = IXGBE_READ_REG(hw, IXGBE_PVTIVAR_MISC(vfn));
+   ivar &= ~0xFF;
+   ivar |= 0x2 | IXGBE_IVAR_ALLOC_VAL;
+   IXGBE_WRITE_REG(hw, IXGBE_PVTIVAR_MISC(vfn), ivar);
+
+   ixgbe_ping_vf(adapter, vfn);
+   return count;
+}
+


NAK, this won't fly.  You can't just go in from the PF and enable 
interrupts on the VF hoping they are configured well enough to handle an 
interrupt you decide to trigger from them.


Also have you even considered the MSI-X configuration on the VF?  I 
haven't seen anything anywhere that would have migrated the VF's MSI-X 
configuration from BAR 3 on one system to the new system.



  static struct device_attribute ixgbe_per_state_in_pf_attribute =
__ATTR(state_in_pf, S_IRUGO | S_IWUSR,
ixgbe_show_state_in_pf, ixgbe_store_state_in_pf);
  
+static struct device_attribute ixgbe_per_notify_vf_attribute =

+   __ATTR(notify_vf, S_IWUSR, NULL, ixgbe_store_notify_vf);
+
  void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
  {
struct pci_dev *pdev = adapter->pdev;
@@ -241,6 +268,8 @@ void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
if (vfdev->is_virtfn) {
ret = device_create_file(>dev,
_per_state_in_pf_attribute);
+   ret |= device_create_file(>dev,
+   _per_notify_vf_attribute);
if (ret)
pr_warn("Unable to add VF attribute for dev 
%s,\n",
dev_name(>dev));
@@ -269,6 +298,7 @@ void ixgbe_remove_vf_attrib(struct ixgbe_adapter *adapter)
while (vfdev) {
if (vfdev->is_virtfn) {
device_remove_file(>dev, 
_per_state_in_pf_attribute);
+   device_remove_file(>dev, 
_per_notify_vf_attribute);
}
  
  		vfdev = pci_get_device(pdev->vendor, vf_id, vfdev);


More driver specific sysfs.  This needs to be moved out of the driver if 
this is to be considered anything more than a proof of concept.



diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
index dd6ba59..c6ddb66 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
@@ -2302,6 +2302,10 @@ enum {
  #define IXGBE_PVFTDT(P)   (0x06018 + (0x40 * (P)))
  #define IXGBE_PVFTDWBAL(P)(0x06038 + (0x40 * (P)))
  #define IXGBE_PVFTDWBAH(P)(0x0603C + (0x40 * (P)))
+#define IXGBE_PVTEIMS(P)   (0x00D00 + (4 * (P)))
+#define IXGBE_PVTIVAR_MISC(P)  (0x04E00 + (4 * (P)))
+#define IXGBE_PVTEIAC(P)   (0x00F00 + (4 * P))
+#define IXGBE_PVTEIAM(P)   (0x04D00 + (4 * P))
  
  #define IXGBE_PVFTDWBALn(q_per_pool, vf_number, vf_q_index) \

(IXGBE_PVFTDWBAL((q_per_pool)*(vf_number) + (vf_q_index)))


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 06/12] IXGBEVF: Add self emulation layer

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

In order to restore VF function after migration, add self emulation layer
to record regs' values during accessing regs.

Signed-off-by: Lan Tianyu 
---
  drivers/net/ethernet/intel/ixgbevf/Makefile|  3 ++-
  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |  2 +-
  .../net/ethernet/intel/ixgbevf/self-emulation.c| 26 ++
  drivers/net/ethernet/intel/ixgbevf/vf.h|  5 -
  4 files changed, 33 insertions(+), 3 deletions(-)
  create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c

diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile 
b/drivers/net/ethernet/intel/ixgbevf/Makefile
index 4ce4c97..841c884 100644
--- a/drivers/net/ethernet/intel/ixgbevf/Makefile
+++ b/drivers/net/ethernet/intel/ixgbevf/Makefile
@@ -31,7 +31,8 @@
  
  obj-$(CONFIG_IXGBEVF) += ixgbevf.o
  
-ixgbevf-objs := vf.o \

+ixgbevf-objs := self-emulation.o \
+   vf.o \
  mbx.o \
  ethtool.o \
  ixgbevf_main.o
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index a16d267..4446916 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg)
  
  	if (IXGBE_REMOVED(reg_addr))

return IXGBE_FAILED_READ_REG;
-   value = readl(reg_addr + reg);
+   value = ixgbe_self_emul_readl(reg_addr, reg);
if (unlikely(value == IXGBE_FAILED_READ_REG))
ixgbevf_check_remove(hw, reg);
return value;
diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c 
b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
new file mode 100644
index 000..d74b2da
--- /dev/null
+++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
@@ -0,0 +1,26 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vf.h"
+#include "ixgbevf.h"
+
+static u32 hw_regs[0x4000];
+
+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr)
+{
+   u32 tmp;
+
+   tmp = readl(base + addr);
+   hw_regs[(unsigned long)addr] = tmp;
+
+   return tmp;
+}
+
+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr)
+{
+   hw_regs[(unsigned long)addr] = val;
+   writel(val, (volatile void __iomem *)(base + addr));
+}


So I see what you are doing, however I don't think this adds much 
value.  Many of the key registers for the device are not simple 
Read/Write registers.  Most of them are things like write 1 to clear or 
some other sort of value where writing doesn't set the bit but has some 
other side effect.  Just take a look through the Datasheet at registers 
such as the VFCTRL, VFMAILBOX, or most of the interrupt registers.  The 
fact is simply storing the values off doesn't give you any real idea of 
what the state of things are.



diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h 
b/drivers/net/ethernet/intel/ixgbevf/vf.h
index d40f036..6a3f4eb 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.h
@@ -39,6 +39,9 @@
  
  struct ixgbe_hw;
  
+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr);

+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr);
+
  /* iterator type for walking multicast address lists */
  typedef u8* (*ixgbe_mc_addr_itr) (struct ixgbe_hw *hw, u8 **mc_addr_ptr,
  u32 *vmdq);
@@ -182,7 +185,7 @@ static inline void ixgbe_write_reg(struct ixgbe_hw *hw, u32 
reg, u32 value)
  
  	if (IXGBE_REMOVED(reg_addr))

return;
-   writel(value, reg_addr + reg);
+   ixgbe_self_emul_writel(value, reg_addr, reg);
  }
  
  #define IXGBE_WRITE_REG(h, r, v) ixgbe_write_reg(h, r, v)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 02/12] IXGBE: Add new mail box event to restore VF status in the PF driver

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

This patch is to restore VF status in the PF driver when get event
from VF.

Signed-off-by: Lan Tianyu 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe.h   |  1 +
  drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h   |  1 +
  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 40 ++
  3 files changed, 42 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 636f9e3..9d5669a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -148,6 +148,7 @@ struct vf_data_storage {
bool pf_set_mac;
u16 pf_vlan; /* When set, guest VLAN config not allowed. */
u16 pf_qos;
+   u32 vf_lpe;
u16 tx_rate;
u16 vlan_count;
u8 spoofchk_enabled;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
index b1e4703..8fdb38d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
@@ -91,6 +91,7 @@ enum ixgbe_pfvf_api_rev {

  /* mailbox API, version 1.1 VF requests */
  #define IXGBE_VF_GET_QUEUES   0x09 /* get queue configuration */
+#define IXGBE_VF_NOTIFY_RESUME0x0c /* VF notify PF migration finishing */

  /* GET_QUEUES return data indices within the mailbox */
  #define IXGBE_VF_TX_QUEUES1   /* number of Tx queues supported */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 1d17b58..ab2a2e2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -648,6 +648,42 @@ static inline void ixgbe_write_qde(struct ixgbe_adapter 
*adapter, u32 vf,
}
  }

+/**
+ *  Restore the settings by mailbox, after migration
+ **/
+void ixgbe_restore_setting(struct ixgbe_adapter *adapter, u32 vf)
+{
+   struct ixgbe_hw *hw = >hw;
+   u32 reg, reg_offset, vf_shift;
+   int rar_entry = hw->mac.num_rar_entries - (vf + 1);
+
+   vf_shift = vf % 32;
+   reg_offset = vf / 32;
+
+   /* enable transmit and receive for vf */
+   reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
+   reg |= (1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
+   reg |= (1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
+


This is just blanket enabling Rx and Tx.  I don't see how this can be 
valid.  It seems like it would result in memory corruption for the guest 
if you are enabling Rx on a device that is not ready.  A perfect example 
is if the guest is not configured to handle jumbo frames and the PF has 
jumbo frames enabled.



+   reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
+   reg |= (1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);


This assumes that the anti-spoof is enabled.  That may not be the case.


+   ixgbe_vf_reset_event(adapter, vf);
+
+   hw->mac.ops.set_rar(hw, rar_entry,
+   adapter->vfinfo[vf].vf_mac_addresses,
+   vf, IXGBE_RAH_AV);
+
+
+   if (adapter->vfinfo[vf].vf_lpe)
+   ixgbe_set_vf_lpe(adapter, >vfinfo[vf].vf_lpe, vf);
+}
+


The function ixgbe_set_vf_lpe also enabled the receive, you should take 
a look at it.  For 82598 you cannot just arbitrarily enable the Rx as 
there is a risk of corrupting guest memory or causing a kernel panic.



  static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf)
  {
struct ixgbe_ring_feature *vmdq = >ring_feature[RING_F_VMDQ];
@@ -1047,6 +1083,7 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter 
*adapter, u32 vf)
break;
case IXGBE_VF_SET_LPE:
retval = ixgbe_set_vf_lpe(adapter, msgbuf, vf);
+   adapter->vfinfo[vf].vf_lpe = *msgbuf;
break;


Why not just leave this for the VF to notify us of via a reset.  It 
seems like if the VF is migrated it should start with the cts bits of 
the mailbox cleared as though the PF driver as been reloaded.



case IXGBE_VF_SET_MACVLAN:
retval = ixgbe_set_vf_macvlan_msg(adapter, msgbuf, vf);
@@ -1063,6 +1100,9 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter 
*adapter, u32 vf)
case IXGBE_VF_GET_RSS_KEY:
retval = ixgbe_get_vf_rss_key(adapter, msgbuf, vf);
break;
+   case IXGBE_VF_NOTIFY_RESUME:
+   ixgbe_restore_setting(adapter, vf);
+   break;
default:
e_err(drv, "Unhandled Msg %8.8x\n", msgbuf[0]);
retval = IXGBE_ERR_MBX;



I really don't think the VF should be sending us a message telling us to 
restore settings.  Why not just use the existing messages?


The VF as it is now can survive a suspend/resume 

Re: [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

When transmit a package, the end transmit desc of package
indicates whether package is sent already. Current code records
the end desc's pointer in the next_to_watch of struct tx buffer.
This code will be broken if shifting desc ring after migration.
The pointer will be invalid. This patch is to replace recording
pointer with recording the desc number of the package and find
the end decs via the first desc and desc number.

Signed-off-by: Lan Tianyu 
---
  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  1 +
  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 19 ---
  2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 775d089..c823616 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -54,6 +54,7 @@
   */
  struct ixgbevf_tx_buffer {
union ixgbe_adv_tx_desc *next_to_watch;
+   u16 desc_num;
unsigned long time_stamp;
struct sk_buff *skb;
unsigned int bytecount;


So if you can't use next_to_watch why is it left in here?  Also you 
might want to take a look at moving desc_num to a different spot in the 
buffer as you are leaving a 6 byte hole in the descriptor.



diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 4446916..056841c 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -210,6 +210,7 @@ static void ixgbevf_unmap_and_free_tx_resource(struct 
ixgbevf_ring *tx_ring,
   DMA_TO_DEVICE);
}
tx_buffer->next_to_watch = NULL;
+   tx_buffer->desc_num = 0;
tx_buffer->skb = NULL;
dma_unmap_len_set(tx_buffer, len, 0);


This opens up a race condition.  If you have a descriptor ready to be 
cleaned at offset 0 what is to prevent you from just running through the 
ring?  You likely need to find a descriptor number that cannot be valid 
to use here.



/* tx_buffer must be completely set up in the transmit path */
@@ -295,7 +296,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
union ixgbe_adv_tx_desc *tx_desc;
unsigned int total_bytes = 0, total_packets = 0;
unsigned int budget = tx_ring->count / 2;
-   unsigned int i = tx_ring->next_to_clean;
+   int i, watch_index;
  


Where is i being initialized?  It was here but you removed it.  Are you 
using i without initializing it?



if (test_bit(__IXGBEVF_DOWN, >state))
return true;
@@ -305,9 +306,17 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
i -= tx_ring->count;
  
  	do {

-   union ixgbe_adv_tx_desc *eop_desc = tx_buffer->next_to_watch;
+   union ixgbe_adv_tx_desc *eop_desc;
+
+   if (!tx_buffer->desc_num)
+   break;
+
+   if (i + tx_buffer->desc_num >= 0)
+   watch_index = i + tx_buffer->desc_num;
+   else
+   watch_index = i + tx_ring->count + tx_buffer->desc_num;
  
-		/* if next_to_watch is not set then there is no work pending */

+   eop_desc = IXGBEVF_TX_DESC(tx_ring, watch_index);
if (!eop_desc)
break;
  


So I don't see how this isn't triggering Tx hangs.  I suspect for the 
simple ping case desc_num will often be 0.  The fact is there are many 
cases where first and tx_buffer_info are the same descriptor.



@@ -320,6 +329,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
  
  		/* clear next_to_watch to prevent false hangs */

tx_buffer->next_to_watch = NULL;
+   tx_buffer->desc_num = 0;
  
  		/* update the statistics for this packet */

total_bytes += tx_buffer->bytecount;


You cannot use 0 because 0 is a valid number.  You are using it as a 
look-ahead currently and there are cases where i is the eop_desc index.



@@ -3457,6 +3467,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
u32 tx_flags = first->tx_flags;
__le32 cmd_type;
u16 i = tx_ring->next_to_use;
+   u16 start;
  
  	tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
  
@@ -3540,6 +3551,8 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
  
  	/* set next_to_watch value indicating a packet is present */

first->next_to_watch = tx_desc;
+   start = first - tx_ring->tx_buffer_info;
+   first->desc_num = (i - start >= 0) ? i - start: i + tx_ring->count - 
start;
  
  	i++;

if (i == tx_ring->count)


start and i could be the same value.  If you look at ixgbevf_tx_map you 
should find that if the packet is contained in a single buffer then the 
first and last descriptor in your 

Re: [RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

To let VF driver in the guest to know migration status, Qemu will
fake PCI configure reg 0xF0 and 0xF1 to show migrate status and
get ack from VF driver.

When migration starts, Qemu will set reg "0xF0" to 1, notify
VF driver via triggering mail box msg and wait for VF driver to tell
it's ready for migration(set reg "0xF1" to 1). After migration, Qemu
will set reg "0xF0" to 0 and notify VF driver by mail box irq. VF
driver begins to restore tx/rx function after detecting sttatus change.

When VF receives mail box irq, it will check reg "0xF0" in the service
task function to get migration status and performs related operations
according its value.

Steps of restarting receive and transmit function
1) Restore VF status in the PF driver via sending mail event to PF driver
2) Write back reg values recorded by self emulation layer
3) Restart rx/tx ring
4) Recovery interrupt

Transmit/Receive descriptor head regs are read-only and can't
be restored via writing back recording reg value directly and they
are set to 0 during VF reset. To reuse original tx/rx rings, shift
desc ring in order to move the desc pointed by original head reg to
first entry of the ring and then enable tx/rx rings. VF restarts to
receive and transmit from original head desc.

Signed-off-by: Lan Tianyu 
---
  drivers/net/ethernet/intel/ixgbevf/defines.h   |   6 ++
  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h   |   7 +-
  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  | 115 -
  .../net/ethernet/intel/ixgbevf/self-emulation.c| 107 +++
  4 files changed, 232 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h 
b/drivers/net/ethernet/intel/ixgbevf/defines.h
index 770e21a..113efd2 100644
--- a/drivers/net/ethernet/intel/ixgbevf/defines.h
+++ b/drivers/net/ethernet/intel/ixgbevf/defines.h
@@ -239,6 +239,12 @@ struct ixgbe_adv_tx_context_desc {
__le32 mss_l4len_idx;
  };

+union ixgbevf_desc {
+   union ixgbe_adv_tx_desc rx_desc;
+   union ixgbe_adv_rx_desc tx_desc;
+   struct ixgbe_adv_tx_context_desc tx_context_desc;
+};
+
  /* Adv Transmit Descriptor Config Masks */
  #define IXGBE_ADVTXD_DTYP_MASK0x00F0 /* DTYP mask */
  #define IXGBE_ADVTXD_DTYP_CTXT0x0020 /* Advanced Context Desc */
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index c823616..6eab402e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -109,7 +109,7 @@ struct ixgbevf_ring {
struct ixgbevf_ring *next;
struct net_device *netdev;
struct device *dev;
-   void *desc; /* descriptor ring memory */
+   union ixgbevf_desc *desc;   /* descriptor ring memory */
dma_addr_t dma; /* phys. address of descriptor ring */
unsigned int size;  /* length in bytes */
u16 count;  /* amount of descriptors */
@@ -493,6 +493,11 @@ extern void ixgbevf_write_eitr(struct ixgbevf_q_vector 
*q_vector);

  void ixgbe_napi_add_all(struct ixgbevf_adapter *adapter);
  void ixgbe_napi_del_all(struct ixgbevf_adapter *adapter);
+int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head);
+int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head);
+void ixgbevf_restore_state(struct ixgbevf_adapter *adapter);
+inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter);
+

  #ifdef DEBUG
  char *ixgbevf_get_hw_dev_name(struct ixgbe_hw *hw);
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 056841c..15ec361 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -91,6 +91,10 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit Virtual Function Network 
Driver");
  MODULE_LICENSE("GPL");
  MODULE_VERSION(DRV_VERSION);

+
+#define MIGRATION_COMPLETED   0x00
+#define MIGRATION_IN_PROGRESS 0x01
+
  #define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK)
  static int debug = -1;
  module_param(debug, int, 0);
@@ -221,6 +225,78 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring 
*ring)
return ring->stats.packets;
  }

+int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
+{
+   struct ixgbevf_tx_buffer *tx_buffer = NULL;
+   static union ixgbevf_desc *tx_desc = NULL;
+
+   tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
+   if (!tx_buffer)
+   return -ENOMEM;
+
+   tx_desc = vmalloc(sizeof(union ixgbevf_desc) * r->count);
+   if (!tx_desc)
+   return -ENOMEM;
+
+   memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
+   memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count 
- head));
+   

Re: [RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

Ring shifting during restoring VF function maybe race with original
ring operation(transmit/receive package). This patch is to add tx/rx
lock to protect ring related data.

Signed-off-by: Lan Tianyu 
---
  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  2 ++
  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 28 ---
  2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 6eab402e..3a748c8 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -448,6 +448,8 @@ struct ixgbevf_adapter {

spinlock_t mbx_lock;
unsigned long last_reset;
+   spinlock_t mg_rx_lock;
+   spinlock_t mg_tx_lock;
  };



Really, a shared lock for all of the Rx or Tx rings?  This is going to 
kill any chance at performance.  Especially since just recently the VFs 
got support for RSS.


To top it off it also means we cannot clean Tx while adding new buffers 
which will kill Tx performance.


The other concern I have is what is supposed to prevent the hardware 
from accessing the rings while you are reading?  I suspect nothing so I 
don't see how this helps anything.


I would honestly say you are better off just giving up on all of the 
data stored in the descriptor rings rather than trying to restore them. 
 Yes you are going to lose a few packets but you don't have the risk 
for races that this code introduces.



  enum ixbgevf_state_t {
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 15ec361..04b6ce7 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -227,8 +227,10 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring 
*ring)

  int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
  {
+   struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
struct ixgbevf_tx_buffer *tx_buffer = NULL;
static union ixgbevf_desc *tx_desc = NULL;
+   unsigned long flags;

tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
if (!tx_buffer)
@@ -238,6 +240,7 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
if (!tx_desc)
return -ENOMEM;

+   spin_lock_irqsave(>mg_tx_lock, flags);
memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count 
- head));
memcpy(>desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * 
head);
@@ -256,6 +259,8 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
else
r->next_to_use += (r->count - head);

+   spin_unlock_irqrestore(>mg_tx_lock, flags);
+
vfree(tx_buffer);
vfree(tx_desc);
return 0;
@@ -263,8 +268,10 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)

  int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
  {
+   struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
struct ixgbevf_rx_buffer *rx_buffer = NULL;
static union ixgbevf_desc *rx_desc = NULL;
+   unsigned long flags;

rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count));
if (!rx_buffer)
@@ -274,6 +281,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
if (!rx_desc)
return -ENOMEM;

+   spin_lock_irqsave(>mg_rx_lock, flags);
memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count));
memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count 
- head));
memcpy(>desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * 
head);
@@ -291,6 +299,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
r->next_to_use -= head;
else
r->next_to_use += (r->count - head);
+   spin_unlock_irqrestore(>mg_rx_lock, flags);

vfree(rx_buffer);
vfree(rx_desc);
@@ -377,6 +386,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
if (test_bit(__IXGBEVF_DOWN, >state))
return true;

+   spin_lock(>mg_tx_lock);
+   i = tx_ring->next_to_clean;
tx_buffer = _ring->tx_buffer_info[i];
tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
i -= tx_ring->count;
@@ -471,6 +482,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
q_vector->tx.total_bytes += total_bytes;
q_vector->tx.total_packets += total_packets;

+   spin_unlock(>mg_tx_lock);
+
if (check_for_tx_hang(tx_ring) && ixgbevf_check_tx_hang(tx_ring)) {
struct ixgbe_hw *hw = >hw;
union ixgbe_adv_tx_desc *eop_desc;
@@ -999,10 +1012,12 @@ static int 

Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-21 Thread Alexander Duyck

On 10/21/2015 12:20 PM, Alex Williamson wrote:

On Wed, 2015-10-21 at 21:45 +0300, Or Gerlitz wrote:

On Wed, Oct 21, 2015 at 7:37 PM, Lan Tianyu  wrote:

This patchset is to propose a new solution to add live migration support
for 82599 SRIOV network card.



In our solution, we prefer to put all device specific operation into VF and
PF driver and make code in the Qemu more general.


[...]


Service down time test
So far, we tested migration between two laptops with 82599 nic which
are connected to a gigabit switch. Ping VF in the 0.001s interval
during migration on the host of source side. It service down
time is about 180ms.


So... what would you expect service down wise for the following
solution which is zero touch and I think should work for any VF
driver:

on host A: unplug the VM and conduct live migration to host B ala the
no-SRIOV case.


The trouble here is that the VF needs to be unplugged prior to the start
of migration because we can't do effective dirty page tracking while the
device is connected and doing DMA.  So the downtime, assuming we're
counting only VF connectivity, is dependent on memory size, rate of
dirtying, and network bandwidth; seconds for small guests, minutes or
more (maybe much, much more) for large guests.


The question of dirty page tracking though should be pretty simple.  We 
start the Tx packets out as dirty so we don't need to add anything 
there.  It seems like the Rx data and Tx/Rx descriptor rings are the issue.



This is why the typical VF agnostic approach here is to using bonding
and fail over to a emulated device during migration, so performance
suffers, but downtime is something acceptable.

If we want the ability to defer the VF unplug until just before the
final stages of the migration, we need the VF to participate in dirty
page tracking.  Here it's done via an enlightened guest driver.  Alex
Graf presented a solution using a device specific enlightenment in QEMU.
Otherwise we'd need hardware support from the IOMMU.


My only real complaint with this patch series is that it seems like 
there was to much focus on instrumenting the driver instead of providing 
the code necessary to enable a driver ecosystem that enables migration.


I don't know if what we need is a full hardware IOMMU.  It seems like a 
good way to take care of the need to flag dirty pages for DMA capable 
devices would be to add functionality to the dma_map_ops calls 
sync_{sg|single}for_cpu and unmap_{page|sg} so that they would take care 
of mapping the pages as dirty for us when needed.  We could probably 
make do with just a few tweaks to existing API in order to make this work.


As far as the descriptor rings I would argue they are invalid as soon as 
we migrate.  The problem is there is no way to guarantee ordering as we 
cannot pre-emptively mark an Rx data buffer as being a dirty page when 
we haven't even looked at the Rx descriptor for the given buffer yet. 
Tx has similar issues as we cannot guarantee the Tx will disable itself 
after a complete frame.  As such I would say the moment we migrate we 
should just give up on the frames that are still in the descriptor 
rings, drop them, and then start over with fresh rings.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] PCI: introduce helper functions for device flag operation

2014-07-28 Thread Alexander Duyck
On 07/28/2014 07:43 PM, ethan zhao wrote:
 
 On 2014/7/29 10:31, Alex Williamson wrote:
 On Tue, 2014-07-29 at 09:53 +0800, ethan zhao wrote:
 On 2014/7/29 5:00, Alex Williamson wrote:
 On Wed, 2014-07-23 at 00:19 +0800, Ethan Zhao wrote:
 This patch introduced three helper functions to hide direct
 device flag operation.

 void pci_set_dev_assigned(struct pci_dev *pdev);
 void pci_set_dev_deassigned(struct pci_dev *pdev);
 bool pci_is_dev_assigned(struct pci_dev *pdev);

 Signed-off-by: Ethan Zhao ethan.z...@oracle.com
 ---
include/linux/pci.h |   13 +
1 files changed, 13 insertions(+), 0 deletions(-)

 diff --git a/include/linux/pci.h b/include/linux/pci.h
 index aab57b4..5f6f8fa 100644
 --- a/include/linux/pci.h
 +++ b/include/linux/pci.h
 @@ -1129,6 +1129,19 @@ resource_size_t
 pcibios_window_alignment(struct pci_bus *bus,
   int pci_set_vga_state(struct pci_dev *pdev, bool decode,
  unsigned int command_bits, u32 flags);
 +/* helper functions for operation of device flag */
 +static inline void pci_set_dev_assigned(struct pci_dev *pdev)
 +{
 +pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED;
 +}
 +static inline void pci_set_dev_deassigned(struct pci_dev *pdev)
 +{
 +pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED;
 +}
 I think pci_clear_dev_assigned() would make more sense, we're not
 setting a flag named DEASSIGNED.
 Though it is a flag operation now, may not later, we define it
 because we want to hide the internal operation.
 'set' to 'deassigned'  status is enough. So I would like keep it.
 I disagree, the opposite of a 'set' is a 'clear', or at least an
 'unset'.  Using bit-ops-like terminology doesn't lock us into an
 implementation.  As written, this could just as easily be setting two
 different variables.
  So there are two pairs of opposite:
 
  set assigned --- unset assigned
  set assigned --- set deassigned
 
  Here you prefer the 'verb' set /unset, and I prefer the 'adj.' assigned
 / deassigned.
 
  Do they really have different meaning or make confusion ? I don't think
 so.
 
  Thanks,
  Ethan
 

I agree with Alex W.  If you are going to use the set name you should
probably use clear for the operation that undoes it.  Using the term
set implies that it is setting a bit and we currently don't have a
deassigned/unassigned bit.

If that doesn't work perhaps you could use something like
get/put_assigned_device or acquire/release_assigned_device.  You just
need to describe what it is you are doing.  You could even consider
adding some tests to return an error if you attempt to assign a device
that is already assigned.

Thanks,

Alex D

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2 V3] PCI: introduce device assignment interface and refactory related code

2014-07-21 Thread Alexander Duyck
On 07/11/2014 05:30 AM, Ethan Zhao wrote:
 This patch introduces two new device assignment functions
 
 pci_iov_assign_device(),
 pci_iov_deassign_device()
 
 along with the existed one
 
 pci_vfs_assigned()
 
 They construct the VFs assignment management interface, used to assign/
 deassign device to VM and query the VFs reference counter. instead of
 direct manipulation of device flag.
 
 This patch refashioned the related code and make them atomic.
 
 v3: change the naming of device assignment helpers, because they work
 for all kind of PCI device, not only SR-IOV (david.vra...@citrix.com)
 
 v2: reorder the patchset and make it bisectable and atomic, steps clear
 between interface defination and implemenation according to the
 suggestion from alex.william...@redhat.com
 
 Signed-off-by: Ethan Zhao ethan.z...@oracle.com
 ---
  drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |   17 ++---
  drivers/pci/iov.c  |   20 
 
  drivers/xen/xen-pciback/pci_stub.c |4 ++--
  include/linux/pci.h|4 
  virt/kvm/assigned-dev.c|2 +-
  virt/kvm/iommu.c   |4 ++--
  6 files changed, 31 insertions(+), 20 deletions(-)
 
 diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
 b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
 index 02c11a7..781040e 100644
 --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
 +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
 @@ -693,22 +693,9 @@ complete_reset:
  static bool i40e_vfs_are_assigned(struct i40e_pf *pf)
  {
   struct pci_dev *pdev = pf-pdev;
 - struct pci_dev *vfdev;
 -
 - /* loop through all the VFs to see if we own any that are assigned */
 - vfdev = pci_get_device(PCI_VENDOR_ID_INTEL, I40E_DEV_ID_VF , NULL);
 - while (vfdev) {
 - /* if we don't own it we don't care */
 - if (vfdev-is_virtfn  pci_physfn(vfdev) == pdev) {
 - /* if it is assigned we cannot release it */
 - if (vfdev-dev_flags  PCI_DEV_FLAGS_ASSIGNED)
 - return true;
 - }
  
 - vfdev = pci_get_device(PCI_VENDOR_ID_INTEL,
 -I40E_DEV_ID_VF,
 -vfdev);
 - }
 + if (pci_vfs_assigned(pdev))
 + return true;
  
   return false;
  }

This portion for i40e should be in one patch by itself.  It shouldn't be
included in the bits below.  Normally this would go through netdev.  The
rest of this below would go through linux-pci.

 diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
 index de7a747..090f827 100644
 --- a/drivers/pci/iov.c
 +++ b/drivers/pci/iov.c
 @@ -644,6 +644,26 @@ int pci_vfs_assigned(struct pci_dev *dev)
  EXPORT_SYMBOL_GPL(pci_vfs_assigned);
  
  /**
 + * pci_iov_assign_device - assign device to VM
 + * @pdev: the device to be assigned.
 + */
 +void pci_iov_assign_device(struct pci_dev *pdev)
 +{
 + pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED;
 +}
 +EXPORT_SYMBOL_GPL(pci_iov_assign_device);
 +
 +/**
 + * pci_iov_deassign_device - deasign device from VM
 + * @pdev: the device to be deassigned.
 + */
 +void pci_iov_deassign_device(struct pci_dev *pdev)
 +{
 + pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED;
 +}
 +EXPORT_SYMBOL_GPL(pci_iov_deassign_device);
 +
 +/**
   * pci_sriov_set_totalvfs -- reduce the TotalVFs available
   * @dev: the PCI PF device
   * @numvfs: number that should be used for TotalVFs supported

The two functions above don't have anything to do with IOV.  You can
direct assign a device that doesn't even support SR-IOV or MR-IOV.  You
might be better off defining this as something like
pci_set_flag_assigned/pci_clear_flag_assigned.  I would likely also
make them inline and possibly move them to pci.h since it would likely
result in less actual code after you consider the overhead to push
everything on the stack prior to making the call.

 diff --git a/drivers/xen/xen-pciback/pci_stub.c 
 b/drivers/xen/xen-pciback/pci_stub.c
 index 62fcd48..27e00d1 100644
 --- a/drivers/xen/xen-pciback/pci_stub.c
 +++ b/drivers/xen/xen-pciback/pci_stub.c
 @@ -133,7 +133,7 @@ static void pcistub_device_release(struct kref *kref)
   xen_pcibk_config_free_dyn_fields(dev);
   xen_pcibk_config_free_dev(dev);
  
 - dev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED;
 + pci_iov_deassign_device(dev);
   pci_dev_put(dev);
  
   kfree(psdev);
 @@ -404,7 +404,7 @@ static int pcistub_init_device(struct pci_dev *dev)
   dev_dbg(dev-dev, reset device\n);
   xen_pcibk_reset_device(dev);
  
 - dev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED;
 + pci_iov_assign_device(dev);
   return 0;
  
  config_release:
 diff --git a/include/linux/pci.h b/include/linux/pci.h
 index aab57b4..5ece6d6 100644
 --- a/include/linux/pci.h
 +++ b/include/linux/pci.h
 @@ -1603,6 

Re: [net-next PATCH 1/2] igbvf: add new driver to support 82576 virtual functions

2009-03-18 Thread Alexander Duyck

Andrew Morton wrote:

On Tue, 17 Mar 2009 17:12:48 -0700 Alexander Duyck 
alexander.h.du...@intel.com wrote:

Thanks for all the comments.  I tried to incorporate most of them into 
the igbvf driver and also ended up porting some over to our other 
drivers, specifically igb since the igbvf driver copies much of the code.


I have added my comments inline below.

Thanks,

Alex

Andrew Morton wrote:

On Tue, 10 Mar 2009 19:09:28 -0700 Jeff Kirsher jeffrey.t.kirs...@intel.com 
wrote:


From: Alexander Duyck alexander.h.du...@intel.com

This adds an igbvf driver to handle virtual functions provided
by the igb driver.

The drive-by reader is now wondering what a virtual function is.


^^ this comment was missed.

I was indirectly asking for an overview (preferably in the changelog) of
what the whole patch actually does.


Sorry, while I missed the comment in my response I had gotten to 
addressing it in the next version.  I updated it to more thoroughly 
describe what the VF driver is doing.  I also included instructions on 
how to enable the VFs from the PF so that they can be tested.



+static int igbvf_set_ringparam(struct net_device *netdev,
+   struct ethtool_ringparam *ring)
+{
+ struct igbvf_adapter *adapter = netdev_priv(netdev);
+ struct igbvf_ring *tx_ring, *tx_old;
+ struct igbvf_ring *rx_ring, *rx_old;
+ int err;
+
+ if ((ring-rx_mini_pending) || (ring-rx_jumbo_pending))
+ return -EINVAL;
+
+ while (test_and_set_bit(__IGBVF_RESETTING, adapter-state))
+ msleep(1);

No timeout needed here?  Interrupts might not be working, for example..
This bit isn't set in interrupt context.  This is always used out of 
interrupt context and is just to prevent multiple setting changes at the 
same time.


Oh.  Can't use plain old mutex_lock()?


We have one or two spots that actually check to see if the bit is set 
and just report a warning instead of actually waiting on the bit to clear.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next PATCH 1/2] igbvf: add new driver to support 82576 virtual functions

2009-03-17 Thread Alexander Duyck
Thanks for all the comments.  I tried to incorporate most of them into 
the igbvf driver and also ended up porting some over to our other 
drivers, specifically igb since the igbvf driver copies much of the code.


I have added my comments inline below.

Thanks,

Alex

Andrew Morton wrote:

On Tue, 10 Mar 2009 19:09:28 -0700 Jeff Kirsher jeffrey.t.kirs...@intel.com 
wrote:


From: Alexander Duyck alexander.h.du...@intel.com

This adds an igbvf driver to handle virtual functions provided
by the igb driver.


The drive-by reader is now wondering what a virtual function is.


...

+#define IGBVF_STAT(m, b) sizeof(((struct igbvf_adapter *)0)-m), \
+ offsetof(struct igbvf_adapter, m), \
+ offsetof(struct igbvf_adapter, b)

+static const struct igbvf_stats igbvf_gstrings_stats[] = {
+ { rx_packets, IGBVF_STAT(stats.gprc, stats.base_gprc) },
+ { tx_packets, IGBVF_STAT(stats.gptc, stats.base_gptc) },
+ { rx_bytes, IGBVF_STAT(stats.gorc, stats.base_gorc) },
+ { tx_bytes, IGBVF_STAT(stats.gotc, stats.base_gotc) },
+ { multicast, IGBVF_STAT(stats.mprc, stats.base_mprc) },
+ { lbrx_bytes, IGBVF_STAT(stats.gorlbc, stats.base_gorlbc) },
+ { lbrx_packets, IGBVF_STAT(stats.gprlbc, stats.base_gprlbc) },
+ { tx_restart_queue, IGBVF_STAT(restart_queue, zero_base) },
+ { rx_long_byte_count, IGBVF_STAT(stats.gorc, stats.base_gorc) },
+ { rx_csum_offload_good, IGBVF_STAT(hw_csum_good, zero_base) },
+ { rx_csum_offload_errors, IGBVF_STAT(hw_csum_err, zero_base) },
+ { rx_header_split, IGBVF_STAT(rx_hdr_split, zero_base) },
+ { alloc_rx_buff_failed, IGBVF_STAT(alloc_rx_buff_failed, zero_base) },
+};


stares at it for a while

It would be clearer if `m' and `b' were (much) more meaningful identifiers.


I agree, the values have been changed to current and base.


+#define IGBVF_GLOBAL_STATS_LEN   \
+ (sizeof(igbvf_gstrings_stats) / sizeof(struct igbvf_stats))


This is ARRAY_SIZE().


+#define IGBVF_STATS_LEN (IGBVF_GLOBAL_STATS_LEN)


Why does this need to exist?


It doesn't, it has been dropped and references to it replaced with 
IGBVF_GLOBAL_STATS_LEN.



...

+static int igbvf_set_ringparam(struct net_device *netdev,
+   struct ethtool_ringparam *ring)
+{
+ struct igbvf_adapter *adapter = netdev_priv(netdev);
+ struct igbvf_ring *tx_ring, *tx_old;
+ struct igbvf_ring *rx_ring, *rx_old;
+ int err;
+
+ if ((ring-rx_mini_pending) || (ring-rx_jumbo_pending))
+ return -EINVAL;
+
+ while (test_and_set_bit(__IGBVF_RESETTING, adapter-state))
+ msleep(1);


No timeout needed here?  Interrupts might not be working, for example..


This bit isn't set in interrupt context.  This is always used out of 
interrupt context and is just to prevent multiple setting changes at the 
same time.



+ if (netif_running(adapter-netdev))
+ igbvf_down(adapter);
+
+ tx_old = adapter-tx_ring;
+ rx_old = adapter-rx_ring;
+
+ err = -ENOMEM;
+ tx_ring = kzalloc(sizeof(struct igbvf_ring), GFP_KERNEL);
+ if (!tx_ring)
+ goto err_alloc_tx;
+ /*
+  * use a memcpy to save any previously configured
+  * items like napi structs from having to be
+  * reinitialized
+  */
+ memcpy(tx_ring, tx_old, sizeof(struct igbvf_ring));
+
+ rx_ring = kzalloc(sizeof(struct igbvf_ring), GFP_KERNEL);
+ if (!rx_ring)
+ goto err_alloc_rx;
+ memcpy(rx_ring, rx_old, sizeof(struct igbvf_ring));
+
+ adapter-tx_ring = tx_ring;
+ adapter-rx_ring = rx_ring;
+
+ rx_ring-count = max(ring-rx_pending, (u32)IGBVF_MIN_RXD);
+ rx_ring-count = min(rx_ring-count, (u32)(IGBVF_MAX_RXD));
+ rx_ring-count = ALIGN(rx_ring-count, REQ_RX_DESCRIPTOR_MULTIPLE);
+
+ tx_ring-count = max(ring-tx_pending, (u32)IGBVF_MIN_TXD);
+ tx_ring-count = min(tx_ring-count, (u32)(IGBVF_MAX_TXD));
+ tx_ring-count = ALIGN(tx_ring-count, REQ_TX_DESCRIPTOR_MULTIPLE);
+
+ if (netif_running(adapter-netdev)) {
+ /* Try to get new resources before deleting old */
+ err = igbvf_setup_rx_resources(adapter);
+ if (err)
+ goto err_setup_rx;
+ err = igbvf_setup_tx_resources(adapter);
+ if (err)
+ goto err_setup_tx;
+
+ /*
+  * restore the old in order to free it,
+  * then add in the new
+  */
+ adapter-rx_ring = rx_old;
+ adapter-tx_ring = tx_old;
+ igbvf_free_rx_resources(adapter);
+ igbvf_free_tx_resources(adapter);
+ kfree(tx_old);
+ kfree(rx_old);


That's odd-looking.  Why take a copy of rx_old and tx_old when we're
about to free them?


This whole section has been redone.  The approach here didn't work well 
and so I redid it to match how we do this in igb.



+ adapter-rx_ring = rx_ring;
+ adapter