Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
On Tue, Jan 5, 2016 at 1:40 AM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote: >> >> The two mechanisms referenced above would likely require coordination with >> >> QEMU and as such are open to discussion. I haven't attempted to address >> >> them as I am not sure there is a consensus as of yet. My personal >> >> preference would be to add a vendor-specific configuration block to the >> >> emulated pci-bridge interfaces created by QEMU that would allow us to >> >> essentially extend shpc to support guest live migration with pass-through >> >> devices. >> > >> > shpc? >> >> That is kind of what I was thinking. We basically need some mechanism >> to allow for the host to ask the device to quiesce. It has been >> proposed to possibly even look at something like an ACPI interface >> since I know ACPI is used by QEMU to manage hot-plug in the standard >> case. >> >> - Alex > > > Start by using hot-unplug for this! > > Really use your patch guest side, and write host side > to allow starting migration with the device, but > defer completing it. Yeah, I'm fully on board with this idea, though I'm not really working on this right now since last I knew the folks on this thread from Intel were working on it. My patches were mostly meant to be a nudge in this direction so that we could get away from the driver specific code. > So > > 1.- host tells guest to start tracking memory writes > 2.- guest acks > 3.- migration starts > 4.- most memory is migrated > 5.- host tells guest to eject device > 6.- guest acks > 7.- stop vm and migrate rest of state > Sounds about right. The only way this differs from what I see as the final solution for this is that instead of fully ejecting the device in step 5 the driver would instead pause the device and give the host something like 10 seconds to stop the VM and resume with the same device connected if it is available. We would probably also need to look at a solution that would force the device to be ejected or abort prior to starting the migration if it doesn't give us the ack in step 2. > It will already be a win since hot unplug after migration starts and > most memory has been migrated is better than hot unplug before migration > starts. Right. Generally the longer the VF can be maintained as a part of the guest the longer the network performance is improved versus using a purely virtual interface. > Then measure downtime and profile. Then we can look at ways > to quiesce device faster which really means step 5 is replaced > with "host tells guest to quiesce device and dirty (or just unmap!) > all memory mapped for write by device". Step 5 will be the spot where we really need to start modifying drivers. Specifically we probably need to go through and clean-up things so that we can reduce as many of the delays in the driver suspend/resume path as possible. I suspect there is quite a bit that can be done there that would probably also improve boot and shutdown times since those are also impacted by the devices. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
On Mon, Jan 4, 2016 at 12:41 PM, Konrad Rzeszutek Wilk <konrad.w...@oracle.com> wrote: > On Sun, Dec 13, 2015 at 01:28:09PM -0800, Alexander Duyck wrote: >> This patch set is meant to be the guest side code for a proof of concept >> involving leaving pass-through devices in the guest during the warm-up >> phase of guest live migration. In order to accomplish this I have added a > > What does that mean? 'warm-up-phase'? It is the first phase in a pre-copy migration. https://en.wikipedia.org/wiki/Live_migration Basically in this phase all the memory is marked as dirty and then copied. Any memory that changes gets marked as dirty as well. Currently DMA circumvents this as the user space dirty page tracking isn't able to track DMA. >> new function called dma_mark_dirty that will mark the pages associated with >> the DMA transaction as dirty in the case of either an unmap or a >> sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or >> DMA_BIDIRECTIONAL. The pass-through device must still be removed before >> the stop-and-copy phase, however allowing the device to be present should >> significantly improve the performance of the guest during the warm-up >> period. > > .. if the warm-up phase is short I presume? If the warm-up phase takes > a long time (busy guest that is of 1TB size) it wouldn't help much as the > tracking of these DMA's may be quite long? > >> >> This current implementation is very preliminary and there are number of >> items still missing. Specifically in order to make this a more complete >> solution we need to support: >> 1. Notifying hypervisor that drivers are dirtying DMA pages received > > .. And somehow giving the hypervisor the GPFN so it can retain the PFN in > the VT-d as long as possible. Yes, what has happened is that the host went through and marked all memory as read-only. So trying to do any operation that requires write access triggers a page fault which is then used by the host to track pages that were dirtied. >> 2. Bypassing page dirtying when it is not needed. > > How would this work with with device doing DMA operations _after_ the > migration? > That is the driver submits and DMA READ.. migrates away, device is unplugged, > VT-d context is torn down - device does the DMA READ gets an VT-d error... > > and what then? How should the device on the other host replay the DMA READ? The device has to quiesce before the migration can occur. We cannot have any DMA mappings still open when we reach the stop-and-copy phase of the migration. The solution I have proposed here works for streaming mappings but doesn't solve the case for things like dma_alloc_coherent where a bidirectional mapping is maintained between the CPU and the device. >> >> The two mechanisms referenced above would likely require coordination with >> QEMU and as such are open to discussion. I haven't attempted to address >> them as I am not sure there is a consensus as of yet. My personal >> preference would be to add a vendor-specific configuration block to the >> emulated pci-bridge interfaces created by QEMU that would allow us to >> essentially extend shpc to support guest live migration with pass-through >> devices. > > shpc? That is kind of what I was thinking. We basically need some mechanism to allow for the host to ask the device to quiesce. It has been proposed to possibly even look at something like an ACPI interface since I know ACPI is used by QEMU to manage hot-plug in the standard case. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkinwrote: > On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote: >> >> >> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote: >> >As long as you keep up this vague talk about performance during >> >migration, without even bothering with any measurements, this patchset >> >will keep going nowhere. >> > >> >> I measured network service downtime for "keep device alive"(RFC patch V1 >> presented) and "put down and up network interface"(RFC patch V2 presented) >> during migration with some optimizations. >> >> The former is around 140ms and the later is around 240ms. >> >> My patchset relies on the maibox irq which doesn't work in the suspend state >> and so can't get downtime for suspend/resume cases. Will try to get the >> result later. > > > Interesting. So you sare saying merely ifdown/ifup is 100ms? > This does not sound reasonable. > Is there a chance you are e.g. getting IP from dhcp? Actually it wouldn't surprise me if that is due to a reset logic in the driver. For starters there is a 10 msec delay in the call ixgbevf_reset_hw_vf which I believe is present to allow the PF time to clear registers after the VF has requested a reset. There is also a 10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues were disabled. That is in addition to the fact that the function that disables the queues does so serially and polls each queue until the hardware acknowledges that the queues are actually disabled. The driver also does the serial enable with poll logic on re-enabling the queues which likely doesn't help things. Really this driver is probably in need of a refactor to clean the cruft out of the reset and initialization logic. I suspect we have far more delays than we really need and that is the source of much of the slow down. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Tue, Dec 29, 2015 at 9:15 AM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Tue, Dec 29, 2015 at 09:04:51AM -0800, Alexander Duyck wrote: >> On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin <m...@redhat.com> wrote: >> > On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote: >> >> >> >> >> >> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote: >> >> >As long as you keep up this vague talk about performance during >> >> >migration, without even bothering with any measurements, this patchset >> >> >will keep going nowhere. >> >> > >> >> >> >> I measured network service downtime for "keep device alive"(RFC patch V1 >> >> presented) and "put down and up network interface"(RFC patch V2 presented) >> >> during migration with some optimizations. >> >> >> >> The former is around 140ms and the later is around 240ms. >> >> >> >> My patchset relies on the maibox irq which doesn't work in the suspend >> >> state >> >> and so can't get downtime for suspend/resume cases. Will try to get the >> >> result later. >> > >> > >> > Interesting. So you sare saying merely ifdown/ifup is 100ms? >> > This does not sound reasonable. >> > Is there a chance you are e.g. getting IP from dhcp? >> >> >> Actually it wouldn't surprise me if that is due to a reset logic in >> the driver. For starters there is a 10 msec delay in the call >> ixgbevf_reset_hw_vf which I believe is present to allow the PF time to >> clear registers after the VF has requested a reset. There is also a >> 10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues >> were disabled. That is in addition to the fact that the function that >> disables the queues does so serially and polls each queue until the >> hardware acknowledges that the queues are actually disabled. The >> driver also does the serial enable with poll logic on re-enabling the >> queues which likely doesn't help things. >> >> Really this driver is probably in need of a refactor to clean the >> cruft out of the reset and initialization logic. I suspect we have >> far more delays than we really need and that is the source of much of >> the slow down. > > For ifdown, why is there any need to reset the device at all? > Is it so buffers can be reclaimed? > I believe it is mostly historical. All the Intel drivers are derived from e1000. The e1000 has a 10ms sleep to allow outstanding PCI transactions to complete before resetting and it looks like they ended up inheriting that in the ixgbevf driver. I suppose it does allow for the buffers to be reclaimed which is something we may need, though the VF driver should have already verified that it disabled the queues when it was doing the polling on the bits being cleared in the individual queue control registers. Likely the 10ms sleep is redundant as a result. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Sun, Dec 27, 2015 at 1:21 AM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote: >> The PCI hot-plug specification calls out that the OS can optionally >> implement a "pause" mechanism which is meant to be used for high >> availability type environments. What I am proposing is basically >> extending the standard SHPC capable PCI bridge so that we can support >> the DMA page dirtying for everything hosted on it, add a vendor >> specific block to the config space so that the guest can notify the >> host that it will do page dirtying, and add a mechanism to indicate >> that all hot-plug events during the warm-up phase of the migration are >> pause events instead of full removals. > > Two comments: > > 1. A vendor specific capability will always be problematic. > Better to register a capability id with pci sig. > > 2. There are actually several capabilities: > > A. support for memory dirtying > if not supported, we must stop device before migration > > This is supported by core guest OS code, > using patches similar to posted by you. > > > B. support for device replacement > This is a faster form of hotplug, where device is removed and > later another device using same driver is inserted in the same slot. > > This is a possible optimization, but I am convinced > (A) should be implemented independently of (B). > My thought on this was that we don't need much to really implement either feature. Really only a bit or two for either one. I had thought about extending the PCI Advanced Features, but for now it might make more sense to just implement it as a vendor capability for the QEMU based bridges instead of trying to make this a true PCI capability since I am not sure if this in any way would apply to physical hardware. The fact is the PCI Advanced Features capability is essentially just a vendor specific capability with a different ID so if we were to use 2 bits that are currently reserved in the capability we could later merge the functionality without much overhead. I fully agree that the two implementations should be separate but nothing says we have to implement them completely different. If we are just using 3 bits for capability, status, and control of each feature there is no reason for them to need to be stored in separate locations. >> I've been poking around in the kernel and QEMU code and the part I >> have been trying to sort out is how to get QEMU based pci-bridge to >> use the SHPC driver because from what I can tell the driver never >> actually gets loaded on the device as it is left in the control of >> ACPI hot-plug. > > There are ways, but you can just use pci express, it's easier. That's true. I should probably just give up on trying to do an implementation that works with the i440fx implementation. I could probably move over to the q35 and once that is done then we could look at something like the PCI Advanced Features solution for something like the PCI-bridge drivers. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Sun, Dec 27, 2015 at 7:20 PM, Dong, Eddiewrote: >> > >> > Even if the device driver doesn't support migration, you still want to >> > migrate VM? That maybe risk and we should add the "bad path" for the >> > driver at least. >> >> At a minimum we should have support for hot-plug if we are expecting to >> support migration. You would simply have to hot-plug the device before you >> start migration and then return it after. That is how the current bonding >> approach for this works if I am not mistaken. > > Hotplug is good to eliminate the device spefic state clone, but bonding > approach is very network specific, it doesn’t work for other devices such as > FPGA device, QaT devices & GPU devices, which we plan to support gradually :) Hotplug would be usable for that assuming the guest supports the optional "pause" implementation as called out in the PCI hotplug spec. With that the device can maintain state for some period of time after the hotplug remove event has occurred. The problem is that you have to get the device to quiesce at some point as you cannot complete the migration with the device still active. The way you were doing it was using the per-device configuration space mechanism. That doesn't scale when you have to implement it for each and every driver for each and every OS you have to support. Using the "pause" implementation for hot-plug would have a much greater likelihood of scaling as you could either take the fast path approach of "pausing" the device to resume it when migration has completed, or you could just remove the device and restart the driver on the other side if the pause support is not yet implemented. You would lose the state under such a migration but it is much more practical than having to implement a per device solution. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Thu, Dec 24, 2015 at 11:03 PM, Lan Tianyu <tianyu@intel.com> wrote: > Merry Christmas. > Sorry for later response due to personal affair. > > On 2015年12月14日 03:30, Alexander Duyck wrote: >>> > These sounds we need to add a faked bridge for migration and adding a >>> > driver in the guest for it. It also needs to extend PCI bus/hotplug >>> > driver to do pause/resume other devices, right? >>> > >>> > My concern is still that whether we can change PCI bus/hotplug like that >>> > without spec change. >>> > >>> > IRQ should be general for any devices and we may extend it for >>> > migration. Device driver also can make decision to support migration >>> > or not. >> The device should have no say in the matter. Either we are going to >> migrate or we will not. This is why I have suggested my approach as >> it allows for the least amount of driver intrusion while providing the >> maximum number of ways to still perform migration even if the device >> doesn't support it. > > Even if the device driver doesn't support migration, you still want to > migrate VM? That maybe risk and we should add the "bad path" for the > driver at least. At a minimum we should have support for hot-plug if we are expecting to support migration. You would simply have to hot-plug the device before you start migration and then return it after. That is how the current bonding approach for this works if I am not mistaken. The advantage we are looking to gain is to avoid removing/disabling the device for as long as possible. Ideally we want to keep the device active through the warm-up period, but if the guest doesn't do that we should still be able to fall back on the older approaches if needed. >> >> The solution I have proposed is simple: >> >> 1. Extend swiotlb to allow for a page dirtying functionality. >> >> This part is pretty straight forward. I'll submit a few patches >> later today as RFC that can provided the minimal functionality needed >> for this. > > Very appreciate to do that. > >> >> 2. Provide a vendor specific configuration space option on the QEMU >> implementation of a PCI bridge to act as a bridge between direct >> assigned devices and the host bridge. >> >> My thought was to add some vendor specific block that includes a >> capabilities, status, and control register so you could go through and >> synchronize things like the DMA page dirtying feature. The bridge >> itself could manage the migration capable bit inside QEMU for all >> devices assigned to it. So if you added a VF to the bridge it would >> flag that you can support migration in QEMU, while the bridge would >> indicate you cannot until the DMA page dirtying control bit is set by >> the guest. >> >> We could also go through and optimize the DMA page dirtying after >> this is added so that we can narrow down the scope of use, and as a >> result improve the performance for other devices that don't need to >> support migration. It would then be a matter of adding an interrupt >> in the device to handle an event such as the DMA page dirtying status >> bit being set in the config space status register, while the bit is >> not set in the control register. If it doesn't get set then we would >> have to evict the devices before the warm-up phase of the migration, >> otherwise we can defer it until the end of the warm-up phase. >> >> 3. Extend existing shpc driver to support the optional "pause" >> functionality as called out in section 4.1.2 of the Revision 1.1 PCI >> hot-plug specification. > > Since your solution has added a faked PCI bridge. Why not notify the > bridge directly during migration via irq and call device driver's > callback in the new bridge driver? > > Otherwise, the new bridge driver also can check whether the device > driver provides migration callback or not and call them to improve the > passthough device's performance during migration. This is basically what I had in mind. Though I would take things one step further. You don't need to add any new call-backs if you make use of the existing suspend/resume logic. For a VF this does exactly what you would need since the VFs don't support wake on LAN so it will simply clear the bus master enable and put the netdev in a suspended state until the resume can be called. The PCI hot-plug specification calls out that the OS can optionally implement a "pause" mechanism which is meant to be used for high availability type environments. What I am proposing is basically extending the standard SHPC capable PCI bridge so th
Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest
On Mon, Dec 14, 2015 at 6:00 AM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Sun, Dec 13, 2015 at 01:28:31PM -0800, Alexander Duyck wrote: >> This patch is meant to provide the guest with a way of flagging DMA pages >> as being dirty to the host when using a direct-assign device within a >> guest. The advantage to this approach is that it is fairly simple, however >> It currently has a singificant impact on device performance in all the >> scenerios where it won't be needed. >> >> As such this is really meant only as a proof of concept and to get the ball >> rolling in terms of figuring out how best to approach the issue of dirty >> page tracking for a guest that is using a direct assigned device. In >> addition with just this patch it should be possible to modify current >> migration approaches so that instead of having to hot-remove the device >> before starting the migration this can instead be delayed until the period >> before the final stop and copy. >> >> Signed-off-by: Alexander Duyck <adu...@mirantis.com> >> --- >> arch/arm/include/asm/dma-mapping.h |3 ++- >> arch/arm64/include/asm/dma-mapping.h |5 ++--- >> arch/ia64/include/asm/dma.h |1 + >> arch/mips/include/asm/dma-mapping.h |1 + >> arch/powerpc/include/asm/swiotlb.h |1 + >> arch/tile/include/asm/dma-mapping.h |1 + >> arch/unicore32/include/asm/dma-mapping.h |1 + >> arch/x86/Kconfig | 11 +++ >> arch/x86/include/asm/swiotlb.h | 26 ++ >> drivers/xen/swiotlb-xen.c|6 ++ >> lib/swiotlb.c|6 ++ >> 11 files changed, 58 insertions(+), 4 deletions(-) >> >> diff --git a/arch/arm/include/asm/dma-mapping.h >> b/arch/arm/include/asm/dma-mapping.h >> index ccb3aa64640d..1962d7b471c7 100644 >> --- a/arch/arm/include/asm/dma-mapping.h >> +++ b/arch/arm/include/asm/dma-mapping.h >> @@ -167,7 +167,8 @@ static inline bool dma_capable(struct device *dev, >> dma_addr_t addr, size_t size) >> return 1; >> } >> >> -static inline void dma_mark_clean(void *addr, size_t size) { } >> +static inline void dma_mark_clean(void *addr, size_t size) {} >> +static inline void dma_mark_dirty(void *addr, size_t size) {} >> >> extern int arm_dma_set_mask(struct device *dev, u64 dma_mask); >> >> diff --git a/arch/arm64/include/asm/dma-mapping.h >> b/arch/arm64/include/asm/dma-mapping.h >> index 61e08f360e31..8d24fe11c8a3 100644 >> --- a/arch/arm64/include/asm/dma-mapping.h >> +++ b/arch/arm64/include/asm/dma-mapping.h >> @@ -84,9 +84,8 @@ static inline bool dma_capable(struct device *dev, >> dma_addr_t addr, size_t size) >> return addr + size - 1 <= *dev->dma_mask; >> } >> >> -static inline void dma_mark_clean(void *addr, size_t size) >> -{ >> -} >> +static inline void dma_mark_clean(void *addr, size_t size) {} >> +static inline void dma_mark_dirty(void *addr, size_t size) {} >> >> #endif /* __KERNEL__ */ >> #endif /* __ASM_DMA_MAPPING_H */ >> diff --git a/arch/ia64/include/asm/dma.h b/arch/ia64/include/asm/dma.h >> index 4d97f60f1ef5..d92ebeb2758e 100644 >> --- a/arch/ia64/include/asm/dma.h >> +++ b/arch/ia64/include/asm/dma.h >> @@ -20,5 +20,6 @@ extern unsigned long MAX_DMA_ADDRESS; >> #define free_dma(x) >> >> void dma_mark_clean(void *addr, size_t size); >> +static inline void dma_mark_dirty(void *addr, size_t size) {} >> >> #endif /* _ASM_IA64_DMA_H */ >> diff --git a/arch/mips/include/asm/dma-mapping.h >> b/arch/mips/include/asm/dma-mapping.h >> index e604f760c4a0..567f6e03e337 100644 >> --- a/arch/mips/include/asm/dma-mapping.h >> +++ b/arch/mips/include/asm/dma-mapping.h >> @@ -28,6 +28,7 @@ static inline bool dma_capable(struct device *dev, >> dma_addr_t addr, size_t size) >> } >> >> static inline void dma_mark_clean(void *addr, size_t size) {} >> +static inline void dma_mark_dirty(void *addr, size_t size) {} >> >> #include >> >> diff --git a/arch/powerpc/include/asm/swiotlb.h >> b/arch/powerpc/include/asm/swiotlb.h >> index de99d6e29430..b694e8399e28 100644 >> --- a/arch/powerpc/include/asm/swiotlb.h >> +++ b/arch/powerpc/include/asm/swiotlb.h >> @@ -16,6 +16,7 @@ >> extern struct dma_map_ops swiotlb_dma_ops; >> >> static inline void dma_mark_clean(void *addr, size_t size) {} >> +static inline void dma_mark_dirty(void
Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest
On Mon, Dec 14, 2015 at 9:20 AM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote: >> > This way distro can use a guest agent to disable >> > dirtying until before migration starts. >> >> Right. For a v2 version I would definitely want to have some way to >> limit the scope of this. My main reason for putting this out here is >> to start altering the course of discussions since it seems like were >> weren't getting anywhere with the ixgbevf migration changes that were >> being proposed. > > Absolutely, thanks for working on this. > >> >> + unsigned long pg_addr, start; >> >> + >> >> + start = (unsigned long)addr; >> >> + pg_addr = PAGE_ALIGN(start + size); >> >> + start &= ~(sizeof(atomic_t) - 1); >> >> + >> >> + /* trigger a write fault on each page, excluding first page */ >> >> + while ((pg_addr -= PAGE_SIZE) > start) >> >> + atomic_add(0, (atomic_t *)pg_addr); >> >> + >> >> + /* trigger a write fault on first word of DMA */ >> >> + atomic_add(0, (atomic_t *)start); >> > >> > start might not be aligned correctly for a cast to atomic_t. >> > It's harmless to do this for any memory, so I think you should >> > just do this for 1st byte of all pages including the first one. >> >> You may not have noticed it but I actually aligned start in the line >> after pg_addr. > > Yes you did. alignof would make it a bit more noticeable. > >> However instead of aligning to the start of the next >> atomic_t I just masked off the lower bits so that we start at the >> DWORD that contains the first byte of the starting address. The >> assumption here is that I cannot trigger any sort of fault since if I >> have access to a given byte within a DWORD I will have access to the >> entire DWORD. > > I'm curious where does this come from. Isn't it true that access is > controlled at page granularity normally, so you can touch beginning of > page just as well? Yeah, I am pretty sure it probably is page granularity. However my thought was to try and stick to the start of the DMA as the last access. That way we don't pull in any more cache lines than we need to in order to dirty the pages. Usually the start of the DMA region will contain some sort of headers or something that needs to be accessed with the highest priority so I wanted to make certain that we were forcing usable data into the L1 cache rather than just the first cache line of the page where the DMA started. If however the start of a DMA was the start of the page there is nothing there to prevent that. >> I coded this up so that the spots where we touch the >> memory should match up with addresses provided by the hardware to >> perform the DMA over the PCI bus. > > Yes but there's no requirement to do it like this from > virt POV. You just need to touch each page. I know, but at the same time if we match up with the DMA then it is more likely that we avoid grabbing unneeded cache lines. In the case of most drivers the data for headers and start is at the start of the DMA. So if we dirty the cache line associated with the start of the DMA it will be pulled into the L1 cache and there is a greater chance that it may already be prefetched as well. >> Also I intentionally ran from highest address to lowest since that way >> we don't risk pushing the first cache line of the DMA buffer out of >> the L1 cache due to the PAGE_SIZE stride. > > Interesting. How does order of access help with this? If you use a PAGE_SIZE stride you will start evicting things from L1 cache after something like 8 accesses on an x86 processor as most of the recent ones have a 32K 8 way associative L1 cache. So if I go from back to front then I evict the stuff that would likely be in the data portion of a buffer instead of headers which are usually located at the front. > By the way, if you are into these micro-optimizations you might want to > limit prefetch, to this end you want to access the last line of the > page. And it's probably worth benchmarking a bit and not doing it all just > based on theory, keep code simple in v1 otherwise. My main goal for now is functional code over high performance code. That is why I have kept this code fairly simple. I might have done some optimization but it was as much about the optimization as keeping the code simple. For example by using the start of the page instead of the end I could easily do the comparison against start and avoid doing more than one write per page. The issue for me doing performance testing is that I don't have anyth
Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest
On Mon, Dec 14, 2015 at 12:52 PM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Mon, Dec 14, 2015 at 09:59:13AM -0800, Alexander Duyck wrote: >> On Mon, Dec 14, 2015 at 9:20 AM, Michael S. Tsirkin <m...@redhat.com> wrote: >> > On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote: >> >> > This way distro can use a guest agent to disable >> >> > dirtying until before migration starts. >> >> >> >> Right. For a v2 version I would definitely want to have some way to >> >> limit the scope of this. My main reason for putting this out here is >> >> to start altering the course of discussions since it seems like were >> >> weren't getting anywhere with the ixgbevf migration changes that were >> >> being proposed. >> > >> > Absolutely, thanks for working on this. >> > >> >> >> + unsigned long pg_addr, start; >> >> >> + >> >> >> + start = (unsigned long)addr; >> >> >> + pg_addr = PAGE_ALIGN(start + size); >> >> >> + start &= ~(sizeof(atomic_t) - 1); >> >> >> + >> >> >> + /* trigger a write fault on each page, excluding first page */ >> >> >> + while ((pg_addr -= PAGE_SIZE) > start) >> >> >> + atomic_add(0, (atomic_t *)pg_addr); >> >> >> + >> >> >> + /* trigger a write fault on first word of DMA */ >> >> >> + atomic_add(0, (atomic_t *)start); > > Actually, I have second thoughts about using atomic_add here, > especially for _sync. > > Many architectures do > > #define ATOMIC_OP_RETURN(op, c_op) \ > static inline int atomic_##op##_return(int i, atomic_t *v) \ > { \ > unsigned long flags;\ > int ret;\ > \ > raw_local_irq_save(flags); \ > ret = (v->counter = v->counter c_op i); \ > raw_local_irq_restore(flags); \ > \ > return ret; \ > } > > and this is not safe if device is still doing DMA to/from > this memory. > > Generally, atomic_t is there for SMP effects, not for sync > with devices. > > This is why I said you should do > cmpxchg(pg_addr, 0xdead, 0xdead); > > Yes, we probably never actually want to run m68k within a VM, > but let's not misuse interfaces like this. Right now this implementation is for x86 only. Any other architecture currently reports dma_mark_dirty as an empty inline function. The reason why I chose the atomic_add for x86 is simply because it is guaranteed dirty the cache line with relatively few instructions and operands as all I have to have is the pointer and 0. For the m68k we could implement it as a cmpxchg instead. The general thought here is that each architecture is probably going to have to do it a little bit differently. >> >> > >> >> > start might not be aligned correctly for a cast to atomic_t. >> >> > It's harmless to do this for any memory, so I think you should >> >> > just do this for 1st byte of all pages including the first one. >> >> >> >> You may not have noticed it but I actually aligned start in the line >> >> after pg_addr. >> > >> > Yes you did. alignof would make it a bit more noticeable. >> > >> >> However instead of aligning to the start of the next >> >> atomic_t I just masked off the lower bits so that we start at the >> >> DWORD that contains the first byte of the starting address. The >> >> assumption here is that I cannot trigger any sort of fault since if I >> >> have access to a given byte within a DWORD I will have access to the >> >> entire DWORD. >> > >> > I'm curious where does this come from. Isn't it true that access is >> > controlled at page granularity normally, so you can touch beginning of >> > page just as well? >> >> Yeah, I am pretty sure it probably is page granularity. However my >> thought was to try and stick to the start of the DMA as the last >> access. That way we don't pull in any more cache lines
Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
On Sun, Dec 13, 2015 at 6:27 PM, Yang Zhang <yang.zhang...@gmail.com> wrote: > On 2015/12/14 5:28, Alexander Duyck wrote: >> >> This patch set is meant to be the guest side code for a proof of concept >> involving leaving pass-through devices in the guest during the warm-up >> phase of guest live migration. In order to accomplish this I have added a >> new function called dma_mark_dirty that will mark the pages associated >> with >> the DMA transaction as dirty in the case of either an unmap or a >> sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or >> DMA_BIDIRECTIONAL. The pass-through device must still be removed before >> the stop-and-copy phase, however allowing the device to be present should >> significantly improve the performance of the guest during the warm-up >> period. >> >> This current implementation is very preliminary and there are number of >> items still missing. Specifically in order to make this a more complete >> solution we need to support: >> 1. Notifying hypervisor that drivers are dirtying DMA pages received >> 2. Bypassing page dirtying when it is not needed. >> > > Shouldn't current log dirty mechanism already cover them? The guest has no way of currently knowing that the hypervisor is doing dirty page logging, and the log dirty mechanism currently has no way of tracking device DMA accesses. This change is meant to bridge the two so that the guest device driver will force the SWIOTLB DMA API to mark pages written to by the device as dirty. >> The two mechanisms referenced above would likely require coordination with >> QEMU and as such are open to discussion. I haven't attempted to address >> them as I am not sure there is a consensus as of yet. My personal >> preference would be to add a vendor-specific configuration block to the >> emulated pci-bridge interfaces created by QEMU that would allow us to >> essentially extend shpc to support guest live migration with pass-through >> devices. >> >> The functionality in this patch set is currently disabled by default. To >> enable it you can select "SWIOTLB page dirtying" from the "Processor type >> and features" menu. > > > Only SWIOTLB is supported? Yes. For right now this only supports SWIOTLB. The assumption here is that SWIOTLB is in use for most cases where an IOMMU is not present. If an IOMMU is present in a virtualized guest then most likely the IOMMU might be able to provide a separate mechanism for dirty page tracking. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Sun, Dec 13, 2015 at 7:47 AM, Lan, Tianyu <tianyu@intel.com> wrote: > > > On 12/11/2015 1:16 AM, Alexander Duyck wrote: >> >> On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu <tianyu@intel.com> wrote: >>> >>> >>> >>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: >>>>> >>>>> >>>>> Ideally, it is able to leave guest driver unmodified but it requires >>>>> the >>>>>> >>>>>> hypervisor or qemu to aware the device which means we may need a >>>>>> driver >>>>>> in >>>>>> hypervisor or qemu to handle the device on behalf of guest driver. >>>> >>>> >>>> Can you answer the question of when do you use your code - >>>> at the start of migration or >>>> just before the end? >>> >>> >>> >>> Just before stopping VCPU in this version and inject VF mailbox irq to >>> notify the driver if the irq handler is installed. >>> Qemu side also will check this via the faked PCI migration capability >>> and driver will set the status during device open() or resume() callback. >> >> >> The VF mailbox interrupt is a very bad idea. Really the device should >> be in a reset state on the other side of a migration. It doesn't make >> sense to have the interrupt firing if the device is not configured. >> This is one of the things that is preventing you from being able to >> migrate the device while the interface is administratively down or the >> VF driver is not loaded. > > > From my opinion, if VF driver is not loaded and hardware doesn't start > to work, the device state doesn't need to be migrated. > > We may add a flag for driver to check whether migration happened during it's > down and reinitialize the hardware and clear the flag when system try to put > it up. > > We may add migration core in the Linux kernel and provide some helps > functions to facilitate to add migration support for drivers. > Migration core is in charge to sync status with Qemu. > > Example. > migration_register() > Driver provides > - Callbacks to be called before and after migration or for bad path > - Its irq which it prefers to deal with migration event. You would be better off just using function pointers in the pci_driver struct and let the PCI driver registration take care of all that. > migration_event_check() > Driver calls it in the irq handler. Migration core code will check > migration status and call its callbacks when migration happens. No, this is still a bad idea. You haven't addressed what you do when the device has had interrupts disabled such as being in the down state. This is the biggest issue I see with your whole patch set. It requires the driver containing certain changes and being in a certain state. You cannot put those expectations on the guest. You really need to try and move as much of this out to existing functionality as possible. >> >> My thought on all this is that it might make sense to move this >> functionality into a PCI-to-PCI bridge device and make it a >> requirement that all direct-assigned devices have to exist behind that >> device in order to support migration. That way you would be working >> with a directly emulated device that would likely already be >> supporting hot-plug anyway. Then it would just be a matter of coming >> up with a few Qemu specific extensions that you would need to add to >> the device itself. The same approach would likely be portable enough >> that you could achieve it with PCIe as well via the same configuration >> space being present on the upstream side of a PCIe port or maybe a >> PCIe switch of some sort. >> >> It would then be possible to signal via your vendor-specific PCI >> capability on that device that all devices behind this bridge require >> DMA page dirtying, you could use the configuration in addition to the >> interrupt already provided for hot-plug to signal things like when you >> are starting migration, and possibly even just extend the shpc >> functionality so that if this capability is present you have the >> option to pause/resume instead of remove/probe the device in the case >> of certain hot-plug events. The fact is there may be some use for a >> pause/resume type approach for PCIe hot-plug in the near future >> anyway. From the sounds of it Apple has required it for all >> Thunderbolt device drivers so that they can halt the device in order >> to shuffle resources around, perhaps we should look at something >> similar for Linux
[RFC PATCH 2/3] xen/swiotlb: Fold static unmap and sync calls into calling functions
This change essentially does two things. First it folds the swiotlb_unmap and swiotlb_sync calls into their callers. The goal behind this is three fold. First this helps to reduce execution time and improves performance since we aren't having to call into as many functions. Second it allows us to split up some of the sync functionality as there is the dma_mark_clean portion of the sync call that is only really needed for dma_sync_for_cpu since we don't actually want to mark the page as clean if we are syncing for the device. The second change is to move dma_mark_clean inside the if statement instead of using a return in the case of sync and unmap. By doing this we make it so that we can also add a dma_mark_dirty function later. Signed-off-by: Alexander Duyck <adu...@mirantis.com> --- drivers/xen/swiotlb-xen.c | 90 ++--- 1 file changed, 36 insertions(+), 54 deletions(-) diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index 7399782c0998..2154c70e47da 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -432,9 +432,9 @@ EXPORT_SYMBOL_GPL(xen_swiotlb_map_page); * After this call, reads by the cpu to the buffer are guaranteed to see * whatever the device wrote there. */ -static void xen_unmap_single(struct device *hwdev, dma_addr_t dev_addr, -size_t size, enum dma_data_direction dir, -struct dma_attrs *attrs) +void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, + size_t size, enum dma_data_direction dir, + struct dma_attrs *attrs) { phys_addr_t paddr = xen_bus_to_phys(dev_addr); @@ -448,23 +448,14 @@ static void xen_unmap_single(struct device *hwdev, dma_addr_t dev_addr, return; } - if (dir != DMA_FROM_DEVICE) - return; - /* * phys_to_virt doesn't work with hihgmem page but we could * call dma_mark_clean() with hihgmem page here. However, we * are fine since dma_mark_clean() is null on POWERPC. We can * make dma_mark_clean() take a physical address if necessary. */ - dma_mark_clean(phys_to_virt(paddr), size); -} - -void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, - size_t size, enum dma_data_direction dir, - struct dma_attrs *attrs) -{ - xen_unmap_single(hwdev, dev_addr, size, dir, attrs); + if (dir == DMA_FROM_DEVICE) + dma_mark_clean(phys_to_virt(paddr), size); } EXPORT_SYMBOL_GPL(xen_swiotlb_unmap_page); @@ -478,36 +469,22 @@ EXPORT_SYMBOL_GPL(xen_swiotlb_unmap_page); * address back to the card, you must first perform a * xen_swiotlb_dma_sync_for_device, and then the device again owns the buffer */ -static void -xen_swiotlb_sync_single(struct device *hwdev, dma_addr_t dev_addr, - size_t size, enum dma_data_direction dir, - enum dma_sync_target target) +void +xen_swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr, + size_t size, enum dma_data_direction dir) { phys_addr_t paddr = xen_bus_to_phys(dev_addr); BUG_ON(dir == DMA_NONE); - if (target == SYNC_FOR_CPU) - xen_dma_sync_single_for_cpu(hwdev, dev_addr, size, dir); + xen_dma_sync_single_for_cpu(hwdev, dev_addr, size, dir); /* NOTE: We use dev_addr here, not paddr! */ if (is_xen_swiotlb_buffer(dev_addr)) - swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target); - - if (target == SYNC_FOR_DEVICE) - xen_dma_sync_single_for_device(hwdev, dev_addr, size, dir); - - if (dir != DMA_FROM_DEVICE) - return; + swiotlb_tbl_sync_single(hwdev, paddr, size, dir, SYNC_FOR_CPU); - dma_mark_clean(phys_to_virt(paddr), size); -} - -void -xen_swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr, - size_t size, enum dma_data_direction dir) -{ - xen_swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_CPU); + if (dir == DMA_FROM_DEVICE) + dma_mark_clean(phys_to_virt(paddr), size); } EXPORT_SYMBOL_GPL(xen_swiotlb_sync_single_for_cpu); @@ -515,7 +492,16 @@ void xen_swiotlb_sync_single_for_device(struct device *hwdev, dma_addr_t dev_addr, size_t size, enum dma_data_direction dir) { - xen_swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_DEVICE); + phys_addr_t paddr = xen_bus_to_phys(dev_addr); + + BUG_ON(dir == DMA_NONE); + + /* NOTE: We use dev_addr here, not paddr! */ + if (is_xen_swiotlb_buffer(dev_addr)) + swiotlb_tbl_sync_single(hwdev, paddr, size, dir, + SYNC_FOR_
[RFC PATCH 1/3] swiotlb: Fold static unmap and sync calls into calling functions
This change essentially does two things. First it folds the swiotlb_unmap and swiotlb_sync calls into their callers. The goal behind this is three fold. First this helps to reduce execution time and improves performance since we aren't having to call into as many functions. Second it allows us to split up some of the sync functionality as there is the dma_mark_clean portion of the sync call that is only really needed for dma_sync_for_cpu since we don't actually want to mark the page as clean if we are syncing for the device. The second change is to move dma_mark_clean inside the if statement instead of using a return in the case of sync and unmap. By doing this we make it so that we can also add a dma_mark_dirty function later. Signed-off-by: Alexander Duyck <adu...@mirantis.com> --- lib/swiotlb.c | 81 +++-- 1 file changed, 33 insertions(+), 48 deletions(-) diff --git a/lib/swiotlb.c b/lib/swiotlb.c index 76f29ecba8f4..384ac06217b2 100644 --- a/lib/swiotlb.c +++ b/lib/swiotlb.c @@ -781,8 +781,9 @@ EXPORT_SYMBOL_GPL(swiotlb_map_page); * After this call, reads by the cpu to the buffer are guaranteed to see * whatever the device wrote there. */ -static void unmap_single(struct device *hwdev, dma_addr_t dev_addr, -size_t size, enum dma_data_direction dir) +void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, + size_t size, enum dma_data_direction dir, + struct dma_attrs *attrs) { phys_addr_t paddr = dma_to_phys(hwdev, dev_addr); @@ -793,23 +794,14 @@ static void unmap_single(struct device *hwdev, dma_addr_t dev_addr, return; } - if (dir != DMA_FROM_DEVICE) - return; - /* * phys_to_virt doesn't work with hihgmem page but we could * call dma_mark_clean() with hihgmem page here. However, we * are fine since dma_mark_clean() is null on POWERPC. We can * make dma_mark_clean() take a physical address if necessary. */ - dma_mark_clean(phys_to_virt(paddr), size); -} - -void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, - size_t size, enum dma_data_direction dir, - struct dma_attrs *attrs) -{ - unmap_single(hwdev, dev_addr, size, dir); + if (dir == DMA_FROM_DEVICE) + dma_mark_clean(phys_to_virt(paddr), size); } EXPORT_SYMBOL_GPL(swiotlb_unmap_page); @@ -823,31 +815,21 @@ EXPORT_SYMBOL_GPL(swiotlb_unmap_page); * address back to the card, you must first perform a * swiotlb_dma_sync_for_device, and then the device again owns the buffer */ -static void -swiotlb_sync_single(struct device *hwdev, dma_addr_t dev_addr, - size_t size, enum dma_data_direction dir, - enum dma_sync_target target) +void +swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr, + size_t size, enum dma_data_direction dir) { phys_addr_t paddr = dma_to_phys(hwdev, dev_addr); BUG_ON(dir == DMA_NONE); if (is_swiotlb_buffer(paddr)) { - swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target); + swiotlb_tbl_sync_single(hwdev, paddr, size, dir, SYNC_FOR_CPU); return; } - if (dir != DMA_FROM_DEVICE) - return; - - dma_mark_clean(phys_to_virt(paddr), size); -} - -void -swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr, - size_t size, enum dma_data_direction dir) -{ - swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_CPU); + if (dir == DMA_FROM_DEVICE) + dma_mark_clean(phys_to_virt(paddr), size); } EXPORT_SYMBOL(swiotlb_sync_single_for_cpu); @@ -855,7 +837,14 @@ void swiotlb_sync_single_for_device(struct device *hwdev, dma_addr_t dev_addr, size_t size, enum dma_data_direction dir) { - swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_DEVICE); + phys_addr_t paddr = dma_to_phys(hwdev, dev_addr); + + BUG_ON(dir == DMA_NONE); + + if (!is_swiotlb_buffer(paddr)) + return; + + swiotlb_tbl_sync_single(hwdev, paddr, size, dir, SYNC_FOR_DEVICE); } EXPORT_SYMBOL(swiotlb_sync_single_for_device); @@ -929,10 +918,9 @@ swiotlb_unmap_sg_attrs(struct device *hwdev, struct scatterlist *sgl, struct scatterlist *sg; int i; - BUG_ON(dir == DMA_NONE); - for_each_sg(sgl, sg, nelems, i) - unmap_single(hwdev, sg->dma_address, sg_dma_len(sg), dir); + swiotlb_unmap_page(hwdev, sg->dma_address, sg_dma_len(sg), + dir, attrs); } EXPORT_SYMBOL(swiotlb_unmap_sg_attrs); @@ -952,32 +940,29 @@ EXPORT_SYMBOL(swiotlb_unmap_sg); * The same as swiotlb_sync_single_* but for a sc
[RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
This patch set is meant to be the guest side code for a proof of concept involving leaving pass-through devices in the guest during the warm-up phase of guest live migration. In order to accomplish this I have added a new function called dma_mark_dirty that will mark the pages associated with the DMA transaction as dirty in the case of either an unmap or a sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL. The pass-through device must still be removed before the stop-and-copy phase, however allowing the device to be present should significantly improve the performance of the guest during the warm-up period. This current implementation is very preliminary and there are number of items still missing. Specifically in order to make this a more complete solution we need to support: 1. Notifying hypervisor that drivers are dirtying DMA pages received 2. Bypassing page dirtying when it is not needed. The two mechanisms referenced above would likely require coordination with QEMU and as such are open to discussion. I haven't attempted to address them as I am not sure there is a consensus as of yet. My personal preference would be to add a vendor-specific configuration block to the emulated pci-bridge interfaces created by QEMU that would allow us to essentially extend shpc to support guest live migration with pass-through devices. The functionality in this patch set is currently disabled by default. To enable it you can select "SWIOTLB page dirtying" from the "Processor type and features" menu. --- Alexander Duyck (3): swiotlb: Fold static unmap and sync calls into calling functions xen/swiotlb: Fold static unmap and sync calls into calling functions x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest arch/arm/include/asm/dma-mapping.h |3 + arch/arm64/include/asm/dma-mapping.h |5 +- arch/ia64/include/asm/dma.h |1 arch/mips/include/asm/dma-mapping.h |1 arch/powerpc/include/asm/swiotlb.h |1 arch/tile/include/asm/dma-mapping.h |1 arch/unicore32/include/asm/dma-mapping.h |1 arch/x86/Kconfig | 11 arch/x86/include/asm/swiotlb.h | 26 drivers/xen/swiotlb-xen.c| 92 +- lib/swiotlb.c| 83 --- 11 files changed, 123 insertions(+), 102 deletions(-) -- -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest
This patch is meant to provide the guest with a way of flagging DMA pages as being dirty to the host when using a direct-assign device within a guest. The advantage to this approach is that it is fairly simple, however It currently has a singificant impact on device performance in all the scenerios where it won't be needed. As such this is really meant only as a proof of concept and to get the ball rolling in terms of figuring out how best to approach the issue of dirty page tracking for a guest that is using a direct assigned device. In addition with just this patch it should be possible to modify current migration approaches so that instead of having to hot-remove the device before starting the migration this can instead be delayed until the period before the final stop and copy. Signed-off-by: Alexander Duyck <adu...@mirantis.com> --- arch/arm/include/asm/dma-mapping.h |3 ++- arch/arm64/include/asm/dma-mapping.h |5 ++--- arch/ia64/include/asm/dma.h |1 + arch/mips/include/asm/dma-mapping.h |1 + arch/powerpc/include/asm/swiotlb.h |1 + arch/tile/include/asm/dma-mapping.h |1 + arch/unicore32/include/asm/dma-mapping.h |1 + arch/x86/Kconfig | 11 +++ arch/x86/include/asm/swiotlb.h | 26 ++ drivers/xen/swiotlb-xen.c|6 ++ lib/swiotlb.c|6 ++ 11 files changed, 58 insertions(+), 4 deletions(-) diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index ccb3aa64640d..1962d7b471c7 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -167,7 +167,8 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size) return 1; } -static inline void dma_mark_clean(void *addr, size_t size) { } +static inline void dma_mark_clean(void *addr, size_t size) {} +static inline void dma_mark_dirty(void *addr, size_t size) {} extern int arm_dma_set_mask(struct device *dev, u64 dma_mask); diff --git a/arch/arm64/include/asm/dma-mapping.h b/arch/arm64/include/asm/dma-mapping.h index 61e08f360e31..8d24fe11c8a3 100644 --- a/arch/arm64/include/asm/dma-mapping.h +++ b/arch/arm64/include/asm/dma-mapping.h @@ -84,9 +84,8 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size) return addr + size - 1 <= *dev->dma_mask; } -static inline void dma_mark_clean(void *addr, size_t size) -{ -} +static inline void dma_mark_clean(void *addr, size_t size) {} +static inline void dma_mark_dirty(void *addr, size_t size) {} #endif /* __KERNEL__ */ #endif /* __ASM_DMA_MAPPING_H */ diff --git a/arch/ia64/include/asm/dma.h b/arch/ia64/include/asm/dma.h index 4d97f60f1ef5..d92ebeb2758e 100644 --- a/arch/ia64/include/asm/dma.h +++ b/arch/ia64/include/asm/dma.h @@ -20,5 +20,6 @@ extern unsigned long MAX_DMA_ADDRESS; #define free_dma(x) void dma_mark_clean(void *addr, size_t size); +static inline void dma_mark_dirty(void *addr, size_t size) {} #endif /* _ASM_IA64_DMA_H */ diff --git a/arch/mips/include/asm/dma-mapping.h b/arch/mips/include/asm/dma-mapping.h index e604f760c4a0..567f6e03e337 100644 --- a/arch/mips/include/asm/dma-mapping.h +++ b/arch/mips/include/asm/dma-mapping.h @@ -28,6 +28,7 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size) } static inline void dma_mark_clean(void *addr, size_t size) {} +static inline void dma_mark_dirty(void *addr, size_t size) {} #include diff --git a/arch/powerpc/include/asm/swiotlb.h b/arch/powerpc/include/asm/swiotlb.h index de99d6e29430..b694e8399e28 100644 --- a/arch/powerpc/include/asm/swiotlb.h +++ b/arch/powerpc/include/asm/swiotlb.h @@ -16,6 +16,7 @@ extern struct dma_map_ops swiotlb_dma_ops; static inline void dma_mark_clean(void *addr, size_t size) {} +static inline void dma_mark_dirty(void *addr, size_t size) {} extern unsigned int ppc_swiotlb_enable; int __init swiotlb_setup_bus_notifier(void); diff --git a/arch/tile/include/asm/dma-mapping.h b/arch/tile/include/asm/dma-mapping.h index 96ac6cce4a32..79953f09e938 100644 --- a/arch/tile/include/asm/dma-mapping.h +++ b/arch/tile/include/asm/dma-mapping.h @@ -58,6 +58,7 @@ static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t daddr) } static inline void dma_mark_clean(void *addr, size_t size) {} +static inline void dma_mark_dirty(void *addr, size_t size) {} static inline void set_dma_ops(struct device *dev, struct dma_map_ops *ops) { diff --git a/arch/unicore32/include/asm/dma-mapping.h b/arch/unicore32/include/asm/dma-mapping.h index 8140e053ccd3..b9d357ab122d 100644 --- a/arch/unicore32/include/asm/dma-mapping.h +++ b/arch/unicore32/include/asm/dma-mapping.h @@ -49,6 +49,7 @@ static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t daddr) } static inline void dma_mark_clean(void *addr,
Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
On Sun, Dec 13, 2015 at 9:22 PM, Yang Zhang <yang.zhang...@gmail.com> wrote: > On 2015/12/14 12:54, Alexander Duyck wrote: >> >> On Sun, Dec 13, 2015 at 6:27 PM, Yang Zhang <yang.zhang...@gmail.com> >> wrote: >>> >>> On 2015/12/14 5:28, Alexander Duyck wrote: >>>> >>>> >>>> This patch set is meant to be the guest side code for a proof of concept >>>> involving leaving pass-through devices in the guest during the warm-up >>>> phase of guest live migration. In order to accomplish this I have added >>>> a >>>> new function called dma_mark_dirty that will mark the pages associated >>>> with >>>> the DMA transaction as dirty in the case of either an unmap or a >>>> sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or >>>> DMA_BIDIRECTIONAL. The pass-through device must still be removed before >>>> the stop-and-copy phase, however allowing the device to be present >>>> should >>>> significantly improve the performance of the guest during the warm-up >>>> period. >>>> >>>> This current implementation is very preliminary and there are number of >>>> items still missing. Specifically in order to make this a more complete >>>> solution we need to support: >>>> 1. Notifying hypervisor that drivers are dirtying DMA pages received >>>> 2. Bypassing page dirtying when it is not needed. >>>> >>> >>> Shouldn't current log dirty mechanism already cover them? >> >> >> The guest has no way of currently knowing that the hypervisor is doing >> dirty page logging, and the log dirty mechanism currently has no way >> of tracking device DMA accesses. This change is meant to bridge the >> two so that the guest device driver will force the SWIOTLB DMA API to >> mark pages written to by the device as dirty. > > > OK. This is what we called "dummy write mechanism". Actually, this is just a > workaround before iommu dirty bit ready. Eventually, we need to change to > use the hardware dirty bit. Besides, we may still lost the data if dma > happens during/just before stop and copy phase. Right, this is a "dummy write mechanism" in order to allow for entry tracking. This only works completely if we force the hardware to quiesce via a hot-plug event before we reach the stop-and-copy phase of the migration. The IOMMU dirty bit approach is likely going to have a significant number of challenges involved. Looking over the driver and the data sheet it looks like the current implementation is using a form of huge pages in the IOMMU, as such we will need to tear that down and replace it with 4K pages if we don't want to dirty large regions with each DMA transaction, and I'm not sure that is something we can change while DMA is active to the affected regions. In addition the data sheet references the fact that the page table entries are stored in a translation cache and in order to sync things up you have to invalidate the entries. I'm not sure what the total overhead would be for invalidating something like a half million 4K pages to migrate a guest with just 2G of RAM, but I would think that might be a bit expensive given the fact that IOMMU accesses aren't known for being incredibly fast when invalidating DMA on the host. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Thu, Dec 10, 2015 at 8:11 AM, Michael S. Tsirkinwrote: > On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote: >> >> >> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: >> >>Ideally, it is able to leave guest driver unmodified but it requires the >> >>>hypervisor or qemu to aware the device which means we may need a driver in >> >>>hypervisor or qemu to handle the device on behalf of guest driver. >> >Can you answer the question of when do you use your code - >> >at the start of migration or >> >just before the end? >> >> Just before stopping VCPU in this version and inject VF mailbox irq to >> notify the driver if the irq handler is installed. >> Qemu side also will check this via the faked PCI migration capability >> and driver will set the status during device open() or resume() callback. > > Right, this is the "good path" optimization. Whether this buys anything > as compared to just sending reset to the device when VCPU is stopped > needs to be measured. In any case, we probably do need a way to > interrupt driver on destination to make it reconfigure the device - > otherwise it might take seconds for it to notice. And a way to make > sure driver can handle this surprise reset so we can block migration if > it can't. The question is how do we handle the "bad path"? From what I can tell it seems like we would have to have the dirty page tracking for DMA handled in the host in order to support that. Otherwise we risk corrupting the memory in the guest as there are going to be a few stale pages that end up being in the guest. The easiest way to probably flag a "bad path" migration would be to emulate a Manually-operated Retention Latch being opened and closed on the device. It may even allow us to work with the desire to support a means for doing a pause/resume as that would be a hot-plug event where the latch was never actually opened. Basically if the retention latch is released and then re-closed it can be assumed that the device has lost power and as a result been reset. As such a normal hot-plug controller would have to reconfigure the device in such an event. The key bit being that with the power being cycled on the port the assumption is that the device has lost any existing state, and we should emulate that as well by clearing any state Qemu might be carrying such as the shadow of the MSI-X table. In addition we could also signal if the host supports the dirty page tracking via the IOMMU so if needed the guest could trigger some sort of memory exception handling due to the risk of memory corruption. I would argue that we don't necessarily have to provide a means to guarantee the driver can support a surprise removal/reset. Worst case scenario is that it would be equivalent to somebody pulling the plug on an externally connected PCIe cage in a physical host. I know the Intel Ethernet drivers have already had to add support for surprise removal due to the fact that such a scenario can occur on Thunderbolt enabled platforms. Since it is acceptable for physical hosts to have such an event occur I think we could support the same type of failure for direct assigned devices in guests. That would be the one spot where I would say it is up to the drivers to figure out how they are going to deal with it since this is something that can occur for any given driver on any given OS assuming it can be plugged into an externally removable cage. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyuwrote: > > > On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: >>> >>> Ideally, it is able to leave guest driver unmodified but it requires the >>> >hypervisor or qemu to aware the device which means we may need a driver >>> > in >>> >hypervisor or qemu to handle the device on behalf of guest driver. >> >> Can you answer the question of when do you use your code - >> at the start of migration or >> just before the end? > > > Just before stopping VCPU in this version and inject VF mailbox irq to > notify the driver if the irq handler is installed. > Qemu side also will check this via the faked PCI migration capability > and driver will set the status during device open() or resume() callback. The VF mailbox interrupt is a very bad idea. Really the device should be in a reset state on the other side of a migration. It doesn't make sense to have the interrupt firing if the device is not configured. This is one of the things that is preventing you from being able to migrate the device while the interface is administratively down or the VF driver is not loaded. My thought on all this is that it might make sense to move this functionality into a PCI-to-PCI bridge device and make it a requirement that all direct-assigned devices have to exist behind that device in order to support migration. That way you would be working with a directly emulated device that would likely already be supporting hot-plug anyway. Then it would just be a matter of coming up with a few Qemu specific extensions that you would need to add to the device itself. The same approach would likely be portable enough that you could achieve it with PCIe as well via the same configuration space being present on the upstream side of a PCIe port or maybe a PCIe switch of some sort. It would then be possible to signal via your vendor-specific PCI capability on that device that all devices behind this bridge require DMA page dirtying, you could use the configuration in addition to the interrupt already provided for hot-plug to signal things like when you are starting migration, and possibly even just extend the shpc functionality so that if this capability is present you have the option to pause/resume instead of remove/probe the device in the case of certain hot-plug events. The fact is there may be some use for a pause/resume type approach for PCIe hot-plug in the near future anyway. From the sounds of it Apple has required it for all Thunderbolt device drivers so that they can halt the device in order to shuffle resources around, perhaps we should look at something similar for Linux. The other advantage behind grouping functions on one bridge is things like reset domains. The PCI error handling logic will want to be able to reset any devices that experienced an error in the event of something such as a surprise removal. By grouping all of the devices you could disable/reset/enable them as one logical group in the event of something such as the "bad path" approach Michael has mentioned. >> > >It would be great if we could avoid changing the guest; but at least > > your guest > >driver changes don't actually seem to be that hardware specific; > > could your > >changes actually be moved to generic PCI level so they could be made > >to work for lots of drivers? >>> >>> > >>> >It is impossible to use one common solution for all devices unless the >>> > PCIE >>> >spec documents it clearly and i think one day it will be there. But >>> > before >>> >that, we need some workarounds on guest driver to make it work even it >>> > looks >>> >ugly. > > > Yes, so far there is not hardware migration support and it's hard to modify > bus level code. It also will block implementation on the Windows. Please don't assume things. Unless you have hard data from Microsoft that says they want it this way lets just try to figure out what works best for us for now and then we can start worrying about third party implementations after we have figured out a solution that actually works. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Wed, Dec 9, 2015 at 1:28 AM, Lan, Tianyu <tianyu@intel.com> wrote: > > > On 12/8/2015 1:12 AM, Alexander Duyck wrote: >> >> On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu <tianyu@intel.com> wrote: >>> >>> On 12/5/2015 1:07 AM, Alexander Duyck wrote: >>>>> >>>>> >>>>> >>>>> We still need to support Windows guest for migration and this is why >>>>> our >>>>> patches keep all changes in the driver since it's impossible to change >>>>> Windows kernel. >>>> >>>> >>>> >>>> That is a poor argument. I highly doubt Microsoft is interested in >>>> having to modify all of the drivers that will support direct assignment >>>> in order to support migration. They would likely request something >>>> similar to what I have in that they will want a way to do DMA tracking >>>> with minimal modification required to the drivers. >>> >>> >>> >>> This totally depends on the NIC or other devices' vendors and they >>> should make decision to support migration or not. If yes, they would >>> modify driver. >> >> >> Having to modify every driver that wants to support live migration is >> a bit much. In addition I don't see this being limited only to NIC >> devices. You can direct assign a number of different devices, your >> solution cannot be specific to NICs. > > > We are also adding such migration support for QAT device and so our > solution will not just be limit to NIC. Now just is the beginning. Agreed, but still QAT is networking related. My advice would be to look at something else that works from within a different subsystem such as storage. All I am saying is that your solution is very networking centric. > We can't limit user to only use Linux guest. So the migration feature > should work for both Windows and Linux guest. Right now what your solution is doing is to limit things so that only the Intel NICs can support this since it will require driver modification across the board. Instead what I have proposed should make it so that once you have done the work there should be very little work that has to be done on your port to support any device. >> >>> If just target to call suspend/resume during migration, the feature will >>> be meaningless. Most cases don't want to affect user during migration >>> a lot and so the service down time is vital. Our target is to apply >>> SRIOV NIC passthough to cloud service and NFV(network functions >>> virtualization) projects which are sensitive to network performance >>> and stability. From my opinion, We should give a change for device >>> driver to implement itself migration job. Call suspend and resume >>> callback in the driver if it doesn't care the performance during >>> migration. >> >> >> The suspend/resume callback should be efficient in terms of time. >> After all we don't want the system to stall for a long period of time >> when it should be either running or asleep. Having it burn cycles in >> a power state limbo doesn't do anyone any good. If nothing else maybe >> it will help to push the vendors to speed up those functions which >> then benefit migration and the system sleep states. > > > If we can benefit both migration and suspend, that would be wonderful. > But migration and system pm is still different. Just for example, > driver doesn't need to put device into deep D-status during migration > and host can do this after migration while it's essential for > system sleep. PCI configure space and interrupt config is emulated by > Qemu and Qemu can migrate these configures to new machine. Driver > doesn't need to deal with such thing. So I think migration still needs a > different callback or different code path than device suspend/resume. SR-IOV devices are considered to be in D3 as soon as you clear the bus master enable bit. They don't actually have a PCIe power management block in their configuration space. The advantage of the suspend/resume approach is that the D0->D3->D0 series of transitions should trigger a PCIe reset on the device. As such the resume call is capable of fully reinitializing a device. As far as migrating the interrupts themselves moving live interrupts is problematic. You are more likely to throw them out of sync since the state of the device will not match the state of what you migrated for things like the pending bit array so if there is a device that actually depending on those bits you might run into issues. > Another concern is that we have to rework PM core ore PCI bus driver &
Re: live migration vs device assignment (motivation)
On Wed, Dec 9, 2015 at 8:26 AM, Lan, Tianyuwrote: > For other kind of devices, it's hard to work. > We are also adding migration support for QAT(QuickAssist Technology) device. > > QAT device user case introduction. > Server, networking, big data, and storage applications use QuickAssist > Technology to offload servers from handling compute-intensive operations, > such as: > 1) Symmetric cryptography functions including cipher operations and > authentication operations > 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve > cryptography > 3) Compression and decompression functions including DEFLATE and LZS > > PCI hotplug will not work for such devices during migration and these > operations will fail when unplug device. I assume the problem is that with a PCI hotplug event you are losing the state information for the device, do I have that right? Looking over the QAT drivers it doesn't seem like any of them support the suspend/resume PM calls. I would imagine it makes it difficult for a system with a QAT card in it to be able to drop the system to a low power state. You might want to try enabling suspend/resume support for the devices on bare metal before you attempt to take on migration as it would provide you with a good testing framework to see what needs to be saved/restored within the device and in what order before you attempt to do the same while migrating from one system to another. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu <tianyu@intel.com> wrote: > On 12/5/2015 1:07 AM, Alexander Duyck wrote: >>> >>> >>> We still need to support Windows guest for migration and this is why our >>> patches keep all changes in the driver since it's impossible to change >>> Windows kernel. >> >> >> That is a poor argument. I highly doubt Microsoft is interested in >> having to modify all of the drivers that will support direct assignment >> in order to support migration. They would likely request something >> similar to what I have in that they will want a way to do DMA tracking >> with minimal modification required to the drivers. > > > This totally depends on the NIC or other devices' vendors and they > should make decision to support migration or not. If yes, they would > modify driver. Having to modify every driver that wants to support live migration is a bit much. In addition I don't see this being limited only to NIC devices. You can direct assign a number of different devices, your solution cannot be specific to NICs. > If just target to call suspend/resume during migration, the feature will > be meaningless. Most cases don't want to affect user during migration > a lot and so the service down time is vital. Our target is to apply > SRIOV NIC passthough to cloud service and NFV(network functions > virtualization) projects which are sensitive to network performance > and stability. From my opinion, We should give a change for device > driver to implement itself migration job. Call suspend and resume > callback in the driver if it doesn't care the performance during migration. The suspend/resume callback should be efficient in terms of time. After all we don't want the system to stall for a long period of time when it should be either running or asleep. Having it burn cycles in a power state limbo doesn't do anyone any good. If nothing else maybe it will help to push the vendors to speed up those functions which then benefit migration and the system sleep states. Also you keep assuming you can keep the device running while you do the migration and you can't. You are going to corrupt the memory if you do, and you have yet to provide any means to explain how you are going to solve that. > >> >>> Following is my idea to do DMA tracking. >>> >>> Inject event to VF driver after memory iterate stage >>> and before stop VCPU and then VF driver marks dirty all >>> using DMA memory. The new allocated pages also need to >>> be marked dirty before stopping VCPU. All dirty memory >>> in this time slot will be migrated until stop-and-copy >>> stage. We also need to make sure to disable VF via clearing the >>> bus master enable bit for VF before migrating these memory. >> >> >> The ordering of your explanation here doesn't quite work. What needs to >> happen is that you have to disable DMA and then mark the pages as dirty. >> What the disabling of the BME does is signal to the hypervisor that >> the device is now stopped. The ixgbevf_suspend call already supported >> by the driver is almost exactly what is needed to take care of something >> like this. > > > This is why I hope to reserve a piece of space in the dma page to do dummy > write. This can help to mark page dirty while not require to stop DMA and > not race with DMA data. You can't and it will still race. What concerns me is that your patches and the document you referenced earlier show a considerable lack of understanding about how DMA and device drivers work. There is a reason why device drivers have so many memory barriers and the like in them. The fact is when you have CPU and a device both accessing memory things have to be done in a very specific order and you cannot violate that. If you have a contiguous block of memory you expect the device to write into you cannot just poke a hole in it. Such a situation is not supported by any hardware that I am aware of. As far as writing to dirty the pages it only works so long as you halt the DMA and then mark the pages dirty. It has to be in that order. Any other order will result in data corruption and I am sure the NFV customers definitely don't want that. > If can't do that, we have to stop DMA in a short time to mark all dma > pages dirty and then reenable it. I am not sure how much we can get by > this way to track all DMA memory with device running during migration. I > need to do some tests and compare results with stop DMA diretly at last > stage during migration. We have to halt the DMA before we can complete the migration. So please feel free to test this. In addition I still feel you would be better off taking this in smaller steps. I still say your fir
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Mon, Dec 7, 2015 at 9:39 AM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Mon, Dec 07, 2015 at 09:12:08AM -0800, Alexander Duyck wrote: >> On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu <tianyu@intel.com> wrote: >> > On 12/5/2015 1:07 AM, Alexander Duyck wrote: >> > If can't do that, we have to stop DMA in a short time to mark all dma >> > pages dirty and then reenable it. I am not sure how much we can get by >> > this way to track all DMA memory with device running during migration. I >> > need to do some tests and compare results with stop DMA diretly at last >> > stage during migration. >> >> We have to halt the DMA before we can complete the migration. So >> please feel free to test this. >> >> In addition I still feel you would be better off taking this in >> smaller steps. I still say your first step would be to come up with a >> generic solution for the dirty page tracking like the dma_mark_clean() >> approach I had mentioned earlier. If I get time I might try to take >> care of it myself later this week since you don't seem to agree with >> that approach. > > Or even try to look at the dirty bit in the VT-D PTEs > on the host. See the mail I have just sent. > Might be slower, or might be faster, but is completely > transparent. I just saw it and I am looking over the VTd spec now. It looks like there might be some performance impacts if software is changing the PTEs since then the VTd harwdare cannot cache them. I still have to do some more reading though so I can fully understand the impacts. >> >> >> >> The question is how we would go about triggering it. I really don't >> >> think the PCI configuration space approach is the right idea. >> >> I wonder >> >> if we couldn't get away with some sort of ACPI event instead. We >> >> already require ACPI support in order to shut down the system >> >> gracefully, I wonder if we couldn't get away with something similar in >> >> order to suspend/resume the direct assigned devices gracefully. >> >> >> > >> > I don't think there is such events in the current spec. >> > Otherwise, There are two kinds of suspend/resume callbacks. >> > 1) System suspend/resume called during S2RAM and S2DISK. >> > 2) Runtime suspend/resume called by pm core when device is idle. >> > If you want to do what you mentioned, you have to change PM core and >> > ACPI spec. >> >> The thought I had was to somehow try to move the direct assigned >> devices into their own power domain and then simulate a AC power event >> where that domain is switched off. However I don't know if there are >> ACPI events to support that since the power domain code currently only >> appears to be in use for runtime power management. >> >> That had also given me the thought to look at something like runtime >> power management for the VFs. We would need to do a runtime >> suspend/resume. The only problem is I don't know if there is any way >> to get the VFs to do a quick wakeup. It might be worthwhile looking >> at trying to check with the ACPI experts out there to see if there is >> anything we can do as bypassing having to use the configuration space >> mechanism to signal this would definitely be worth it. > > I don't much like this idea because it relies on the > device being exactly the same across source/destination. > After all, this is always true for suspend/resume. > Most users do not have control over this, and you would > often get sightly different versions of firmware, > etc without noticing. The original code was operating on that assumption as well. That is kind of why I suggested suspend/resume rather than reinventing the wheel. > I think we should first see how far along we can get > by doing a full device reset, and only carrying over > high level state such as IP, MAC, ARP cache etc. One advantage of the suspend/resume approach is that it is compatible with a full reset. The suspend/resume approach assumes the device goes through a D0->D3->D0 reset as a part of transitioning between the system states. I do admit though that the PCI spec says you aren't supposed to be hot-swapping devices while the system is in a sleep state so odds are you would encounter issues if the device changed in any significant way. >> >>> The dma page allocated by VF driver also needs to reserve space >> >>> to do dummy write. >> >> >> >> >> >> No, this will not work. If for example you have a VF driver allocating >> >> memory for a 9K receive how will that work? It i
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/04/2015 08:32 AM, Lan, Tianyu wrote: Hi Michael & Alexander: Thanks a lot for your comments and suggestions. We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. That is a poor argument. I highly doubt Microsoft is interested in having to modify all of the drivers that will support direct assignment in order to support migration. They would likely request something similar to what I have in that they will want a way to do DMA tracking with minimal modification required to the drivers. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The ordering of your explanation here doesn't quite work. What needs to happen is that you have to disable DMA and then mark the pages as dirty. What the disabling of the BME does is signal to the hypervisor that the device is now stopped. The ixgbevf_suspend call already supported by the driver is almost exactly what is needed to take care of something like this. The question is how we would go about triggering it. I really don't think the PCI configuration space approach is the right idea. I wonder if we couldn't get away with some sort of ACPI event instead. We already require ACPI support in order to shut down the system gracefully, I wonder if we couldn't get away with something similar in order to suspend/resume the direct assigned devices gracefully. The dma page allocated by VF driver also needs to reserve space to do dummy write. No, this will not work. If for example you have a VF driver allocating memory for a 9K receive how will that work? It isn't as if you can poke a hole in the contiguous memory. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
On Wed, Dec 2, 2015 at 6:08 AM, Lan, Tianyuwrote: > On 12/1/2015 11:02 PM, Michael S. Tsirkin wrote: >>> >>> But >>> it requires guest OS to do specific configurations inside and rely on >>> bonding driver which blocks it work on Windows. >>> From performance side, >>> putting VF and virtio NIC under bonded interface will affect their >>> performance even when not do migration. These factors block to use VF >>> NIC passthough in some user cases(Especially in the cloud) which require >>> migration. >> >> >> That's really up to guest. You don't need to do bonding, >> you can just move the IP and mac from userspace, that's >> possible on most OS-es. >> >> Or write something in guest kernel that is more lightweight if you are >> so inclined. What we are discussing here is the host-guest interface, >> not the in-guest interface. >> >>> Current solution we proposed changes NIC driver and Qemu. Guest Os >>> doesn't need to do special thing for migration. >>> It's easy to deploy >> >> >> >> Except of course these patches don't even work properly yet. >> >> And when they do, even minor changes in host side NIC hardware across >> migration will break guests in hard to predict ways. > > > Switching between PV and VF NIC will introduce network stop and the > latency of hotplug VF is measurable. For some user cases(cloud service > and OPNFV) which are sensitive to network stabilization and performance, > these are not friend and blocks SRIOV NIC usage in these case. We hope > to find a better way to make SRIOV NIC work in these cases and this is > worth to do since SRIOV NIC provides better network performance compared > with PV NIC. Current patches have some issues. I think we can find > solution for them andimprove them step by step. I still believe the concepts being put into use here are deeply flawed. You are assuming you can somehow complete the migration while the device is active and I seriously doubt that is the case. You are going to cause data corruption or worse cause a kernel panic when you end up corrupting the guest memory. You have to halt the device at some point in order to complete the migration. Now I fully agree it is best to do this for as small a window as possible. I really think that your best approach would be embrace and extend the current solution that is making use of bonding. The first step being to make it so that you don't have to hot-plug the VF until just before you halt the guest instead of before you start he migration. Just doing that would yield a significant gain in terms of performance during the migration. In addition something like that should be able to be done without having to be overly invasive into the drivers. A few tweaks to the DMA API and you could probably have that resolved. As far as avoiding the hot-plug itself that would be better handled as a separate follow-up, and really belongs more to the PCI layer than the NIC device drivers. The device drivers should already have code for handling a suspend/resume due to a power cycle event. If you could make use of that then it is just a matter of implementing something in the hot-plug or PCIe drivers that would allow QEMU to signal when the device needs to go into D3 and when it can resume normal operation at D0. You could probably use the PCI Bus Master Enable bit as the test on if the device is ready for migration or not. If the bit is set you cannot migrate the VM, and if it is cleared than you are ready to migrate. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Tue, Dec 01, 2015 at 11:04:31PM +0800, Lan, Tianyu wrote: >> >> >> On 12/1/2015 12:07 AM, Alexander Duyck wrote: >> >They can only be corrected if the underlying assumptions are correct >> >and they aren't. Your solution would have never worked correctly. >> >The problem is you assume you can keep the device running when you are >> >migrating and you simply cannot. At some point you will always have >> >to stop the device in order to complete the migration, and you cannot >> >stop it before you have stopped your page tracking mechanism. So >> >unless the platform has an IOMMU that is somehow taking part in the >> >dirty page tracking you will not be able to stop the guest and then >> >the device, it will have to be the device and then the guest. >> > >> >>>Doing suspend and resume() may help to do migration easily but some >> >>>devices requires low service down time. Especially network and I got >> >>>that some cloud company promised less than 500ms network service downtime. >> >Honestly focusing on the downtime is getting the cart ahead of the >> >horse. First you need to be able to do this without corrupting system >> >memory and regardless of the state of the device. You haven't even >> >gotten to that state yet. Last I knew the device had to be up in >> >order for your migration to even work. >> >> I think the issue is that the content of rx package delivered to stack maybe >> changed during migration because the piece of memory won't be migrated to >> new machine. This may confuse applications or stack. Current dummy write >> solution can ensure the content of package won't change after doing dummy >> write while the content maybe not received data if migration happens before >> that point. We can recheck the content via checksum or crc in the protocol >> after dummy write to ensure the content is what VF received. I think stack >> has already done such checks and the package will be abandoned if failed to >> pass through the check. > > > Most people nowdays rely on hardware checksums so I don't think this can > fly. Correct. The checksum/crc approach will not work since it is possible for a checksum to even be mangled in the case of some features such as LRO or GRO. >> Another way is to tell all memory driver are using to Qemu and let Qemu to >> migrate these memory after stopping VCPU and the device. This seems safe but >> implementation maybe complex. > > Not really 100% safe. See below. > > I think hiding these details behind dma_* API does have > some appeal. In any case, it gives us a good > terminology as it covers what most drivers do. That was kind of my thought. If we were to build our own dma_mark_clean() type function that will mark the DMA region dirty on sync or unmap then that is half the battle right there as we would be able to at least keep the regions consistent after they have left the driver. > There are several components to this: > - dma_map_* needs to prevent page from > being migrated while device is running. > For example, expose some kind of bitmap from guest > to host, set bit there while page is mapped. > What happens if we stop the guest and some > bits are still set? See dma_alloc_coherent below > for some ideas. Yeah, I could see something like this working. Maybe we could do something like what was done for the NX bit and make use of the upper order bits beyond the limits of the memory range to mark pages as non-migratable? I'm curious. What we have with a DMA mapped region is essentially shared memory between the guest and the device. How would we resolve something like this with IVSHMEM, or are we blocked there as well in terms of migration? > - dma_unmap_* needs to mark page as dirty > This can be done by writing into a page. > > - dma_sync_* needs to mark page as dirty > This is trickier as we can not change the data. > One solution is using atomics. > For example: > int x = ACCESS_ONCE(*p); > cmpxchg(p, x, x); > Seems to do a write without changing page > contents. Like I said we can probably kill 2 birds with one stone by just implementing our own dma_mark_clean() for x86 virtualized environments. I'd say we could take your solution one step further and just use 0 instead of bothering to read the value. After all it won't write the area if the value at the offset is not 0. The only downside is that this is a locked operation so we will take a pretty serious performance penalty when this is active. As such my preference would
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote: >> On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin <m...@redhat.com> wrote: >> > There are several components to this: >> > - dma_map_* needs to prevent page from >> > being migrated while device is running. >> > For example, expose some kind of bitmap from guest >> > to host, set bit there while page is mapped. >> > What happens if we stop the guest and some >> > bits are still set? See dma_alloc_coherent below >> > for some ideas. >> >> Yeah, I could see something like this working. Maybe we could do >> something like what was done for the NX bit and make use of the upper >> order bits beyond the limits of the memory range to mark pages as >> non-migratable? >> >> I'm curious. What we have with a DMA mapped region is essentially >> shared memory between the guest and the device. How would we resolve >> something like this with IVSHMEM, or are we blocked there as well in >> terms of migration? > > I have some ideas. Will post later. I look forward to it. >> > - dma_unmap_* needs to mark page as dirty >> > This can be done by writing into a page. >> > >> > - dma_sync_* needs to mark page as dirty >> > This is trickier as we can not change the data. >> > One solution is using atomics. >> > For example: >> > int x = ACCESS_ONCE(*p); >> > cmpxchg(p, x, x); >> > Seems to do a write without changing page >> > contents. >> >> Like I said we can probably kill 2 birds with one stone by just >> implementing our own dma_mark_clean() for x86 virtualized >> environments. >> >> I'd say we could take your solution one step further and just use 0 >> instead of bothering to read the value. After all it won't write the >> area if the value at the offset is not 0. > > Really almost any atomic that has no side effect will do. > atomic or with 0 > atomic and with > > It's just that cmpxchg already happens to have a portable > wrapper. I was originally thinking maybe an atomic_add with 0 would be the way to go. Either way though we still are using a locked prefix and having to dirty a cache line per page which is going to come at some cost. >> > - dma_alloc_coherent memory (e.g. device rings) >> > must be migrated after device stopped modifying it. >> > Just stopping the VCPU is not enough: >> > you must make sure device is not changing it. >> > >> > Or maybe the device has some kind of ring flush operation, >> > if there was a reasonably portable way to do this >> > (e.g. a flush capability could maybe be added to SRIOV) >> > then hypervisor could do this. >> >> This is where things start to get messy. I was suggesting the >> suspend/resume to resolve this bit, but it might be possible to also >> deal with this via something like this via clearing the bus master >> enable bit for the VF. If I am not mistaken that should disable MSI-X >> interrupts and halt any DMA. That should work as long as you have >> some mechanism that is tracking the pages in use for DMA. > > A bigger issue is recovering afterwards. Agreed. >> > In case you need to resume on source, you >> > really need to follow the same path >> > as on destination, preferably detecting >> > device reset and restoring the device >> > state. >> >> The problem with detecting the reset is that you would likely have to >> be polling to do something like that. > > We could some event to guest to notify it about this event > through a new or existing channel. > > Or we could make it possible for userspace to trigger this, > then notify guest through the guest agent. The first thing that comes to mind would be to use something like PCIe Advanced Error Reporting, however I don't know if we can put a requirement on the system supporting the q35 machine type or not in order to support migration. >> I believe the fm10k driver >> already has code like that in place where it will detect a reset as a >> part of its watchdog, however the response time is something like 2 >> seconds for that. That was one of the reasons I preferred something >> like hot-plug as that should be functioning as soon as the guest is up >> and it is a mechanism that operates outside of the VF drivers. > > That's pretty minor. > A bigger issue is making sure guest does not crash > when device is suddenly reset under it's legs. I know the ixgbevf driver should already have logic to address some of that. If you look through the code there should be logic there for a surprise removal support in ixgbevf. The only issue is that unlike fm10k it will not restore itself after a resume or slot_reset call. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Sun, Nov 29, 2015 at 10:53 PM, Lan, Tianyu <tianyu@intel.com> wrote: > On 11/26/2015 11:56 AM, Alexander Duyck wrote: >> >> > I am not saying you cannot modify the drivers, however what you are >> doing is far too invasive. Do you seriously plan on modifying all of >> the PCI device drivers out there in order to allow any device that >> might be direct assigned to a port to support migration? I certainly >> hope not. That is why I have said that this solution will not scale. > > > Current drivers are not migration friendly. If the driver wants to > support migration, it's necessary to be changed. Modifying all of the drivers directly will not solve the issue though. This is why I have suggested looking at possibly implementing something like dma_mark_clean() which is used for ia64 architectures to mark pages that were DMAed in as clean. In your case though you would want to mark such pages as dirty so that the page migration will notice them and move them over. > RFC PATCH V1 presented our ideas about how to deal with MMIO, ring and > DMA tracking during migration. These are common for most drivers and > they maybe problematic in the previous version but can be corrected later. They can only be corrected if the underlying assumptions are correct and they aren't. Your solution would have never worked correctly. The problem is you assume you can keep the device running when you are migrating and you simply cannot. At some point you will always have to stop the device in order to complete the migration, and you cannot stop it before you have stopped your page tracking mechanism. So unless the platform has an IOMMU that is somehow taking part in the dirty page tracking you will not be able to stop the guest and then the device, it will have to be the device and then the guest. > Doing suspend and resume() may help to do migration easily but some > devices requires low service down time. Especially network and I got > that some cloud company promised less than 500ms network service downtime. Honestly focusing on the downtime is getting the cart ahead of the horse. First you need to be able to do this without corrupting system memory and regardless of the state of the device. You haven't even gotten to that state yet. Last I knew the device had to be up in order for your migration to even work. Many devices are very state driven. As such you cannot just freeze them and restore them like you would regular device memory. That is where something like suspend/resume comes in because it already takes care of getting the device ready for halt, and then resume. Keep in mind that those functions were meant to function on a device doing something like a suspend to RAM or disk. This is not too far of from what a migration is doing since you need to halt the guest before you move it. As such the first step is to make it so that we can do the current bonding approach with one change. Specifically we want to leave the device in the guest until the last portion of the migration instead of having to remove it first. To that end I would suggest focusing on solving the DMA problem via something like a dma_mark_clean() type solution as that would be one issue resolved and we all would see an immediate gain instead of just those users of the ixgbevf driver. > So I think performance effect also should be taken into account when we > design the framework. What you are proposing I would call premature optimization. You need to actually solve the problem before you can start optimizing things and I don't see anything actually solved yet since your solution is too unstable. >> >> What I am counter proposing seems like a very simple proposition. It >> can be implemented in two steps. >> >> 1. Look at modifying dma_mark_clean(). It is a function called in >> the sync and unmap paths of the lib/swiotlb.c. If you could somehow >> modify it to take care of marking the pages you unmap for Rx as being >> dirty it will get you a good way towards your goal as it will allow >> you to continue to do DMA while you are migrating the VM. >> >> 2. Look at making use of the existing PCI suspend/resume calls that >> are there to support PCI power management. They have everything >> needed to allow you to pause and resume DMA for the device before and >> after the migration while retaining the driver state. If you can >> implement something that allows you to trigger these calls from the >> PCI subsystem such as hot-plug then you would have a generic solution >> that can be easily reproduced for multiple drivers beyond those >> supported by ixgbevf. > > > Glanced at PCI hotplug code. The hotplug events are triggered by PCI hotplug > controller and these event are defined in the controller spec. > It's ha
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu <tianyu@intel.com> wrote: > On 2015年11月25日 13:30, Alexander Duyck wrote: >> No, what I am getting at is that you can't go around and modify the >> configuration space for every possible device out there. This >> solution won't scale. > > > PCI config space regs are emulation by Qemu and so We can find the free > PCI config space regs for the faked PCI capability. Its position can be > not permanent. Yes, but do you really want to edit every driver on every OS that you plan to support this on. What about things like direct assignment of regular Ethernet ports? What you really need is a solution that will work generically on any existing piece of hardware out there. >> If you instead moved the logic for notifying >> the device into a separate mechanism such as making it a part of the >> hot-plug logic then you only have to write the code once per OS in >> order to get the hot-plug capability to pause/resume the device. What >> I am talking about is not full hot-plug, but rather to extend the >> existing hot-plug in Qemu and the Linux kernel to support a >> "pause/resume" functionality. The PCI hot-plug specification calls >> out the option of implementing something like this, but we don't >> currently have support for it. >> > > Could you elaborate the part of PCI hot-plug specification you mentioned? > > My concern is whether it needs to change PCI spec or not. In the PCI Hot-Plug Specification 1.1, in section 4.1.2 it states: In addition to quiescing add-in card activity, an operating-system vendor may optionally implement a less drastic “pause” capability, in anticipation of the same or a similar add-in card being reinserted. The idea I had was basically if we were to implement something like that in Linux then we could pause/resume the device instead of outright removing it. The pause functionality could make use of the suspend/resume functionality most drivers already have for PCI power management. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver
On Wed, Nov 25, 2015 at 8:02 AM, Lan, Tianyuwrote: > On 11/25/2015 8:28 PM, Michael S. Tsirkin wrote: >> >> Frankly, I don't really see what this short term hack buys us, >> and if it goes in, we'll have to maintain it forever. >> > > The framework of how to notify VF about migration status won't be > changed regardless of stopping VF or not before doing migration. > We hope to reach agreement on this first. Tracking dirty memory still > need to more discussions and we will continue working on it. Stop VF may > help to work around the issue and make tracking easier. The problem is you still have to stop the device at some point for the same reason why you have to halt the VM. You seem to think you can get by without doing that but you can't. All you do is open the system up to multiple races if you leave the device running. The goal should be to avoid stopping the device until the last possible moment, however it will still have to be stopped eventually. It isn't as if you can migrate memory and leave the device doing DMA and expect to get a clean state. I agree with Michael. The focus needs to be on first addressing dirty page tracking. Once you have that you could use a variation on the bonding solution where you postpone the hot-plug event until near the end of the migration just before you halt the guest instead of having to do it before you start the migration. Then after that we could look at optimizing things further by introducing a variation that you could further improve on things by introducing a variation of hot-plug that would pause the device as I suggested instead of removing it. At that point you should be able to have almost all of the key issues addresses so that you could drop the bond interface entirely. >> Also, assuming you just want to do ifdown/ifup for some reason, it's >> easy enough to do using a guest agent, in a completely generic way. >> > > Just ifdown/ifup is not enough for migration. It needs to restore some PCI > settings before doing ifup on the target machine That is why I have been suggesting making use of suspend/resume logic that is already in place for PCI power management. In the case of a suspend/resume we already have to deal with the fact that the device will go through a D0->D3->D0 reset so we have to restore all of the existing state. It would take a significant load off of Qemu since the guest would be restoring its own state instead of making Qemu have to do all of the device migration work. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver
On Wed, Nov 25, 2015 at 8:39 AM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Wed, Nov 25, 2015 at 08:24:38AM -0800, Alexander Duyck wrote: >> >> Also, assuming you just want to do ifdown/ifup for some reason, it's >> >> easy enough to do using a guest agent, in a completely generic way. >> >> >> > >> > Just ifdown/ifup is not enough for migration. It needs to restore some PCI >> > settings before doing ifup on the target machine >> >> That is why I have been suggesting making use of suspend/resume logic >> that is already in place for PCI power management. In the case of a >> suspend/resume we already have to deal with the fact that the device >> will go through a D0->D3->D0 reset so we have to restore all of the >> existing state. It would take a significant load off of Qemu since >> the guest would be restoring its own state instead of making Qemu have >> to do all of the device migration work. > > That can work, though again, the issue is you need guest > cooperation to migrate. Right now the problem is you need to have guest cooperation anyway as you need to have some way of tracking the dirty pages. If the IOMMU on the host were to provide some sort of dirty page tracking then we could exclude the guest from the equation, but until then we need the guest to notify us of what pages it is letting the device dirty. I'm still of the opinion that the best way to go there is to just modify the DMA API that is used in the guest so that it supports some sort of page flag modification or something along those lines so we can track all of the pages that might be written to by the device. > If you reset device on destination instead of restoring state, > then that issue goes away, but maybe the downtime > will be increased. Yes, the downtime will be increased, but it shouldn't be by much. Depending on the setup a VF with a single queue can have about 3MB of data outstanding when you move the driver over. After that it is just a matter of bringing the interface back up which should take only a few hundred milliseconds assuming the PF is fairly responsive. > Will it really? I think it's worth it to start with the > simplest solution (reset on destination) and see > what the effect is, then add optimizations. Agreed. My thought would be to start with something like dma_mark_clean() that could be used to take care of marking the pages for migration when they are unmapped or synced. > One thing that I've been thinking about for a while, is saving (some) > state speculatively. For example, notify guest a bit before migration > is done, so it can save device state. If guest responds quickly, you > have state that can be restored. If it doesn't, still migrate, and it > will have to reset on destination. I'm not sure how much more device state we really need to save. The driver in the guest has to have enough state to recover in the event of a device failure resulting in a slot reset. To top it off the driver is able to reconfigure things probably as quick as we could if we were restoring the state. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Wed, Nov 25, 2015 at 7:15 PM, Dong, Eddie <eddie.d...@intel.com> wrote: >> On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu <tianyu@intel.com> wrote: >> > On 2015年11月25日 13:30, Alexander Duyck wrote: >> >> No, what I am getting at is that you can't go around and modify the >> >> configuration space for every possible device out there. This >> >> solution won't scale. >> > >> > >> > PCI config space regs are emulation by Qemu and so We can find the >> > free PCI config space regs for the faked PCI capability. Its position >> > can be not permanent. >> >> Yes, but do you really want to edit every driver on every OS that you plan to >> support this on. What about things like direct assignment of regular >> Ethernet >> ports? What you really need is a solution that will work generically on any >> existing piece of hardware out there. > > The fundamental assumption of this patch series is to modify the driver in > guest to self-emulate or track the device state, so that the migration may be > possible. > I don't think we can modify OS, without modifying the drivers, even using the > PCIe hotplug mechanism. > In the meantime, modifying Windows OS is a big challenge given that only > Microsoft can do. While, modifying driver is relatively simple and manageable > to device vendors, if the device vendor want to support state-clone based > migration. The problem is the code you are presenting, even as a proof of concept is seriously flawed. It does a poor job of exposing how any of this can be duplicated for any other VF other than the one you are working on. I am not saying you cannot modify the drivers, however what you are doing is far too invasive. Do you seriously plan on modifying all of the PCI device drivers out there in order to allow any device that might be direct assigned to a port to support migration? I certainly hope not. That is why I have said that this solution will not scale. What I am counter proposing seems like a very simple proposition. It can be implemented in two steps. 1. Look at modifying dma_mark_clean(). It is a function called in the sync and unmap paths of the lib/swiotlb.c. If you could somehow modify it to take care of marking the pages you unmap for Rx as being dirty it will get you a good way towards your goal as it will allow you to continue to do DMA while you are migrating the VM. 2. Look at making use of the existing PCI suspend/resume calls that are there to support PCI power management. They have everything needed to allow you to pause and resume DMA for the device before and after the migration while retaining the driver state. If you can implement something that allows you to trigger these calls from the PCI subsystem such as hot-plug then you would have a generic solution that can be easily reproduced for multiple drivers beyond those supported by ixgbevf. Thanks. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Tue, Nov 24, 2015 at 7:18 PM, Lan Tianyu <tianyu@intel.com> wrote: > On 2015年11月24日 22:20, Alexander Duyck wrote: >> I'm still not a fan of this approach. I really feel like this is >> something that should be resolved by extending the existing PCI hot-plug >> rather than trying to instrument this per driver. Then you will get the >> goodness for multiple drivers and multiple OSes instead of just one. An >> added advantage to dealing with this in the PCI hot-plug environment >> would be that you could then still do a hot-plug even if the guest >> didn't load a driver for the VF since you would be working with the PCI >> slot instead of the device itself. >> >> - Alex > > Hi Alex: > What's you mentioned seems the bonding driver solution. > Paper "Live Migration with Pass-through Device for Linux VM" describes > it. It does VF hotplug during migration. In order to maintain Network > connection when VF is out, it takes advantage of Linux bonding driver to > switch between VF NIC and emulated NIC. But the side affects, that > requires VM to do additional configure and the performance during > switching two NIC is not good. No, what I am getting at is that you can't go around and modify the configuration space for every possible device out there. This solution won't scale. If you instead moved the logic for notifying the device into a separate mechanism such as making it a part of the hot-plug logic then you only have to write the code once per OS in order to get the hot-plug capability to pause/resume the device. What I am talking about is not full hot-plug, but rather to extend the existing hot-plug in Qemu and the Linux kernel to support a "pause/resume" functionality. The PCI hot-plug specification calls out the option of implementing something like this, but we don't currently have support for it. I just feel doing it through PCI hot-plug messages will scale much better as you could likely make use of the power management suspend/resume calls to take care of most of the needed implementation details. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver
On Tue, Nov 24, 2015 at 1:20 PM, Michael S. Tsirkinwrote: > On Tue, Nov 24, 2015 at 09:38:18PM +0800, Lan Tianyu wrote: >> This patch is to add migration support for ixgbevf driver. Using >> faked PCI migration capability table communicates with Qemu to >> share migration status and mailbox irq vector index. >> >> Qemu will notify VF via sending MSIX msg to trigger mailbox >> vector during migration and store migration status in the >> PCI_VF_MIGRATION_VMM_STATUS regs in the new capability table. >> The mailbox irq will be triggered just befoe stop-and-copy stage >> and after migration on the target machine. >> >> VF driver will put down net when detect migration and tell >> Qemu it's ready for migration via writing PCI_VF_MIGRATION_VF_STATUS >> reg. After migration, put up net again. >> >> Qemu will in charge of migrating PCI config space regs and MSIX config. >> >> The patch is to dedicate on the normal case that net traffic works >> when mailbox irq is enabled. For other cases(such as the driver >> isn't loaded, adapter is suspended or closed), mailbox irq won't be >> triggered and VF driver will disable it via PCI_VF_MIGRATION_CAP >> reg. These case will be resolved later. >> >> Signed-off-by: Lan Tianyu > > I have to say, I was much more interested in the idea > of tracking dirty memory. I have some thoughts about > that one - did you give up on it then? The tracking of dirty pages still needs to be addressed unless the interface is being downed before migration even starts which based on other comments I am assuming is not the case. I still feel that having a means of marking a page as being dirty when it is unmapped would be the best way to go. That way you only have to update the DMA API instead of messing with each and every driver trying to add code to force the page to be dirtied. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 11/24/2015 05:38 AM, Lan Tianyu wrote: This patchset is to propose a solution of adding live migration support for SRIOV NIC. During migration, Qemu needs to let VF driver in the VM to know migration start and end. Qemu adds faked PCI migration capability to help to sync status between two sides during migration. Qemu triggers VF's mailbox irq via sending MSIX msg when migration status is changed. VF driver tells Qemu its mailbox vector index via the new PCI capability. In some cases(NIC is suspended or closed), VF mailbox irq is freed and VF driver can disable irq injecting via new capability. VF driver will put down nic before migration and put up again on the target machine. Lan Tianyu (3): VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO PCI: Add macros for faked PCI migration capability Ixgbevf: Add migration support for ixgbevf driver drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 5 ++ drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 ++ drivers/vfio/pci/vfio_pci.c | 21 + drivers/vfio/pci/vfio_pci_config.c| 38 ++-- drivers/vfio/pci/vfio_pci_private.h | 5 ++ include/uapi/linux/pci_regs.h | 18 +++- include/uapi/linux/vfio.h | 12 +++ 7 files changed, 194 insertions(+), 7 deletions(-) I'm still not a fan of this approach. I really feel like this is something that should be resolved by extending the existing PCI hot-plug rather than trying to instrument this per driver. Then you will get the goodness for multiple drivers and multiple OSes instead of just one. An added advantage to dealing with this in the PCI hot-plug environment would be that you could then still do a hot-plug even if the guest didn't load a driver for the VF since you would be working with the PCI slot instead of the device itself. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
On 10/29/2015 07:41 PM, Lan Tianyu wrote: On 2015年10月30日 00:17, Alexander Duyck wrote: On 10/29/2015 01:33 AM, Lan Tianyu wrote: On 2015年10月29日 14:58, Alexander Duyck wrote: Your code was having to do a bunch of shuffling in order to get things set up so that you could bring the interface back up. I would argue that it may actually be faster at least on the bring-up to just drop the old rings and start over since it greatly reduced the complexity and the amount of device related data that has to be moved. If give up the old ring after migration and keep DMA running before stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and just make sure that all Rx buffers delivered to stack has been migrated. 1) Dummy write Rx buffer before checking Rx descriptor to ensure packet migrated first. Don't dummy write the Rx descriptor. You should only really need to dummy write the Rx buffer and you would do so after checking the descriptor, not before. Otherwise you risk corrupting the Rx buffer because it is possible for you to read the Rx buffer, DMA occurs, and then you write back the Rx buffer and now you have corrupted the memory. 2) Make a copy of Rx descriptor and then use the copied data to check buffer status. Not use the original descriptor because it won't be migrated and migration may happen between two access of the Rx descriptor. Do not just blindly copy the Rx descriptor ring. That is a recipe for disaster. The problem is DMA has to happen in a very specific order for things to function correctly. The Rx buffer has to be written and then the Rx descriptor. The problem is you will end up getting a read-ahead on the Rx descriptor ring regardless of which order you dirty things in. Sorry, I didn't say clearly. I meant to copy one Rx descriptor when receive rx irq and handle Rx ring. No, I understood what you are saying. My explanation was that it will not work. Current code in the ixgbevf_clean_rx_irq() checks status of the Rx descriptor whether its Rx buffer has been populated data and then read the packet length from Rx descriptor to handle the Rx buffer. That part you have correct. However there are very explicit rules about the ordering of the reads. My idea is to do the following three steps when receive Rx buffer in the ixgbevf_clean_rx_irq(). (1) dummy write the Rx buffer first, You cannot dummy write the Rx buffer without first being given ownership of it. In the driver this is handled in two phases. First we have to read the DD bit to see if it is set. If it is we can take ownership of the buffer. Second we have to either do a dma_sync_range_for_cpu or dma_unmap_page call so that we can guarantee the data has been moved to the buffer by the DMA API and that it knows it should no longer be accessing it. (2) make a copy of its Rx descriptor This is not advisable. Unless you can guarantee you are going to only read the descriptor after the DD bit is set you cannot guarantee that you won't race with device DMA. The problem is you could have the migration occur right in the middle of (2). If that occurs then you will have valid status bits, but the rest of the descriptor would be invalid data. (3) Check the buffer status and get length from the copy. I believe this is the assumption that is leading you down the wrong path. You would have to read the status before you could do the copy. You cannot do it after. Migration may happen every time. Happen between (1) and (2). If the Rx buffer has been populated data, VF driver will not know that on the new machine because the Rx descriptor isn't migrated. But it's still safe. The part I think you are not getting is that DMA can occur between (1) and (2). So if for example you were doing your dummy write while DMA was occurring you pull in your value, DMA occurs, you write your value and now you have corrupted an Rx frame by writing stale data back into it. Happen between (2) and (3). The copy will be migrated to new machine and Rx buffer is migrated firstly. If there is data in the Rx buffer, VF driver still can handle the buffer without migrating Rx descriptor. The next buffers will be ignored since we don't migrate Rx descriptor for them. Their status will be not completed on the new machine. You have kind of lost me on this part. Why do you believe there statuses will not be completed? How are you going to prevent the Rx descriptor ring from being migrated as it will be a dirty page by the virtue of the fact that it is a bidirectional DMA mapping where the Rx path provides new buffers and writes those addresses in while the device is writing back the status bits and length back. This is kind of what I was getting at. The Rx descriptor ring will show up as one of the dirtiest spots on the driver since it is constantly being overwritten by the CPU in ixgbevf_alloc_rx_buffers. Anyway we are kind of getting side tracked and I really think
Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
On 10/29/2015 01:33 AM, Lan Tianyu wrote: On 2015年10月29日 14:58, Alexander Duyck wrote: Your code was having to do a bunch of shuffling in order to get things set up so that you could bring the interface back up. I would argue that it may actually be faster at least on the bring-up to just drop the old rings and start over since it greatly reduced the complexity and the amount of device related data that has to be moved. If give up the old ring after migration and keep DMA running before stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and just make sure that all Rx buffers delivered to stack has been migrated. 1) Dummy write Rx buffer before checking Rx descriptor to ensure packet migrated first. Don't dummy write the Rx descriptor. You should only really need to dummy write the Rx buffer and you would do so after checking the descriptor, not before. Otherwise you risk corrupting the Rx buffer because it is possible for you to read the Rx buffer, DMA occurs, and then you write back the Rx buffer and now you have corrupted the memory. 2) Make a copy of Rx descriptor and then use the copied data to check buffer status. Not use the original descriptor because it won't be migrated and migration may happen between two access of the Rx descriptor. Do not just blindly copy the Rx descriptor ring. That is a recipe for disaster. The problem is DMA has to happen in a very specific order for things to function correctly. The Rx buffer has to be written and then the Rx descriptor. The problem is you will end up getting a read-ahead on the Rx descriptor ring regardless of which order you dirty things in. The descriptor is only 16 bytes, you can fit 256 of them in a single page. There is a good chance you probably wouldn't be able to migrate if you were under heavy network stress, however you could still have several buffers written in the time it takes for you to halt the VM and migrate the remaining pages. Those buffers wouldn't be marked as dirty but odds are the page the descriptors are in would be. As such you will end up with the descriptors but not the buffers. The only way you could possibly migrate the descriptors rings cleanly would be to have enough knowledge about the layout of things to force the descriptor rings to be migrated first followed by all of the currently mapped Rx buffers. In addition you would need to have some means of tracking all of the Rx buffers such as an emulated IOMMU as you would need to migrate all of them, not just part. By doing it this way you would get the Rx descriptor rings in the earliest state possible and would be essentially emulating the Rx buffer writes occurring before the Rx descriptor writes. You would likely have several Rx buffer writes that would be discarded in the process as there would be no descriptor for them but at least the state of the system would be consistent. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
On 10/28/2015 11:12 PM, Lan Tianyu wrote: On 2015年10月26日 23:03, Alexander Duyck wrote: No. I think you are missing the fact that there are 256 descriptors per page. As such if you dirty just 1 you will be pulling in 255 more, of which you may or may not have pulled in the receive buffer for. So for example if you have the descriptor ring size set to 256 then that means you are going to get whatever the descriptor ring has since you will be marking the entire ring dirty with every packet processed, however you cannot guarantee that you are going to get all of the receive buffers unless you go through and flush the entire ring prior to migrating. Yes, that will be a problem. How about adding tag for each Rx buffer and check the tag when deliver the Rx buffer to stack? If tag has been overwritten, this means the packet data has been migrated. Then you have to come up with a pattern that you can guarantee is the tag and not part of the packet data. That isn't going to be something that is easy to do. It would also have a serious performance impact on the VF. This is why I have said you will need to do something to force the rings to be flushed such as initiating a PM suspend prior to migrating. You need to do something to stop the DMA and flush the remaining Rx buffers if you want to have any hope of being able to migrate the Rx in a consistent state. Beyond that the only other thing you have to worry about are the Rx buffers that have already been handed off to the stack. However those should be handled if you do a suspend and somehow flag pages as dirty when they are unmapped from the DMA. - Alex This will be simple and maybe our first version to enable migration. But we still hope to find a way not to disable DMA before stopping VCPU to decrease service down time. You have to stop the Rx DMA at some point anyway. It is the only means to guarantee that the device stops updating buffers and descriptors so that you will have a consistent state. Your code was having to do a bunch of shuffling in order to get things set up so that you could bring the interface back up. I would argue that it may actually be faster at least on the bring-up to just drop the old rings and start over since it greatly reduced the complexity and the amount of device related data that has to be moved. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
On 10/25/2015 10:36 PM, Lan Tianyu wrote: On 2015年10月24日 02:36, Alexander Duyck wrote: I was thinking about it and I am pretty sure the dummy write approach is problematic at best. Specifically the issue is that while you are performing a dummy write you risk pulling in descriptors for data that hasn't been dummy written to yet. So when you resume and restore your descriptors you will have once that may contain Rx descriptors indicating they contain data when after the migration they don't. How about changing sequence? dummy writing Rx packet data fist and then its desc. This can ensure that RX data is migrated before its desc and prevent such case. No. I think you are missing the fact that there are 256 descriptors per page. As such if you dirty just 1 you will be pulling in 255 more, of which you may or may not have pulled in the receive buffer for. So for example if you have the descriptor ring size set to 256 then that means you are going to get whatever the descriptor ring has since you will be marking the entire ring dirty with every packet processed, however you cannot guarantee that you are going to get all of the receive buffers unless you go through and flush the entire ring prior to migrating. This is why I have said you will need to do something to force the rings to be flushed such as initiating a PM suspend prior to migrating. You need to do something to stop the DMA and flush the remaining Rx buffers if you want to have any hope of being able to migrate the Rx in a consistent state. Beyond that the only other thing you have to worry about are the Rx buffers that have already been handed off to the stack. However those should be handled if you do a suspend and somehow flag pages as dirty when they are unmapped from the DMA. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"
On 10/24/2015 08:43 AM, Lan, Tianyu wrote: On 10/22/2015 4:52 AM, Alexander Duyck wrote: Also have you even considered the MSI-X configuration on the VF? I haven't seen anything anywhere that would have migrated the VF's MSI-X configuration from BAR 3 on one system to the new system. MSI-X migration is done by Hypervisor(Qemu). Following link is my Qemu patch to do that. http://marc.info/?l=kvm=144544706530484=2 I really don't like the idea of trying to migrate the MSI-X across from host to host while it is still active. I really think Qemu shouldn't be moving this kind of data over in a migration. I think that having the VF do a suspend/resume is the best way to go. Then it simplifies things as all you have to deal with is the dirty page tracking for the Rx DMA and you should be able to do this without making things too difficult. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
On 10/21/2015 09:37 AM, Lan Tianyu wrote: This patchset is to propose a new solution to add live migration support for 82599 SRIOV network card. Im our solution, we prefer to put all device specific operation into VF and PF driver and make code in the Qemu more general. VF status migration = VF status can be divided into 4 parts 1) PCI configure regs 2) MSIX configure 3) VF status in the PF driver 4) VF MMIO regs The first three status are all handled by Qemu. The PCI configure space regs and MSIX configure are originally stored in Qemu. To save and restore "VF status in the PF driver" by Qemu during migration, adds new sysfs node "state_in_pf" under VF sysfs directory. For VF MMIO regs, we introduce self emulation layer in the VF driver to record MMIO reg values during reading or writing MMIO and put these data in the guest memory. It will be migrated with guest memory to new machine. VF function restoration Restoring VF function operation are done in the VF and PF driver. In order to let VF driver to know migration status, Qemu fakes VF PCI configure regs to indicate migration status and add new sysfs node "notify_vf" to trigger VF mailbox irq in order to notify VF about migration status change. Transmit/Receive descriptor head regs are read-only and can't be restored via writing back recording reg value directly and they are set to 0 during VF reset. To reuse original tx/rx rings, shift desc ring in order to move the desc pointed by original head reg to first entry of the ring and then enable tx/rx rings. VF restarts to receive and transmit from original head desc. Tracking DMA accessed memory = Migration relies on tracking dirty page to migrate memory. Hardware can't automatically mark a page as dirty after DMA memory access. VF descriptor rings and data buffers are modified by hardware when receive and transmit data. To track such dirty memory manually, do dummy writes(read a byte and write it back) when receive and transmit data. I was thinking about it and I am pretty sure the dummy write approach is problematic at best. Specifically the issue is that while you are performing a dummy write you risk pulling in descriptors for data that hasn't been dummy written to yet. So when you resume and restore your descriptors you will have once that may contain Rx descriptors indicating they contain data when after the migration they don't. I really think the best approach to take would be to look at implementing an emulated IOMMU so that you could track DMA mapped pages and avoid migrating the ones marked as DMA_FROM_DEVICE until they are unmapped. The advantage to this is that in the case of the ixgbevf driver it now reuses the same pages for Rx DMA. As a result it will be rewriting the same pages often and if you are marking those pages as dirty and transitioning them it is possible for a flow of small packets to really make a mess of things since you would be rewriting the same pages in a loop while the device is processing packets. Beyond that I would say you could suspend/resume the device in order to get it to stop and flush the descriptor rings and any outstanding packets. The code for suspend would unmap the DMA memory which would then be the trigger to flush it across in the migration, and the resume code would take care of any state restoration needed beyond any values that can be configured with the ip link command. If you wanted to do a proof of concept of this you could probably do so with very little overhead. Basically you would need the "page_addr" portion of patch 12 to emulate a slightly migration aware DMA API, and then beyond that you would need something like patch 9 but instead of adding new functions and API you would be switching things on and off via the ixgbevf_suspend/resume calls. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
On 10/23/2015 12:05 PM, Alex Williamson wrote: On Fri, 2015-10-23 at 11:36 -0700, Alexander Duyck wrote: On 10/21/2015 09:37 AM, Lan Tianyu wrote: This patchset is to propose a new solution to add live migration support for 82599 SRIOV network card. Im our solution, we prefer to put all device specific operation into VF and PF driver and make code in the Qemu more general. VF status migration = VF status can be divided into 4 parts 1) PCI configure regs 2) MSIX configure 3) VF status in the PF driver 4) VF MMIO regs The first three status are all handled by Qemu. The PCI configure space regs and MSIX configure are originally stored in Qemu. To save and restore "VF status in the PF driver" by Qemu during migration, adds new sysfs node "state_in_pf" under VF sysfs directory. For VF MMIO regs, we introduce self emulation layer in the VF driver to record MMIO reg values during reading or writing MMIO and put these data in the guest memory. It will be migrated with guest memory to new machine. VF function restoration Restoring VF function operation are done in the VF and PF driver. In order to let VF driver to know migration status, Qemu fakes VF PCI configure regs to indicate migration status and add new sysfs node "notify_vf" to trigger VF mailbox irq in order to notify VF about migration status change. Transmit/Receive descriptor head regs are read-only and can't be restored via writing back recording reg value directly and they are set to 0 during VF reset. To reuse original tx/rx rings, shift desc ring in order to move the desc pointed by original head reg to first entry of the ring and then enable tx/rx rings. VF restarts to receive and transmit from original head desc. Tracking DMA accessed memory = Migration relies on tracking dirty page to migrate memory. Hardware can't automatically mark a page as dirty after DMA memory access. VF descriptor rings and data buffers are modified by hardware when receive and transmit data. To track such dirty memory manually, do dummy writes(read a byte and write it back) when receive and transmit data. I was thinking about it and I am pretty sure the dummy write approach is problematic at best. Specifically the issue is that while you are performing a dummy write you risk pulling in descriptors for data that hasn't been dummy written to yet. So when you resume and restore your descriptors you will have once that may contain Rx descriptors indicating they contain data when after the migration they don't. I really think the best approach to take would be to look at implementing an emulated IOMMU so that you could track DMA mapped pages and avoid migrating the ones marked as DMA_FROM_DEVICE until they are unmapped. The advantage to this is that in the case of the ixgbevf driver it now reuses the same pages for Rx DMA. As a result it will be rewriting the same pages often and if you are marking those pages as dirty and transitioning them it is possible for a flow of small packets to really make a mess of things since you would be rewriting the same pages in a loop while the device is processing packets. I'd be concerned that an emulated IOMMU on the DMA path would reduce throughput to the point where we shouldn't even bother with assigning the device in the first place and should be using virtio-net instead. POWER systems have a guest visible IOMMU and it's been challenging for them to get to 10Gbps, requiring real-mode tricks. virtio-net may add some latency, but it's not that hard to get it to 10Gbps and it already supports migration. An emulated IOMMU in the guest is really only good for relatively static mappings, the latency for anything else is likely too high. Maybe there are shadow page table tricks that could help, but it's imposing overhead the whole time the guest is running, not only on migration. Thanks, The big overhead I have seen with IOMMU implementations is the fact that they almost always have some sort of locked table or tree that prevents multiple CPUs from accessing resources in any kind of timely fashion. As a result things like Tx is usually slowed down for network workloads when multiple CPUs are enabled. I admit doing a guest visible IOMMU would probably add some overhead, but this current patch set as implemented already has some of the hints of that as the descriptor rings are locked which means we cannot unmap in the Tx clean-up while we are mapping on another Tx queue for instance. One approach for this would be to implement or extend a lightweight DMA API such as swiotlb or nommu. The code would need to have a bit in there so it can take care of marking the pages as dirty on sync_for_cpu and unmap calls when set for BIDIRECTIONAL or FROM_DEVICE. Then if we cou
Re: [Qemu-devel] [RFC Patch 06/12] IXGBEVF: Add self emulation layer
On 10/22/2015 05:50 AM, Michael S. Tsirkin wrote: On Wed, Oct 21, 2015 at 01:58:19PM -0700, Alexander Duyck wrote: On 10/21/2015 09:37 AM, Lan Tianyu wrote: In order to restore VF function after migration, add self emulation layer to record regs' values during accessing regs. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- drivers/net/ethernet/intel/ixgbevf/Makefile| 3 ++- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +- .../net/ethernet/intel/ixgbevf/self-emulation.c| 26 ++ drivers/net/ethernet/intel/ixgbevf/vf.h| 5 - 4 files changed, 33 insertions(+), 3 deletions(-) create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile b/drivers/net/ethernet/intel/ixgbevf/Makefile index 4ce4c97..841c884 100644 --- a/drivers/net/ethernet/intel/ixgbevf/Makefile +++ b/drivers/net/ethernet/intel/ixgbevf/Makefile @@ -31,7 +31,8 @@ obj-$(CONFIG_IXGBEVF) += ixgbevf.o -ixgbevf-objs := vf.o \ +ixgbevf-objs := self-emulation.o \ + vf.o \ mbx.o \ ethtool.o \ ixgbevf_main.o diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index a16d267..4446916 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg) if (IXGBE_REMOVED(reg_addr)) return IXGBE_FAILED_READ_REG; - value = readl(reg_addr + reg); + value = ixgbe_self_emul_readl(reg_addr, reg); if (unlikely(value == IXGBE_FAILED_READ_REG)) ixgbevf_check_remove(hw, reg); return value; diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c new file mode 100644 index 000..d74b2da --- /dev/null +++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c @@ -0,0 +1,26 @@ +#include +#include +#include +#include +#include + +#include "vf.h" +#include "ixgbevf.h" + +static u32 hw_regs[0x4000]; + +u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr) +{ + u32 tmp; + + tmp = readl(base + addr); + hw_regs[(unsigned long)addr] = tmp; + + return tmp; +} + +void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32 addr) +{ + hw_regs[(unsigned long)addr] = val; + writel(val, (volatile void __iomem *)(base + addr)); +} So I see what you are doing, however I don't think this adds much value. Many of the key registers for the device are not simple Read/Write registers. Most of them are things like write 1 to clear or some other sort of value where writing doesn't set the bit but has some other side effect. Just take a look through the Datasheet at registers such as the VFCTRL, VFMAILBOX, or most of the interrupt registers. The fact is simply storing the values off doesn't give you any real idea of what the state of things are. It doesn't, but I guess the point is to isolate the migration-related logic in the recovery code. An alternative would be to have some smart logic all over the place to only store what's required - that would be much more intrusive. After reviewing all of the patches yesterday I would say that almost all the values being stored aren't needed. They can be restored from the settings of the driver itself anyway. Copying the values out don't make much sense here since there are already enough caches for almost all of this data. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device
On 10/21/2015 09:37 AM, Lan Tianyu wrote: Add "virtfn_index" member in the struct pci_device to record VF sequence of PF. This will be used in the VF sysfs node handle. Signed-off-by: Lan Tianyu--- drivers/pci/iov.c | 1 + include/linux/pci.h | 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index ee0ebff..065b6bb 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -136,6 +136,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset) virtfn->physfn = pci_dev_get(dev); virtfn->is_virtfn = 1; virtfn->multifunction = 0; + virtfn->virtfn_index = id; for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) { res = >resource[i + PCI_IOV_RESOURCES]; diff --git a/include/linux/pci.h b/include/linux/pci.h index 353db8d..85c5531 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -356,6 +356,7 @@ struct pci_dev { unsigned intio_window_1k:1; /* Intel P2P bridge 1K I/O windows */ unsigned intirq_managed:1; pci_dev_flags_t dev_flags; + unsigned intvirtfn_index; atomic_tenable_cnt; /* pci_enable_device has been called */ u32 saved_config_space[16]; /* config space saved at suspend time */ Can't you just calculate the VF index based on the VF BDF number combined with the information in the PF BDF number and VF offset/stride? Seems kind of pointless to add a variable that is only used by one driver and is in a slowpath when you can just calculate it pretty quickly. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver
On 10/21/2015 09:37 AM, Lan Tianyu wrote: This patch is to add sysfs interface state_in_pf under sysfs directory of VF PCI device for Qemu to get and put VF status in the PF driver during migration. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 156 - 1 file changed, 155 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index ab2a2e2..89671eb 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -124,6 +124,157 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter *adapter) return -ENOMEM; } +#define IXGBE_PCI_VFCOMMAND 0x4 +#define IXGBE_PCI_VFMSIXMC0x72 +#define IXGBE_SRIOV_VF_OFFSET 0x180 +#define IXGBE_SRIOV_VF_STRIDE 0x2 + +#define to_adapter(dev) ((struct ixgbe_adapter *)(pci_get_drvdata(to_pci_dev(dev)->physfn))) + +struct state_in_pf { + u16 command; + u16 msix_message_control; + struct vf_data_storage vf_data; +}; + +static struct pci_dev *ixgbe_get_virtfn_dev(struct pci_dev *pdev, int vfn) +{ + u16 rid = pdev->devfn + IXGBE_SRIOV_VF_OFFSET + IXGBE_SRIOV_VF_STRIDE * vfn; + return pci_get_bus_and_slot(pdev->bus->number + (rid >> 8), rid & 0xff); +} + +static ssize_t ixgbe_show_state_in_pf(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct ixgbe_adapter *adapter = to_adapter(dev); + struct pci_dev *pdev = adapter->pdev, *vdev; + struct pci_dev *vf_pdev = to_pci_dev(dev); + struct ixgbe_hw *hw = >hw; + struct state_in_pf *state = (struct state_in_pf *)buf; + int vfn = vf_pdev->virtfn_index; + u32 reg, reg_offset, vf_shift; + + /* Clear VF mac and disable VF */ + ixgbe_del_mac_filter(adapter, adapter->vfinfo[vfn].vf_mac_addresses, vfn); + + /* Record PCI configurations */ + vdev = ixgbe_get_virtfn_dev(pdev, vfn); + if (vdev) { + pci_read_config_word(vdev, IXGBE_PCI_VFCOMMAND, >command); + pci_read_config_word(vdev, IXGBE_PCI_VFMSIXMC, >msix_message_control); + } + else + printk(KERN_WARNING "Unable to find VF device.\n"); + Formatting for the if/else is incorrect. The else condition should be in brackets as well. + /* Record states hold by PF */ + memcpy(>vf_data, >vfinfo[vfn], sizeof(struct vf_data_storage)); + + vf_shift = vfn % 32; + reg_offset = vfn / 32; + + reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset)); + reg &= ~(1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg); + + reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset)); + reg &= ~(1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg); + + reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset)); + reg &= ~(1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg); + + return sizeof(struct state_in_pf); +} + This is a read. Why does it need to switch off the VF? Also why turn of the anti-spoof, it doesn't make much sense. +static ssize_t ixgbe_store_state_in_pf(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct ixgbe_adapter *adapter = to_adapter(dev); + struct pci_dev *pdev = adapter->pdev, *vdev; + struct pci_dev *vf_pdev = to_pci_dev(dev); + struct state_in_pf *state = (struct state_in_pf *)buf; + int vfn = vf_pdev->virtfn_index; + + /* Check struct size */ + if (count != sizeof(struct state_in_pf)) { + printk(KERN_ERR "State in PF size does not fit.\n"); + goto out; + } + + /* Restore PCI configurations */ + vdev = ixgbe_get_virtfn_dev(pdev, vfn); + if (vdev) { + pci_write_config_word(vdev, IXGBE_PCI_VFCOMMAND, state->command); + pci_write_config_word(vdev, IXGBE_PCI_VFMSIXMC, state->msix_message_control); + } + + /* Restore states hold by PF */ + memcpy(>vfinfo[vfn], >vf_data, sizeof(struct vf_data_storage)); + + out: + return count; +} Just doing a memcpy to move the vfinfo over adds no value. The fact is there are a number of filters that have to be configured in hardware after, and it isn't as simple as just migrating the values stored. As I mentioned in the case of the 82598 there is also jumbo frames to take into account. If the first PF didn't have it enabled, but the second one does that implies the state of the VF needs to change to account for that. I really think you would be better off only migrating the data related to what can be configured using the ip link command and leaving other values such as clear_to_send at the reset value of 0. Then
Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"
On 10/21/2015 09:37 AM, Lan Tianyu wrote: This patch is to add new sysfs interface of "notify_vf" under sysfs directory of VF PCI device for Qemu to notify VF when migration status is changed. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 30 ++ drivers/net/ethernet/intel/ixgbe/ixgbe_type.h | 4 2 files changed, 34 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index e247d67..5cc7817 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -217,10 +217,37 @@ static ssize_t ixgbe_store_state_in_pf(struct device *dev, return count; } +static ssize_t ixgbe_store_notify_vf(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct ixgbe_adapter *adapter = to_adapter(dev); + struct ixgbe_hw *hw = >hw; + struct pci_dev *vf_pdev = to_pci_dev(dev); + int vfn = vf_pdev->virtfn_index; + u32 ivar; + + /* Enable VF mailbox irq first */ + IXGBE_WRITE_REG(hw, IXGBE_PVTEIMS(vfn), 0x4); + IXGBE_WRITE_REG(hw, IXGBE_PVTEIAM(vfn), 0x4); + IXGBE_WRITE_REG(hw, IXGBE_PVTEIAC(vfn), 0x4); + + ivar = IXGBE_READ_REG(hw, IXGBE_PVTIVAR_MISC(vfn)); + ivar &= ~0xFF; + ivar |= 0x2 | IXGBE_IVAR_ALLOC_VAL; + IXGBE_WRITE_REG(hw, IXGBE_PVTIVAR_MISC(vfn), ivar); + + ixgbe_ping_vf(adapter, vfn); + return count; +} + NAK, this won't fly. You can't just go in from the PF and enable interrupts on the VF hoping they are configured well enough to handle an interrupt you decide to trigger from them. Also have you even considered the MSI-X configuration on the VF? I haven't seen anything anywhere that would have migrated the VF's MSI-X configuration from BAR 3 on one system to the new system. static struct device_attribute ixgbe_per_state_in_pf_attribute = __ATTR(state_in_pf, S_IRUGO | S_IWUSR, ixgbe_show_state_in_pf, ixgbe_store_state_in_pf); +static struct device_attribute ixgbe_per_notify_vf_attribute = + __ATTR(notify_vf, S_IWUSR, NULL, ixgbe_store_notify_vf); + void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter) { struct pci_dev *pdev = adapter->pdev; @@ -241,6 +268,8 @@ void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter) if (vfdev->is_virtfn) { ret = device_create_file(>dev, _per_state_in_pf_attribute); + ret |= device_create_file(>dev, + _per_notify_vf_attribute); if (ret) pr_warn("Unable to add VF attribute for dev %s,\n", dev_name(>dev)); @@ -269,6 +298,7 @@ void ixgbe_remove_vf_attrib(struct ixgbe_adapter *adapter) while (vfdev) { if (vfdev->is_virtfn) { device_remove_file(>dev, _per_state_in_pf_attribute); + device_remove_file(>dev, _per_notify_vf_attribute); } vfdev = pci_get_device(pdev->vendor, vf_id, vfdev); More driver specific sysfs. This needs to be moved out of the driver if this is to be considered anything more than a proof of concept. diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h index dd6ba59..c6ddb66 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h @@ -2302,6 +2302,10 @@ enum { #define IXGBE_PVFTDT(P) (0x06018 + (0x40 * (P))) #define IXGBE_PVFTDWBAL(P)(0x06038 + (0x40 * (P))) #define IXGBE_PVFTDWBAH(P)(0x0603C + (0x40 * (P))) +#define IXGBE_PVTEIMS(P) (0x00D00 + (4 * (P))) +#define IXGBE_PVTIVAR_MISC(P) (0x04E00 + (4 * (P))) +#define IXGBE_PVTEIAC(P) (0x00F00 + (4 * P)) +#define IXGBE_PVTEIAM(P) (0x04D00 + (4 * P)) #define IXGBE_PVFTDWBALn(q_per_pool, vf_number, vf_q_index) \ (IXGBE_PVFTDWBAL((q_per_pool)*(vf_number) + (vf_q_index))) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 06/12] IXGBEVF: Add self emulation layer
On 10/21/2015 09:37 AM, Lan Tianyu wrote: In order to restore VF function after migration, add self emulation layer to record regs' values during accessing regs. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbevf/Makefile| 3 ++- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +- .../net/ethernet/intel/ixgbevf/self-emulation.c| 26 ++ drivers/net/ethernet/intel/ixgbevf/vf.h| 5 - 4 files changed, 33 insertions(+), 3 deletions(-) create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile b/drivers/net/ethernet/intel/ixgbevf/Makefile index 4ce4c97..841c884 100644 --- a/drivers/net/ethernet/intel/ixgbevf/Makefile +++ b/drivers/net/ethernet/intel/ixgbevf/Makefile @@ -31,7 +31,8 @@ obj-$(CONFIG_IXGBEVF) += ixgbevf.o -ixgbevf-objs := vf.o \ +ixgbevf-objs := self-emulation.o \ + vf.o \ mbx.o \ ethtool.o \ ixgbevf_main.o diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index a16d267..4446916 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg) if (IXGBE_REMOVED(reg_addr)) return IXGBE_FAILED_READ_REG; - value = readl(reg_addr + reg); + value = ixgbe_self_emul_readl(reg_addr, reg); if (unlikely(value == IXGBE_FAILED_READ_REG)) ixgbevf_check_remove(hw, reg); return value; diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c new file mode 100644 index 000..d74b2da --- /dev/null +++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c @@ -0,0 +1,26 @@ +#include +#include +#include +#include +#include + +#include "vf.h" +#include "ixgbevf.h" + +static u32 hw_regs[0x4000]; + +u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr) +{ + u32 tmp; + + tmp = readl(base + addr); + hw_regs[(unsigned long)addr] = tmp; + + return tmp; +} + +void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32 addr) +{ + hw_regs[(unsigned long)addr] = val; + writel(val, (volatile void __iomem *)(base + addr)); +} So I see what you are doing, however I don't think this adds much value. Many of the key registers for the device are not simple Read/Write registers. Most of them are things like write 1 to clear or some other sort of value where writing doesn't set the bit but has some other side effect. Just take a look through the Datasheet at registers such as the VFCTRL, VFMAILBOX, or most of the interrupt registers. The fact is simply storing the values off doesn't give you any real idea of what the state of things are. diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h b/drivers/net/ethernet/intel/ixgbevf/vf.h index d40f036..6a3f4eb 100644 --- a/drivers/net/ethernet/intel/ixgbevf/vf.h +++ b/drivers/net/ethernet/intel/ixgbevf/vf.h @@ -39,6 +39,9 @@ struct ixgbe_hw; +u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr); +void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32 addr); + /* iterator type for walking multicast address lists */ typedef u8* (*ixgbe_mc_addr_itr) (struct ixgbe_hw *hw, u8 **mc_addr_ptr, u32 *vmdq); @@ -182,7 +185,7 @@ static inline void ixgbe_write_reg(struct ixgbe_hw *hw, u32 reg, u32 value) if (IXGBE_REMOVED(reg_addr)) return; - writel(value, reg_addr + reg); + ixgbe_self_emul_writel(value, reg_addr, reg); } #define IXGBE_WRITE_REG(h, r, v) ixgbe_write_reg(h, r, v) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 02/12] IXGBE: Add new mail box event to restore VF status in the PF driver
On 10/21/2015 09:37 AM, Lan Tianyu wrote: This patch is to restore VF status in the PF driver when get event from VF. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbe/ixgbe.h | 1 + drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h | 1 + drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 40 ++ 3 files changed, 42 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index 636f9e3..9d5669a 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -148,6 +148,7 @@ struct vf_data_storage { bool pf_set_mac; u16 pf_vlan; /* When set, guest VLAN config not allowed. */ u16 pf_qos; + u32 vf_lpe; u16 tx_rate; u16 vlan_count; u8 spoofchk_enabled; diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h index b1e4703..8fdb38d 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h @@ -91,6 +91,7 @@ enum ixgbe_pfvf_api_rev { /* mailbox API, version 1.1 VF requests */ #define IXGBE_VF_GET_QUEUES 0x09 /* get queue configuration */ +#define IXGBE_VF_NOTIFY_RESUME0x0c /* VF notify PF migration finishing */ /* GET_QUEUES return data indices within the mailbox */ #define IXGBE_VF_TX_QUEUES1 /* number of Tx queues supported */ diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index 1d17b58..ab2a2e2 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -648,6 +648,42 @@ static inline void ixgbe_write_qde(struct ixgbe_adapter *adapter, u32 vf, } } +/** + * Restore the settings by mailbox, after migration + **/ +void ixgbe_restore_setting(struct ixgbe_adapter *adapter, u32 vf) +{ + struct ixgbe_hw *hw = >hw; + u32 reg, reg_offset, vf_shift; + int rar_entry = hw->mac.num_rar_entries - (vf + 1); + + vf_shift = vf % 32; + reg_offset = vf / 32; + + /* enable transmit and receive for vf */ + reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset)); + reg |= (1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg); + + reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset)); + reg |= (1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg); + This is just blanket enabling Rx and Tx. I don't see how this can be valid. It seems like it would result in memory corruption for the guest if you are enabling Rx on a device that is not ready. A perfect example is if the guest is not configured to handle jumbo frames and the PF has jumbo frames enabled. + reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset)); + reg |= (1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg); This assumes that the anti-spoof is enabled. That may not be the case. + ixgbe_vf_reset_event(adapter, vf); + + hw->mac.ops.set_rar(hw, rar_entry, + adapter->vfinfo[vf].vf_mac_addresses, + vf, IXGBE_RAH_AV); + + + if (adapter->vfinfo[vf].vf_lpe) + ixgbe_set_vf_lpe(adapter, >vfinfo[vf].vf_lpe, vf); +} + The function ixgbe_set_vf_lpe also enabled the receive, you should take a look at it. For 82598 you cannot just arbitrarily enable the Rx as there is a risk of corrupting guest memory or causing a kernel panic. static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf) { struct ixgbe_ring_feature *vmdq = >ring_feature[RING_F_VMDQ]; @@ -1047,6 +1083,7 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter *adapter, u32 vf) break; case IXGBE_VF_SET_LPE: retval = ixgbe_set_vf_lpe(adapter, msgbuf, vf); + adapter->vfinfo[vf].vf_lpe = *msgbuf; break; Why not just leave this for the VF to notify us of via a reset. It seems like if the VF is migrated it should start with the cts bits of the mailbox cleared as though the PF driver as been reloaded. case IXGBE_VF_SET_MACVLAN: retval = ixgbe_set_vf_macvlan_msg(adapter, msgbuf, vf); @@ -1063,6 +1100,9 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter *adapter, u32 vf) case IXGBE_VF_GET_RSS_KEY: retval = ixgbe_get_vf_rss_key(adapter, msgbuf, vf); break; + case IXGBE_VF_NOTIFY_RESUME: + ixgbe_restore_setting(adapter, vf); + break; default: e_err(drv, "Unhandled Msg %8.8x\n", msgbuf[0]); retval = IXGBE_ERR_MBX; I really don't think the VF should be sending us a message telling us to restore settings. Why not just use the existing messages? The VF as it is now can survive a suspend/resume
Re: [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package
On 10/21/2015 09:37 AM, Lan Tianyu wrote: When transmit a package, the end transmit desc of package indicates whether package is sent already. Current code records the end desc's pointer in the next_to_watch of struct tx buffer. This code will be broken if shifting desc ring after migration. The pointer will be invalid. This patch is to replace recording pointer with recording the desc number of the package and find the end decs via the first desc and desc number. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 1 + drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 19 --- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h index 775d089..c823616 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h @@ -54,6 +54,7 @@ */ struct ixgbevf_tx_buffer { union ixgbe_adv_tx_desc *next_to_watch; + u16 desc_num; unsigned long time_stamp; struct sk_buff *skb; unsigned int bytecount; So if you can't use next_to_watch why is it left in here? Also you might want to take a look at moving desc_num to a different spot in the buffer as you are leaving a 6 byte hole in the descriptor. diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 4446916..056841c 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -210,6 +210,7 @@ static void ixgbevf_unmap_and_free_tx_resource(struct ixgbevf_ring *tx_ring, DMA_TO_DEVICE); } tx_buffer->next_to_watch = NULL; + tx_buffer->desc_num = 0; tx_buffer->skb = NULL; dma_unmap_len_set(tx_buffer, len, 0); This opens up a race condition. If you have a descriptor ready to be cleaned at offset 0 what is to prevent you from just running through the ring? You likely need to find a descriptor number that cannot be valid to use here. /* tx_buffer must be completely set up in the transmit path */ @@ -295,7 +296,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, union ixgbe_adv_tx_desc *tx_desc; unsigned int total_bytes = 0, total_packets = 0; unsigned int budget = tx_ring->count / 2; - unsigned int i = tx_ring->next_to_clean; + int i, watch_index; Where is i being initialized? It was here but you removed it. Are you using i without initializing it? if (test_bit(__IXGBEVF_DOWN, >state)) return true; @@ -305,9 +306,17 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, i -= tx_ring->count; do { - union ixgbe_adv_tx_desc *eop_desc = tx_buffer->next_to_watch; + union ixgbe_adv_tx_desc *eop_desc; + + if (!tx_buffer->desc_num) + break; + + if (i + tx_buffer->desc_num >= 0) + watch_index = i + tx_buffer->desc_num; + else + watch_index = i + tx_ring->count + tx_buffer->desc_num; - /* if next_to_watch is not set then there is no work pending */ + eop_desc = IXGBEVF_TX_DESC(tx_ring, watch_index); if (!eop_desc) break; So I don't see how this isn't triggering Tx hangs. I suspect for the simple ping case desc_num will often be 0. The fact is there are many cases where first and tx_buffer_info are the same descriptor. @@ -320,6 +329,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, /* clear next_to_watch to prevent false hangs */ tx_buffer->next_to_watch = NULL; + tx_buffer->desc_num = 0; /* update the statistics for this packet */ total_bytes += tx_buffer->bytecount; You cannot use 0 because 0 is a valid number. You are using it as a look-ahead currently and there are cases where i is the eop_desc index. @@ -3457,6 +3467,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring, u32 tx_flags = first->tx_flags; __le32 cmd_type; u16 i = tx_ring->next_to_use; + u16 start; tx_desc = IXGBEVF_TX_DESC(tx_ring, i); @@ -3540,6 +3551,8 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring, /* set next_to_watch value indicating a packet is present */ first->next_to_watch = tx_desc; + start = first - tx_ring->tx_buffer_info; + first->desc_num = (i - start >= 0) ? i - start: i + tx_ring->count - start; i++; if (i == tx_ring->count) start and i could be the same value. If you look at ixgbevf_tx_map you should find that if the packet is contained in a single buffer then the first and last descriptor in your
Re: [RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver
On 10/21/2015 09:37 AM, Lan Tianyu wrote: To let VF driver in the guest to know migration status, Qemu will fake PCI configure reg 0xF0 and 0xF1 to show migrate status and get ack from VF driver. When migration starts, Qemu will set reg "0xF0" to 1, notify VF driver via triggering mail box msg and wait for VF driver to tell it's ready for migration(set reg "0xF1" to 1). After migration, Qemu will set reg "0xF0" to 0 and notify VF driver by mail box irq. VF driver begins to restore tx/rx function after detecting sttatus change. When VF receives mail box irq, it will check reg "0xF0" in the service task function to get migration status and performs related operations according its value. Steps of restarting receive and transmit function 1) Restore VF status in the PF driver via sending mail event to PF driver 2) Write back reg values recorded by self emulation layer 3) Restart rx/tx ring 4) Recovery interrupt Transmit/Receive descriptor head regs are read-only and can't be restored via writing back recording reg value directly and they are set to 0 during VF reset. To reuse original tx/rx rings, shift desc ring in order to move the desc pointed by original head reg to first entry of the ring and then enable tx/rx rings. VF restarts to receive and transmit from original head desc. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbevf/defines.h | 6 ++ drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 7 +- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 115 - .../net/ethernet/intel/ixgbevf/self-emulation.c| 107 +++ 4 files changed, 232 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h b/drivers/net/ethernet/intel/ixgbevf/defines.h index 770e21a..113efd2 100644 --- a/drivers/net/ethernet/intel/ixgbevf/defines.h +++ b/drivers/net/ethernet/intel/ixgbevf/defines.h @@ -239,6 +239,12 @@ struct ixgbe_adv_tx_context_desc { __le32 mss_l4len_idx; }; +union ixgbevf_desc { + union ixgbe_adv_tx_desc rx_desc; + union ixgbe_adv_rx_desc tx_desc; + struct ixgbe_adv_tx_context_desc tx_context_desc; +}; + /* Adv Transmit Descriptor Config Masks */ #define IXGBE_ADVTXD_DTYP_MASK0x00F0 /* DTYP mask */ #define IXGBE_ADVTXD_DTYP_CTXT0x0020 /* Advanced Context Desc */ diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h index c823616..6eab402e 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h @@ -109,7 +109,7 @@ struct ixgbevf_ring { struct ixgbevf_ring *next; struct net_device *netdev; struct device *dev; - void *desc; /* descriptor ring memory */ + union ixgbevf_desc *desc; /* descriptor ring memory */ dma_addr_t dma; /* phys. address of descriptor ring */ unsigned int size; /* length in bytes */ u16 count; /* amount of descriptors */ @@ -493,6 +493,11 @@ extern void ixgbevf_write_eitr(struct ixgbevf_q_vector *q_vector); void ixgbe_napi_add_all(struct ixgbevf_adapter *adapter); void ixgbe_napi_del_all(struct ixgbevf_adapter *adapter); +int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head); +int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head); +void ixgbevf_restore_state(struct ixgbevf_adapter *adapter); +inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter); + #ifdef DEBUG char *ixgbevf_get_hw_dev_name(struct ixgbe_hw *hw); diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 056841c..15ec361 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -91,6 +91,10 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit Virtual Function Network Driver"); MODULE_LICENSE("GPL"); MODULE_VERSION(DRV_VERSION); + +#define MIGRATION_COMPLETED 0x00 +#define MIGRATION_IN_PROGRESS 0x01 + #define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK) static int debug = -1; module_param(debug, int, 0); @@ -221,6 +225,78 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring *ring) return ring->stats.packets; } +int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) +{ + struct ixgbevf_tx_buffer *tx_buffer = NULL; + static union ixgbevf_desc *tx_desc = NULL; + + tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count)); + if (!tx_buffer) + return -ENOMEM; + + tx_desc = vmalloc(sizeof(union ixgbevf_desc) * r->count); + if (!tx_desc) + return -ENOMEM; + + memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count); + memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count - head)); +
Re: [RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation
On 10/21/2015 09:37 AM, Lan Tianyu wrote: Ring shifting during restoring VF function maybe race with original ring operation(transmit/receive package). This patch is to add tx/rx lock to protect ring related data. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 2 ++ drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 28 --- 2 files changed, 27 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h index 6eab402e..3a748c8 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h @@ -448,6 +448,8 @@ struct ixgbevf_adapter { spinlock_t mbx_lock; unsigned long last_reset; + spinlock_t mg_rx_lock; + spinlock_t mg_tx_lock; }; Really, a shared lock for all of the Rx or Tx rings? This is going to kill any chance at performance. Especially since just recently the VFs got support for RSS. To top it off it also means we cannot clean Tx while adding new buffers which will kill Tx performance. The other concern I have is what is supposed to prevent the hardware from accessing the rings while you are reading? I suspect nothing so I don't see how this helps anything. I would honestly say you are better off just giving up on all of the data stored in the descriptor rings rather than trying to restore them. Yes you are going to lose a few packets but you don't have the risk for races that this code introduces. enum ixbgevf_state_t { diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 15ec361..04b6ce7 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -227,8 +227,10 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring *ring) int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) { + struct ixgbevf_adapter *adapter = netdev_priv(r->netdev); struct ixgbevf_tx_buffer *tx_buffer = NULL; static union ixgbevf_desc *tx_desc = NULL; + unsigned long flags; tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count)); if (!tx_buffer) @@ -238,6 +240,7 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) if (!tx_desc) return -ENOMEM; + spin_lock_irqsave(>mg_tx_lock, flags); memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count); memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count - head)); memcpy(>desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * head); @@ -256,6 +259,8 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) else r->next_to_use += (r->count - head); + spin_unlock_irqrestore(>mg_tx_lock, flags); + vfree(tx_buffer); vfree(tx_desc); return 0; @@ -263,8 +268,10 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head) { + struct ixgbevf_adapter *adapter = netdev_priv(r->netdev); struct ixgbevf_rx_buffer *rx_buffer = NULL; static union ixgbevf_desc *rx_desc = NULL; + unsigned long flags; rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count)); if (!rx_buffer) @@ -274,6 +281,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head) if (!rx_desc) return -ENOMEM; + spin_lock_irqsave(>mg_rx_lock, flags); memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count)); memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count - head)); memcpy(>desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * head); @@ -291,6 +299,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head) r->next_to_use -= head; else r->next_to_use += (r->count - head); + spin_unlock_irqrestore(>mg_rx_lock, flags); vfree(rx_buffer); vfree(rx_desc); @@ -377,6 +386,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, if (test_bit(__IXGBEVF_DOWN, >state)) return true; + spin_lock(>mg_tx_lock); + i = tx_ring->next_to_clean; tx_buffer = _ring->tx_buffer_info[i]; tx_desc = IXGBEVF_TX_DESC(tx_ring, i); i -= tx_ring->count; @@ -471,6 +482,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, q_vector->tx.total_bytes += total_bytes; q_vector->tx.total_packets += total_packets; + spin_unlock(>mg_tx_lock); + if (check_for_tx_hang(tx_ring) && ixgbevf_check_tx_hang(tx_ring)) { struct ixgbe_hw *hw = >hw; union ixgbe_adv_tx_desc *eop_desc; @@ -999,10 +1012,12 @@ static int
Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
On 10/21/2015 12:20 PM, Alex Williamson wrote: On Wed, 2015-10-21 at 21:45 +0300, Or Gerlitz wrote: On Wed, Oct 21, 2015 at 7:37 PM, Lan Tianyuwrote: This patchset is to propose a new solution to add live migration support for 82599 SRIOV network card. In our solution, we prefer to put all device specific operation into VF and PF driver and make code in the Qemu more general. [...] Service down time test So far, we tested migration between two laptops with 82599 nic which are connected to a gigabit switch. Ping VF in the 0.001s interval during migration on the host of source side. It service down time is about 180ms. So... what would you expect service down wise for the following solution which is zero touch and I think should work for any VF driver: on host A: unplug the VM and conduct live migration to host B ala the no-SRIOV case. The trouble here is that the VF needs to be unplugged prior to the start of migration because we can't do effective dirty page tracking while the device is connected and doing DMA. So the downtime, assuming we're counting only VF connectivity, is dependent on memory size, rate of dirtying, and network bandwidth; seconds for small guests, minutes or more (maybe much, much more) for large guests. The question of dirty page tracking though should be pretty simple. We start the Tx packets out as dirty so we don't need to add anything there. It seems like the Rx data and Tx/Rx descriptor rings are the issue. This is why the typical VF agnostic approach here is to using bonding and fail over to a emulated device during migration, so performance suffers, but downtime is something acceptable. If we want the ability to defer the VF unplug until just before the final stages of the migration, we need the VF to participate in dirty page tracking. Here it's done via an enlightened guest driver. Alex Graf presented a solution using a device specific enlightenment in QEMU. Otherwise we'd need hardware support from the IOMMU. My only real complaint with this patch series is that it seems like there was to much focus on instrumenting the driver instead of providing the code necessary to enable a driver ecosystem that enables migration. I don't know if what we need is a full hardware IOMMU. It seems like a good way to take care of the need to flag dirty pages for DMA capable devices would be to add functionality to the dma_map_ops calls sync_{sg|single}for_cpu and unmap_{page|sg} so that they would take care of mapping the pages as dirty for us when needed. We could probably make do with just a few tweaks to existing API in order to make this work. As far as the descriptor rings I would argue they are invalid as soon as we migrate. The problem is there is no way to guarantee ordering as we cannot pre-emptively mark an Rx data buffer as being a dirty page when we haven't even looked at the Rx descriptor for the given buffer yet. Tx has similar issues as we cannot guarantee the Tx will disable itself after a complete frame. As such I would say the moment we migrate we should just give up on the frames that are still in the descriptor rings, drop them, and then start over with fresh rings. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] PCI: introduce helper functions for device flag operation
On 07/28/2014 07:43 PM, ethan zhao wrote: On 2014/7/29 10:31, Alex Williamson wrote: On Tue, 2014-07-29 at 09:53 +0800, ethan zhao wrote: On 2014/7/29 5:00, Alex Williamson wrote: On Wed, 2014-07-23 at 00:19 +0800, Ethan Zhao wrote: This patch introduced three helper functions to hide direct device flag operation. void pci_set_dev_assigned(struct pci_dev *pdev); void pci_set_dev_deassigned(struct pci_dev *pdev); bool pci_is_dev_assigned(struct pci_dev *pdev); Signed-off-by: Ethan Zhao ethan.z...@oracle.com --- include/linux/pci.h | 13 + 1 files changed, 13 insertions(+), 0 deletions(-) diff --git a/include/linux/pci.h b/include/linux/pci.h index aab57b4..5f6f8fa 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1129,6 +1129,19 @@ resource_size_t pcibios_window_alignment(struct pci_bus *bus, int pci_set_vga_state(struct pci_dev *pdev, bool decode, unsigned int command_bits, u32 flags); +/* helper functions for operation of device flag */ +static inline void pci_set_dev_assigned(struct pci_dev *pdev) +{ +pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED; +} +static inline void pci_set_dev_deassigned(struct pci_dev *pdev) +{ +pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED; +} I think pci_clear_dev_assigned() would make more sense, we're not setting a flag named DEASSIGNED. Though it is a flag operation now, may not later, we define it because we want to hide the internal operation. 'set' to 'deassigned' status is enough. So I would like keep it. I disagree, the opposite of a 'set' is a 'clear', or at least an 'unset'. Using bit-ops-like terminology doesn't lock us into an implementation. As written, this could just as easily be setting two different variables. So there are two pairs of opposite: set assigned --- unset assigned set assigned --- set deassigned Here you prefer the 'verb' set /unset, and I prefer the 'adj.' assigned / deassigned. Do they really have different meaning or make confusion ? I don't think so. Thanks, Ethan I agree with Alex W. If you are going to use the set name you should probably use clear for the operation that undoes it. Using the term set implies that it is setting a bit and we currently don't have a deassigned/unassigned bit. If that doesn't work perhaps you could use something like get/put_assigned_device or acquire/release_assigned_device. You just need to describe what it is you are doing. You could even consider adding some tests to return an error if you attempt to assign a device that is already assigned. Thanks, Alex D -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2 V3] PCI: introduce device assignment interface and refactory related code
On 07/11/2014 05:30 AM, Ethan Zhao wrote: This patch introduces two new device assignment functions pci_iov_assign_device(), pci_iov_deassign_device() along with the existed one pci_vfs_assigned() They construct the VFs assignment management interface, used to assign/ deassign device to VM and query the VFs reference counter. instead of direct manipulation of device flag. This patch refashioned the related code and make them atomic. v3: change the naming of device assignment helpers, because they work for all kind of PCI device, not only SR-IOV (david.vra...@citrix.com) v2: reorder the patchset and make it bisectable and atomic, steps clear between interface defination and implemenation according to the suggestion from alex.william...@redhat.com Signed-off-by: Ethan Zhao ethan.z...@oracle.com --- drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 17 ++--- drivers/pci/iov.c | 20 drivers/xen/xen-pciback/pci_stub.c |4 ++-- include/linux/pci.h|4 virt/kvm/assigned-dev.c|2 +- virt/kvm/iommu.c |4 ++-- 6 files changed, 31 insertions(+), 20 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c index 02c11a7..781040e 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c @@ -693,22 +693,9 @@ complete_reset: static bool i40e_vfs_are_assigned(struct i40e_pf *pf) { struct pci_dev *pdev = pf-pdev; - struct pci_dev *vfdev; - - /* loop through all the VFs to see if we own any that are assigned */ - vfdev = pci_get_device(PCI_VENDOR_ID_INTEL, I40E_DEV_ID_VF , NULL); - while (vfdev) { - /* if we don't own it we don't care */ - if (vfdev-is_virtfn pci_physfn(vfdev) == pdev) { - /* if it is assigned we cannot release it */ - if (vfdev-dev_flags PCI_DEV_FLAGS_ASSIGNED) - return true; - } - vfdev = pci_get_device(PCI_VENDOR_ID_INTEL, -I40E_DEV_ID_VF, -vfdev); - } + if (pci_vfs_assigned(pdev)) + return true; return false; } This portion for i40e should be in one patch by itself. It shouldn't be included in the bits below. Normally this would go through netdev. The rest of this below would go through linux-pci. diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index de7a747..090f827 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -644,6 +644,26 @@ int pci_vfs_assigned(struct pci_dev *dev) EXPORT_SYMBOL_GPL(pci_vfs_assigned); /** + * pci_iov_assign_device - assign device to VM + * @pdev: the device to be assigned. + */ +void pci_iov_assign_device(struct pci_dev *pdev) +{ + pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED; +} +EXPORT_SYMBOL_GPL(pci_iov_assign_device); + +/** + * pci_iov_deassign_device - deasign device from VM + * @pdev: the device to be deassigned. + */ +void pci_iov_deassign_device(struct pci_dev *pdev) +{ + pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED; +} +EXPORT_SYMBOL_GPL(pci_iov_deassign_device); + +/** * pci_sriov_set_totalvfs -- reduce the TotalVFs available * @dev: the PCI PF device * @numvfs: number that should be used for TotalVFs supported The two functions above don't have anything to do with IOV. You can direct assign a device that doesn't even support SR-IOV or MR-IOV. You might be better off defining this as something like pci_set_flag_assigned/pci_clear_flag_assigned. I would likely also make them inline and possibly move them to pci.h since it would likely result in less actual code after you consider the overhead to push everything on the stack prior to making the call. diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c index 62fcd48..27e00d1 100644 --- a/drivers/xen/xen-pciback/pci_stub.c +++ b/drivers/xen/xen-pciback/pci_stub.c @@ -133,7 +133,7 @@ static void pcistub_device_release(struct kref *kref) xen_pcibk_config_free_dyn_fields(dev); xen_pcibk_config_free_dev(dev); - dev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED; + pci_iov_deassign_device(dev); pci_dev_put(dev); kfree(psdev); @@ -404,7 +404,7 @@ static int pcistub_init_device(struct pci_dev *dev) dev_dbg(dev-dev, reset device\n); xen_pcibk_reset_device(dev); - dev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED; + pci_iov_assign_device(dev); return 0; config_release: diff --git a/include/linux/pci.h b/include/linux/pci.h index aab57b4..5ece6d6 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1603,6
Re: [net-next PATCH 1/2] igbvf: add new driver to support 82576 virtual functions
Andrew Morton wrote: On Tue, 17 Mar 2009 17:12:48 -0700 Alexander Duyck alexander.h.du...@intel.com wrote: Thanks for all the comments. I tried to incorporate most of them into the igbvf driver and also ended up porting some over to our other drivers, specifically igb since the igbvf driver copies much of the code. I have added my comments inline below. Thanks, Alex Andrew Morton wrote: On Tue, 10 Mar 2009 19:09:28 -0700 Jeff Kirsher jeffrey.t.kirs...@intel.com wrote: From: Alexander Duyck alexander.h.du...@intel.com This adds an igbvf driver to handle virtual functions provided by the igb driver. The drive-by reader is now wondering what a virtual function is. ^^ this comment was missed. I was indirectly asking for an overview (preferably in the changelog) of what the whole patch actually does. Sorry, while I missed the comment in my response I had gotten to addressing it in the next version. I updated it to more thoroughly describe what the VF driver is doing. I also included instructions on how to enable the VFs from the PF so that they can be tested. +static int igbvf_set_ringparam(struct net_device *netdev, + struct ethtool_ringparam *ring) +{ + struct igbvf_adapter *adapter = netdev_priv(netdev); + struct igbvf_ring *tx_ring, *tx_old; + struct igbvf_ring *rx_ring, *rx_old; + int err; + + if ((ring-rx_mini_pending) || (ring-rx_jumbo_pending)) + return -EINVAL; + + while (test_and_set_bit(__IGBVF_RESETTING, adapter-state)) + msleep(1); No timeout needed here? Interrupts might not be working, for example.. This bit isn't set in interrupt context. This is always used out of interrupt context and is just to prevent multiple setting changes at the same time. Oh. Can't use plain old mutex_lock()? We have one or two spots that actually check to see if the bit is set and just report a warning instead of actually waiting on the bit to clear. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next PATCH 1/2] igbvf: add new driver to support 82576 virtual functions
Thanks for all the comments. I tried to incorporate most of them into the igbvf driver and also ended up porting some over to our other drivers, specifically igb since the igbvf driver copies much of the code. I have added my comments inline below. Thanks, Alex Andrew Morton wrote: On Tue, 10 Mar 2009 19:09:28 -0700 Jeff Kirsher jeffrey.t.kirs...@intel.com wrote: From: Alexander Duyck alexander.h.du...@intel.com This adds an igbvf driver to handle virtual functions provided by the igb driver. The drive-by reader is now wondering what a virtual function is. ... +#define IGBVF_STAT(m, b) sizeof(((struct igbvf_adapter *)0)-m), \ + offsetof(struct igbvf_adapter, m), \ + offsetof(struct igbvf_adapter, b) +static const struct igbvf_stats igbvf_gstrings_stats[] = { + { rx_packets, IGBVF_STAT(stats.gprc, stats.base_gprc) }, + { tx_packets, IGBVF_STAT(stats.gptc, stats.base_gptc) }, + { rx_bytes, IGBVF_STAT(stats.gorc, stats.base_gorc) }, + { tx_bytes, IGBVF_STAT(stats.gotc, stats.base_gotc) }, + { multicast, IGBVF_STAT(stats.mprc, stats.base_mprc) }, + { lbrx_bytes, IGBVF_STAT(stats.gorlbc, stats.base_gorlbc) }, + { lbrx_packets, IGBVF_STAT(stats.gprlbc, stats.base_gprlbc) }, + { tx_restart_queue, IGBVF_STAT(restart_queue, zero_base) }, + { rx_long_byte_count, IGBVF_STAT(stats.gorc, stats.base_gorc) }, + { rx_csum_offload_good, IGBVF_STAT(hw_csum_good, zero_base) }, + { rx_csum_offload_errors, IGBVF_STAT(hw_csum_err, zero_base) }, + { rx_header_split, IGBVF_STAT(rx_hdr_split, zero_base) }, + { alloc_rx_buff_failed, IGBVF_STAT(alloc_rx_buff_failed, zero_base) }, +}; stares at it for a while It would be clearer if `m' and `b' were (much) more meaningful identifiers. I agree, the values have been changed to current and base. +#define IGBVF_GLOBAL_STATS_LEN \ + (sizeof(igbvf_gstrings_stats) / sizeof(struct igbvf_stats)) This is ARRAY_SIZE(). +#define IGBVF_STATS_LEN (IGBVF_GLOBAL_STATS_LEN) Why does this need to exist? It doesn't, it has been dropped and references to it replaced with IGBVF_GLOBAL_STATS_LEN. ... +static int igbvf_set_ringparam(struct net_device *netdev, + struct ethtool_ringparam *ring) +{ + struct igbvf_adapter *adapter = netdev_priv(netdev); + struct igbvf_ring *tx_ring, *tx_old; + struct igbvf_ring *rx_ring, *rx_old; + int err; + + if ((ring-rx_mini_pending) || (ring-rx_jumbo_pending)) + return -EINVAL; + + while (test_and_set_bit(__IGBVF_RESETTING, adapter-state)) + msleep(1); No timeout needed here? Interrupts might not be working, for example.. This bit isn't set in interrupt context. This is always used out of interrupt context and is just to prevent multiple setting changes at the same time. + if (netif_running(adapter-netdev)) + igbvf_down(adapter); + + tx_old = adapter-tx_ring; + rx_old = adapter-rx_ring; + + err = -ENOMEM; + tx_ring = kzalloc(sizeof(struct igbvf_ring), GFP_KERNEL); + if (!tx_ring) + goto err_alloc_tx; + /* + * use a memcpy to save any previously configured + * items like napi structs from having to be + * reinitialized + */ + memcpy(tx_ring, tx_old, sizeof(struct igbvf_ring)); + + rx_ring = kzalloc(sizeof(struct igbvf_ring), GFP_KERNEL); + if (!rx_ring) + goto err_alloc_rx; + memcpy(rx_ring, rx_old, sizeof(struct igbvf_ring)); + + adapter-tx_ring = tx_ring; + adapter-rx_ring = rx_ring; + + rx_ring-count = max(ring-rx_pending, (u32)IGBVF_MIN_RXD); + rx_ring-count = min(rx_ring-count, (u32)(IGBVF_MAX_RXD)); + rx_ring-count = ALIGN(rx_ring-count, REQ_RX_DESCRIPTOR_MULTIPLE); + + tx_ring-count = max(ring-tx_pending, (u32)IGBVF_MIN_TXD); + tx_ring-count = min(tx_ring-count, (u32)(IGBVF_MAX_TXD)); + tx_ring-count = ALIGN(tx_ring-count, REQ_TX_DESCRIPTOR_MULTIPLE); + + if (netif_running(adapter-netdev)) { + /* Try to get new resources before deleting old */ + err = igbvf_setup_rx_resources(adapter); + if (err) + goto err_setup_rx; + err = igbvf_setup_tx_resources(adapter); + if (err) + goto err_setup_tx; + + /* + * restore the old in order to free it, + * then add in the new + */ + adapter-rx_ring = rx_old; + adapter-tx_ring = tx_old; + igbvf_free_rx_resources(adapter); + igbvf_free_tx_resources(adapter); + kfree(tx_old); + kfree(rx_old); That's odd-looking. Why take a copy of rx_old and tx_old when we're about to free them? This whole section has been redone. The approach here didn't work well and so I redid it to match how we do this in igb. + adapter-rx_ring = rx_ring; + adapter