Re: [Qemu-devel] live migration vs device assignment (motivation)
On 2015年12月30日 00:46, Michael S. Tsirkin wrote: > Interesting. So you sare saying merely ifdown/ifup is 100ms? > This does not sound reasonable. > Is there a chance you are e.g. getting IP from dhcp? > > If so that is wrong - clearly should reconfigure the old IP > back without playing with dhcp. For testing, just set up > a static IP. MAC and IP are migrated with VM to target machine and not need to reconfigure IP after migration. >From my test result, ixgbevf_down() consumes 35ms and ixgbevf_up() consumes 55ms during migration. -- Best regards Tianyu Lan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote: As long as you keep up this vague talk about performance during migration, without even bothering with any measurements, this patchset will keep going nowhere. I measured network service downtime for "keep device alive"(RFC patch V1 presented) and "put down and up network interface"(RFC patch V2 presented) during migration with some optimizations. The former is around 140ms and the later is around 240ms. My patchset relies on the maibox irq which doesn't work in the suspend state and so can't get downtime for suspend/resume cases. Will try to get the result later. There's Alex's patch that tracks memory changes during migration. It needs some simple enhancements to be useful in production (e.g. add a host/guest handshake to both enable tracking in guest and to detect the support in host), then it can allow starting migration with an assigned device, by invoking hot-unplug after most of memory have been migrated. Please implement this in qemu and measure the speed. Sure. Will do that. I will not be surprised if destroying/creating netdev in linux turns out to take too long, but before anyone bothered checking, it does not make sense to discuss further enhancements. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
Merry Christmas. Sorry for later response due to personal affair. On 2015年12月14日 03:30, Alexander Duyck wrote: >> > These sounds we need to add a faked bridge for migration and adding a >> > driver in the guest for it. It also needs to extend PCI bus/hotplug >> > driver to do pause/resume other devices, right? >> > >> > My concern is still that whether we can change PCI bus/hotplug like that >> > without spec change. >> > >> > IRQ should be general for any devices and we may extend it for >> > migration. Device driver also can make decision to support migration >> > or not. > The device should have no say in the matter. Either we are going to > migrate or we will not. This is why I have suggested my approach as > it allows for the least amount of driver intrusion while providing the > maximum number of ways to still perform migration even if the device > doesn't support it. Even if the device driver doesn't support migration, you still want to migrate VM? That maybe risk and we should add the "bad path" for the driver at least. > > The solution I have proposed is simple: > > 1. Extend swiotlb to allow for a page dirtying functionality. > > This part is pretty straight forward. I'll submit a few patches > later today as RFC that can provided the minimal functionality needed > for this. Very appreciate to do that. > > 2. Provide a vendor specific configuration space option on the QEMU > implementation of a PCI bridge to act as a bridge between direct > assigned devices and the host bridge. > > My thought was to add some vendor specific block that includes a > capabilities, status, and control register so you could go through and > synchronize things like the DMA page dirtying feature. The bridge > itself could manage the migration capable bit inside QEMU for all > devices assigned to it. So if you added a VF to the bridge it would > flag that you can support migration in QEMU, while the bridge would > indicate you cannot until the DMA page dirtying control bit is set by > the guest. > > We could also go through and optimize the DMA page dirtying after > this is added so that we can narrow down the scope of use, and as a > result improve the performance for other devices that don't need to > support migration. It would then be a matter of adding an interrupt > in the device to handle an event such as the DMA page dirtying status > bit being set in the config space status register, while the bit is > not set in the control register. If it doesn't get set then we would > have to evict the devices before the warm-up phase of the migration, > otherwise we can defer it until the end of the warm-up phase. > > 3. Extend existing shpc driver to support the optional "pause" > functionality as called out in section 4.1.2 of the Revision 1.1 PCI > hot-plug specification. Since your solution has added a faked PCI bridge. Why not notify the bridge directly during migration via irq and call device driver's callback in the new bridge driver? Otherwise, the new bridge driver also can check whether the device driver provides migration callback or not and call them to improve the passthough device's performance during migration. > > Note I call out "extend" here instead of saying to add this. > Basically what we should do is provide a means of quiescing the device > without unloading the driver. This is called out as something the OS > vendor can optionally implement in the PCI hot-plug specification. On > OSes that wouldn't support this it would just be treated as a standard > hot-plug event. We could add a capability, status, and control bit > in the vendor specific configuration block for this as well and if we > set the status bit would indicate the host wants to pause instead of > remove and the control bit would indicate the guest supports "pause" > in the OS. We then could optionally disable guest migration while the > VF is present and pause is not supported. > > To support this we would need to add a timer and if a new device > is not inserted in some period of time (60 seconds for example), or if > a different device is inserted, > we need to unload the original driver > from the device. In addition we would need to verify if drivers can > call the remove function after having called suspend without resume. > If not, we could look at adding a recovery function to remove the > driver from the device in the case of a suspend with either a failed > resume or no resume call. Once again it would probably be useful to > have for those cases where power management suspend/resume runs into > an issue like somebody causing a surprise removal while a device was > suspended. -- Best regards Tianyu Lan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/11/2015 1:16 AM, Alexander Duyck wrote: On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu <tianyu@intel.com> wrote: On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: Ideally, it is able to leave guest driver unmodified but it requires the hypervisor or qemu to aware the device which means we may need a driver in hypervisor or qemu to handle the device on behalf of guest driver. Can you answer the question of when do you use your code - at the start of migration or just before the end? Just before stopping VCPU in this version and inject VF mailbox irq to notify the driver if the irq handler is installed. Qemu side also will check this via the faked PCI migration capability and driver will set the status during device open() or resume() callback. The VF mailbox interrupt is a very bad idea. Really the device should be in a reset state on the other side of a migration. It doesn't make sense to have the interrupt firing if the device is not configured. This is one of the things that is preventing you from being able to migrate the device while the interface is administratively down or the VF driver is not loaded. From my opinion, if VF driver is not loaded and hardware doesn't start to work, the device state doesn't need to be migrated. We may add a flag for driver to check whether migration happened during it's down and reinitialize the hardware and clear the flag when system try to put it up. We may add migration core in the Linux kernel and provide some helps functions to facilitate to add migration support for drivers. Migration core is in charge to sync status with Qemu. Example. migration_register() Driver provides - Callbacks to be called before and after migration or for bad path - Its irq which it prefers to deal with migration event. migration_event_check() Driver calls it in the irq handler. Migration core code will check migration status and call its callbacks when migration happens. My thought on all this is that it might make sense to move this functionality into a PCI-to-PCI bridge device and make it a requirement that all direct-assigned devices have to exist behind that device in order to support migration. That way you would be working with a directly emulated device that would likely already be supporting hot-plug anyway. Then it would just be a matter of coming up with a few Qemu specific extensions that you would need to add to the device itself. The same approach would likely be portable enough that you could achieve it with PCIe as well via the same configuration space being present on the upstream side of a PCIe port or maybe a PCIe switch of some sort. It would then be possible to signal via your vendor-specific PCI capability on that device that all devices behind this bridge require DMA page dirtying, you could use the configuration in addition to the interrupt already provided for hot-plug to signal things like when you are starting migration, and possibly even just extend the shpc functionality so that if this capability is present you have the option to pause/resume instead of remove/probe the device in the case of certain hot-plug events. The fact is there may be some use for a pause/resume type approach for PCIe hot-plug in the near future anyway. From the sounds of it Apple has required it for all Thunderbolt device drivers so that they can halt the device in order to shuffle resources around, perhaps we should look at something similar for Linux. The other advantage behind grouping functions on one bridge is things like reset domains. The PCI error handling logic will want to be able to reset any devices that experienced an error in the event of something such as a surprise removal. By grouping all of the devices you could disable/reset/enable them as one logical group in the event of something such as the "bad path" approach Michael has mentioned. These sounds we need to add a faked bridge for migration and adding a driver in the guest for it. It also needs to extend PCI bus/hotplug driver to do pause/resume other devices, right? My concern is still that whether we can change PCI bus/hotplug like that without spec change. IRQ should be general for any devices and we may extend it for migration. Device driver also can make decision to support migration or not. It would be great if we could avoid changing the guest; but at least your guest driver changes don't actually seem to be that hardware specific; could your changes actually be moved to generic PCI level so they could be made to work for lots of drivers? It is impossible to use one common solution for all devices unless the PCIE spec documents it clearly and i think one day it will be there. But before that, we need some workarounds on guest driver to make it work even it looks ugly. Yes, so far there is not hardware migration support and it's hard to modify bus level code. It also will block implementation on the Windows
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/11/2015 12:11 AM, Michael S. Tsirkin wrote: On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote: On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: Ideally, it is able to leave guest driver unmodified but it requires the hypervisor or qemu to aware the device which means we may need a driver in hypervisor or qemu to handle the device on behalf of guest driver. Can you answer the question of when do you use your code - at the start of migration or just before the end? Just before stopping VCPU in this version and inject VF mailbox irq to notify the driver if the irq handler is installed. Qemu side also will check this via the faked PCI migration capability and driver will set the status during device open() or resume() callback. Right, this is the "good path" optimization. Whether this buys anything as compared to just sending reset to the device when VCPU is stopped needs to be measured. In any case, we probably do need a way to interrupt driver on destination to make it reconfigure the device - otherwise it might take seconds for it to notice. And a way to make sure driver can handle this surprise reset so we can block migration if it can't. Yes, we need such a way to notify driver about migration status and do reset or restore operation on the destination machine. My original design is to take advantage of device's irq to do that. Driver can tell Qemu that which irq it prefers to handle such task and whether the irq is enabled or bound with handler. We may discuss the detail in the other thread. It would be great if we could avoid changing the guest; but at least your guest driver changes don't actually seem to be that hardware specific; could your changes actually be moved to generic PCI level so they could be made to work for lots of drivers? It is impossible to use one common solution for all devices unless the PCIE spec documents it clearly and i think one day it will be there. But before that, we need some workarounds on guest driver to make it work even it looks ugly. Yes, so far there is not hardware migration support VT-D supports setting dirty bit in the PTE in hardware. Actually, this doesn't support in the current hardware. VTD spec documents the dirty bit for first level translation which requires devices to support DMA request with PASID(process address space identifier). Most device don't support the feature. and it's hard to modify bus level code. Why is it hard? As Yang said, the concern is that PCI Spec doesn't document about how to do migration. It also will block implementation on the Windows. Implementation of what? We are discussing motivation here, not implementation. E.g. windows drivers typically support surprise removal, should you use that, you get some working code for free. Just stop worrying about it. Make it work, worry about closed source software later. Dave -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: live migration vs device assignment (motivation)
On 12/10/2015 4:38 PM, Michael S. Tsirkin wrote: Let's assume you do save state and do have a way to detect whether state matches a given hardware. For example, driver could store firmware and hardware versions in the state, and then on destination, retrieve them and compare. It will be pretty common that you have a mismatch, and you must not just fail migration. You need a way to recover, maybe with more downtime. Second, you can change the driver but you can not be sure it will have the chance to run at all. Host overload is a common reason to migrate out of the host. You also can not trust guest to do the right thing. So how long do you want to wait until you decide guest is not cooperating and kill it? Most people will probably experiment a bit and then add a bit of a buffer. This is not robust at all. Again, maybe you ask driver to save state, and if it does not respond for a while, then you still migrate, and driver has to recover on destination. With the above in mind, you need to support two paths: 1. "good path": driver stores state on source, checks it on destination detects a match and restores state into the device 2. "bad path": driver does not store state, or detects a mismatch on destination. driver has to assume device was lost, and reset it So what I am saying is, implement bad path first. Then good path is an optimization - measure whether it's faster, and by how much. These sound reasonable. Driver should have ability to do such check to ensure hardware or firmware coherence after migration and reset device when migration happens at some unexpected position. Also, it would be nice if on the bad path there was a way to switch to another driver entirely, even if that means a bit more downtime. For example, have a way for driver to tell Linux it has to re-do probing for the device. Just glace the code of device core. device_reprobe() does what you said. /** * device_reprobe - remove driver for a device and probe for a new driver * @dev: the device to reprobe * * This function detaches the attached driver (if any) for the given * device and restarts the driver probing process. It is intended * to use if probing criteria changed during a devices lifetime and * driver attachment should change accordingly. */ int device_reprobe(struct device *dev) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: Ideally, it is able to leave guest driver unmodified but it requires the >hypervisor or qemu to aware the device which means we may need a driver in >hypervisor or qemu to handle the device on behalf of guest driver. Can you answer the question of when do you use your code - at the start of migration or just before the end? Just before stopping VCPU in this version and inject VF mailbox irq to notify the driver if the irq handler is installed. Qemu side also will check this via the faked PCI migration capability and driver will set the status during device open() or resume() callback. > >It would be great if we could avoid changing the guest; but at least your guest > >driver changes don't actually seem to be that hardware specific; could your > >changes actually be moved to generic PCI level so they could be made > >to work for lots of drivers? > >It is impossible to use one common solution for all devices unless the PCIE >spec documents it clearly and i think one day it will be there. But before >that, we need some workarounds on guest driver to make it work even it looks >ugly. Yes, so far there is not hardware migration support and it's hard to modify bus level code. It also will block implementation on the Windows. Dave -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/8/2015 1:12 AM, Alexander Duyck wrote: On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu <tianyu@intel.com> wrote: On 12/5/2015 1:07 AM, Alexander Duyck wrote: We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. That is a poor argument. I highly doubt Microsoft is interested in having to modify all of the drivers that will support direct assignment in order to support migration. They would likely request something similar to what I have in that they will want a way to do DMA tracking with minimal modification required to the drivers. This totally depends on the NIC or other devices' vendors and they should make decision to support migration or not. If yes, they would modify driver. Having to modify every driver that wants to support live migration is a bit much. In addition I don't see this being limited only to NIC devices. You can direct assign a number of different devices, your solution cannot be specific to NICs. We are also adding such migration support for QAT device and so our solution will not just be limit to NIC. Now just is the beginning. We can't limit user to only use Linux guest. So the migration feature should work for both Windows and Linux guest. If just target to call suspend/resume during migration, the feature will be meaningless. Most cases don't want to affect user during migration a lot and so the service down time is vital. Our target is to apply SRIOV NIC passthough to cloud service and NFV(network functions virtualization) projects which are sensitive to network performance and stability. From my opinion, We should give a change for device driver to implement itself migration job. Call suspend and resume callback in the driver if it doesn't care the performance during migration. The suspend/resume callback should be efficient in terms of time. After all we don't want the system to stall for a long period of time when it should be either running or asleep. Having it burn cycles in a power state limbo doesn't do anyone any good. If nothing else maybe it will help to push the vendors to speed up those functions which then benefit migration and the system sleep states. If we can benefit both migration and suspend, that would be wonderful. But migration and system pm is still different. Just for example, driver doesn't need to put device into deep D-status during migration and host can do this after migration while it's essential for system sleep. PCI configure space and interrupt config is emulated by Qemu and Qemu can migrate these configures to new machine. Driver doesn't need to deal with such thing. So I think migration still needs a different callback or different code path than device suspend/resume. Another concern is that we have to rework PM core ore PCI bus driver to call suspend/resume for passthrough devices during migration. This also blocks new feature works on the Windows. Also you keep assuming you can keep the device running while you do the migration and you can't. You are going to corrupt the memory if you do, and you have yet to provide any means to explain how you are going to solve that. The main problem is tracking DMA issue. I will repose my solution in the new thread for discussion. If not way to mark DMA page dirty when DMA is enabled, we have to stop DMA for a small time to do that at the last stage. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The ordering of your explanation here doesn't quite work. What needs to happen is that you have to disable DMA and then mark the pages as dirty. What the disabling of the BME does is signal to the hypervisor that the device is now stopped. The ixgbevf_suspend call already supported by the driver is almost exactly what is needed to take care of something like this. This is why I hope to reserve a piece of space in the dma page to do dummy write. This can help to mark page dirty while not require to stop DMA and not race with DMA data. You can't and it will still race. What concerns me is that your patches and the document you referenced earlier show a considerable lack of understanding about how DMA and device drivers work. There is a reason why device drivers have so many memory barriers and the like in them. The fact is when you have CPU and a device both accessing memory things have to be done in a very specific order and you cannot violate that. If you have a contiguous block of memory you expect the device to write into you cannot just poke
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/9/2015 6:37 PM, Michael S. Tsirkin wrote: On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote: Hi Michael & Alexander: Thanks a lot for your comments and suggestions. It's nice that it's appreciated, but you then go on and ignore all that I have written here: https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html No, I will reply it separately and according your suggestion to snip it into 3 thread. We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. This is not a reasonable argument. It makes no sense to duplicate code on Linux because you must duplicate code on Windows. Let's assume you must do it in the driver on windows because windows has closed source drivers. What does it matter? Linux can still do it as part of DMA API and have it apply to all drivers. Sure. Duplicated code should be encapsulated and make it able to reuse by other drivers. Just like you said the dummy write part. I meant the framework should not require to change Windows kernel code (such as PM core or PCI bus driver)and this will block implementation on the Windows. I think it's not problem to duplicate code in the Windows drivers. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The dma page allocated by VF driver also needs to reserve space to do dummy write. I suggested ways to do it all in the hypervisor without driver hacks, or hide it within DMA API without need to reserve extra space. Both approaches seem much cleaner. This sounds reasonable. We may discuss it detail in the separate thread. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/9/2015 7:28 PM, Michael S. Tsirkin wrote: I remember reading that it's possible to implement a bus driver on windows if required. But basically I don't see how windows can be relevant to discussing guest driver patches. That discussion probably belongs on the qemu maling list, not on lkml. I am not sure whether we can write a bus driver for Windows to support migration. But I think device vendor who want to support migration will improve their driver if we provide such framework in the hypervisor which just need them to change their driver. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: live migration vs device assignment (motivation)
On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote: I thought about what this is doing at the high level, and I do have some value in what you are trying to do, but I also think we need to clarify the motivation a bit more. What you are saying is not really what the patches are doing. And with that clearer understanding of the motivation in mind (assuming it actually captures a real need), I would also like to suggest some changes. Motivation: Most current solutions for migration with passthough device are based on the PCI hotplug but it has side affect and can't work for all device. For NIC device: PCI hotplug solution can work around Network device migration via switching VF and PF. But switching network interface will introduce service down time. I tested the service down time via putting VF and PV interface into a bonded interface and ping the bonded interface during plug and unplug VF. 1) About 100ms when add VF 2) About 30ms when del VF It also requires guest to do switch configuration. These are hard to manage and deploy from our customers. To maintain PV performance during migration, host side also needs to assign a VF to PV device. This affects scalability. These factors block SRIOV NIC passthough usage in the cloud service and OPNFV which require network high performance and stability a lot. For other kind of devices, it's hard to work. We are also adding migration support for QAT(QuickAssist Technology) device. QAT device user case introduction. Server, networking, big data, and storage applications use QuickAssist Technology to offload servers from handling compute-intensive operations, such as: 1) Symmetric cryptography functions including cipher operations and authentication operations 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve cryptography 3) Compression and decompression functions including DEFLATE and LZS PCI hotplug will not work for such devices during migration and these operations will fail when unplug device. So we are trying implementing a new solution which really migrates device state to target machine and won't affect user during migration with low service down time. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: live migration vs device assignment (motivation)
On 12/10/2015 1:14 AM, Alexander Duyck wrote: On Wed, Dec 9, 2015 at 8:26 AM, Lan, Tianyu <tianyu@intel.com> wrote: For other kind of devices, it's hard to work. We are also adding migration support for QAT(QuickAssist Technology) device. QAT device user case introduction. Server, networking, big data, and storage applications use QuickAssist Technology to offload servers from handling compute-intensive operations, such as: 1) Symmetric cryptography functions including cipher operations and authentication operations 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve cryptography 3) Compression and decompression functions including DEFLATE and LZS PCI hotplug will not work for such devices during migration and these operations will fail when unplug device. I assume the problem is that with a PCI hotplug event you are losing the state information for the device, do I have that right? Looking over the QAT drivers it doesn't seem like any of them support the suspend/resume PM calls. I would imagine it makes it difficult for a system with a QAT card in it to be able to drop the system to a low power state. You might want to try enabling suspend/resume support for the devices on bare metal before you attempt to take on migration as it would provide you with a good testing framework to see what needs to be saved/restored within the device and in what order before you attempt to do the same while migrating from one system to another. Sure. The suspend/resume job is under way. Actually, we have enabled QAT work for migration internally. Doing more test and fixing bugs. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: live migration vs device assignment (motivation)
On 12/10/2015 4:07 AM, Michael S. Tsirkin wrote: On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote: On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote: I thought about what this is doing at the high level, and I do have some value in what you are trying to do, but I also think we need to clarify the motivation a bit more. What you are saying is not really what the patches are doing. And with that clearer understanding of the motivation in mind (assuming it actually captures a real need), I would also like to suggest some changes. Motivation: Most current solutions for migration with passthough device are based on the PCI hotplug but it has side affect and can't work for all device. For NIC device: PCI hotplug solution can work around Network device migration via switching VF and PF. This is just more confusion. hotplug is just a way to add and remove devices. switching VF and PF is up to guest and hypervisor. This is a combination. Because it's not able to migrate device state in the current world during migration(What we are doing), Exist solutions of migrating VM with passthough NIC relies on the PCI hotplug. Unplug VF before starting migration and then switch network from VF NIC to PV NIC in order to maintain the network connection. Plug VF again after migration and then switch from PV back to VF. Bond driver provides a way to switch between PV and VF NIC automatically with save IP and MAC and so bond driver is more preferred. But switching network interface will introduce service down time. I tested the service down time via putting VF and PV interface into a bonded interface and ping the bonded interface during plug and unplug VF. 1) About 100ms when add VF 2) About 30ms when del VF OK and what's the source of the downtime? I'm guessing that's just arp being repopulated. So simply save and re-populate it. There would be a much cleaner solution. Or maybe there's a timer there that just delays hotplug for no reason. Fix it, everyone will benefit. It also requires guest to do switch configuration. That's just wrong. if you want a switch, you need to configure a switch. I meant the config of switching operation between PV and VF. These are hard to manage and deploy from our customers. So kernel want to remain flexible, and the stack is configurable. Downside: customers need to deploy userspace to configure it. Your solution: a hard-coded configuration within kernel and hypervisor. Sorry, this makes no sense. If kernel is easier for you to deploy than userspace, you need to rethink your deployment strategy. This is one factor. To maintain PV performance during migration, host side also needs to assign a VF to PV device. This affects scalability. No idea what this means. These factors block SRIOV NIC passthough usage in the cloud service and OPNFV which require network high performance and stability a lot. Everyone needs performance and scalability. For other kind of devices, it's hard to work. We are also adding migration support for QAT(QuickAssist Technology) device. QAT device user case introduction. Server, networking, big data, and storage applications use QuickAssist Technology to offload servers from handling compute-intensive operations, such as: 1) Symmetric cryptography functions including cipher operations and authentication operations 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve cryptography 3) Compression and decompression functions including DEFLATE and LZS PCI hotplug will not work for such devices during migration and these operations will fail when unplug device. So we are trying implementing a new solution which really migrates device state to target machine and won't affect user during migration with low service down time. Let's assume for the sake of the argument that there's a lot going on and removing the device is just too slow (though you should figure out what's going on before giving up and just building something new from scratch). No, we can find a PV NIC as backup for VF NIC during migration but it doesn't work for other kinds of device since there is no backup for them. E,G When migration happens during users compresses files via QAT, it's impossible to remove QAT at that point. If do that, the compress operation will fail and affect user experience. I still don't think you should be migrating state. That's just too fragile, and it also means you depend on driver to be nice and shut down device on source, so you can not migrate at will. Instead, reset device on destination and re-initialize it. Yes, saving and restoring device state relies on the driver and so we reworks driver and make it more friend to migration. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/5/2015 1:07 AM, Alexander Duyck wrote: We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. That is a poor argument. I highly doubt Microsoft is interested in having to modify all of the drivers that will support direct assignment in order to support migration. They would likely request something similar to what I have in that they will want a way to do DMA tracking with minimal modification required to the drivers. This totally depends on the NIC or other devices' vendors and they should make decision to support migration or not. If yes, they would modify driver. If just target to call suspend/resume during migration, the feature will be meaningless. Most cases don't want to affect user during migration a lot and so the service down time is vital. Our target is to apply SRIOV NIC passthough to cloud service and NFV(network functions virtualization) projects which are sensitive to network performance and stability. From my opinion, We should give a change for device driver to implement itself migration job. Call suspend and resume callback in the driver if it doesn't care the performance during migration. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The ordering of your explanation here doesn't quite work. What needs to happen is that you have to disable DMA and then mark the pages as dirty. What the disabling of the BME does is signal to the hypervisor that the device is now stopped. The ixgbevf_suspend call already supported by the driver is almost exactly what is needed to take care of something like this. This is why I hope to reserve a piece of space in the dma page to do dummy write. This can help to mark page dirty while not require to stop DMA and not race with DMA data. If can't do that, we have to stop DMA in a short time to mark all dma pages dirty and then reenable it. I am not sure how much we can get by this way to track all DMA memory with device running during migration. I need to do some tests and compare results with stop DMA diretly at last stage during migration. The question is how we would go about triggering it. I really don't think the PCI configuration space approach is the right idea. I wonder if we couldn't get away with some sort of ACPI event instead. We already require ACPI support in order to shut down the system gracefully, I wonder if we couldn't get away with something similar in order to suspend/resume the direct assigned devices gracefully. I don't think there is such events in the current spec. Otherwise, There are two kinds of suspend/resume callbacks. 1) System suspend/resume called during S2RAM and S2DISK. 2) Runtime suspend/resume called by pm core when device is idle. If you want to do what you mentioned, you have to change PM core and ACPI spec. The dma page allocated by VF driver also needs to reserve space to do dummy write. No, this will not work. If for example you have a VF driver allocating memory for a 9K receive how will that work? It isn't as if you can poke a hole in the contiguous memory. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
On 12/4/2015 4:05 PM, Michael S. Tsirkin wrote: I haven't read it, but I would like to note you can't rely on research papers. If you propose a patch to be merged you need to measure what is its actual effect on modern linux at the end of 2015. Sure. Will do that. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
Hi Michael & Alexander: Thanks a lot for your comments and suggestions. We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The dma page allocated by VF driver also needs to reserve space to do dummy write. On 12/2/2015 7:44 PM, Michael S. Tsirkin wrote: On Tue, Dec 01, 2015 at 10:36:33AM -0800, Alexander Duyck wrote: On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkinwrote: On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote: On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin wrote: There are several components to this: - dma_map_* needs to prevent page from being migrated while device is running. For example, expose some kind of bitmap from guest to host, set bit there while page is mapped. What happens if we stop the guest and some bits are still set? See dma_alloc_coherent below for some ideas. Yeah, I could see something like this working. Maybe we could do something like what was done for the NX bit and make use of the upper order bits beyond the limits of the memory range to mark pages as non-migratable? I'm curious. What we have with a DMA mapped region is essentially shared memory between the guest and the device. How would we resolve something like this with IVSHMEM, or are we blocked there as well in terms of migration? I have some ideas. Will post later. I look forward to it. - dma_unmap_* needs to mark page as dirty This can be done by writing into a page. - dma_sync_* needs to mark page as dirty This is trickier as we can not change the data. One solution is using atomics. For example: int x = ACCESS_ONCE(*p); cmpxchg(p, x, x); Seems to do a write without changing page contents. Like I said we can probably kill 2 birds with one stone by just implementing our own dma_mark_clean() for x86 virtualized environments. I'd say we could take your solution one step further and just use 0 instead of bothering to read the value. After all it won't write the area if the value at the offset is not 0. Really almost any atomic that has no side effect will do. atomic or with 0 atomic and with It's just that cmpxchg already happens to have a portable wrapper. I was originally thinking maybe an atomic_add with 0 would be the way to go. cmpxchg with any value too. Either way though we still are using a locked prefix and having to dirty a cache line per page which is going to come at some cost. I agree. It's likely not necessary for everyone to be doing this: only people that both run within the VM and want migration to work need to do this logging. So set some module option to have driver tell hypervisor that it supports logging. If bus mastering is enabled before this, migration is blocked. Or even pass some flag from hypervisor so driver can detect it needs to log writes. I guess this could be put in device config somewhere, though in practice it's a global thing, not a per device one, so maybe we need some new channel to pass this flag to guest. CPUID? Or maybe we can put some kind of agent in the initrd and use the existing guest agent channel after all. agent in initrd could open up a lot of new possibilities. - dma_alloc_coherent memory (e.g. device rings) must be migrated after device stopped modifying it. Just stopping the VCPU is not enough: you must make sure device is not changing it. Or maybe the device has some kind of ring flush operation, if there was a reasonably portable way to do this (e.g. a flush capability could maybe be added to SRIOV) then hypervisor could do this. This is where things start to get messy. I was suggesting the suspend/resume to resolve this bit, but it might be possible to also deal with this via something like this via clearing the bus master enable bit for the VF. If I am not mistaken that should disable MSI-X interrupts and halt any DMA. That should work as long as you have some mechanism that is tracking the pages in use for DMA. A bigger issue is recovering afterwards. Agreed. In case you need to resume on source, you really need to follow the same path as on destination, preferably detecting device reset and restoring the device state. The problem with detecting the reset is that you would likely have to be polling to do something like that. We could some event to guest to notify it about this event through a new or
Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
On 12/2/2015 10:31 PM, Michael S. Tsirkin wrote: >We hope >to find a better way to make SRIOV NIC work in these cases and this is >worth to do since SRIOV NIC provides better network performance compared >with PV NIC. If this is a performance optimization as the above implies, you need to include some numbers, and document how did you implement the switch and how did you measure the performance. OK. Some ideas of my patches come from paper "CompSC: Live Migration with Pass-through Devices". http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/papers/p109.pdf It compared performance data between the solution of switching PV and VF and VF migration.(Chapter 7: Discussion) >Current patches have some issues. I think we can find >solution for them andimprove them step by step. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
On 12/3/2015 6:25 AM, Alex Williamson wrote: On Tue, 2015-11-24 at 21:35 +0800, Lan Tianyu wrote: This patch is to add SRIOV VF migration support. Create new device type "vfio-sriov" and add faked PCI migration capability to the type device. The purpose of the new capability 1) sync migration status with VF driver in the VM 2) Get mailbox irq vector to notify VF driver during migration. 3) Provide a way to control injecting irq or not. Qemu will migrate PCI configure space regs and MSIX config for VF. Inject mailbox irq at last stage of migration to notify VF about migration event and wait VF driver ready for migration. VF driver writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table to tell Qemu. What makes this sr-iov specific? Why wouldn't we simply extend vfio-pci with a migration=on feature? Thanks, Sounds reasonable and will update it. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 06/10] Qemu/PCI: Add macros for faked PCI migration capability
On 12/3/2015 6:25 AM, Alex Williamson wrote: This will of course break if the PCI SIG defines that capability index. Couldn't this be done within a vendor defined capability? Thanks, Yes, it should work and thanks for suggestion. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
On 12/3/2015 6:25 AM, Alex Williamson wrote: I didn't seen a matching kernel patch series for this, but why is the kernel more capable of doing this than userspace is already? The following link is the kernel patch. http://marc.info/?l=kvm=144837328920989=2 These seem like pointless ioctls, we're creating a purely virtual PCI capability, the kernel doesn't really need to participate in that. VFIO kernel driver has pci_config_map which indicates the PCI capability position and length which helps to find free PCI config regs. Qemu side doesn't have such info and can't get the exact table size of PCI capability. If we want to add such support in the Qemu, needs duplicates a lot of code of vfio_pci_configs.c in the Qemu. Also, why are we restricting ourselves to standard capabilities? This version is to check whether it's on the right way and We can extend this to pci extended capability later. That's often a crowded space and we can't always know whether an area is free or not based only on it being covered by a capability. Some capabilities can also appear more than once, so there's context that isn't being passed to the kernel here. Thanks, The region outside of PCI capability are not passed to kernel or used by Qemu for MSI/MSIX . It's possible to use these places for new capability. One concerns is that guest driver may abuse them and quirk of masking some special regs outside of capability maybe helpful. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
On 12/2/2015 10:31 PM, Michael S. Tsirkin wrote: >We hope >to find a better way to make SRIOV NIC work in these cases and this is >worth to do since SRIOV NIC provides better network performance compared >with PV NIC. If this is a performance optimization as the above implies, you need to include some numbers, and document how did you implement the switch and how did you measure the performance. OK. Some ideas of my patches come from paper "CompSC: Live Migration with Pass-through Devices". http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/papers/p109.pdf It compared performance data between the solution of switching PV and VF and VF migration.(Chapter 7: Discussion) >Current patches have some issues. I think we can find >solution for them andimprove them step by step. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
On 12/1/2015 11:02 PM, Michael S. Tsirkin wrote: But it requires guest OS to do specific configurations inside and rely on bonding driver which blocks it work on Windows. From performance side, putting VF and virtio NIC under bonded interface will affect their performance even when not do migration. These factors block to use VF NIC passthough in some user cases(Especially in the cloud) which require migration. That's really up to guest. You don't need to do bonding, you can just move the IP and mac from userspace, that's possible on most OS-es. Or write something in guest kernel that is more lightweight if you are so inclined. What we are discussing here is the host-guest interface, not the in-guest interface. Current solution we proposed changes NIC driver and Qemu. Guest Os doesn't need to do special thing for migration. It's easy to deploy Except of course these patches don't even work properly yet. And when they do, even minor changes in host side NIC hardware across migration will break guests in hard to predict ways. Switching between PV and VF NIC will introduce network stop and the latency of hotplug VF is measurable. For some user cases(cloud service and OPNFV) which are sensitive to network stabilization and performance, these are not friend and blocks SRIOV NIC usage in these case. We hope to find a better way to make SRIOV NIC work in these cases and this is worth to do since SRIOV NIC provides better network performance compared with PV NIC. Current patches have some issues. I think we can find solution for them andimprove them step by step. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/1/2015 12:07 AM, Alexander Duyck wrote: They can only be corrected if the underlying assumptions are correct and they aren't. Your solution would have never worked correctly. The problem is you assume you can keep the device running when you are migrating and you simply cannot. At some point you will always have to stop the device in order to complete the migration, and you cannot stop it before you have stopped your page tracking mechanism. So unless the platform has an IOMMU that is somehow taking part in the dirty page tracking you will not be able to stop the guest and then the device, it will have to be the device and then the guest. >Doing suspend and resume() may help to do migration easily but some >devices requires low service down time. Especially network and I got >that some cloud company promised less than 500ms network service downtime. Honestly focusing on the downtime is getting the cart ahead of the horse. First you need to be able to do this without corrupting system memory and regardless of the state of the device. You haven't even gotten to that state yet. Last I knew the device had to be up in order for your migration to even work. I think the issue is that the content of rx package delivered to stack maybe changed during migration because the piece of memory won't be migrated to new machine. This may confuse applications or stack. Current dummy write solution can ensure the content of package won't change after doing dummy write while the content maybe not received data if migration happens before that point. We can recheck the content via checksum or crc in the protocol after dummy write to ensure the content is what VF received. I think stack has already done such checks and the package will be abandoned if failed to pass through the check. Another way is to tell all memory driver are using to Qemu and let Qemu to migrate these memory after stopping VCPU and the device. This seems safe but implementation maybe complex. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
On 11/30/2015 4:01 PM, Michael S. Tsirkin wrote: It is still not very clear what it is you are trying to achieve, and whether your patchset achieves it. You merely say "adding live migration" but it seems pretty clear this isn't about being able to migrate a guest transparently, since you are adding a host/guest handshake. This isn't about functionality either: I think that on KVM, it isn't hard to live migrate if you can do a host/guest handshake, even today, with no kernel changes: 1. before migration, expose a pv nic to guest (can be done directly on boot) 2. use e.g. a serial connection to move IP from an assigned device to pv nic 3. maybe move the mac as well 4. eject the assigned device 5. detect eject on host (QEMU generates a DEVICE_DELETED event when this happens) and start migration This looks like the bonding driver solution which put pv nic and VF in one bonded interface under active-backup mode. The bonding driver will switch from VF to PV nic automatically when VF is unplugged during migration. This is the only available solution for VF NIC migration. But it requires guest OS to do specific configurations inside and rely on bonding driver which blocks it work on Windows. From performance side, putting VF and virtio NIC under bonded interface will affect their performance even when not do migration. These factors block to use VF NIC passthough in some user cases(Especially in the cloud) which require migration. Current solution we proposed changes NIC driver and Qemu. Guest Os doesn't need to do special thing for migration. It's easy to deploy and all changes are in the NIC driver, NIC vendor can implement migration support just in the their driver. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 11/26/2015 11:56 AM, Alexander Duyck wrote: > I am not saying you cannot modify the drivers, however what you are doing is far too invasive. Do you seriously plan on modifying all of the PCI device drivers out there in order to allow any device that might be direct assigned to a port to support migration? I certainly hope not. That is why I have said that this solution will not scale. Current drivers are not migration friendly. If the driver wants to support migration, it's necessary to be changed. RFC PATCH V1 presented our ideas about how to deal with MMIO, ring and DMA tracking during migration. These are common for most drivers and they maybe problematic in the previous version but can be corrected later. Doing suspend and resume() may help to do migration easily but some devices requires low service down time. Especially network and I got that some cloud company promised less than 500ms network service downtime. So I think performance effect also should be taken into account when we design the framework. What I am counter proposing seems like a very simple proposition. It can be implemented in two steps. 1. Look at modifying dma_mark_clean(). It is a function called in the sync and unmap paths of the lib/swiotlb.c. If you could somehow modify it to take care of marking the pages you unmap for Rx as being dirty it will get you a good way towards your goal as it will allow you to continue to do DMA while you are migrating the VM. 2. Look at making use of the existing PCI suspend/resume calls that are there to support PCI power management. They have everything needed to allow you to pause and resume DMA for the device before and after the migration while retaining the driver state. If you can implement something that allows you to trigger these calls from the PCI subsystem such as hot-plug then you would have a generic solution that can be easily reproduced for multiple drivers beyond those supported by ixgbevf. Glanced at PCI hotplug code. The hotplug events are triggered by PCI hotplug controller and these event are defined in the controller spec. It's hard to extend more events. Otherwise, we also need to add some specific codes in the PCI hotplug core since it's only add and remove PCI device when it gets events. It's also a challenge to modify Windows hotplug codes. So we may need to find another way. Thanks. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver
On 11/25/2015 8:28 PM, Michael S. Tsirkin wrote: Frankly, I don't really see what this short term hack buys us, and if it goes in, we'll have to maintain it forever. The framework of how to notify VF about migration status won't be changed regardless of stopping VF or not before doing migration. We hope to reach agreement on this first. Tracking dirty memory still need to more discussions and we will continue working on it. Stop VF may help to work around the issue and make tracking easier. Also, assuming you just want to do ifdown/ifup for some reason, it's easy enough to do using a guest agent, in a completely generic way. Just ifdown/ifup is not enough for migration. It needs to restore some PCI settings before doing ifup on the target machine -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
On 11/25/2015 5:03 AM, Michael S. Tsirkin wrote: >+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr, >+ uint32_t val, int len) >+{ >+VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev); >+ >+if (addr == vdev->migration_cap + PCI_VF_MIGRATION_VF_STATUS >+&& val == PCI_VF_READY_FOR_MIGRATION) { >+qemu_event_set(_event); This would wake migration so it can proceed - except it needs QEMU lock to run, and that's taken by the migration thread. Sorry, I seem to miss something. Which lock may cause dead lock when calling vfio_migration_cap_handle() and run migration? The function is called when VF accesses faked PCI capability. It seems unlikely that this ever worked - how did you test this? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 2015年11月25日 13:30, Alexander Duyck wrote: > No, what I am getting at is that you can't go around and modify the > configuration space for every possible device out there. This > solution won't scale. PCI config space regs are emulation by Qemu and so We can find the free PCI config space regs for the faked PCI capability. Its position can be not permanent. > If you instead moved the logic for notifying > the device into a separate mechanism such as making it a part of the > hot-plug logic then you only have to write the code once per OS in > order to get the hot-plug capability to pause/resume the device. What > I am talking about is not full hot-plug, but rather to extend the > existing hot-plug in Qemu and the Linux kernel to support a > "pause/resume" functionality. The PCI hot-plug specification calls > out the option of implementing something like this, but we don't > currently have support for it. > Could you elaborate the part of PCI hot-plug specification you mentioned? My concern is whether it needs to change PCI spec or not. > I just feel doing it through PCI hot-plug messages will scale much > better as you could likely make use of the power management > suspend/resume calls to take care of most of the needed implementation > details. > > - Alex -- Best regards Tianyu Lan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 2015年11月24日 22:20, Alexander Duyck wrote: > I'm still not a fan of this approach. I really feel like this is > something that should be resolved by extending the existing PCI hot-plug > rather than trying to instrument this per driver. Then you will get the > goodness for multiple drivers and multiple OSes instead of just one. An > added advantage to dealing with this in the PCI hot-plug environment > would be that you could then still do a hot-plug even if the guest > didn't load a driver for the VF since you would be working with the PCI > slot instead of the device itself. > > - Alex Hi Alex: What's you mentioned seems the bonding driver solution. Paper "Live Migration with Pass-through Device for Linux VM" describes it. It does VF hotplug during migration. In order to maintain Network connection when VF is out, it takes advantage of Linux bonding driver to switch between VF NIC and emulated NIC. But the side affects, that requires VM to do additional configure and the performance during switching two NIC is not good. -- Best regards Tianyu Lan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver
On 2015年11月25日 05:20, Michael S. Tsirkin wrote: > I have to say, I was much more interested in the idea > of tracking dirty memory. I have some thoughts about > that one - did you give up on it then? No, our finial target is to keep VF active before doing migration and tracking dirty memory is essential. But this seems not easy to do that in short term for upstream. As starters, stop VF before migration. After deep thinking, the way of stopping VF still needs tracking DMA-accessed dirty memory to make sure the received data buffer before stopping VF migrated. It's easier to do that via dummy writing data buffer when receive packet. -- Best regards Tianyu Lan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver
This patch is to add migration support for ixgbevf driver. Using faked PCI migration capability table communicates with Qemu to share migration status and mailbox irq vector index. Qemu will notify VF via sending MSIX msg to trigger mailbox vector during migration and store migration status in the PCI_VF_MIGRATION_VMM_STATUS regs in the new capability table. The mailbox irq will be triggered just befoe stop-and-copy stage and after migration on the target machine. VF driver will put down net when detect migration and tell Qemu it's ready for migration via writing PCI_VF_MIGRATION_VF_STATUS reg. After migration, put up net again. Qemu will in charge of migrating PCI config space regs and MSIX config. The patch is to dedicate on the normal case that net traffic works when mailbox irq is enabled. For other cases(such as the driver isn't loaded, adapter is suspended or closed), mailbox irq won't be triggered and VF driver will disable it via PCI_VF_MIGRATION_CAP reg. These case will be resolved later. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 5 ++ drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 ++ 2 files changed, 107 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h index 775d089..4b8ba2f 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h @@ -438,6 +438,11 @@ struct ixgbevf_adapter { u64 bp_tx_missed; #endif + u8 migration_cap; + u8 last_migration_reg; + unsigned long migration_status; + struct work_struct migration_task; + u8 __iomem *io_addr; /* Mainly for iounmap use */ u32 link_speed; bool link_up; diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index a16d267..95860c2 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -96,6 +96,8 @@ static int debug = -1; module_param(debug, int, 0); MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)"); +#define MIGRATION_IN_PROGRESS 0 + static void ixgbevf_service_event_schedule(struct ixgbevf_adapter *adapter) { if (!test_bit(__IXGBEVF_DOWN, >state) && @@ -1262,6 +1264,22 @@ static void ixgbevf_set_itr(struct ixgbevf_q_vector *q_vector) } } +static void ixgbevf_migration_check(struct ixgbevf_adapter *adapter) +{ + struct pci_dev *pdev = adapter->pdev; + u8 val; + + pci_read_config_byte(pdev, +adapter->migration_cap + PCI_VF_MIGRATION_VMM_STATUS, +); + + if (val != adapter->last_migration_reg) { + schedule_work(>migration_task); + adapter->last_migration_reg = val; + } + +} + static irqreturn_t ixgbevf_msix_other(int irq, void *data) { struct ixgbevf_adapter *adapter = data; @@ -1269,6 +1287,7 @@ static irqreturn_t ixgbevf_msix_other(int irq, void *data) hw->mac.get_link_status = 1; + ixgbevf_migration_check(adapter); ixgbevf_service_event_schedule(adapter); IXGBE_WRITE_REG(hw, IXGBE_VTEIMS, adapter->eims_other); @@ -1383,6 +1402,7 @@ out: static int ixgbevf_request_msix_irqs(struct ixgbevf_adapter *adapter) { struct net_device *netdev = adapter->netdev; + struct pci_dev *pdev = adapter->pdev; int q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS; int vector, err; int ri = 0, ti = 0; @@ -1423,6 +1443,12 @@ static int ixgbevf_request_msix_irqs(struct ixgbevf_adapter *adapter) goto free_queue_irqs; } + if (adapter->migration_cap) { + pci_write_config_byte(pdev, + adapter->migration_cap + PCI_VF_MIGRATION_IRQ, + vector); + } + return 0; free_queue_irqs: @@ -2891,6 +2917,59 @@ static void ixgbevf_watchdog_subtask(struct ixgbevf_adapter *adapter) ixgbevf_update_stats(adapter); } +static void ixgbevf_migration_task(struct work_struct *work) +{ + struct ixgbevf_adapter *adapter = container_of(work, + struct ixgbevf_adapter, + migration_task); + struct pci_dev *pdev = adapter->pdev; + struct net_device *netdev = adapter->netdev; + u8 val; + + if (!test_bit(MIGRATION_IN_PROGRESS, >migration_status)) { + pci_read_config_byte(pdev, +adapter->migration_cap + PCI_VF_MIGRATION_VMM_STATUS, +); + if (val != VMM_MIGRATION_START) + return; + + pr_info("migration start\n"); + set_bit(MIGRATION_IN_PROGRESS, >migration_status);
[RFC PATCH V2 1/3] VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO
This patch is to add new ioctl cmd VFIO_GET_PCI_CAP_INFO to get PCI cap table size and get free PCI config space regs according pos and size. Qemu will add faked PCI capability for migration and need such info. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- drivers/vfio/pci/vfio_pci.c | 21 drivers/vfio/pci/vfio_pci_config.c | 38 +++-- drivers/vfio/pci/vfio_pci_private.h | 5 + include/uapi/linux/vfio.h | 12 4 files changed, 70 insertions(+), 6 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 69fab0f..2e42de0 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -784,6 +784,27 @@ hot_reset_release: kfree(groups); return ret; + } else if (cmd == VFIO_GET_PCI_CAP_INFO) { + struct vfio_pci_cap_info info; + int offset; + + if (copy_from_user(, (void __user *)arg, sizeof(info))) + return -EFAULT; + + switch (info.index) { + case VFIO_PCI_CAP_GET_SIZE: + info.size = vfio_get_cap_size(vdev, info.cap, info.offset); + break; + case VFIO_PCI_CAP_GET_FREE_REGION: + offset = vfio_find_free_pci_config_reg(vdev, + info.offset, info.size); + info.offset = offset; + break; + default: + return -EINVAL; + } + + return copy_to_user((void __user *)arg, , sizeof(info)); } return -ENOTTY; diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index ff75ca3..8afbda4 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -841,6 +841,21 @@ static int vfio_find_cap_start(struct vfio_pci_device *vdev, int pos) return pos; } +int vfio_find_free_pci_config_reg(struct vfio_pci_device *vdev, + int pos, int size) +{ + int i, offset = pos; + + for (i = pos; i < PCI_CFG_SPACE_SIZE; i++) { + if (vdev->pci_config_map[i] != PCI_CAP_ID_INVALID) + offset = i + 1; + else if (i - offset + 1 == size) + return offset; + } + + return 0; +} + static int vfio_msi_config_read(struct vfio_pci_device *vdev, int pos, int count, struct perm_bits *perm, int offset, __le32 *val) @@ -1199,6 +1214,20 @@ static int vfio_fill_vconfig_bytes(struct vfio_pci_device *vdev, return ret; } +int vfio_get_cap_size(struct vfio_pci_device *vdev, u8 cap, int pos) +{ + int len; + + len = pci_cap_length[cap]; + if (len == 0xFF) { /* Variable length */ + len = vfio_cap_len(vdev, cap, pos); + if (len < 0) + return len; + } + + return len; +} + static int vfio_cap_init(struct vfio_pci_device *vdev) { struct pci_dev *pdev = vdev->pdev; @@ -1238,12 +1267,9 @@ static int vfio_cap_init(struct vfio_pci_device *vdev) return ret; if (cap <= PCI_CAP_ID_MAX) { - len = pci_cap_length[cap]; - if (len == 0xFF) { /* Variable length */ - len = vfio_cap_len(vdev, cap, pos); - if (len < 0) - return len; - } + len = vfio_get_cap_size(vdev, cap, pos); + if (len < 0) + return len; } if (!len) { diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index ae0e1b4..91b4f9b 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -89,4 +89,9 @@ extern void vfio_pci_uninit_perm_bits(void); extern int vfio_config_init(struct vfio_pci_device *vdev); extern void vfio_config_free(struct vfio_pci_device *vdev); +extern int vfio_find_free_pci_config_reg(struct vfio_pci_device *vdev, + int pos, int size); +extern int vfio_get_cap_size(struct vfio_pci_device *vdev, + u8 cap, int pos); + #endif /* VFIO_PCI_PRIVATE_H */ diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index b57b750..dfa7023 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -495,6 +495,18 @@ struct vfio_eeh_pe_op { #define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21) +#define VFIO_GET_PCI_CAP_INFO _IO(VFIO_TYPE, VFIO_BASE + 22) +struct vfio_pci_cap_info { + __u32 argsz; + __u32 flags; +#define VFIO_PCI_CAP_GET_SIZE (1 << 0) +#define VFIO_
[RFC PATCH V2 2/3] PCI: Add macros for faked PCI migration capability
This patch is to extend PCI CAP id for migration cap and add reg macros. The CAP ID is trial and we may find better one if the solution is feasible. *PCI_VF_MIGRATION_CAP For VF driver to control that triggers mailbox irq or not during migration. *PCI_VF_MIGRATION_VMM_STATUS Qemu stores migration status in the reg *PCI_VF_MIGRATION_VF_STATUS VF driver tells Qemu ready for migration *PCI_VF_MIGRATION_IRQ VF driver stores mailbox interrupt vector in the reg for Qemu to trigger during migration. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- include/uapi/linux/pci_regs.h | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h index efe3443..9defb6f 100644 --- a/include/uapi/linux/pci_regs.h +++ b/include/uapi/linux/pci_regs.h @@ -216,7 +216,8 @@ #define PCI_CAP_ID_MSIX 0x11/* MSI-X */ #define PCI_CAP_ID_SATA 0x12/* SATA Data/Index Conf. */ #define PCI_CAP_ID_AF 0x13/* PCI Advanced Features */ -#define PCI_CAP_ID_MAXPCI_CAP_ID_AF +#define PCI_CAP_ID_MIGRATION 0X14 +#define PCI_CAP_ID_MAXPCI_CAP_ID_MIGRATION #define PCI_CAP_LIST_NEXT 1 /* Next capability in the list */ #define PCI_CAP_FLAGS 2 /* Capability defined flags (16 bits) */ #define PCI_CAP_SIZEOF 4 @@ -904,4 +905,19 @@ #define PCI_TPH_CAP_ST_SHIFT 16 /* st table shift */ #define PCI_TPH_BASE_SIZEOF12 /* size with no st table */ +/* Migration*/ +#define PCI_VF_MIGRATION_CAP 0x04 +#define PCI_VF_MIGRATION_VMM_STATUS0x05 +#define PCI_VF_MIGRATION_VF_STATUS 0x06 +#define PCI_VF_MIGRATION_IRQ 0x07 + +#define PCI_VF_MIGRATION_DISABLE 0x00 +#define PCI_VF_MIGRATION_ENABLE0x01 + +#define VMM_MIGRATION_END0x00 +#define VMM_MIGRATION_START 0x01 + +#define PCI_VF_WAIT_FOR_MIGRATION 0x00 +#define PCI_VF_READY_FOR_MIGRATION 0x01 + #endif /* LINUX_PCI_REGS_H */ -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
This patch is to add SRIOV VF migration support. Create new device type "vfio-sriov" and add faked PCI migration capability to the type device. The purpose of the new capability 1) sync migration status with VF driver in the VM 2) Get mailbox irq vector to notify VF driver during migration. 3) Provide a way to control injecting irq or not. Qemu will migrate PCI configure space regs and MSIX config for VF. Inject mailbox irq at last stage of migration to notify VF about migration event and wait VF driver ready for migration. VF driver writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table to tell Qemu. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- hw/vfio/Makefile.objs | 2 +- hw/vfio/pci.c | 6 ++ hw/vfio/pci.h | 4 ++ hw/vfio/sriov.c | 178 ++ 4 files changed, 189 insertions(+), 1 deletion(-) create mode 100644 hw/vfio/sriov.c diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs index d540c9d..9cf0178 100644 --- a/hw/vfio/Makefile.objs +++ b/hw/vfio/Makefile.objs @@ -1,6 +1,6 @@ ifeq ($(CONFIG_LINUX), y) obj-$(CONFIG_SOFTMMU) += common.o -obj-$(CONFIG_PCI) += pci.o +obj-$(CONFIG_PCI) += pci.o sriov.o obj-$(CONFIG_SOFTMMU) += platform.o obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o endif diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 7c43fc1..e7583b5 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2013,6 +2013,11 @@ void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, } else if (was_enabled && !is_enabled) { vfio_disable_msix(vdev); } +} else if (vdev->migration_cap && +ranges_overlap(addr, len, vdev->migration_cap, 0x10)) { +/* Write everything to QEMU to keep emulated bits correct */ +pci_default_write_config(pdev, addr, val, len); +vfio_migration_cap_handle(pdev, addr, val, len); } else { /* Write everything to QEMU to keep emulated bits correct */ pci_default_write_config(pdev, addr, val, len); @@ -3517,6 +3522,7 @@ static int vfio_initfn(PCIDevice *pdev) vfio_register_err_notifier(vdev); vfio_register_req_notifier(vdev); vfio_setup_resetfn(vdev); +vfio_add_migration_capability(vdev); return 0; diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h index 6c00575..ee6ca5e 100644 --- a/hw/vfio/pci.h +++ b/hw/vfio/pci.h @@ -134,6 +134,7 @@ typedef struct VFIOPCIDevice { PCIHostDeviceAddress host; EventNotifier err_notifier; EventNotifier req_notifier; +uint16_tmigration_cap; int (*resetfn)(struct VFIOPCIDevice *); uint32_t features; #define VFIO_FEATURE_ENABLE_VGA_BIT 0 @@ -162,3 +163,6 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len); void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, uint32_t val, int len); void vfio_enable_msix(VFIOPCIDevice *vdev); +void vfio_add_migration_capability(VFIOPCIDevice *vdev); +void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr, + uint32_t val, int len); diff --git a/hw/vfio/sriov.c b/hw/vfio/sriov.c new file mode 100644 index 000..3109538 --- /dev/null +++ b/hw/vfio/sriov.c @@ -0,0 +1,178 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "hw/hw.h" +#include "hw/vfio/pci.h" +#include "hw/vfio/vfio.h" +#include "hw/vfio/vfio-common.h" + +#define TYPE_VFIO_SRIOV "vfio-sriov" + +#define SRIOV_LM_SETUP 0x01 +#define SRIOV_LM_COMPLETE 0x02 + +QemuEvent migration_event; + +static void vfio_dev_post_load(void *opaque) +{ +struct PCIDevice *pdev = (struct PCIDevice *)opaque; +VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev); +MSIMessage msg; +int vector; + +if (vfio_pci_read_config(pdev, +vdev->migration_cap + PCI_VF_MIGRATION_CAP, 1) +!= PCI_VF_MIGRATION_ENABLE) +return; + +vector = vfio_pci_read_config(pdev, +vdev->migration_cap + PCI_VF_MIGRATION_IRQ, 1); + +msg = msix_get_message(pdev, vector); +kvm_irqchip_send_msi(kvm_state, msg); +} + +static int vfio_dev_load(QEMUFile *f, void *opaque, int version_id) +{ +struct PCIDevice *pdev = (struct PCIDevice *)opaque; +VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev); +int ret; + +if(qemu_get_byte(f)!= SRIOV_LM_COMPLETE) +return 0; + +ret = pci_device_load(pdev, f); +if (ret) { +error_report("Faild to load PCI config space.\n"); +return ret; +} + +if (msix_enabled(pdev)) { +vfio_enable_msix(vdev); +msix_load(pdev, f); +} + +vfio_pci_write_config(pdev,vdev->migration_cap + +PCI_VF_MIGRATION_VMM_STATUS, VMM_MIGRATION_END, 1); +vfio_pci_write_config(pdev,vdev->migration_cap + +PCI_VF_MIGRATION_VF_STATUS, P
[RFC PATCH V2 10/10] Qemu/VFIO: Misc change for enable migration with VFIO
Signed-off-by: Lan Tianyu <tianyu@intel.com> --- hw/vfio/pci.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index e7583b5..404a5cd 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3625,11 +3625,6 @@ static Property vfio_pci_dev_properties[] = { DEFINE_PROP_END_OF_LIST(), }; -static const VMStateDescription vfio_pci_vmstate = { -.name = "vfio-pci", -.unmigratable = 1, -}; - static void vfio_pci_dev_class_init(ObjectClass *klass, void *data) { DeviceClass *dc = DEVICE_CLASS(klass); @@ -3637,7 +3632,6 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data) dc->reset = vfio_pci_reset; dc->props = vfio_pci_dev_properties; -dc->vmsd = _pci_vmstate; dc->desc = "VFIO-based PCI device assignment"; set_bit(DEVICE_CATEGORY_MISC, dc->categories); pdc->init = vfio_initfn; -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V2 01/10] Qemu/VFIO: Create head file pci.h to share data struct.
Signed-off-by: Lan Tianyu <tianyu@intel.com> --- hw/vfio/pci.c | 137 +- hw/vfio/pci.h | 158 ++ 2 files changed, 159 insertions(+), 136 deletions(-) create mode 100644 hw/vfio/pci.h diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index e0e339a..5c3f8a7 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -42,138 +42,7 @@ #include "trace.h" #include "hw/vfio/vfio.h" #include "hw/vfio/vfio-common.h" - -struct VFIOPCIDevice; - -typedef struct VFIOQuirk { -MemoryRegion mem; -struct VFIOPCIDevice *vdev; -QLIST_ENTRY(VFIOQuirk) next; -struct { -uint32_t base_offset:TARGET_PAGE_BITS; -uint32_t address_offset:TARGET_PAGE_BITS; -uint32_t address_size:3; -uint32_t bar:3; - -uint32_t address_match; -uint32_t address_mask; - -uint32_t address_val:TARGET_PAGE_BITS; -uint32_t data_offset:TARGET_PAGE_BITS; -uint32_t data_size:3; - -uint8_t flags; -uint8_t read_flags; -uint8_t write_flags; -} data; -} VFIOQuirk; - -typedef struct VFIOBAR { -VFIORegion region; -bool ioport; -bool mem64; -QLIST_HEAD(, VFIOQuirk) quirks; -} VFIOBAR; - -typedef struct VFIOVGARegion { -MemoryRegion mem; -off_t offset; -int nr; -QLIST_HEAD(, VFIOQuirk) quirks; -} VFIOVGARegion; - -typedef struct VFIOVGA { -off_t fd_offset; -int fd; -VFIOVGARegion region[QEMU_PCI_VGA_NUM_REGIONS]; -} VFIOVGA; - -typedef struct VFIOINTx { -bool pending; /* interrupt pending */ -bool kvm_accel; /* set when QEMU bypass through KVM enabled */ -uint8_t pin; /* which pin to pull for qemu_set_irq */ -EventNotifier interrupt; /* eventfd triggered on interrupt */ -EventNotifier unmask; /* eventfd for unmask on QEMU bypass */ -PCIINTxRoute route; /* routing info for QEMU bypass */ -uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */ -QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */ -} VFIOINTx; - -typedef struct VFIOMSIVector { -/* - * Two interrupt paths are configured per vector. The first, is only used - * for interrupts injected via QEMU. This is typically the non-accel path, - * but may also be used when we want QEMU to handle masking and pending - * bits. The KVM path bypasses QEMU and is therefore higher performance, - * but requires masking at the device. virq is used to track the MSI route - * through KVM, thus kvm_interrupt is only available when virq is set to a - * valid (>= 0) value. - */ -EventNotifier interrupt; -EventNotifier kvm_interrupt; -struct VFIOPCIDevice *vdev; /* back pointer to device */ -int virq; -bool use; -} VFIOMSIVector; - -enum { -VFIO_INT_NONE = 0, -VFIO_INT_INTx = 1, -VFIO_INT_MSI = 2, -VFIO_INT_MSIX = 3, -}; - -/* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */ -typedef struct VFIOMSIXInfo { -uint8_t table_bar; -uint8_t pba_bar; -uint16_t entries; -uint32_t table_offset; -uint32_t pba_offset; -MemoryRegion mmap_mem; -void *mmap; -} VFIOMSIXInfo; - -typedef struct VFIOPCIDevice { -PCIDevice pdev; -VFIODevice vbasedev; -VFIOINTx intx; -unsigned int config_size; -uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */ -off_t config_offset; /* Offset of config space region within device fd */ -unsigned int rom_size; -off_t rom_offset; /* Offset of ROM region within device fd */ -void *rom; -int msi_cap_size; -VFIOMSIVector *msi_vectors; -VFIOMSIXInfo *msix; -int nr_vectors; /* Number of MSI/MSIX vectors currently in use */ -int interrupt; /* Current interrupt type */ -VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */ -VFIOVGA vga; /* 0xa, 0x3b0, 0x3c0 */ -PCIHostDeviceAddress host; -EventNotifier err_notifier; -EventNotifier req_notifier; -int (*resetfn)(struct VFIOPCIDevice *); -uint32_t features; -#define VFIO_FEATURE_ENABLE_VGA_BIT 0 -#define VFIO_FEATURE_ENABLE_VGA (1 << VFIO_FEATURE_ENABLE_VGA_BIT) -#define VFIO_FEATURE_ENABLE_REQ_BIT 1 -#define VFIO_FEATURE_ENABLE_REQ (1 << VFIO_FEATURE_ENABLE_REQ_BIT) -int32_t bootindex; -uint8_t pm_cap; -bool has_vga; -bool pci_aer; -bool req_enabled; -bool has_flr; -bool has_pm_reset; -bool rom_read_failed; -} VFIOPCIDevice; - -typedef struct VFIORomBlacklistEntry { -uint16_t vendor_id; -uint16_t device_id; -} VFIORomBlacklistEntry; +#include "hw/vfio/pci.h" /* * List of device ids/vendor ids for which to disable @@ -193,12 +62,8 @@ static const VFIORomBlacklistEntry romblacklist[] = { { 0x14e4, 0x168e } }; -#define MSIX_CAP_LENGTH 12 static void vfio_disable_interrupts(VFIOPCIDevice *vdev); -static uint32_t vfio_pci_
[RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
This patchset is to propose a solution of adding live migration support for SRIOV NIC. During migration, Qemu needs to let VF driver in the VM to know migration start and end. Qemu adds faked PCI migration capability to help to sync status between two sides during migration. Qemu triggers VF's mailbox irq via sending MSIX msg when migration status is changed. VF driver tells Qemu its mailbox vector index via the new PCI capability. In some cases(NIC is suspended or closed), VF mailbox irq is freed and VF driver can disable irq injecting via new capability. VF driver will put down nic before migration and put up again on the target machine. Lan Tianyu (10): Qemu/VFIO: Create head file pci.h to share data struct. Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition Qemu/VFIO: Rework vfio_std_cap_max_size() function Qemu/VFIO: Add vfio_find_free_cfg_reg() to find free PCI config space regs Qemu/VFIO: Expose PCI config space read/write and msix functions Qemu/PCI: Add macros for faked PCI migration capability Qemu: Add post_load_state() to run after restoring CPU state Qemu: Add save_before_stop callback to run just before stopping VCPU during migration Qemu/VFIO: Add SRIOV VF migration support Qemu/VFIO: Misc change for enable migration with VFIO hw/vfio/Makefile.objs | 2 +- hw/vfio/pci.c | 196 +--- hw/vfio/pci.h | 168 + hw/vfio/sriov.c | 178 include/hw/pci/pci_regs.h | 19 + include/migration/vmstate.h | 5 ++ include/sysemu/sysemu.h | 1 + linux-headers/linux/vfio.h | 16 migration/migration.c | 3 +- migration/savevm.c | 28 +++ 10 files changed, 459 insertions(+), 157 deletions(-) create mode 100644 hw/vfio/pci.h create mode 100644 hw/vfio/sriov.c -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V2 03/10] Qemu/VFIO: Rework vfio_std_cap_max_size() function
Use new ioctl cmd VFIO_GET_PCI_CAP_INFO to get PCI cap table size. This helps to get accurate table size and faciliate to find free PCI config space regs for faked PCI capability. Current code assigns PCI config space regs from the start of last PCI capability table to pos 0xff to the last capability and occupy some free PCI config space regs. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- hw/vfio/pci.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 5c3f8a7..29845e3 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2344,18 +2344,20 @@ static void vfio_unmap_bars(VFIOPCIDevice *vdev) /* * General setup */ -static uint8_t vfio_std_cap_max_size(PCIDevice *pdev, uint8_t pos) +static uint8_t vfio_std_cap_max_size(VFIOPCIDevice *vdev, uint8_t cap) { -uint8_t tmp, next = 0xff; +struct vfio_pci_cap_info reg_info = { +.argsz = sizeof(reg_info), +.index = VFIO_PCI_CAP_GET_SIZE, +.cap = cap +}; +int ret; -for (tmp = pdev->config[PCI_CAPABILITY_LIST]; tmp; - tmp = pdev->config[tmp + 1]) { -if (tmp > pos && tmp < next) { -next = tmp; -} -} +ret = ioctl(vdev->vbasedev.fd, VFIO_GET_PCI_CAP_INFO, _info); +if (ret || reg_info.size == 0) +error_report("vfio: Failed to find free PCI config reg: %m\n"); -return next - pos; +return reg_info.size; } static void vfio_set_word_bits(uint8_t *buf, uint16_t val, uint16_t mask) @@ -2521,7 +2523,7 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos) * Since QEMU doesn't actually handle many of the config accesses, * exact size doesn't seem worthwhile. */ -size = vfio_std_cap_max_size(pdev, pos); +size = vfio_std_cap_max_size(vdev, cap_id); /* * pci_add_capability always inserts the new capability at the head -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
Signed-off-by: Lan Tianyu <tianyu@intel.com> --- linux-headers/linux/vfio.h | 16 1 file changed, 16 insertions(+) diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h index 0508d0b..732b0bd 100644 --- a/linux-headers/linux/vfio.h +++ b/linux-headers/linux/vfio.h @@ -495,6 +495,22 @@ struct vfio_eeh_pe_op { #define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21) + +#define VFIO_FIND_FREE_PCI_CONFIG_REG _IO(VFIO_TYPE, VFIO_BASE + 22) + +#define VFIO_GET_PCI_CAP_INFO _IO(VFIO_TYPE, VFIO_BASE + 22) + +struct vfio_pci_cap_info { +__u32 argsz; +__u32 flags; +#define VFIO_PCI_CAP_GET_SIZE (1 << 0) +#define VFIO_PCI_CAP_GET_FREE_REGION (1 << 1) +__u32 index; +__u32 offset; +__u32 size; +__u8 cap; +}; + /* * */ #endif /* VFIO_H */ -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V2 07/10] Qemu: Add post_load_state() to run after restoring CPU state
After migration, Qemu needs to trigger mailbox irq to notify VF driver in the guest about status change. The irq delivery restarts to work after restoring CPU state. This patch is to add new callback to run after restoring CPU state and provide a way to trigger mailbox irq later. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- include/migration/vmstate.h | 2 ++ migration/savevm.c | 15 +++ 2 files changed, 17 insertions(+) diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h index 0695d7c..dc681a6 100644 --- a/include/migration/vmstate.h +++ b/include/migration/vmstate.h @@ -56,6 +56,8 @@ typedef struct SaveVMHandlers { int (*save_live_setup)(QEMUFile *f, void *opaque); uint64_t (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t max_size); +/* This runs after restoring CPU related state */ +void (*post_load_state)(void *opaque); LoadStateHandler *load_state; } SaveVMHandlers; diff --git a/migration/savevm.c b/migration/savevm.c index 9e0e286..48b6223 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -702,6 +702,20 @@ bool qemu_savevm_state_blocked(Error **errp) return false; } +void qemu_savevm_post_load(void) +{ +SaveStateEntry *se; + +QTAILQ_FOREACH(se, _state.handlers, entry) { +if (!se->ops || !se->ops->post_load_state) { +continue; +} + +se->ops->post_load_state(se->opaque); +} +} + + void qemu_savevm_state_header(QEMUFile *f) { trace_savevm_state_header(); @@ -1140,6 +1154,7 @@ int qemu_loadvm_state(QEMUFile *f) } cpu_synchronize_all_post_init(); +qemu_savevm_post_load(); ret = 0; -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
This patchset is to propose a solution of adding live migration support for SRIOV NIC. During migration, Qemu needs to let VF driver in the VM to know migration start and end. Qemu adds faked PCI migration capability to help to sync status between two sides during migration. Qemu triggers VF's mailbox irq via sending MSIX msg when migration status is changed. VF driver tells Qemu its mailbox vector index via the new PCI capability. In some cases(NIC is suspended or closed), VF mailbox irq is freed and VF driver can disable irq injecting via new capability. VF driver will put down nic before migration and put up again on the target machine. Lan Tianyu (3): VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO PCI: Add macros for faked PCI migration capability Ixgbevf: Add migration support for ixgbevf driver drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 5 ++ drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 ++ drivers/vfio/pci/vfio_pci.c | 21 + drivers/vfio/pci/vfio_pci_config.c| 38 ++-- drivers/vfio/pci/vfio_pci_private.h | 5 ++ include/uapi/linux/pci_regs.h | 18 +++- include/uapi/linux/vfio.h | 12 +++ 7 files changed, 194 insertions(+), 7 deletions(-) -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V2 08/10] Qemu: Add save_before_stop callback to run just before stopping VCPU during migration
This patch is to add a callback which is called just before stopping VCPU. It's for VF migration to trigger mailbox irq as later as possible to decrease service downtime. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- include/migration/vmstate.h | 3 +++ include/sysemu/sysemu.h | 1 + migration/migration.c | 3 ++- migration/savevm.c | 13 + 4 files changed, 19 insertions(+), 1 deletion(-) diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h index dc681a6..093faf1 100644 --- a/include/migration/vmstate.h +++ b/include/migration/vmstate.h @@ -58,6 +58,9 @@ typedef struct SaveVMHandlers { /* This runs after restoring CPU related state */ void (*post_load_state)(void *opaque); + +/* This runs before stopping VCPU */ +void (*save_before_stop)(QEMUFile *f, void *opaque); LoadStateHandler *load_state; } SaveVMHandlers; diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h index df80951..3d0d72c 100644 --- a/include/sysemu/sysemu.h +++ b/include/sysemu/sysemu.h @@ -84,6 +84,7 @@ void qemu_announce_self(void); bool qemu_savevm_state_blocked(Error **errp); void qemu_savevm_state_begin(QEMUFile *f, const MigrationParams *params); +void qemu_savevm_save_before_stop(QEMUFile *f); void qemu_savevm_state_header(QEMUFile *f); int qemu_savevm_state_iterate(QEMUFile *f); void qemu_savevm_state_complete(QEMUFile *f); diff --git a/migration/migration.c b/migration/migration.c index c6ac08a..fccadea 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -759,7 +759,6 @@ int64_t migrate_xbzrle_cache_size(void) } /* migration thread support */ - static void *migration_thread(void *opaque) { MigrationState *s = opaque; @@ -788,6 +787,8 @@ static void *migration_thread(void *opaque) } else { int ret; +qemu_savevm_save_before_stop(s->file); + qemu_mutex_lock_iothread(); start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER); diff --git a/migration/savevm.c b/migration/savevm.c index 48b6223..c2e4802 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -715,6 +715,19 @@ void qemu_savevm_post_load(void) } } +void qemu_savevm_save_before_stop(QEMUFile *f) +{ +SaveStateEntry *se; + +QTAILQ_FOREACH(se, _state.handlers, entry) { +if (!se->ops || !se->ops->save_before_stop) { +continue; +} + +se->ops->save_before_stop(f, se->opaque); +} +} + void qemu_savevm_state_header(QEMUFile *f) { -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V2 05/10] Qemu/VFIO: Expose PCI config space read/write and msix functions
Signed-off-by: Lan Tianyu <tianyu@intel.com> --- hw/vfio/pci.c | 6 +++--- hw/vfio/pci.h | 4 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index d0354a0..7c43fc1 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -613,7 +613,7 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr) } } -static void vfio_enable_msix(VFIOPCIDevice *vdev) +void vfio_enable_msix(VFIOPCIDevice *vdev) { vfio_disable_interrupts(vdev); @@ -1931,7 +1931,7 @@ static void vfio_bar_quirk_free(VFIOPCIDevice *vdev, int nr) /* * PCI config space */ -static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len) +uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len) { VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev); uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val; @@ -1964,7 +1964,7 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len) return val; } -static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, +void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, uint32_t val, int len) { VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev); diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h index 6083300..6c00575 100644 --- a/hw/vfio/pci.h +++ b/hw/vfio/pci.h @@ -158,3 +158,7 @@ typedef struct VFIORomBlacklistEntry { #define MSIX_CAP_LENGTH 12 uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size); +uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len); +void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, + uint32_t val, int len); +void vfio_enable_msix(VFIOPCIDevice *vdev); -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V2 04/10] Qemu/VFIO: Add vfio_find_free_cfg_reg() to find free PCI config space regs
This patch is to add ioctl wrap to find free PCI config sapce regs. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- hw/vfio/pci.c | 19 +++ hw/vfio/pci.h | 2 ++ 2 files changed, 21 insertions(+) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 29845e3..d0354a0 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2508,6 +2508,25 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, uint8_t pos) } } +uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size) +{ +struct vfio_pci_cap_info reg_info = { +.argsz = sizeof(reg_info), +.offset = pos, +.index = VFIO_PCI_CAP_GET_FREE_REGION, +.size = size, +}; +int ret; + +ret = ioctl(vdev->vbasedev.fd, VFIO_GET_PCI_CAP_INFO, _info); +if (ret || reg_info.offset == 0) { +error_report("vfio: Failed to find free PCI config reg: %m\n"); +return -EFAULT; +} + +return reg_info.offset; +} + static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos) { PCIDevice *pdev = >pdev; diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h index 9f360bf..6083300 100644 --- a/hw/vfio/pci.h +++ b/hw/vfio/pci.h @@ -156,3 +156,5 @@ typedef struct VFIORomBlacklistEntry { } VFIORomBlacklistEntry; #define MSIX_CAP_LENGTH 12 + +uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size); -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V2 06/10] Qemu/PCI: Add macros for faked PCI migration capability
This patch is to extend PCI CAP id for migration cap and add reg macros. The CAP ID is trial and we may find better one if the solution is feasible. *PCI_VF_MIGRATION_CAP For VF driver to control that triggers mailbox irq or not during migration. *PCI_VF_MIGRATION_VMM_STATUS Qemu stores migration status in the reg *PCI_VF_MIGRATION_VF_STATUS VF driver tells Qemu ready for migration *PCI_VF_MIGRATION_IRQ VF driver stores mailbox interrupt vector in the reg for Qemu to trigger during migration. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- include/hw/pci/pci_regs.h | 19 +++ 1 file changed, 19 insertions(+) diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h index 57e8c80..0dcaf7e 100644 --- a/include/hw/pci/pci_regs.h +++ b/include/hw/pci/pci_regs.h @@ -213,6 +213,7 @@ #define PCI_CAP_ID_MSIX 0x11/* MSI-X */ #define PCI_CAP_ID_SATA 0x12/* Serial ATA */ #define PCI_CAP_ID_AF 0x13/* PCI Advanced Features */ +#define PCI_CAP_ID_MIGRATION 0x14 #define PCI_CAP_LIST_NEXT 1 /* Next capability in the list */ #define PCI_CAP_FLAGS 2 /* Capability defined flags (16 bits) */ #define PCI_CAP_SIZEOF 4 @@ -716,4 +717,22 @@ #define PCI_ACS_CTRL 0x06/* ACS Control Register */ #define PCI_ACS_EGRESS_CTL_V 0x08/* ACS Egress Control Vector */ +/* Migration*/ +#define PCI_VF_MIGRATION_CAP0x04 +#define PCI_VF_MIGRATION_VMM_STATUS0x05 +#define PCI_VF_MIGRATION_VF_STATUS 0x06 +#define PCI_VF_MIGRATION_IRQ 0x07 + +#define PCI_VF_MIGRATION_CAP_SIZE 0x08 + +#define VMM_MIGRATION_END0x00 +#define VMM_MIGRATION_START 0x01 + +#define PCI_VF_WAIT_FOR_MIGRATION 0x00 +#define PCI_VF_READY_FOR_MIGRATION 0x01 + +#define PCI_VF_MIGRATION_DISABLE0x00 +#define PCI_VF_MIGRATION_ENABLE 0x01 + + #endif /* LINUX_PCI_REGS_H */ -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
On 2015年10月26日 23:03, Alexander Duyck wrote: > No. I think you are missing the fact that there are 256 descriptors per > page. As such if you dirty just 1 you will be pulling in 255 more, of > which you may or may not have pulled in the receive buffer for. > > So for example if you have the descriptor ring size set to 256 then that > means you are going to get whatever the descriptor ring has since you > will be marking the entire ring dirty with every packet processed, > however you cannot guarantee that you are going to get all of the > receive buffers unless you go through and flush the entire ring prior to > migrating. Yes, that will be a problem. How about adding tag for each Rx buffer and check the tag when deliver the Rx buffer to stack? If tag has been overwritten, this means the packet data has been migrated. > > This is why I have said you will need to do something to force the rings > to be flushed such as initiating a PM suspend prior to migrating. You > need to do something to stop the DMA and flush the remaining Rx buffers > if you want to have any hope of being able to migrate the Rx in a > consistent state. Beyond that the only other thing you have to worry > about are the Rx buffers that have already been handed off to the > stack. However those should be handled if you do a suspend and somehow > flag pages as dirty when they are unmapped from the DMA. > > - Alex This will be simple and maybe our first version to enable migration. But we still hope to find a way not to disable DMA before stopping VCPU to decrease service down time. -- Best regards Tianyu Lan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
On 2015年10月29日 14:58, Alexander Duyck wrote: > > Your code was having to do a bunch of shuffling in order to get things > set up so that you could bring the interface back up. I would argue > that it may actually be faster at least on the bring-up to just drop the > old rings and start over since it greatly reduced the complexity and the > amount of device related data that has to be moved. If give up the old ring after migration and keep DMA running before stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and just make sure that all Rx buffers delivered to stack has been migrated. 1) Dummy write Rx buffer before checking Rx descriptor to ensure packet migrated first. 2) Make a copy of Rx descriptor and then use the copied data to check buffer status. Not use the original descriptor because it won't be migrated and migration may happen between two access of the Rx descriptor. -- Best regards Tianyu Lan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
On 2015年10月30日 00:17, Alexander Duyck wrote: > On 10/29/2015 01:33 AM, Lan Tianyu wrote: >> On 2015年10月29日 14:58, Alexander Duyck wrote: >>> Your code was having to do a bunch of shuffling in order to get things >>> set up so that you could bring the interface back up. I would argue >>> that it may actually be faster at least on the bring-up to just drop the >>> old rings and start over since it greatly reduced the complexity and the >>> amount of device related data that has to be moved. >> If give up the old ring after migration and keep DMA running before >> stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and >> just make sure that all Rx buffers delivered to stack has been migrated. >> >> 1) Dummy write Rx buffer before checking Rx descriptor to ensure packet >> migrated first. > > Don't dummy write the Rx descriptor. You should only really need to > dummy write the Rx buffer and you would do so after checking the > descriptor, not before. Otherwise you risk corrupting the Rx buffer > because it is possible for you to read the Rx buffer, DMA occurs, and > then you write back the Rx buffer and now you have corrupted the memory. > >> 2) Make a copy of Rx descriptor and then use the copied data to check >> buffer status. Not use the original descriptor because it won't be >> migrated and migration may happen between two access of the Rx >> descriptor. > > Do not just blindly copy the Rx descriptor ring. That is a recipe for > disaster. The problem is DMA has to happen in a very specific order for > things to function correctly. The Rx buffer has to be written and then > the Rx descriptor. The problem is you will end up getting a read-ahead > on the Rx descriptor ring regardless of which order you dirty things in. Sorry, I didn't say clearly. I meant to copy one Rx descriptor when receive rx irq and handle Rx ring. Current code in the ixgbevf_clean_rx_irq() checks status of the Rx descriptor whether its Rx buffer has been populated data and then read the packet length from Rx descriptor to handle the Rx buffer. My idea is to do the following three steps when receive Rx buffer in the ixgbevf_clean_rx_irq(). (1) dummy write the Rx buffer first, (2) make a copy of its Rx descriptor (3) Check the buffer status and get length from the copy. Migration may happen every time. Happen between (1) and (2). If the Rx buffer has been populated data, VF driver will not know that on the new machine because the Rx descriptor isn't migrated. But it's still safe. Happen between (2) and (3). The copy will be migrated to new machine and Rx buffer is migrated firstly. If there is data in the Rx buffer, VF driver still can handle the buffer without migrating Rx descriptor. The next buffers will be ignored since we don't migrate Rx descriptor for them. Their status will be not completed on the new machine. -- Best regards Tianyu Lan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver
On 10/22/2015 4:45 AM, Alexander Duyck wrote: +/* Record states hold by PF */ +memcpy(>vf_data, >vfinfo[vfn], sizeof(struct vf_data_storage)); + +vf_shift = vfn % 32; +reg_offset = vfn / 32; + +reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset)); +reg &= ~(1 << vf_shift); +IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg); + +reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset)); +reg &= ~(1 << vf_shift); +IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg); + +reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset)); +reg &= ~(1 << vf_shift); +IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg); + +return sizeof(struct state_in_pf); +} + This is a read. Why does it need to switch off the VF? Also why turn of the anti-spoof, it doesn't make much sense. This is to prevent packets which target to VM from delivering to original VF after migration. E,G After migration, VM pings the PF of original machine and the ping reply packet will forward to original VF if it wasn't disabled. BTW, the read is done when VM has been stopped on the source machine. +static ssize_t ixgbe_store_state_in_pf(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ +struct ixgbe_adapter *adapter = to_adapter(dev); +struct pci_dev *pdev = adapter->pdev, *vdev; +struct pci_dev *vf_pdev = to_pci_dev(dev); +struct state_in_pf *state = (struct state_in_pf *)buf; +int vfn = vf_pdev->virtfn_index; + +/* Check struct size */ +if (count != sizeof(struct state_in_pf)) { +printk(KERN_ERR "State in PF size does not fit.\n"); +goto out; +} + +/* Restore PCI configurations */ +vdev = ixgbe_get_virtfn_dev(pdev, vfn); +if (vdev) { +pci_write_config_word(vdev, IXGBE_PCI_VFCOMMAND, state->command); +pci_write_config_word(vdev, IXGBE_PCI_VFMSIXMC, state->msix_message_control); +} + +/* Restore states hold by PF */ +memcpy(>vfinfo[vfn], >vf_data, sizeof(struct vf_data_storage)); + + out: +return count; +} Just doing a memcpy to move the vfinfo over adds no value. The fact is there are a number of filters that have to be configured in hardware after, and it isn't as simple as just migrating the values stored. Restoring VF status in the PF is triggered by VF driver via new mailbox msg and call ixgbe_restore_setting(). Here just copy data into vfinfo. If configure hardware early, state will be cleared by FLR which is trigger by restoring operation in the VF driver. As I mentioned in the case of the 82598 there is also jumbo frames to take into account. If the first PF didn't have it enabled, but the second one does that implies the state of the VF needs to change to account for that. Yes, that will be a problem and VF driver also need to know this change after migration and reconfigure jumbo frame I really think you would be better off only migrating the data related to what can be configured using the ip link command and leaving other values such as clear_to_send at the reset value of 0. Then you can at least restore state from the VF after just a couple of quick messages. This sounds good. I will try it later. +static struct device_attribute ixgbe_per_state_in_pf_attribute = +__ATTR(state_in_pf, S_IRUGO | S_IWUSR, +ixgbe_show_state_in_pf, ixgbe_store_state_in_pf); + +void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter) +{ +struct pci_dev *pdev = adapter->pdev; +struct pci_dev *vfdev; +unsigned short vf_id; +int pos, ret; + +pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV); +if (!pos) +return; + +/* get the device ID for the VF */ +pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, _id); + +vfdev = pci_get_device(pdev->vendor, vf_id, NULL); + +while (vfdev) { +if (vfdev->is_virtfn) { +ret = device_create_file(>dev, +_per_state_in_pf_attribute); +if (ret) +pr_warn("Unable to add VF attribute for dev %s,\n", +dev_name(>dev)); +} + +vfdev = pci_get_device(pdev->vendor, vf_id, vfdev); +} +} Driver specific sysfs is a no-go. Otherwise we will end up with a different implementation of this for every driver. You will need to find a way to make this generic in order to have a hope of getting this to be acceptable. Yes, Alex Williamson proposed to get/put data via VFIO interface. This will be more general. I will do more research about how to communicate between PF driver and VFIO subsystem. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"
On 10/25/2015 2:03 PM, Alexander Duyck wrote: On 10/24/2015 08:43 AM, Lan, Tianyu wrote: On 10/22/2015 4:52 AM, Alexander Duyck wrote: Also have you even considered the MSI-X configuration on the VF? I haven't seen anything anywhere that would have migrated the VF's MSI-X configuration from BAR 3 on one system to the new system. MSI-X migration is done by Hypervisor(Qemu). Following link is my Qemu patch to do that. http://marc.info/?l=kvm=144544706530484=2 I really don't like the idea of trying to migrate the MSI-X across from host to host while it is still active. I really think Qemu shouldn't be moving this kind of data over in a migration. Hi Alex: VF MSI-X regs in the VM are faked by Qemu and Qemu maps host vectors of VF with guest's vector. The MSIX data migrated are for faked regs rather than the one on the host. After migration, Qemu will remap guest vectors with host vectors on the new machine. Moreover, VM is stopped during migrating MSI-X data. I think that having the VF do a suspend/resume is the best way to go. Then it simplifies things as all you have to deal with is the dirty page tracking for the Rx DMA and you should be able to do this without making things too difficult. Yes, that will be simple and most concern is service down time. I will test later. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
On 2015年10月24日 02:36, Alexander Duyck wrote: > I was thinking about it and I am pretty sure the dummy write approach is > problematic at best. Specifically the issue is that while you are > performing a dummy write you risk pulling in descriptors for data that > hasn't been dummy written to yet. So when you resume and restore your > descriptors you will have once that may contain Rx descriptors > indicating they contain data when after the migration they don't. How about changing sequence? dummy writing Rx packet data fist and then its desc. This can ensure that RX data is migrated before its desc and prevent such case. -- Best regards Tianyu Lan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"
On 10/22/2015 4:52 AM, Alexander Duyck wrote: Also have you even considered the MSI-X configuration on the VF? I haven't seen anything anywhere that would have migrated the VF's MSI-X configuration from BAR 3 on one system to the new system. MSI-X migration is done by Hypervisor(Qemu). Following link is my Qemu patch to do that. http://marc.info/?l=kvm=144544706530484=2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package
On 10/22/2015 5:14 AM, Alexander Duyck wrote: Where is i being initialized? It was here but you removed it. Are you using i without initializing it? Sorry, the initialization was put into patch 10 by mistake. "i" is assigned with "tx_ring->next_to_clean". -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device
On 10/22/2015 2:07 AM, Alexander Duyck wrote: On 10/21/2015 09:37 AM, Lan Tianyu wrote: Add "virtfn_index" member in the struct pci_device to record VF sequence of PF. This will be used in the VF sysfs node handle. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- drivers/pci/iov.c | 1 + include/linux/pci.h | 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index ee0ebff..065b6bb 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -136,6 +136,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset) virtfn->physfn = pci_dev_get(dev); virtfn->is_virtfn = 1; virtfn->multifunction = 0; +virtfn->virtfn_index = id; for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) { res = >resource[i + PCI_IOV_RESOURCES]; diff --git a/include/linux/pci.h b/include/linux/pci.h index 353db8d..85c5531 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -356,6 +356,7 @@ struct pci_dev { unsigned intio_window_1k:1;/* Intel P2P bridge 1K I/O windows */ unsigned intirq_managed:1; pci_dev_flags_t dev_flags; +unsigned intvirtfn_index; atomic_tenable_cnt;/* pci_enable_device has been called */ u32saved_config_space[16]; /* config space saved at suspend time */ Can't you just calculate the VF index based on the VF BDF number combined with the information in the PF BDF number and VF offset/stride? Seems kind of pointless to add a variable that is only used by one driver and is in a slowpath when you can just calculate it pretty quickly. Good suggestion. Will try it. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package
On 10/22/2015 8:58 PM, Michael S. Tsirkin wrote: Do you really need to play the shifting games? Can't you just reset everything and re-initialize the rings? It's slower but way less intrusive. Also removes the need to track writes into rings. Shift ring is to avoid losing those packets in the ring. This may cause some race condition and so I introduced a lock to prevent such cases in the latter patch. Yes, reset everything after migration can make thing easy. But just like you said it would affect performance and loss more packets. I can do a test later to get data about these two way. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 11/12] IXGBEVF: Migrate VF statistic data
VF statistic regs are read-only and can't be migrated via writing back directly. Currently, statistic data returned to user space by the driver is not equal to value of statistic regs. VF driver records value of statistic regs as base data when net interface is up or open, calculate increased count of regs during last period of online service and added it to saved_reset data. When user space collects statistic data, VF driver returns result of "current - base + saved_reset". "Current" is reg value at that point. Restoring net function after migration just likes net interface is up or open. Call existed function to update base and saved_reset data to keep statistic data continual during migration. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 04b6ce7..d22160f 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -3005,6 +3005,7 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter) return 0; del_timer_sync(>service_timer); + ixgbevf_update_stats(adapter); pr_info("migration start\n"); migration_status = MIGRATION_IN_PROGRESS; @@ -3017,6 +3018,8 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter) return 1; ixgbevf_restore_state(adapter); + ixgbevf_save_reset_stats(adapter); + ixgbevf_init_last_counter_stats(adapter); migration_status = MIGRATION_COMPLETED; pr_info("migration end\n"); return 0; -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device
Add "virtfn_index" member in the struct pci_device to record VF sequence of PF. This will be used in the VF sysfs node handle. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- drivers/pci/iov.c | 1 + include/linux/pci.h | 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index ee0ebff..065b6bb 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -136,6 +136,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset) virtfn->physfn = pci_dev_get(dev); virtfn->is_virtfn = 1; virtfn->multifunction = 0; + virtfn->virtfn_index = id; for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) { res = >resource[i + PCI_IOV_RESOURCES]; diff --git a/include/linux/pci.h b/include/linux/pci.h index 353db8d..85c5531 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -356,6 +356,7 @@ struct pci_dev { unsigned intio_window_1k:1; /* Intel P2P bridge 1K I/O windows */ unsigned intirq_managed:1; pci_dev_flags_t dev_flags; + unsigned intvirtfn_index; atomic_tenable_cnt; /* pci_enable_device has been called */ u32 saved_config_space[16]; /* config space saved at suspend time */ -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
82.3 ms [983769928.256503] 64 bytes from 10.239.48.100: icmp_seq=4150 ttl=64 time=72.2 ms [983769928.256631] 64 bytes from 10.239.48.100: icmp_seq=4158 ttl=64 time=0.500 ms [983769928.257284] 64 bytes from 10.239.48.100: icmp_seq=4159 ttl=64 time=0.154 ms [983769928.258297] 64 bytes from 10.239.48.100: icmp_seq=4160 ttl=64 time=0.165 ms Todo === So far, the patchset isn't perfect. VF net interface can't be open, closed, down and up during migration. Will prevent such operation during migration in the future job. Very appreciate for your comments. Lan Tianyu (12): PCI: Add virtfn_index for struct pci_device IXGBE: Add new mail box event to restore VF status in the PF driver IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver IXGBE: Add ixgbe_ping_vf() to notify a specified VF via mailbox msg. IXGBE: Add new sysfs interface of "notify_vf" IXGBEVF: Add self emulation layer IXGBEVF: Add new mail box event for migration IXGBEVF: Rework code of finding the end transmit desc of package IXGBEVF: Add live migration support for VF driver IXGBEVF: Add lock to protect tx/rx ring operation IXGBEVF: Migrate VF statistic data IXGBEVF: Track dma dirty pages drivers/net/ethernet/intel/ixgbe/ixgbe.h | 1 + drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h | 1 + drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 245 - drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h | 1 + drivers/net/ethernet/intel/ixgbe/ixgbe_type.h | 4 + drivers/net/ethernet/intel/ixgbevf/Makefile| 3 +- drivers/net/ethernet/intel/ixgbevf/defines.h | 6 + drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 10 +- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 179 ++- drivers/net/ethernet/intel/ixgbevf/mbx.h | 3 + .../net/ethernet/intel/ixgbevf/self-emulation.c| 133 +++ drivers/net/ethernet/intel/ixgbevf/vf.c| 10 + drivers/net/ethernet/intel/ixgbevf/vf.h| 6 +- drivers/pci/iov.c | 1 + include/linux/pci.h| 1 + 15 files changed, 582 insertions(+), 22 deletions(-) create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package
When transmit a package, the end transmit desc of package indicates whether package is sent already. Current code records the end desc's pointer in the next_to_watch of struct tx buffer. This code will be broken if shifting desc ring after migration. The pointer will be invalid. This patch is to replace recording pointer with recording the desc number of the package and find the end decs via the first desc and desc number. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 1 + drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 19 --- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h index 775d089..c823616 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h @@ -54,6 +54,7 @@ */ struct ixgbevf_tx_buffer { union ixgbe_adv_tx_desc *next_to_watch; + u16 desc_num; unsigned long time_stamp; struct sk_buff *skb; unsigned int bytecount; diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 4446916..056841c 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -210,6 +210,7 @@ static void ixgbevf_unmap_and_free_tx_resource(struct ixgbevf_ring *tx_ring, DMA_TO_DEVICE); } tx_buffer->next_to_watch = NULL; + tx_buffer->desc_num = 0; tx_buffer->skb = NULL; dma_unmap_len_set(tx_buffer, len, 0); /* tx_buffer must be completely set up in the transmit path */ @@ -295,7 +296,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, union ixgbe_adv_tx_desc *tx_desc; unsigned int total_bytes = 0, total_packets = 0; unsigned int budget = tx_ring->count / 2; - unsigned int i = tx_ring->next_to_clean; + int i, watch_index; if (test_bit(__IXGBEVF_DOWN, >state)) return true; @@ -305,9 +306,17 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, i -= tx_ring->count; do { - union ixgbe_adv_tx_desc *eop_desc = tx_buffer->next_to_watch; + union ixgbe_adv_tx_desc *eop_desc; + + if (!tx_buffer->desc_num) + break; + + if (i + tx_buffer->desc_num >= 0) + watch_index = i + tx_buffer->desc_num; + else + watch_index = i + tx_ring->count + tx_buffer->desc_num; - /* if next_to_watch is not set then there is no work pending */ + eop_desc = IXGBEVF_TX_DESC(tx_ring, watch_index); if (!eop_desc) break; @@ -320,6 +329,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, /* clear next_to_watch to prevent false hangs */ tx_buffer->next_to_watch = NULL; + tx_buffer->desc_num = 0; /* update the statistics for this packet */ total_bytes += tx_buffer->bytecount; @@ -3457,6 +3467,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring, u32 tx_flags = first->tx_flags; __le32 cmd_type; u16 i = tx_ring->next_to_use; + u16 start; tx_desc = IXGBEVF_TX_DESC(tx_ring, i); @@ -3540,6 +3551,8 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring, /* set next_to_watch value indicating a packet is present */ first->next_to_watch = tx_desc; + start = first - tx_ring->tx_buffer_info; + first->desc_num = (i - start >= 0) ? i - start: i + tx_ring->count - start; i++; if (i == tx_ring->count) -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver
To let VF driver in the guest to know migration status, Qemu will fake PCI configure reg 0xF0 and 0xF1 to show migrate status and get ack from VF driver. When migration starts, Qemu will set reg "0xF0" to 1, notify VF driver via triggering mail box msg and wait for VF driver to tell it's ready for migration(set reg "0xF1" to 1). After migration, Qemu will set reg "0xF0" to 0 and notify VF driver by mail box irq. VF driver begins to restore tx/rx function after detecting sttatus change. When VF receives mail box irq, it will check reg "0xF0" in the service task function to get migration status and performs related operations according its value. Steps of restarting receive and transmit function 1) Restore VF status in the PF driver via sending mail event to PF driver 2) Write back reg values recorded by self emulation layer 3) Restart rx/tx ring 4) Recovery interrupt Transmit/Receive descriptor head regs are read-only and can't be restored via writing back recording reg value directly and they are set to 0 during VF reset. To reuse original tx/rx rings, shift desc ring in order to move the desc pointed by original head reg to first entry of the ring and then enable tx/rx rings. VF restarts to receive and transmit from original head desc. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- drivers/net/ethernet/intel/ixgbevf/defines.h | 6 ++ drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 7 +- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 115 - .../net/ethernet/intel/ixgbevf/self-emulation.c| 107 +++ 4 files changed, 232 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h b/drivers/net/ethernet/intel/ixgbevf/defines.h index 770e21a..113efd2 100644 --- a/drivers/net/ethernet/intel/ixgbevf/defines.h +++ b/drivers/net/ethernet/intel/ixgbevf/defines.h @@ -239,6 +239,12 @@ struct ixgbe_adv_tx_context_desc { __le32 mss_l4len_idx; }; +union ixgbevf_desc { + union ixgbe_adv_tx_desc rx_desc; + union ixgbe_adv_rx_desc tx_desc; + struct ixgbe_adv_tx_context_desc tx_context_desc; +}; + /* Adv Transmit Descriptor Config Masks */ #define IXGBE_ADVTXD_DTYP_MASK 0x00F0 /* DTYP mask */ #define IXGBE_ADVTXD_DTYP_CTXT 0x0020 /* Advanced Context Desc */ diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h index c823616..6eab402e 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h @@ -109,7 +109,7 @@ struct ixgbevf_ring { struct ixgbevf_ring *next; struct net_device *netdev; struct device *dev; - void *desc; /* descriptor ring memory */ + union ixgbevf_desc *desc; /* descriptor ring memory */ dma_addr_t dma; /* phys. address of descriptor ring */ unsigned int size; /* length in bytes */ u16 count; /* amount of descriptors */ @@ -493,6 +493,11 @@ extern void ixgbevf_write_eitr(struct ixgbevf_q_vector *q_vector); void ixgbe_napi_add_all(struct ixgbevf_adapter *adapter); void ixgbe_napi_del_all(struct ixgbevf_adapter *adapter); +int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head); +int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head); +void ixgbevf_restore_state(struct ixgbevf_adapter *adapter); +inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter); + #ifdef DEBUG char *ixgbevf_get_hw_dev_name(struct ixgbe_hw *hw); diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 056841c..15ec361 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -91,6 +91,10 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit Virtual Function Network Driver"); MODULE_LICENSE("GPL"); MODULE_VERSION(DRV_VERSION); + +#define MIGRATION_COMPLETED 0x00 +#define MIGRATION_IN_PROGRESS 0x01 + #define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK) static int debug = -1; module_param(debug, int, 0); @@ -221,6 +225,78 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring *ring) return ring->stats.packets; } +int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) +{ + struct ixgbevf_tx_buffer *tx_buffer = NULL; + static union ixgbevf_desc *tx_desc = NULL; + + tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count)); + if (!tx_buffer) + return -ENOMEM; + + tx_desc = vmalloc(sizeof(union ixgbevf_desc) * r->count); + if (!tx_desc) + return -ENOMEM; + + memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count); + memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count - hea
[RFC Patch 04/12] IXGBE: Add ixgbe_ping_vf() to notify a specified VF via mailbox msg.
This patch is to add ixgbe_ping_vf() to notify a specified VF. When migration status is changed, it's necessary to notify VF the change. VF driver will check the migrate status when it gets mailbox msg. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 19 --- drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h | 1 + 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index 89671eb..e247d67 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -1318,18 +1318,23 @@ void ixgbe_disable_tx_rx(struct ixgbe_adapter *adapter) IXGBE_WRITE_REG(hw, IXGBE_VFRE(1), 0); } -void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter) +void ixgbe_ping_vf(struct ixgbe_adapter *adapter, int vfn) { struct ixgbe_hw *hw = >hw; u32 ping; + + ping = IXGBE_PF_CONTROL_MSG; + if (adapter->vfinfo[vfn].clear_to_send) + ping |= IXGBE_VT_MSGTYPE_CTS; + ixgbe_write_mbx(hw, , 1, vfn); +} + +void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter) +{ int i; - for (i = 0 ; i < adapter->num_vfs; i++) { - ping = IXGBE_PF_CONTROL_MSG; - if (adapter->vfinfo[i].clear_to_send) - ping |= IXGBE_VT_MSGTYPE_CTS; - ixgbe_write_mbx(hw, , 1, i); - } + for (i = 0 ; i < adapter->num_vfs; i++) + ixgbe_ping_vf(adapter, i); } int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int vf, u8 *mac) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h index 2c197e6..143e2fd 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h @@ -41,6 +41,7 @@ void ixgbe_msg_task(struct ixgbe_adapter *adapter); int ixgbe_vf_configuration(struct pci_dev *pdev, unsigned int event_mask); void ixgbe_disable_tx_rx(struct ixgbe_adapter *adapter); void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter); +void ixgbe_ping_vf(struct ixgbe_adapter *adapter, int vfn); int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int queue, u8 *mac); int ixgbe_ndo_set_vf_vlan(struct net_device *netdev, int queue, u16 vlan, u8 qos); -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 07/12] IXGBEVF: Add new mail box event for migration
VF status in the PF driver needs to be restored after migration and reset VF hardware. This patch is to add a new event for VF driver to notify PF driver to restore status. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- drivers/net/ethernet/intel/ixgbevf/mbx.h | 3 +++ drivers/net/ethernet/intel/ixgbevf/vf.c | 10 ++ drivers/net/ethernet/intel/ixgbevf/vf.h | 1 + 3 files changed, 14 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbevf/mbx.h b/drivers/net/ethernet/intel/ixgbevf/mbx.h index 82f44e0..22761d8 100644 --- a/drivers/net/ethernet/intel/ixgbevf/mbx.h +++ b/drivers/net/ethernet/intel/ixgbevf/mbx.h @@ -112,6 +112,9 @@ enum ixgbe_pfvf_api_rev { #define IXGBE_VF_GET_RETA 0x0a/* VF request for RETA */ #define IXGBE_VF_GET_RSS_KEY 0x0b/* get RSS hash key */ +/* mail box event for live migration */ +#define IXGBE_VF_NOTIFY_RESUME 0x0c /* VF notify PF migration to restore status */ + /* length of permanent address message returned from PF */ #define IXGBE_VF_PERMADDR_MSG_LEN 4 /* word in permanent address message with the current multicast type */ diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.c b/drivers/net/ethernet/intel/ixgbevf/vf.c index d1339b0..1e4e5e6 100644 --- a/drivers/net/ethernet/intel/ixgbevf/vf.c +++ b/drivers/net/ethernet/intel/ixgbevf/vf.c @@ -717,6 +717,15 @@ int ixgbevf_get_queues(struct ixgbe_hw *hw, unsigned int *num_tcs, return err; } +static void ixgbevf_notify_resume_vf(struct ixgbe_hw *hw) +{ + struct ixgbe_mbx_info *mbx = >mbx; + u32 msgbuf[1]; + + msgbuf[0] = IXGBE_VF_NOTIFY_RESUME; + mbx->ops.write_posted(hw, msgbuf, 1); +} + static const struct ixgbe_mac_operations ixgbevf_mac_ops = { .init_hw= ixgbevf_init_hw_vf, .reset_hw = ixgbevf_reset_hw_vf, @@ -729,6 +738,7 @@ static const struct ixgbe_mac_operations ixgbevf_mac_ops = { .update_mc_addr_list= ixgbevf_update_mc_addr_list_vf, .set_uc_addr= ixgbevf_set_uc_addr_vf, .set_vfta = ixgbevf_set_vfta_vf, + .notify_resume = ixgbevf_notify_resume_vf, }; const struct ixgbevf_info ixgbevf_82599_vf_info = { diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h b/drivers/net/ethernet/intel/ixgbevf/vf.h index 6a3f4eb..a25fe81 100644 --- a/drivers/net/ethernet/intel/ixgbevf/vf.h +++ b/drivers/net/ethernet/intel/ixgbevf/vf.h @@ -70,6 +70,7 @@ struct ixgbe_mac_operations { s32 (*disable_mc)(struct ixgbe_hw *); s32 (*clear_vfta)(struct ixgbe_hw *); s32 (*set_vfta)(struct ixgbe_hw *, u32, u32, bool); + void (*notify_resume)(struct ixgbe_hw *); }; enum ixgbe_mac_type { -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/3] Qemu: Add pci-assign.h to share functions and struct definition with new file
Signed-off-by: Lan Tianyu <tianyu@intel.com> --- hw/i386/kvm/pci-assign.c | 111 ++- hw/i386/kvm/pci-assign.h | 109 ++ 2 files changed, 112 insertions(+), 108 deletions(-) create mode 100644 hw/i386/kvm/pci-assign.h diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c index 74d22f4..616532d 100644 --- a/hw/i386/kvm/pci-assign.c +++ b/hw/i386/kvm/pci-assign.c @@ -37,112 +37,7 @@ #include "hw/pci/pci.h" #include "hw/pci/msi.h" #include "kvm_i386.h" - -#define MSIX_PAGE_SIZE 0x1000 - -/* From linux/ioport.h */ -#define IORESOURCE_IO 0x0100 /* Resource type */ -#define IORESOURCE_MEM 0x0200 -#define IORESOURCE_IRQ 0x0400 -#define IORESOURCE_DMA 0x0800 -#define IORESOURCE_PREFETCH 0x2000 /* No side effects */ -#define IORESOURCE_MEM_64 0x0010 - -//#define DEVICE_ASSIGNMENT_DEBUG - -#ifdef DEVICE_ASSIGNMENT_DEBUG -#define DEBUG(fmt, ...) \ -do { \ -fprintf(stderr, "%s: " fmt, __func__ , __VA_ARGS__); \ -} while (0) -#else -#define DEBUG(fmt, ...) -#endif - -typedef struct PCIRegion { -int type; /* Memory or port I/O */ -int valid; -uint64_t base_addr; -uint64_t size;/* size of the region */ -int resource_fd; -} PCIRegion; - -typedef struct PCIDevRegions { -uint8_t bus, dev, func; /* Bus inside domain, device and function */ -int irq;/* IRQ number */ -uint16_t region_number; /* number of active regions */ - -/* Port I/O or MMIO Regions */ -PCIRegion regions[PCI_NUM_REGIONS - 1]; -int config_fd; -} PCIDevRegions; - -typedef struct AssignedDevRegion { -MemoryRegion container; -MemoryRegion real_iomem; -union { -uint8_t *r_virtbase; /* mmapped access address for memory regions */ -uint32_t r_baseport; /* the base guest port for I/O regions */ -} u; -pcibus_t e_size;/* emulated size of region in bytes */ -pcibus_t r_size;/* real size of region in bytes */ -PCIRegion *region; -} AssignedDevRegion; - -#define ASSIGNED_DEVICE_PREFER_MSI_BIT 0 -#define ASSIGNED_DEVICE_SHARE_INTX_BIT 1 - -#define ASSIGNED_DEVICE_PREFER_MSI_MASK (1 << ASSIGNED_DEVICE_PREFER_MSI_BIT) -#define ASSIGNED_DEVICE_SHARE_INTX_MASK (1 << ASSIGNED_DEVICE_SHARE_INTX_BIT) - -typedef struct MSIXTableEntry { -uint32_t addr_lo; -uint32_t addr_hi; -uint32_t data; -uint32_t ctrl; -} MSIXTableEntry; - -typedef enum AssignedIRQType { -ASSIGNED_IRQ_NONE = 0, -ASSIGNED_IRQ_INTX_HOST_INTX, -ASSIGNED_IRQ_INTX_HOST_MSI, -ASSIGNED_IRQ_MSI, -ASSIGNED_IRQ_MSIX -} AssignedIRQType; - -typedef struct AssignedDevice { -PCIDevice dev; -PCIHostDeviceAddress host; -uint32_t dev_id; -uint32_t features; -int intpin; -AssignedDevRegion v_addrs[PCI_NUM_REGIONS - 1]; -PCIDevRegions real_device; -PCIINTxRoute intx_route; -AssignedIRQType assigned_irq_type; -struct { -#define ASSIGNED_DEVICE_CAP_MSI (1 << 0) -#define ASSIGNED_DEVICE_CAP_MSIX (1 << 1) -uint32_t available; -#define ASSIGNED_DEVICE_MSI_ENABLED (1 << 0) -#define ASSIGNED_DEVICE_MSIX_ENABLED (1 << 1) -#define ASSIGNED_DEVICE_MSIX_MASKED (1 << 2) -uint32_t state; -} cap; -uint8_t emulate_config_read[PCI_CONFIG_SPACE_SIZE]; -uint8_t emulate_config_write[PCI_CONFIG_SPACE_SIZE]; -int msi_virq_nr; -int *msi_virq; -MSIXTableEntry *msix_table; -hwaddr msix_table_addr; -uint16_t msix_max; -MemoryRegion mmio; -char *configfd_name; -int32_t bootindex; -} AssignedDevice; - -#define TYPE_PCI_ASSIGN "kvm-pci-assign" -#define PCI_ASSIGN(obj) OBJECT_CHECK(AssignedDevice, (obj), TYPE_PCI_ASSIGN) +#include "pci-assign.h" static void assigned_dev_update_irq_routing(PCIDevice *dev); @@ -1044,7 +939,7 @@ static bool assigned_dev_msix_masked(MSIXTableEntry *entry) * sure the physical MSI-X state tracks the guest's view, which is important * for some VF/PF and PF/fw communication channels. */ -static bool assigned_dev_msix_skipped(MSIXTableEntry *entry) +bool assigned_dev_msix_skipped(MSIXTableEntry *entry) { return !entry->data; } @@ -1114,7 +1009,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev) return r; } -static void assigned_dev_update_msix(PCIDevice *pci_dev) +void assigned_dev_update_msix(PCIDevice *pci_dev) { AssignedDevice *assigned_dev = PCI_ASSIGN(pci_dev); uint16_t ctrl_word = pci_get_word(pci_dev->config + pci_dev->msix_cap + diff --git a/hw/i386/kvm/pci-assign.h b/hw/i386/kvm/pci-assign.h new file mode 100644 index 000..91d00ea --- /dev/null +++ b/hw/i386/kvm/pci-assign.h @@ -0,0 +1,109 @@ +#define MSIX_PAGE_SIZE 0x1
[RFC PATCH 0/3] Qemu/IXGBE: Add live migration support for SRIOV NIC
This patchset is Qemu part for live migration support for SRIOV NIC. kernel part patch information is in the following link. http://marc.info/?l=kvm=144544635330193=2 Lan Tianyu (3): Qemu: Add pci-assign.h to share functions and struct definition with new file Qemu: Add post_load_state() to run after restoring CPU state Qemu: Introduce pci-sriov device type to support VF live migration hw/i386/kvm/Makefile.objs | 2 +- hw/i386/kvm/pci-assign.c| 113 +-- hw/i386/kvm/pci-assign.h| 109 +++ hw/i386/kvm/sriov.c | 213 include/migration/vmstate.h | 2 + migration/savevm.c | 15 6 files changed, 344 insertions(+), 110 deletions(-) create mode 100644 hw/i386/kvm/pci-assign.h create mode 100644 hw/i386/kvm/sriov.c -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 06/12] IXGBEVF: Add self emulation layer
In order to restore VF function after migration, add self emulation layer to record regs' values during accessing regs. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- drivers/net/ethernet/intel/ixgbevf/Makefile| 3 ++- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +- .../net/ethernet/intel/ixgbevf/self-emulation.c| 26 ++ drivers/net/ethernet/intel/ixgbevf/vf.h| 5 - 4 files changed, 33 insertions(+), 3 deletions(-) create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile b/drivers/net/ethernet/intel/ixgbevf/Makefile index 4ce4c97..841c884 100644 --- a/drivers/net/ethernet/intel/ixgbevf/Makefile +++ b/drivers/net/ethernet/intel/ixgbevf/Makefile @@ -31,7 +31,8 @@ obj-$(CONFIG_IXGBEVF) += ixgbevf.o -ixgbevf-objs := vf.o \ +ixgbevf-objs := self-emulation.o \ + vf.o \ mbx.o \ ethtool.o \ ixgbevf_main.o diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index a16d267..4446916 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg) if (IXGBE_REMOVED(reg_addr)) return IXGBE_FAILED_READ_REG; - value = readl(reg_addr + reg); + value = ixgbe_self_emul_readl(reg_addr, reg); if (unlikely(value == IXGBE_FAILED_READ_REG)) ixgbevf_check_remove(hw, reg); return value; diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c new file mode 100644 index 000..d74b2da --- /dev/null +++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c @@ -0,0 +1,26 @@ +#include +#include +#include +#include +#include + +#include "vf.h" +#include "ixgbevf.h" + +static u32 hw_regs[0x4000]; + +u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr) +{ + u32 tmp; + + tmp = readl(base + addr); + hw_regs[(unsigned long)addr] = tmp; + + return tmp; +} + +void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32 addr) +{ + hw_regs[(unsigned long)addr] = val; + writel(val, (volatile void __iomem *)(base + addr)); +} diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h b/drivers/net/ethernet/intel/ixgbevf/vf.h index d40f036..6a3f4eb 100644 --- a/drivers/net/ethernet/intel/ixgbevf/vf.h +++ b/drivers/net/ethernet/intel/ixgbevf/vf.h @@ -39,6 +39,9 @@ struct ixgbe_hw; +u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr); +void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32 addr); + /* iterator type for walking multicast address lists */ typedef u8* (*ixgbe_mc_addr_itr) (struct ixgbe_hw *hw, u8 **mc_addr_ptr, u32 *vmdq); @@ -182,7 +185,7 @@ static inline void ixgbe_write_reg(struct ixgbe_hw *hw, u32 reg, u32 value) if (IXGBE_REMOVED(reg_addr)) return; - writel(value, reg_addr + reg); + ixgbe_self_emul_writel(value, reg_addr, reg); } #define IXGBE_WRITE_REG(h, r, v) ixgbe_write_reg(h, r, v) -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/3] Qemu: Add post_load_state() to run after restoring CPU state
After migration, Qemu needs to trigger mailbox irq to notify VF driver in the guest about status change. The irq delivery restarts to work after restoring CPU state. This patch is to add new callback to run after restoring CPU state and provide a way to trigger mailbox irq later. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- include/migration/vmstate.h | 2 ++ migration/savevm.c | 15 +++ 2 files changed, 17 insertions(+) diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h index 0695d7c..dc681a6 100644 --- a/include/migration/vmstate.h +++ b/include/migration/vmstate.h @@ -56,6 +56,8 @@ typedef struct SaveVMHandlers { int (*save_live_setup)(QEMUFile *f, void *opaque); uint64_t (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t max_size); +/* This runs after restoring CPU related state */ +void (*post_load_state)(void *opaque); LoadStateHandler *load_state; } SaveVMHandlers; diff --git a/migration/savevm.c b/migration/savevm.c index 9e0e286..48b6223 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -702,6 +702,20 @@ bool qemu_savevm_state_blocked(Error **errp) return false; } +void qemu_savevm_post_load(void) +{ +SaveStateEntry *se; + +QTAILQ_FOREACH(se, _state.handlers, entry) { +if (!se->ops || !se->ops->post_load_state) { +continue; +} + +se->ops->post_load_state(se->opaque); +} +} + + void qemu_savevm_state_header(QEMUFile *f) { trace_savevm_state_header(); @@ -1140,6 +1154,7 @@ int qemu_loadvm_state(QEMUFile *f) } cpu_synchronize_all_post_init(); +qemu_savevm_post_load(); ret = 0; -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation
Ring shifting during restoring VF function maybe race with original ring operation(transmit/receive package). This patch is to add tx/rx lock to protect ring related data. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 2 ++ drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 28 --- 2 files changed, 27 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h index 6eab402e..3a748c8 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h @@ -448,6 +448,8 @@ struct ixgbevf_adapter { spinlock_t mbx_lock; unsigned long last_reset; + spinlock_t mg_rx_lock; + spinlock_t mg_tx_lock; }; enum ixbgevf_state_t { diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 15ec361..04b6ce7 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -227,8 +227,10 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring *ring) int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) { + struct ixgbevf_adapter *adapter = netdev_priv(r->netdev); struct ixgbevf_tx_buffer *tx_buffer = NULL; static union ixgbevf_desc *tx_desc = NULL; + unsigned long flags; tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count)); if (!tx_buffer) @@ -238,6 +240,7 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) if (!tx_desc) return -ENOMEM; + spin_lock_irqsave(>mg_tx_lock, flags); memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count); memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count - head)); memcpy(>desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * head); @@ -256,6 +259,8 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) else r->next_to_use += (r->count - head); + spin_unlock_irqrestore(>mg_tx_lock, flags); + vfree(tx_buffer); vfree(tx_desc); return 0; @@ -263,8 +268,10 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head) { + struct ixgbevf_adapter *adapter = netdev_priv(r->netdev); struct ixgbevf_rx_buffer *rx_buffer = NULL; static union ixgbevf_desc *rx_desc = NULL; + unsigned long flags; rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count)); if (!rx_buffer) @@ -274,6 +281,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head) if (!rx_desc) return -ENOMEM; + spin_lock_irqsave(>mg_rx_lock, flags); memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count)); memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count - head)); memcpy(>desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * head); @@ -291,6 +299,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head) r->next_to_use -= head; else r->next_to_use += (r->count - head); + spin_unlock_irqrestore(>mg_rx_lock, flags); vfree(rx_buffer); vfree(rx_desc); @@ -377,6 +386,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, if (test_bit(__IXGBEVF_DOWN, >state)) return true; + spin_lock(>mg_tx_lock); + i = tx_ring->next_to_clean; tx_buffer = _ring->tx_buffer_info[i]; tx_desc = IXGBEVF_TX_DESC(tx_ring, i); i -= tx_ring->count; @@ -471,6 +482,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, q_vector->tx.total_bytes += total_bytes; q_vector->tx.total_packets += total_packets; + spin_unlock(>mg_tx_lock); + if (check_for_tx_hang(tx_ring) && ixgbevf_check_tx_hang(tx_ring)) { struct ixgbe_hw *hw = >hw; union ixgbe_adv_tx_desc *eop_desc; @@ -999,10 +1012,12 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector, struct ixgbevf_ring *rx_ring, int budget) { + struct ixgbevf_adapter *adapter = netdev_priv(rx_ring->netdev); unsigned int total_rx_bytes = 0, total_rx_packets = 0; u16 cleaned_count = ixgbevf_desc_unused(rx_ring); struct sk_buff *skb = rx_ring->skb; + spin_lock(>mg_rx_lock); while (likely(total_rx_packets < budget)) { union ixgbe_adv_rx_desc *rx_desc; @@ -1078,6 +1093,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector, q_vec
[RFC Patch 12/12] IXGBEVF: Track dma dirty pages
Migration relies on tracking dirty page to migrate memory. Hardware can't automatically mark a page as dirty after DMA memory access. VF descriptor rings and data buffers are modified by hardware when receive and transmit data. To track such dirty memory manually, do dummy writes(read a byte and write it back) during receive and transmit data. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index d22160f..ce7bd7a 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -414,6 +414,9 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, if (!(eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD))) break; + /* write back status to mark page dirty */ + eop_desc->wb.status = eop_desc->wb.status; + /* clear next_to_watch to prevent false hangs */ tx_buffer->next_to_watch = NULL; tx_buffer->desc_num = 0; @@ -946,15 +949,17 @@ static struct sk_buff *ixgbevf_fetch_rx_buffer(struct ixgbevf_ring *rx_ring, { struct ixgbevf_rx_buffer *rx_buffer; struct page *page; + u8 *page_addr; rx_buffer = _ring->rx_buffer_info[rx_ring->next_to_clean]; page = rx_buffer->page; prefetchw(page); - if (likely(!skb)) { - void *page_addr = page_address(page) + - rx_buffer->page_offset; + /* Mark page dirty */ + page_addr = page_address(page) + rx_buffer->page_offset; + *page_addr = *page_addr; + if (likely(!skb)) { /* prefetch first cache line of first page */ prefetch(page_addr); #if L1_CACHE_BYTES < 128 @@ -1032,6 +1037,9 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector, if (!ixgbevf_test_staterr(rx_desc, IXGBE_RXD_STAT_DD)) break; + /* Write back status to mark page dirty */ + rx_desc->wb.upper.status_error = rx_desc->wb.upper.status_error; + /* This memory barrier is needed to keep us from reading * any other fields out of the rx_desc until we know the * RXD_STAT_DD bit is set -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 02/12] IXGBE: Add new mail box event to restore VF status in the PF driver
This patch is to restore VF status in the PF driver when get event from VF. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- drivers/net/ethernet/intel/ixgbe/ixgbe.h | 1 + drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h | 1 + drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 40 ++ 3 files changed, 42 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index 636f9e3..9d5669a 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -148,6 +148,7 @@ struct vf_data_storage { bool pf_set_mac; u16 pf_vlan; /* When set, guest VLAN config not allowed. */ u16 pf_qos; + u32 vf_lpe; u16 tx_rate; u16 vlan_count; u8 spoofchk_enabled; diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h index b1e4703..8fdb38d 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h @@ -91,6 +91,7 @@ enum ixgbe_pfvf_api_rev { /* mailbox API, version 1.1 VF requests */ #define IXGBE_VF_GET_QUEUES0x09 /* get queue configuration */ +#define IXGBE_VF_NOTIFY_RESUME0x0c /* VF notify PF migration finishing */ /* GET_QUEUES return data indices within the mailbox */ #define IXGBE_VF_TX_QUEUES 1 /* number of Tx queues supported */ diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index 1d17b58..ab2a2e2 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -648,6 +648,42 @@ static inline void ixgbe_write_qde(struct ixgbe_adapter *adapter, u32 vf, } } +/** + * Restore the settings by mailbox, after migration + **/ +void ixgbe_restore_setting(struct ixgbe_adapter *adapter, u32 vf) +{ + struct ixgbe_hw *hw = >hw; + u32 reg, reg_offset, vf_shift; + int rar_entry = hw->mac.num_rar_entries - (vf + 1); + + vf_shift = vf % 32; + reg_offset = vf / 32; + + /* enable transmit and receive for vf */ + reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset)); + reg |= (1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg); + + reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset)); + reg |= (1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg); + + reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset)); + reg |= (1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg); + + ixgbe_vf_reset_event(adapter, vf); + + hw->mac.ops.set_rar(hw, rar_entry, + adapter->vfinfo[vf].vf_mac_addresses, + vf, IXGBE_RAH_AV); + + + if (adapter->vfinfo[vf].vf_lpe) + ixgbe_set_vf_lpe(adapter, >vfinfo[vf].vf_lpe, vf); +} + static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf) { struct ixgbe_ring_feature *vmdq = >ring_feature[RING_F_VMDQ]; @@ -1047,6 +1083,7 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter *adapter, u32 vf) break; case IXGBE_VF_SET_LPE: retval = ixgbe_set_vf_lpe(adapter, msgbuf, vf); + adapter->vfinfo[vf].vf_lpe = *msgbuf; break; case IXGBE_VF_SET_MACVLAN: retval = ixgbe_set_vf_macvlan_msg(adapter, msgbuf, vf); @@ -1063,6 +1100,9 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter *adapter, u32 vf) case IXGBE_VF_GET_RSS_KEY: retval = ixgbe_get_vf_rss_key(adapter, msgbuf, vf); break; + case IXGBE_VF_NOTIFY_RESUME: + ixgbe_restore_setting(adapter, vf); + break; default: e_err(drv, "Unhandled Msg %8.8x\n", msgbuf[0]); retval = IXGBE_ERR_MBX; -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver
This patch is to add sysfs interface state_in_pf under sysfs directory of VF PCI device for Qemu to get and put VF status in the PF driver during migration. Signed-off-by: Lan Tianyu <tianyu@intel.com> --- drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 156 - 1 file changed, 155 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index ab2a2e2..89671eb 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -124,6 +124,157 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter *adapter) return -ENOMEM; } +#define IXGBE_PCI_VFCOMMAND 0x4 +#define IXGBE_PCI_VFMSIXMC0x72 +#define IXGBE_SRIOV_VF_OFFSET 0x180 +#define IXGBE_SRIOV_VF_STRIDE 0x2 + +#define to_adapter(dev) ((struct ixgbe_adapter *)(pci_get_drvdata(to_pci_dev(dev)->physfn))) + +struct state_in_pf { + u16 command; + u16 msix_message_control; + struct vf_data_storage vf_data; +}; + +static struct pci_dev *ixgbe_get_virtfn_dev(struct pci_dev *pdev, int vfn) +{ + u16 rid = pdev->devfn + IXGBE_SRIOV_VF_OFFSET + IXGBE_SRIOV_VF_STRIDE * vfn; + return pci_get_bus_and_slot(pdev->bus->number + (rid >> 8), rid & 0xff); +} + +static ssize_t ixgbe_show_state_in_pf(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct ixgbe_adapter *adapter = to_adapter(dev); + struct pci_dev *pdev = adapter->pdev, *vdev; + struct pci_dev *vf_pdev = to_pci_dev(dev); + struct ixgbe_hw *hw = >hw; + struct state_in_pf *state = (struct state_in_pf *)buf; + int vfn = vf_pdev->virtfn_index; + u32 reg, reg_offset, vf_shift; + + /* Clear VF mac and disable VF */ + ixgbe_del_mac_filter(adapter, adapter->vfinfo[vfn].vf_mac_addresses, vfn); + + /* Record PCI configurations */ + vdev = ixgbe_get_virtfn_dev(pdev, vfn); + if (vdev) { + pci_read_config_word(vdev, IXGBE_PCI_VFCOMMAND, >command); + pci_read_config_word(vdev, IXGBE_PCI_VFMSIXMC, >msix_message_control); + } + else + printk(KERN_WARNING "Unable to find VF device.\n"); + + /* Record states hold by PF */ + memcpy(>vf_data, >vfinfo[vfn], sizeof(struct vf_data_storage)); + + vf_shift = vfn % 32; + reg_offset = vfn / 32; + + reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset)); + reg &= ~(1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg); + + reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset)); + reg &= ~(1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg); + + reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset)); + reg &= ~(1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg); + + return sizeof(struct state_in_pf); +} + +static ssize_t ixgbe_store_state_in_pf(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct ixgbe_adapter *adapter = to_adapter(dev); + struct pci_dev *pdev = adapter->pdev, *vdev; + struct pci_dev *vf_pdev = to_pci_dev(dev); + struct state_in_pf *state = (struct state_in_pf *)buf; + int vfn = vf_pdev->virtfn_index; + + /* Check struct size */ + if (count != sizeof(struct state_in_pf)) { + printk(KERN_ERR "State in PF size does not fit.\n"); + goto out; + } + + /* Restore PCI configurations */ + vdev = ixgbe_get_virtfn_dev(pdev, vfn); + if (vdev) { + pci_write_config_word(vdev, IXGBE_PCI_VFCOMMAND, state->command); + pci_write_config_word(vdev, IXGBE_PCI_VFMSIXMC, state->msix_message_control); + } + + /* Restore states hold by PF */ + memcpy(>vfinfo[vfn], >vf_data, sizeof(struct vf_data_storage)); + + out: + return count; +} + +static struct device_attribute ixgbe_per_state_in_pf_attribute = + __ATTR(state_in_pf, S_IRUGO | S_IWUSR, + ixgbe_show_state_in_pf, ixgbe_store_state_in_pf); + +void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter) +{ + struct pci_dev *pdev = adapter->pdev; + struct pci_dev *vfdev; + unsigned short vf_id; + int pos, ret; + + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV); + if (!pos) + return; + + /* get the device ID for the VF */ + pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, _id); + + vfdev = pci_get_device(pdev->vendor, vf_id, NULL); + + while (vfdev) { + if (vfdev->is_virtfn) { + ret = device_create_file(>dev, +
[PATCH] KVM: Avoid warning user requested TSC rate below hardware speed when create VM.
KVM populates max_tsc_khz with tsc_khz at arch init stage on the constant tsc machine and creates VM with max_tsc_khz as tsc. However, tsc_khz maybe changed during tsc clocksource driver refines calibration. This will cause to create VM with slow tsc and produce the following warning. To fix the issue, compare max_tsc_khz with tsc_khz and update max_tsc_khz with new value of tsc_khz if it has been changed when create a VM. [ 94.916906] [ cut here ] [ 94.922127] WARNING: CPU: 0 PID: 824 at arch/x86/kvm/vmx.c:2272 vmx_set_tsc_khz+0x3e/0x50() [ 94.931503] user requested TSC rate below hardware speed [ 94.937470] Modules linked in: [ 94.940923] CPU: 0 PID: 824 Comm: qemu-system-x86 Tainted: G D W 4.1.0-rc3+ #4 [ 94.960350] 81f453f8 88027e9f3bc8 81b5eb8a [ 94.968721] 88027e9f3c18 88027e9f3c08 810e6f8a 8802 [ 94.977103] 001d3300 88027e98 0001 88027e98 [ 94.985476] Call Trace: [ 94.988242] [81b5eb8a] dump_stack+0x45/0x57 [ 94.994020] [810e6f8a] warn_slowpath_common+0x8a/0xc0 [ 95.000772] [810e7006] warn_slowpath_fmt+0x46/0x50 [ 95.007222] [8104676e] vmx_set_tsc_khz+0x3e/0x50 [ 95.013488] [810112f7] kvm_set_tsc_khz.part.106+0xa7/0xe0 [ 95.020618] [8101e628] kvm_arch_vcpu_init+0x208/0x230 [ 95.027373] [81003bf9] kvm_vcpu_init+0xc9/0x110 [ 95.033540] [81049fd0] vmx_create_vcpu+0x70/0xc30 [ 95.039911] [81049f80] ? vmx_create_vcpu+0x20/0xc30 [ 95.046476] [8101dc9e] kvm_arch_vcpu_create+0x3e/0x60 [ 95.053233] [81009f00] kvm_vm_ioctl+0x1a0/0x770 [ 95.059400] [8129f395] ? __fget+0x5/0x200 [ 95.064991] [8115b85f] ? rcu_irq_exit+0x7f/0xd0 [ 95.071157] [81293448] do_vfs_ioctl+0x308/0x540 [ 95.077323] [8129f301] ? expand_files+0x1f1/0x280 [ 95.083684] [8147836b] ? selinux_file_ioctl+0x5b/0x100 [ 95.090538] [81293701] SyS_ioctl+0x81/0xa0 [ 95.096218] [81b6a72e] system_call_fastpath+0x12/0x76 [ 95.102974] ---[ end trace 08ade884081d9dd7 ]--- Link: https://bugzilla.kernel.org/show_bug.cgi?id=99861 Signed-off-by: Lan Tianyu tianyu@intel.com --- arch/x86/kvm/x86.c | 16 1 file changed, 16 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 43f0df7..6c7fefe 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7814,6 +7814,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) { struct page *page; struct kvm *kvm; + int cpu; int r; BUG_ON(vcpu-kvm == NULL); @@ -7833,6 +7834,21 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) } vcpu-arch.pio_data = page_address(page); + /* +* max_tsc_khz records tsc_khz at arch init stage on the constant tsc +* machine. However, tsc_khz maybe changed during tsc clocksource driver +* refines calibration. This will cause to create VM with slow tsc +* and produce warning. To avoid such case, check whether tsc_khz +* has been changed here and update max_tsc_khz with new value of +* tsc_khz if changed. +*/ + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) max_tsc_khz != tsc_khz) { + max_tsc_khz = tsc_khz; + pr_debug(kvm: max_tsc_khz is changed to %ld\n, max_tsc_khz); + for_each_online_cpu(cpu) + smp_call_function_single(cpu, tsc_khz_changed, NULL, 1); + } + kvm_set_tsc_khz(vcpu, max_tsc_khz); r = kvm_mmu_create(vcpu); -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] kvm tools: Use kernel error check functions
Add compiler.h file to support using kernel error check funciotns(e.g ERR_PTR, PTR_ERR and so on). Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/include/linux/compiler.h | 16 1 files changed, 16 insertions(+), 0 deletions(-) create mode 100644 tools/kvm/include/linux/compiler.h diff --git a/tools/kvm/include/linux/compiler.h b/tools/kvm/include/linux/compiler.h new file mode 100644 index 000..bd360c2 --- /dev/null +++ b/tools/kvm/include/linux/compiler.h @@ -0,0 +1,16 @@ +#ifndef _PERF_LINUX_COMPILER_H_ +#define _PERF_LINUX_COMPILER_H_ + +#ifndef __always_inline +#define __always_inlineinline +#endif +#define __user +#define __attribute_const__ + +#define __used __attribute__((__unused__)) +#define __iomem +#define __force +#define __must_check +#define unlikely + +#endif -- 1.7.6.rc2.8.g28eb -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] kvm tools, qcow: Add support for growing refcount blocks
This patch enables allocating new refcount blocks and so then kvm tools could expand qcow2 image much larger. Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c | 111 +--- 1 files changed, 94 insertions(+), 17 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index e139fa5..b2dae66 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -12,14 +12,20 @@ #include string.h #include unistd.h #include fcntl.h +#include errno.h #ifdef CONFIG_HAS_ZLIB #include zlib.h #endif +#include linux/err.h #include linux/byteorder.h #include linux/kernel.h #include linux/types.h +static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append); +static int qcow_write_refcount_table(struct qcow *q); +static u64 qcow_alloc_clusters(struct qcow *q, u64 size, int update_ref); +static void qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size); static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t offset) @@ -657,6 +663,58 @@ static struct qcow_refcount_block *refcount_block_search(struct qcow *q, u64 off return rfb; } +static struct qcow_refcount_block *qcow_grow_refcount_block(struct qcow *q, + u64 clust_idx) +{ + struct qcow_header *header = q-header; + struct qcow_refcount_table *rft = q-refcount_table; + struct qcow_refcount_block *rfb; + u64 new_block_offset; + u64 rft_idx; + + rft_idx = clust_idx (header-cluster_bits - + QCOW_REFCOUNT_BLOCK_SHIFT); + + if (rft_idx = rft-rf_size) { + pr_warning(Don't support grow refcount block table); + return NULL; + } + + new_block_offset = qcow_alloc_clusters(q, q-cluster_size, 0); + if (new_block_offset 0) + return NULL; + + rfb = new_refcount_block(q, new_block_offset); + if (!rfb) + return NULL; + + memset(rfb-entries, 0x00, q-cluster_size); + rfb-dirty = 1; + + /* write refcount block */ + if (write_refcount_block(q, rfb) 0) + goto free_rfb; + + if (cache_refcount_block(q, rfb) 0) + goto free_rfb; + + rft-rf_table[rft_idx] = cpu_to_be64(new_block_offset); + if (update_cluster_refcount(q, new_block_offset + header-cluster_bits, 1) 0) + goto recover_rft; + + if (qcow_write_refcount_table(q) 0) + goto recover_rft; + + return rfb; + +recover_rft: + rft-rf_table[rft_idx] = 0; +free_rfb: + free(rfb); + return NULL; +} + static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 clust_idx) { struct qcow_header *header = q-header; @@ -667,14 +725,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rft_idx = clust_idx (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT); if (rft_idx = rft-rf_size) - return NULL; + return ERR_PTR(-ENOSPC); rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]); - - if (!rfb_offset) { - pr_warning(Don't support to grow refcount table); - return NULL; - } + if (!rfb_offset) + return ERR_PTR(-ENOSPC); rfb = refcount_block_search(q, rfb_offset); if (rfb) @@ -705,7 +760,9 @@ static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx) u64 rfb_idx; rfb = qcow_read_refcount_block(q, clust_idx); - if (!rfb) { + if (PTR_ERR(rfb) == -ENOSPC) + return 0; + else if (IS_ERR_OR_NULL(rfb)) { pr_warning(Error while reading refcount table); return -1; } @@ -729,7 +786,13 @@ static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append) u64 rfb_idx; rfb = qcow_read_refcount_block(q, clust_idx); - if (!rfb) { + if (PTR_ERR(rfb) == -ENOSPC) { + rfb = qcow_grow_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(error while growing refcount table); + return -1; + } + } else if (IS_ERR_OR_NULL(rfb)) { pr_warning(error while reading refcount table); return -1; } @@ -774,11 +837,11 @@ static void qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size) * can satisfy the size. free_clust_idx is initialized to zero and * Record last position. */ -static u64 qcow_alloc_clusters(struct qcow *q, u64 size) +static u64 qcow_alloc_clusters(struct qcow *q, u64 size, int update_ref) { struct qcow_header *header = q-header; u16 clust_refcount; - u32 clust_idx, i; + u32 clust_idx = 0, i; u64 clust_num; clust_num = (size + (q-cluster_size - 1)) header-cluster_bits; @@ -793,12 +856,15 @@ again
Re: [RFC PATCH] kvm tools, qcow: Add support for growing refcount blocks
oh,yeah. Thanks for your reply. It's a good lesson for me. :) I will update soon. On 二, 2011-12-13 at 16:57 +0800, Kevin Wolf wrote: Am 13.12.2011 04:41, schrieb lan,Tianyu: On 一, 2011-12-12 at 19:15 +0800, Kevin Wolf wrote: Am 12.12.2011 11:58, schrieb Pekka Enberg: On Mon, 12 Dec 2011, Kevin Wolf wrote: @@ -667,14 +722,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rft_idx = clust_idx (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT); if (rft_idx = rft-rf_size) - return NULL; + return (void *)-ENOSPC; Is this allowed style in kvm-tool? :-/ It needs to use ERR_PTR() and related macros but otherwise I don't see a big problem with it. Can you be sure that it never clashes with a valid allocation when you use this in userspace? But yes, at least using appropriate functions should be required. And this means that you can't only check for -ENOSPC, but you need to check for all possible error codes (IS_ERR_VALUE() I guess). The qcow_read_refcount_block() is invoiked in the two places qcow_get_refcount() and update_cluster_refcount() and will only return NULL or -ENOSPC. In the qcow_get_refcount(), when qcow_read_refcount_block() returned -ENOSPEC(no refcount block is available.), returning zero which means the cluster is not in use and the refcount block can be grew in the update_cluster_refcount(). In the update_cluster_refcount(), when qcow_read_refcount_block() returned -ENOSPEC, it is necessary to grow the refcount blocks. So the ENOSPEC should be specially dealt with. Does this make sense? :) I'm not saying that your code won't work today, just that it's brittle. If someone adds a different error return code somewhere and doesn't check if all callers properly deal with error codes, it will break. Kevin Thanks Tianyu Lan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] kvm tools, qcow: Add support for growing refcount blocks
Thanks for your review. On 一, 2011-12-12 at 17:55 +0800, Kevin Wolf wrote: Am 12.12.2011 03:03, schrieb Lan Tianyu: This patch enables allocating new refcount blocks and so then kvm tools could expand qcow2 image much larger. Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c | 105 +--- 1 files changed, 89 insertions(+), 16 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index e139fa5..929ba69 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -12,6 +12,7 @@ #include string.h #include unistd.h #include fcntl.h +#include errno.h #ifdef CONFIG_HAS_ZLIB #include zlib.h #endif @@ -20,6 +21,10 @@ #include linux/kernel.h #include linux/types.h +static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append); +static int qcow_write_refcount_table(struct qcow *q); +static u64 qcow_alloc_clusters(struct qcow *q, u64 size, int update_ref); +static void qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size); static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t offset) @@ -657,6 +662,56 @@ static struct qcow_refcount_block *refcount_block_search(struct qcow *q, u64 off return rfb; } +static struct qcow_refcount_block *qcow_grow_refcount_block(struct qcow *q, + u64 clust_idx) +{ + struct qcow_header *header = q-header; + struct qcow_refcount_table *rft = q-refcount_table; + struct qcow_refcount_block *rfb; + u64 new_block_offset; + u64 rft_idx; + + rft_idx = clust_idx (header-cluster_bits - + QCOW_REFCOUNT_BLOCK_SHIFT); + + if (rft_idx = rft-rf_size) { + pr_warning(Don't support grow refcount block table); + return NULL; + } + + new_block_offset = qcow_alloc_clusters(q, q-cluster_size, 0); + if (new_block_offset 0) + return NULL; + + rfb = new_refcount_block(q, new_block_offset); + if (!rfb) + return NULL; + + memset(rfb-entries, 0x00, q-cluster_size); + rfb-dirty = 1; + + /* write refcount block */ + if (write_refcount_block(q, rfb) 0) + goto free_rfb; + + if (cache_refcount_block(q, rfb) 0) + goto free_rfb; + + rft-rf_table[rft_idx] = cpu_to_be64(new_block_offset); + if (qcow_write_refcount_table(q) 0) + goto free_rfb; + + if (update_cluster_refcount(q, new_block_offset + header-cluster_bits, 1) 0) + goto free_rfb; This order is unsafe, you could end up with a refcount block that is already in use, but still has a refcount of 0. How about following? rft-rf_table[rft_idx] = cpu_to_be64(new_block_offset); if (update_cluster_refcount(q, new_block_offset header-cluster_bits, 1) 0) { rft-rf_table[rft_idx] = 0; goto free_rfb; } if (qcow_write_refcount_table(q) 0) { rft-rf_table[rft_idx] = 0; qcow_free_clusters(q, new_block_offset, q-cluster_size); goto free_rfb; } update_cluster_refcount() will use the rft-rf_table. So updating the rft-rf_table must be before update_cluster_refcount(). + + return rfb; + +free_rfb: + free(rfb); + return NULL; +} + static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 clust_idx) { struct qcow_header *header = q-header; @@ -667,14 +722,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rft_idx = clust_idx (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT); if (rft_idx = rft-rf_size) - return NULL; + return (void *)-ENOSPC; Is this allowed style in kvm-tool? :-/ Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] kvm tools, qcow: Add support for growing refcount blocks
On 一, 2011-12-12 at 18:58 +0800, Pekka Enberg wrote: On Mon, 12 Dec 2011, Kevin Wolf wrote: @@ -667,14 +722,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rft_idx = clust_idx (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT); if (rft_idx = rft-rf_size) - return NULL; + return (void *)-ENOSPC; Is this allowed style in kvm-tool? :-/ It needs to use ERR_PTR() and related macros but otherwise I don't see a big problem with it. Pekka I have tried to use ERR_PTR(). But when I included linux/err.h, a compile error was following. CC disk/qcow.o In file included from disk/qcow.c:20: ../../include/linux/err.h:22: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'ERR_PTR' ../../include/linux/err.h:27: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'PTR_ERR' ../../include/linux/err.h:32: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'IS_ERR' ../../include/linux/err.h:37: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'IS_ERR_OR_NULL' ../../include/linux/err.h:49: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'ERR_CAST' ../../include/linux/err.h:55: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'PTR_RET' in the linux/err.h static inline void * __must_check ERR_PTR(long error) { return (void *) error; } when I comment __must_check, the error disappear. Finally I did cast myself rather than use ERR_PTR(). Are there some substitutions of ERR_PTR in the user space or choices? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] kvm tools, qcow: Add support for growing refcount blocks
On 一, 2011-12-12 at 19:15 +0800, Kevin Wolf wrote: Am 12.12.2011 11:58, schrieb Pekka Enberg: On Mon, 12 Dec 2011, Kevin Wolf wrote: @@ -667,14 +722,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rft_idx = clust_idx (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT); if (rft_idx = rft-rf_size) - return NULL; + return (void *)-ENOSPC; Is this allowed style in kvm-tool? :-/ It needs to use ERR_PTR() and related macros but otherwise I don't see a big problem with it. Can you be sure that it never clashes with a valid allocation when you use this in userspace? But yes, at least using appropriate functions should be required. And this means that you can't only check for -ENOSPC, but you need to check for all possible error codes (IS_ERR_VALUE() I guess). The qcow_read_refcount_block() is invoiked in the two places qcow_get_refcount() and update_cluster_refcount() and will only return NULL or -ENOSPC. In the qcow_get_refcount(), when qcow_read_refcount_block() returned -ENOSPEC(no refcount block is available.), returning zero which means the cluster is not in use and the refcount block can be grew in the update_cluster_refcount(). In the update_cluster_refcount(), when qcow_read_refcount_block() returned -ENOSPEC, it is necessary to grow the refcount blocks. So the ENOSPEC should be specially dealt with. Does this make sense? :) Kevin Thanks Tianyu Lan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] kvm tools, qcow: Add support for growing refcount blocks
This patch enables allocating new refcount blocks and so then kvm tools could expand qcow2 image much larger. Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c | 105 +--- 1 files changed, 89 insertions(+), 16 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index e139fa5..929ba69 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -12,6 +12,7 @@ #include string.h #include unistd.h #include fcntl.h +#include errno.h #ifdef CONFIG_HAS_ZLIB #include zlib.h #endif @@ -20,6 +21,10 @@ #include linux/kernel.h #include linux/types.h +static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append); +static int qcow_write_refcount_table(struct qcow *q); +static u64 qcow_alloc_clusters(struct qcow *q, u64 size, int update_ref); +static void qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size); static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t offset) @@ -657,6 +662,56 @@ static struct qcow_refcount_block *refcount_block_search(struct qcow *q, u64 off return rfb; } +static struct qcow_refcount_block *qcow_grow_refcount_block(struct qcow *q, + u64 clust_idx) +{ + struct qcow_header *header = q-header; + struct qcow_refcount_table *rft = q-refcount_table; + struct qcow_refcount_block *rfb; + u64 new_block_offset; + u64 rft_idx; + + rft_idx = clust_idx (header-cluster_bits - + QCOW_REFCOUNT_BLOCK_SHIFT); + + if (rft_idx = rft-rf_size) { + pr_warning(Don't support grow refcount block table); + return NULL; + } + + new_block_offset = qcow_alloc_clusters(q, q-cluster_size, 0); + if (new_block_offset 0) + return NULL; + + rfb = new_refcount_block(q, new_block_offset); + if (!rfb) + return NULL; + + memset(rfb-entries, 0x00, q-cluster_size); + rfb-dirty = 1; + + /* write refcount block */ + if (write_refcount_block(q, rfb) 0) + goto free_rfb; + + if (cache_refcount_block(q, rfb) 0) + goto free_rfb; + + rft-rf_table[rft_idx] = cpu_to_be64(new_block_offset); + if (qcow_write_refcount_table(q) 0) + goto free_rfb; + + if (update_cluster_refcount(q, new_block_offset + header-cluster_bits, 1) 0) + goto free_rfb; + + return rfb; + +free_rfb: + free(rfb); + return NULL; +} + static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 clust_idx) { struct qcow_header *header = q-header; @@ -667,14 +722,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rft_idx = clust_idx (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT); if (rft_idx = rft-rf_size) - return NULL; + return (void *)-ENOSPC; rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]); - - if (!rfb_offset) { - pr_warning(Don't support to grow refcount table); - return NULL; - } + if (!rfb_offset) + return (void *)-ENOSPC; rfb = refcount_block_search(q, rfb_offset); if (rfb) @@ -708,7 +760,8 @@ static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx) if (!rfb) { pr_warning(Error while reading refcount table); return -1; - } + } else if ((long)rfb == -ENOSPC) + return 0; rfb_idx = clust_idx (((1ULL (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); @@ -732,6 +785,12 @@ static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append) if (!rfb) { pr_warning(error while reading refcount table); return -1; + } else if ((long)rfb == -ENOSPC) { + rfb = qcow_grow_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(error while growing refcount table); + return -1; + } } rfb_idx = clust_idx (((1ULL @@ -774,11 +833,11 @@ static void qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size) * can satisfy the size. free_clust_idx is initialized to zero and * Record last position. */ -static u64 qcow_alloc_clusters(struct qcow *q, u64 size) +static u64 qcow_alloc_clusters(struct qcow *q, u64 size, int update_ref) { struct qcow_header *header = q-header; u16 clust_refcount; - u32 clust_idx, i; + u32 clust_idx = 0, i; u64 clust_num; clust_num = (size + (q-cluster_size - 1)) header-cluster_bits; @@ -793,12 +852,15 @@ again: goto again; } - for (i = 0; i clust_num; i++) - if (update_cluster_refcount(q, - q-free_clust_idx
[PATCH v5] kvm tools, qcow: Add the support for copy-on-write cluster
When meeting request to write the cluster without copied flag, allocate a new cluster and write original data with modification to the new cluster. This also adds support for the writing operation of the qcow2 compressed image. After testing, image file can pass through qemu-img check. The performance is needed to be improved. Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c| 454 ++ tools/kvm/include/kvm/qcow.h |2 + 2 files changed, 289 insertions(+), 167 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index 680b37d..e139fa5 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -20,6 +20,16 @@ #include linux/kernel.h #include linux/types.h + +static inline int qcow_pwrite_sync(int fd, + void *buf, size_t count, off_t offset) +{ + if (pwrite_in_full(fd, buf, count, offset) 0) + return -1; + + return fdatasync(fd); +} + static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new) { struct rb_node **link = (root-rb_node), *parent = NULL; @@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct qcow_l2_table *c) size = 1 header-l2_bits; - if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset) 0) + if (qcow_pwrite_sync(q-fd, c-table, + size * sizeof(u64), c-offset) 0) return -1; c-dirty = 0; @@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table *c) */ lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, list); - if (qcow_l2_cache_write(q, lru) 0) - goto error; - /* Remove the node from the cache */ rb_erase(lru-node, r); list_del_init(lru-list); @@ -508,47 +516,6 @@ static ssize_t qcow_read_sector(struct disk_image *disk, u64 sector, return total; } -static inline u64 file_size(int fd) -{ - struct stat st; - - if (fstat(fd, st) 0) - return 0; - - return st.st_size; -} - -static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t offset) -{ - if (pwrite_in_full(fd, buf, count, offset) 0) - return -1; - - return fdatasync(fd); -} - -/* Writes a level 2 table at the end of the file. */ -static u64 qcow_write_l2_table(struct qcow *q, u64 *table) -{ - struct qcow_header *header = q-header; - u64 clust_sz; - u64 f_sz; - u64 off; - u64 sz; - - f_sz= file_size(q-fd); - if (!f_sz) - return 0; - - sz = 1 header-l2_bits; - clust_sz= 1 header-cluster_bits; - off = ALIGN(f_sz, clust_sz); - - if (pwrite_in_full(q-fd, table, sz * sizeof(u64), off) 0) - return 0; - - return off; -} - static void refcount_table_free_cache(struct qcow_refcount_table *rft) { struct rb_root *r = rft-root; @@ -601,7 +568,8 @@ static int write_refcount_block(struct qcow *q, struct qcow_refcount_block *rfb) if (!rfb-dirty) return 0; - if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), rfb-offset) 0) + if (qcow_pwrite_sync(q-fd, rfb-entries, + rfb-size * sizeof(u16), rfb-offset) 0) return -1; rfb-dirty = 0; @@ -618,9 +586,6 @@ static int cache_refcount_block(struct qcow *q, struct qcow_refcount_block *c) if (rft-nr_cached == MAX_CACHE_NODES) { lru = list_first_entry(rft-lru_list, struct qcow_refcount_block, list); - if (write_refcount_block(q, lru) 0) - goto error; - rb_erase(lru-node, r); list_del_init(lru-list); rft-nr_cached--; @@ -706,6 +671,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]); + if (!rfb_offset) { + pr_warning(Don't support to grow refcount table); + return NULL; + } + rfb = refcount_block_search(q, rfb_offset); if (rfb) return rfb; @@ -728,35 +698,140 @@ error_free_rfb: return NULL; } +static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(Error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + + if (rfb_idx = rfb-size) { + pr_warning(L1: refcount block index out of bounds); + return -1
[PATCH v3] kvm tools, qcow: Add the support for copy-on-write cluster
When meeting request to write the cluster without copied flag, allocate a new cluster and write original data with modification to the new cluster. This also adds support for the writing operation of the qcow2 compressed image. After testing, image file can pass through qemu-img check. The performance is needed to be improved. Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c| 456 +++--- tools/kvm/include/kvm/qcow.h |2 + 2 files changed, 292 insertions(+), 166 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index 680b37d..7cf6fda 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -20,6 +20,16 @@ #include linux/kernel.h #include linux/types.h + +static inline int qcow_pwrite_sync(int fd, + void *buf, size_t count, off_t offset) +{ + if (pwrite_in_full(fd, buf, count, offset) 0) + return -1; + + return fdatasync(fd); +} + static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new) { struct rb_node **link = (root-rb_node), *parent = NULL; @@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct qcow_l2_table *c) size = 1 header-l2_bits; - if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset) 0) + if (qcow_pwrite_sync(q-fd, c-table, + size * sizeof(u64), c-offset) 0) return -1; c-dirty = 0; @@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table *c) */ lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, list); - if (qcow_l2_cache_write(q, lru) 0) - goto error; - /* Remove the node from the cache */ rb_erase(lru-node, r); list_del_init(lru-list); @@ -508,47 +516,6 @@ static ssize_t qcow_read_sector(struct disk_image *disk, u64 sector, return total; } -static inline u64 file_size(int fd) -{ - struct stat st; - - if (fstat(fd, st) 0) - return 0; - - return st.st_size; -} - -static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t offset) -{ - if (pwrite_in_full(fd, buf, count, offset) 0) - return -1; - - return fdatasync(fd); -} - -/* Writes a level 2 table at the end of the file. */ -static u64 qcow_write_l2_table(struct qcow *q, u64 *table) -{ - struct qcow_header *header = q-header; - u64 clust_sz; - u64 f_sz; - u64 off; - u64 sz; - - f_sz= file_size(q-fd); - if (!f_sz) - return 0; - - sz = 1 header-l2_bits; - clust_sz= 1 header-cluster_bits; - off = ALIGN(f_sz, clust_sz); - - if (pwrite_in_full(q-fd, table, sz * sizeof(u64), off) 0) - return 0; - - return off; -} - static void refcount_table_free_cache(struct qcow_refcount_table *rft) { struct rb_root *r = rft-root; @@ -601,7 +568,8 @@ static int write_refcount_block(struct qcow *q, struct qcow_refcount_block *rfb) if (!rfb-dirty) return 0; - if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), rfb-offset) 0) + if (qcow_pwrite_sync(q-fd, rfb-entries, + rfb-size * sizeof(u16), rfb-offset) 0) return -1; rfb-dirty = 0; @@ -618,9 +586,6 @@ static int cache_refcount_block(struct qcow *q, struct qcow_refcount_block *c) if (rft-nr_cached == MAX_CACHE_NODES) { lru = list_first_entry(rft-lru_list, struct qcow_refcount_block, list); - if (write_refcount_block(q, lru) 0) - goto error; - rb_erase(lru-node, r); list_del_init(lru-list); rft-nr_cached--; @@ -706,6 +671,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]); + if (!rfb_offset) { + pr_warning(Don't support to grow refcount table); + return NULL; + } + rfb = refcount_block_search(q, rfb_offset); if (rfb) return rfb; @@ -728,35 +698,140 @@ error_free_rfb: return NULL; } +static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(Error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + + if (rfb_idx = rfb-size) { + pr_warning(L1: refcount block index out of bounds); + return -1
[PATCH v4] kvm tools, qcow: Add the support for copy-on-write cluster
When meeting request to write the cluster without copied flag, allocate a new cluster and write original data with modification to the new cluster. This also adds support for the writing operation of the qcow2 compressed image. After testing, image file can pass through qemu-img check. The performance is needed to be improved. Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c| 456 +++--- tools/kvm/include/kvm/qcow.h |2 + 2 files changed, 292 insertions(+), 166 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index 680b37d..400aae8 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -20,6 +20,16 @@ #include linux/kernel.h #include linux/types.h + +static inline int qcow_pwrite_sync(int fd, + void *buf, size_t count, off_t offset) +{ + if (pwrite_in_full(fd, buf, count, offset) 0) + return -1; + + return fdatasync(fd); +} + static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new) { struct rb_node **link = (root-rb_node), *parent = NULL; @@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct qcow_l2_table *c) size = 1 header-l2_bits; - if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset) 0) + if (qcow_pwrite_sync(q-fd, c-table, + size * sizeof(u64), c-offset) 0) return -1; c-dirty = 0; @@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table *c) */ lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, list); - if (qcow_l2_cache_write(q, lru) 0) - goto error; - /* Remove the node from the cache */ rb_erase(lru-node, r); list_del_init(lru-list); @@ -508,47 +516,6 @@ static ssize_t qcow_read_sector(struct disk_image *disk, u64 sector, return total; } -static inline u64 file_size(int fd) -{ - struct stat st; - - if (fstat(fd, st) 0) - return 0; - - return st.st_size; -} - -static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t offset) -{ - if (pwrite_in_full(fd, buf, count, offset) 0) - return -1; - - return fdatasync(fd); -} - -/* Writes a level 2 table at the end of the file. */ -static u64 qcow_write_l2_table(struct qcow *q, u64 *table) -{ - struct qcow_header *header = q-header; - u64 clust_sz; - u64 f_sz; - u64 off; - u64 sz; - - f_sz= file_size(q-fd); - if (!f_sz) - return 0; - - sz = 1 header-l2_bits; - clust_sz= 1 header-cluster_bits; - off = ALIGN(f_sz, clust_sz); - - if (pwrite_in_full(q-fd, table, sz * sizeof(u64), off) 0) - return 0; - - return off; -} - static void refcount_table_free_cache(struct qcow_refcount_table *rft) { struct rb_root *r = rft-root; @@ -601,7 +568,8 @@ static int write_refcount_block(struct qcow *q, struct qcow_refcount_block *rfb) if (!rfb-dirty) return 0; - if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), rfb-offset) 0) + if (qcow_pwrite_sync(q-fd, rfb-entries, + rfb-size * sizeof(u16), rfb-offset) 0) return -1; rfb-dirty = 0; @@ -618,9 +586,6 @@ static int cache_refcount_block(struct qcow *q, struct qcow_refcount_block *c) if (rft-nr_cached == MAX_CACHE_NODES) { lru = list_first_entry(rft-lru_list, struct qcow_refcount_block, list); - if (write_refcount_block(q, lru) 0) - goto error; - rb_erase(lru-node, r); list_del_init(lru-list); rft-nr_cached--; @@ -706,6 +671,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]); + if (!rfb_offset) { + pr_warning(Don't support to grow refcount table); + return NULL; + } + rfb = refcount_block_search(q, rfb_offset); if (rfb) return rfb; @@ -728,35 +698,140 @@ error_free_rfb: return NULL; } +static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(Error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + + if (rfb_idx = rfb-size) { + pr_warning(L1: refcount block index out of bounds); + return -1
RE: [PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
Yeah. I will fix them in the next version. -Original Message- From: penb...@gmail.com [mailto:penb...@gmail.com] On Behalf Of Pekka Enberg Sent: Monday, November 21, 2011 5:06 PM To: Lan, Tianyu Cc: kvm@vger.kernel.org; asias.he...@gmail.com; levinsasha...@gmail.com; prasadjoshi...@gmail.com; kw...@redhat.com Subject: Re: [PATCH] kvm tools, qcow: Add the support for copy-on-write clusters On Mon, Nov 21, 2011 at 9:12 AM, Lan Tianyu tianyu@intel.com wrote: +/*Allocate clusters according to the size. Find a postion that + *can satisfy the size. free_clust_idx is initialized to zero and + *Record last position. +*/ Can you please fix up your comments to use the following standard kernel style: /* * Allocate clusters according to the size. Find a postion that * can satisfy the size. free_clust_idx is initialized to zero and * Record last position. */ [ Whitespace after asterisk and starting line doesn't have text. ] - rfb = qcow_read_refcount_block(q, clust_idx); - if (!rfb) { - pr_warning(L1: error while reading refcount table); + clust_new_start = qcow_alloc_clusters(q, q-cluster_size); + if (clust_new_start 0) { + pr_warning(Cluster alloc error!); Please drop the exclamation marks from pr_warning() and pr_error() messages. It just adds pointless noise. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] kvm tools, qcow: Add the support for copy-on-write cluster
When meeting request to write the cluster without copied flag, allocate a new cluster and write original data with modification to the new cluster. This also adds support for the writing operation of the qcow2 compressed image. After testing, image file can pass through qemu-img check. The performance is needed to be improved. Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c| 421 +- tools/kvm/include/kvm/qcow.h |2 + 2 files changed, 292 insertions(+), 131 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index 680b37d..99b13e7 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -20,6 +20,16 @@ #include linux/kernel.h #include linux/types.h + +static inline int qcow_pwrite_sync(int fd, + void *buf, size_t count, off_t offset) +{ + if (pwrite_in_full(fd, buf, count, offset) 0) + return -1; + + return fdatasync(fd); +} + static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new) { struct rb_node **link = (root-rb_node), *parent = NULL; @@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct qcow_l2_table *c) size = 1 header-l2_bits; - if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset) 0) + if (qcow_pwrite_sync(q-fd, c-table, + size * sizeof(u64), c-offset) 0) return -1; c-dirty = 0; @@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table *c) */ lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, list); - if (qcow_l2_cache_write(q, lru) 0) - goto error; - /* Remove the node from the cache */ rb_erase(lru-node, r); list_del_init(lru-list); @@ -518,14 +526,6 @@ static inline u64 file_size(int fd) return st.st_size; } -static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t offset) -{ - if (pwrite_in_full(fd, buf, count, offset) 0) - return -1; - - return fdatasync(fd); -} - /* Writes a level 2 table at the end of the file. */ static u64 qcow_write_l2_table(struct qcow *q, u64 *table) { @@ -601,7 +601,8 @@ static int write_refcount_block(struct qcow *q, struct qcow_refcount_block *rfb) if (!rfb-dirty) return 0; - if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), rfb-offset) 0) + if (qcow_pwrite_sync(q-fd, rfb-entries, + rfb-size * sizeof(u16), rfb-offset) 0) return -1; rfb-dirty = 0; @@ -618,9 +619,6 @@ static int cache_refcount_block(struct qcow *q, struct qcow_refcount_block *c) if (rft-nr_cached == MAX_CACHE_NODES) { lru = list_first_entry(rft-lru_list, struct qcow_refcount_block, list); - if (write_refcount_block(q, lru) 0) - goto error; - rb_erase(lru-node, r); list_del_init(lru-list); rft-nr_cached--; @@ -706,6 +704,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]); + if (!rfb_offset) { + pr_warning(Don't support to grow refcount table); + return NULL; + } + rfb = refcount_block_search(q, rfb_offset); if (rfb) return rfb; @@ -728,35 +731,140 @@ error_free_rfb: return NULL; } +static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(Error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + + if (rfb_idx = rfb-size) { + pr_warning(L1: refcount block index out of bounds); + return -1; + } + + return be16_to_cpu(rfb-entries[rfb_idx]); +} + +static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u16 refcount; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + if (rfb_idx = rfb-size) { + pr_warning(refcount block index out of bounds); + return -1; + } + + refcount = be16_to_cpu(rfb-entries[rfb_idx]) + append; + rfb
[PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
When meeting request to write the cluster without copied flag, allocate a new cluster and write original data with modification to the new cluster. This also adds support for the writing operation of the qcow2 compressed image. After testing, image file can pass through qemu-img check. The performance is needed to be improved. Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c| 411 +- tools/kvm/include/kvm/qcow.h |2 + 2 files changed, 285 insertions(+), 128 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index 680b37d..723faaa 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -20,6 +20,16 @@ #include linux/kernel.h #include linux/types.h + +static inline int qcow_pwrite_sync(int fd, + void *buf, size_t count, off_t offset) +{ + if (pwrite_in_full(fd, buf, count, offset) 0) + return -1; + + return fdatasync(fd); +} + static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new) { struct rb_node **link = (root-rb_node), *parent = NULL; @@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct qcow_l2_table *c) size = 1 header-l2_bits; - if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset) 0) + if (qcow_pwrite_sync(q-fd, c-table, + size * sizeof(u64), c-offset) 0) return -1; c-dirty = 0; @@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table *c) */ lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, list); - if (qcow_l2_cache_write(q, lru) 0) - goto error; - /* Remove the node from the cache */ rb_erase(lru-node, r); list_del_init(lru-list); @@ -518,14 +526,6 @@ static inline u64 file_size(int fd) return st.st_size; } -static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t offset) -{ - if (pwrite_in_full(fd, buf, count, offset) 0) - return -1; - - return fdatasync(fd); -} - /* Writes a level 2 table at the end of the file. */ static u64 qcow_write_l2_table(struct qcow *q, u64 *table) { @@ -601,7 +601,8 @@ static int write_refcount_block(struct qcow *q, struct qcow_refcount_block *rfb) if (!rfb-dirty) return 0; - if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), rfb-offset) 0) + if (qcow_pwrite_sync(q-fd, rfb-entries, + rfb-size * sizeof(u16), rfb-offset) 0) return -1; rfb-dirty = 0; @@ -618,9 +619,6 @@ static int cache_refcount_block(struct qcow *q, struct qcow_refcount_block *c) if (rft-nr_cached == MAX_CACHE_NODES) { lru = list_first_entry(rft-lru_list, struct qcow_refcount_block, list); - if (write_refcount_block(q, lru) 0) - goto error; - rb_erase(lru-node, r); list_del_init(lru-list); rft-nr_cached--; @@ -706,6 +704,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]); + if (!rfb_offset) { + pr_warning(Don't support to grow refcount table); + return NULL; + } + rfb = refcount_block_search(q, rfb_offset); if (rfb) return rfb; @@ -728,35 +731,138 @@ error_free_rfb: return NULL; } -/* - * QCOW file might grow during a write operation. Not only data but metadata is - * also written at the end of the file. Therefore it is necessary to ensure - * every write is committed to disk. Hence we use uses qcow_pwrite_sync() to - * synchronize the in-core state of QCOW image to disk. - * - * We also try to restore the image to a consistent state if the metdata - * operation fails. The two metadat operations are: level 1 and level 2 table - * update. If either of them fails the image is truncated to a consistent state. +static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(Error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + + if (rfb_idx = rfb-size) { + pr_warning(L1: refcount block index out of bounds); + return -1; + } + + return be16_to_cpu(rfb-entries[rfb_idx]); +} + +static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u16
RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
How about using the sync_file_range to sync the metadata? -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Lan, Tianyu Sent: Saturday, November 19, 2011 10:09 AM To: Kevin Wolf Cc: penb...@kernel.org; kvm@vger.kernel.org; asias.he...@gmail.com; levinsasha...@gmail.com; prasadjoshi...@gmail.com Subject: RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters Hi Kevin: Thanks for your review. The following means that there should be a fsync after updating metadata(refcunt block, l1 table and l2 table). Thanks Tianyu Lan -Original Message- + /*write l2 table*/ + l2t-dirty = 1; + if (qcow_l2_cache_write(q, l2t) 0) goto free_cache; You need to make sure that the refcount update is written first (e.g. with fsync), otherwise you risk corruption when the host crashes in the middle. - if (cache_table(q, l2t) 0) { - if (ftruncate(q-fd, f_sz) 0) - goto free_cache; + /* Update the l1 talble */ + l1t-l1_table[l1t_idx] = cpu_to_be64(l2t_new_offset + | QCOW2_OFLAG_COPIED); - goto free_cache; - } + if (pwrite_in_full(q-fd, l1t-l1_table, + l1t-table_size * sizeof(u64), + header-l1_table_offset) 0) + goto error; Likewise, the L1 table write must be ordered against the L2 write. goto error is using the wrong label. - /* Update the in-core entry */ - l1t-l1_table[l1t_idx] = cpu_to_be64(l2t_offset); + /*cache l2 table*/ + cache_table(q, l2t); After so many explicit comments, you can probably guess what's wrong here... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
OK. Thx. But fsync is too slow. I try to find a way to sync a range of file. Are there any solutions to meet my purpose? Thanks Tianyu Lan -Original Message- From: Sasha Levin [mailto:levinsasha...@gmail.com] Sent: Sunday, November 20, 2011 12:27 AM To: Lan, Tianyu Cc: Kevin Wolf; penb...@kernel.org; kvm@vger.kernel.org; asias.he...@gmail.com; prasadjoshi...@gmail.com Subject: RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters On Sat, 2011-11-19 at 23:30 +0800, Lan, Tianyu wrote: How about using the sync_file_range to sync the metadata? sync_file_range() is only a hint, it doesn't actually assure anything. -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
Yeah. That will make the work quite simple. After testing, fdatasync is better than fsync. -Original Message- From: Sasha Levin [mailto:levinsasha...@gmail.com] Sent: Sunday, November 20, 2011 2:23 PM To: Lan, Tianyu Cc: Kevin Wolf; penb...@kernel.org; kvm@vger.kernel.org; asias.he...@gmail.com; prasadjoshi...@gmail.com Subject: RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters On Sun, 2011-11-20 at 14:14 +0800, Lan, Tianyu wrote: OK. Thx. But fsync is too slow. I try to find a way to sync a range of file. Are there any solutions to meet my purpose? fdatasync() is as good as it'll get. tbh, maybe we should just consider opening QCOW images with O_SYNC and just get it over with? Thanks Tianyu Lan -Original Message- From: Sasha Levin [mailto:levinsasha...@gmail.com] Sent: Sunday, November 20, 2011 12:27 AM To: Lan, Tianyu Cc: Kevin Wolf; penb...@kernel.org; kvm@vger.kernel.org; asias.he...@gmail.com; prasadjoshi...@gmail.com Subject: RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters On Sat, 2011-11-19 at 23:30 +0800, Lan, Tianyu wrote: How about using the sync_file_range to sync the metadata? sync_file_range() is only a hint, it doesn't actually assure anything. -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
When meeting request to write the cluster without copied flag, allocate a new cluster and write original data with modification to the new cluster. This also adds support for the writing operation of the qcow2 compressed image. After testing, image file can pass through qemu-img check. Please enter the commit message for your changes. Lines starting Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c| 366 +- tools/kvm/include/kvm/qcow.h |2 + 2 files changed, 255 insertions(+), 113 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index 680b37d..4d9125d 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -122,9 +122,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table *c) */ lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, list); - if (qcow_l2_cache_write(q, lru) 0) - goto error; - /* Remove the node from the cache */ rb_erase(lru-node, r); list_del_init(lru-list); @@ -618,9 +615,6 @@ static int cache_refcount_block(struct qcow *q, struct qcow_refcount_block *c) if (rft-nr_cached == MAX_CACHE_NODES) { lru = list_first_entry(rft-lru_list, struct qcow_refcount_block, list); - if (write_refcount_block(q, lru) 0) - goto error; - rb_erase(lru-node, r); list_del_init(lru-list); rft-nr_cached--; @@ -706,6 +700,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]); + if (!rfb_offset) { + pr_warning(Don't support to grow refcount table); + return NULL; + } + rfb = refcount_block_search(q, rfb_offset); if (rfb) return rfb; @@ -728,35 +727,121 @@ error_free_rfb: return NULL; } -/* - * QCOW file might grow during a write operation. Not only data but metadata is - * also written at the end of the file. Therefore it is necessary to ensure - * every write is committed to disk. Hence we use uses qcow_pwrite_sync() to - * synchronize the in-core state of QCOW image to disk. - * - * We also try to restore the image to a consistent state if the metdata - * operation fails. The two metadat operations are: level 1 and level 2 table - * update. If either of them fails the image is truncated to a consistent state. +static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(Error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + + if (rfb_idx = rfb-size) { + pr_warning(L1: refcount block index out of bounds); + return -1; + } + + return be16_to_cpu(rfb-entries[rfb_idx]); +} + +static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u16 refcount; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + if (rfb_idx = rfb-size) { + pr_warning(refcount block index out of bounds); + return -1; + } + + refcount = be16_to_cpu(rfb-entries[rfb_idx]) + append; + rfb-entries[rfb_idx] = cpu_to_be16(refcount); + rfb-dirty = 1; + + /*write refcount block*/ + write_refcount_block(q, rfb); + + /*update free_clust_idx since refcount becomes zero*/ + if (!refcount clust_idx q-free_clust_idx) + q-free_clust_idx = clust_idx; + + return 0; +} + +static void qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size) +{ + struct qcow_header *header = q-header; + u64 start, end, offset; + + start = clust_start ~(q-cluster_size - 1); + end = (clust_start + size - 1) ~(q-cluster_size - 1); + for (offset = start; offset = end; offset += q-cluster_size) + update_cluster_refcount(q, offset header-cluster_bits, -1); +} + +/*Allocate clusters according to the size. Find a postion that + *can satisfy the size. free_clust_idx is initialized to zero and + *Record last position. +*/ +static u64 qcow_alloc_clusters(struct qcow *q, u64 size) +{ + struct qcow_header *header = q-header
RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
Hi Kevin: Thanks for your review. The following means that there should be a fsync after updating metadata(refcunt block, l1 table and l2 table). Thanks Tianyu Lan -Original Message- + /*write l2 table*/ + l2t-dirty = 1; + if (qcow_l2_cache_write(q, l2t) 0) goto free_cache; You need to make sure that the refcount update is written first (e.g. with fsync), otherwise you risk corruption when the host crashes in the middle. - if (cache_table(q, l2t) 0) { - if (ftruncate(q-fd, f_sz) 0) - goto free_cache; + /* Update the l1 talble */ + l1t-l1_table[l1t_idx] = cpu_to_be64(l2t_new_offset + | QCOW2_OFLAG_COPIED); - goto free_cache; - } + if (pwrite_in_full(q-fd, l1t-l1_table, + l1t-table_size * sizeof(u64), + header-l1_table_offset) 0) + goto error; Likewise, the L1 table write must be ordered against the L2 write. goto error is using the wrong label. - /* Update the in-core entry */ - l1t-l1_table[l1t_idx] = cpu_to_be64(l2t_offset); + /*cache l2 table*/ + cache_table(q, l2t); After so many explicit comments, you can probably guess what's wrong here... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
When meeting request to write the cluster without copied flag, allocate a new cluster and write original data with modification to the new cluster. This also can add support for the writing operation of the qcow2 compressed image. Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c| 322 -- tools/kvm/include/kvm/qcow.h |2 + 2 files changed, 218 insertions(+), 106 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index 680b37d..2b9af73 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -122,9 +122,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table *c) */ lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, list); - if (qcow_l2_cache_write(q, lru) 0) - goto error; - /* Remove the node from the cache */ rb_erase(lru-node, r); list_del_init(lru-list); @@ -728,35 +725,110 @@ error_free_rfb: return NULL; } -/* - * QCOW file might grow during a write operation. Not only data but metadata is - * also written at the end of the file. Therefore it is necessary to ensure - * every write is committed to disk. Hence we use uses qcow_pwrite_sync() to - * synchronize the in-core state of QCOW image to disk. - * - * We also try to restore the image to a consistent state if the metdata - * operation fails. The two metadat operations are: level 1 and level 2 table - * update. If either of them fails the image is truncated to a consistent state. +static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + + if (rfb_idx = rfb-size) { + pr_warning(L1: refcount block index out of bounds); + return -1; + } + + return be16_to_cpu(rfb-entries[rfb_idx]); +} + +static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u16 refcount; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + if (rfb_idx = rfb-size) { + pr_warning(refcount block index out of bounds); + return -1; + } + + refcount = be16_to_cpu(rfb-entries[rfb_idx]) + append; + rfb-entries[rfb_idx] = cpu_to_be16(refcount); + rfb-dirty = 1; + + /*write refcount block*/ + write_refcount_block(q, rfb); + + /*update free_clust_idx since refcount becomes zero*/ + if (!refcount clust_idx q-free_clust_idx) + q-free_clust_idx = clust_idx; + + return 0; +} + +/*Allocate clusters according to the size. Find a postion that + *can satisfy the size. free_clust_idx is initialized to zero and + *Record last position. +*/ +static u64 qcow_alloc_clusters(struct qcow *q, u64 size) +{ + struct qcow_header *header = q-header; + u16 clust_refcount; + u32 clust_idx, i; + u64 clust_num; + + clust_num = (size + (q-cluster_size - 1)) header-cluster_bits; + +again: + for (i = 0; i clust_num; i++) { + clust_idx = q-free_clust_idx++; + clust_refcount = qcow_get_refcount(q, clust_idx); + if (clust_refcount 0) + return -1; + else if (clust_refcount 0) + goto again; + } + + for (i = 0; i clust_num; i++) + update_cluster_refcount(q, + q-free_clust_idx - clust_num + i, 1); + + return (q-free_clust_idx - clust_num) header-cluster_bits; +} + +/*Get l2 table. If the table has been copied, read table directly. + *If the table exists, allocate a new cluster and copy the table + *to the new cluster. */ -static ssize_t qcow_write_cluster(struct qcow *q, u64 offset, void *buf, u32 src_len) +static int get_cluster_table(struct qcow *q, u64 offset, + struct qcow_l2_table **result_l2t, u64 *result_l2_index) { struct qcow_header *header = q-header; struct qcow_l1_table *l1t = q-table; struct qcow_l2_table *l2t; - u64 clust_start; - u64 clust_flags; - u64 l2t_offset; - u64 clust_off; - u64 l2t_size; - u64 clust_sz; u64 l1t_idx; + u64 l2t_offset
[PATCH RFC] kvm tools: Add support for the read operation of qcow and qcow2 compressed image
This patch adds the decompression operation when confirming the qcow or qcow2 image is compressed. This patch also divides the read cluster fucntion into two which are respective for qcow and qcow2 in order to be convenient to support these two kind images. Add some macros for qcow. Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/Makefile |1 + tools/kvm/disk/qcow.c| 315 ++ tools/kvm/include/kvm/qcow.h | 33 +++-- 3 files changed, 276 insertions(+), 73 deletions(-) diff --git a/tools/kvm/Makefile b/tools/kvm/Makefile index fef624d..0d0da4b 100644 --- a/tools/kvm/Makefile +++ b/tools/kvm/Makefile @@ -116,6 +116,7 @@ OBJS+= bios/bios-rom.o LIBS += -lrt LIBS += -lpthread LIBS += -lutil +LIBS += -lz # Additional ARCH settings for x86 ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/ -e s/sun4u/sparc64/ \ diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index 2471aeb..2f5c16a 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -12,6 +12,7 @@ #include string.h #include unistd.h #include fcntl.h +#include zlib.h #include linux/byteorder.h #include linux/kernel.h @@ -233,12 +234,41 @@ error: return NULL; } -static ssize_t qcow_read_cluster(struct qcow *q, u64 offset, void *dst, u32 dst_len) +static int qcow_decompress_buffer(u8 *out_buf, int out_buf_size, + const u8 *buf, int buf_size) +{ + z_stream strm1, *strm = strm1; + int ret, out_len; + + memset(strm, 0, sizeof(*strm)); + + strm-next_in = (u8 *)buf; + strm-avail_in = buf_size; + strm-next_out = out_buf; + strm-avail_out = out_buf_size; + + ret = inflateInit2(strm, -12); + if (ret != Z_OK) + return -1; + + ret = inflate(strm, Z_FINISH); + out_len = strm-next_out - out_buf; + if ((ret != Z_STREAM_END ret != Z_BUF_ERROR) || + out_len != out_buf_size) { + inflateEnd(strm); + return -1; + } + + inflateEnd(strm); + return 0; +} + +static ssize_t qcow1_read_cluster(struct qcow *q, u64 offset, + void *dst, u32 dst_len) { struct qcow_header *header = q-header; struct qcow_l1_table *l1t = q-table; struct qcow_l2_table *l2t; - u64 cluster_size; u64 clust_offset; u64 clust_start; u64 l2t_offset; @@ -246,30 +276,24 @@ static ssize_t qcow_read_cluster(struct qcow *q, u64 offset, void *dst, u32 dst_ u64 l2t_size; u64 l1_idx; u64 l2_idx; - - cluster_size = 1 header-cluster_bits; + int coffset; + int csize; l1_idx = get_l1_index(q, offset); if (l1_idx = l1t-table_size) return -1; clust_offset = get_cluster_offset(q, offset); - if (clust_offset = cluster_size) + if (clust_offset = q-cluster_size) return -1; - length = cluster_size - clust_offset; + length = q-cluster_size - clust_offset; if (length dst_len) length = dst_len; mutex_lock(q-mutex); l2t_offset = be64_to_cpu(l1t-l1_table[l1_idx]); - if (l2t_offset QCOW_OFLAG_COMPRESSED) { - pr_warning(compressed sectors are not supported); - goto out_error; - } - - l2t_offset = QCOW_OFFSET_MASK; if (!l2t_offset) goto zero_cluster; @@ -285,20 +309,130 @@ static ssize_t qcow_read_cluster(struct qcow *q, u64 offset, void *dst, u32 dst_ goto out_error; clust_start = be64_to_cpu(l2t-table[l2_idx]); - if (clust_start QCOW_OFLAG_COMPRESSED) { - pr_warning(compressed sectors are not supported); - goto out_error; + if (clust_start QCOW1_OFLAG_COMPRESSED) { + coffset = clust_start q-cluster_offset_mask; + csize = clust_start (63 - q-header-cluster_bits); + csize = (q-cluster_size - 1); + + if (pread_in_full(q-fd, q-cluster_data, csize, + coffset) 0) { + goto out_error; + } + + if (qcow_decompress_buffer(q-cluster_cache, q-cluster_size, + q-cluster_data, csize) 0) { + goto out_error; + } + + memcpy(dst, q-cluster_cache + clust_offset, length); + mutex_unlock(q-mutex); + } else{ + if (!clust_start) + goto zero_cluster; + + mutex_unlock(q-mutex); + + if (pread_in_full(q-fd, dst, length, + clust_start + clust_offset) 0) + return -1; } - clust_start = QCOW_OFFSET_MASK; - if (!clust_start) - goto zero_cluster; + return length; + +zero_cluster: + mutex_unlock(q-mutex