Re: [Qemu-devel] live migration vs device assignment (motivation)

2016-01-03 Thread Lan Tianyu
On 2015年12月30日 00:46, Michael S. Tsirkin wrote:
> Interesting. So you sare saying merely ifdown/ifup is 100ms?
> This does not sound reasonable.
> Is there a chance you are e.g. getting IP from dhcp?
> 
> If so that is wrong - clearly should reconfigure the old IP
> back without playing with dhcp. For testing, just set up
> a static IP.

MAC and IP are migrated with VM to target machine and not need to
reconfigure IP after migration.

>From my test result, ixgbevf_down() consumes 35ms and ixgbevf_up()
consumes 55ms during migration.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-28 Thread Lan, Tianyu



On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:

As long as you keep up this vague talk about performance during
migration, without even bothering with any measurements, this patchset
will keep going nowhere.



I measured network service downtime for "keep device alive"(RFC patch V1 
presented) and "put down and up network interface"(RFC patch V2 
presented) during migration with some optimizations.


The former is around 140ms and the later is around 240ms.

My patchset relies on the maibox irq which doesn't work in the suspend 
state and so can't get downtime for suspend/resume cases. Will try to 
get the result later.






There's Alex's patch that tracks memory changes during migration.  It
needs some simple enhancements to be useful in production (e.g. add a
host/guest handshake to both enable tracking in guest and to detect the
support in host), then it can allow starting migration with an assigned
device, by invoking hot-unplug after most of memory have been migrated.

Please implement this in qemu and measure the speed.


Sure. Will do that.


I will not be surprised if destroying/creating netdev in linux
turns out to take too long, but before anyone bothered
checking, it does not make sense to discuss further enhancements.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-24 Thread Lan Tianyu
Merry Christmas.
Sorry for later response due to personal affair.

On 2015年12月14日 03:30, Alexander Duyck wrote:
>> > These sounds we need to add a faked bridge for migration and adding a
>> > driver in the guest for it. It also needs to extend PCI bus/hotplug
>> > driver to do pause/resume other devices, right?
>> >
>> > My concern is still that whether we can change PCI bus/hotplug like that
>> > without spec change.
>> >
>> > IRQ should be general for any devices and we may extend it for
>> > migration. Device driver also can make decision to support migration
>> > or not.
> The device should have no say in the matter.  Either we are going to
> migrate or we will not.  This is why I have suggested my approach as
> it allows for the least amount of driver intrusion while providing the
> maximum number of ways to still perform migration even if the device
> doesn't support it.

Even if the device driver doesn't support migration, you still want to
migrate VM? That maybe risk and we should add the "bad path" for the
driver at least.

> 
> The solution I have proposed is simple:
> 
> 1.  Extend swiotlb to allow for a page dirtying functionality.
> 
>  This part is pretty straight forward.  I'll submit a few patches
> later today as RFC that can provided the minimal functionality needed
> for this.

Very appreciate to do that.

> 
> 2.  Provide a vendor specific configuration space option on the QEMU
> implementation of a PCI bridge to act as a bridge between direct
> assigned devices and the host bridge.
> 
>  My thought was to add some vendor specific block that includes a
> capabilities, status, and control register so you could go through and
> synchronize things like the DMA page dirtying feature.  The bridge
> itself could manage the migration capable bit inside QEMU for all
> devices assigned to it.  So if you added a VF to the bridge it would
> flag that you can support migration in QEMU, while the bridge would
> indicate you cannot until the DMA page dirtying control bit is set by
> the guest.
> 
>  We could also go through and optimize the DMA page dirtying after
> this is added so that we can narrow down the scope of use, and as a
> result improve the performance for other devices that don't need to
> support migration.  It would then be a matter of adding an interrupt
> in the device to handle an event such as the DMA page dirtying status
> bit being set in the config space status register, while the bit is
> not set in the control register.  If it doesn't get set then we would
> have to evict the devices before the warm-up phase of the migration,
> otherwise we can defer it until the end of the warm-up phase.
> 
> 3.  Extend existing shpc driver to support the optional "pause"
> functionality as called out in section 4.1.2 of the Revision 1.1 PCI
> hot-plug specification.

Since your solution has added a faked PCI bridge. Why not notify the
bridge directly during migration via irq and call device driver's
callback in the new bridge driver?

Otherwise, the new bridge driver also can check whether the device
driver provides migration callback or not and call them to improve the
passthough device's performance during migration.

> 
>  Note I call out "extend" here instead of saying to add this.
> Basically what we should do is provide a means of quiescing the device
> without unloading the driver.  This is called out as something the OS
> vendor can optionally implement in the PCI hot-plug specification.  On
> OSes that wouldn't support this it would just be treated as a standard
> hot-plug event.   We could add a capability, status, and control bit
> in the vendor specific configuration block for this as well and if we
> set the status bit would indicate the host wants to pause instead of
> remove and the control bit would indicate the guest supports "pause"
> in the OS.  We then could optionally disable guest migration while the
> VF is present and pause is not supported.
> 
>  To support this we would need to add a timer and if a new device
> is not inserted in some period of time (60 seconds for example), or if
> a different device is inserted,
> we need to unload the original driver
> from the device.  In addition we would need to verify if drivers can
> call the remove function after having called suspend without resume.
> If not, we could look at adding a recovery function to remove the
> driver from the device in the case of a suspend with either a failed
> resume or no resume call.  Once again it would probably be useful to
> have for those cases where power management suspend/resume runs into
> an issue like somebody causing a surprise removal while a device was
> suspended.


-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-13 Thread Lan, Tianyu



On 12/11/2015 1:16 AM, Alexander Duyck wrote:

On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu <tianyu@intel.com> wrote:



On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:


Ideally, it is able to leave guest driver unmodified but it requires the

hypervisor or qemu to aware the device which means we may need a driver
in
hypervisor or qemu to handle the device on behalf of guest driver.


Can you answer the question of when do you use your code -
 at the start of migration or
 just before the end?



Just before stopping VCPU in this version and inject VF mailbox irq to
notify the driver if the irq handler is installed.
Qemu side also will check this via the faked PCI migration capability
and driver will set the status during device open() or resume() callback.


The VF mailbox interrupt is a very bad idea.  Really the device should
be in a reset state on the other side of a migration.  It doesn't make
sense to have the interrupt firing if the device is not configured.
This is one of the things that is preventing you from being able to
migrate the device while the interface is administratively down or the
VF driver is not loaded.


From my opinion, if VF driver is not loaded and hardware doesn't start
to work, the device state doesn't need to be migrated.

We may add a flag for driver to check whether migration happened during 
it's down and reinitialize the hardware and clear the flag when system 
try to put it up.


We may add migration core in the Linux kernel and provide some helps 
functions to facilitate to add migration support for drivers.

Migration core is in charge to sync status with Qemu.

Example.
migration_register()
Driver provides
- Callbacks to be called before and after migration or for bad path
- Its irq which it prefers to deal with migration event.

migration_event_check()
Driver calls it in the irq handler. Migration core code will check
migration status and call its callbacks when migration happens.




My thought on all this is that it might make sense to move this
functionality into a PCI-to-PCI bridge device and make it a
requirement that all direct-assigned devices have to exist behind that
device in order to support migration.  That way you would be working
with a directly emulated device that would likely already be
supporting hot-plug anyway.  Then it would just be a matter of coming
up with a few Qemu specific extensions that you would need to add to
the device itself.  The same approach would likely be portable enough
that you could achieve it with PCIe as well via the same configuration
space being present on the upstream side of a PCIe port or maybe a
PCIe switch of some sort.

It would then be possible to signal via your vendor-specific PCI
capability on that device that all devices behind this bridge require
DMA page dirtying, you could use the configuration in addition to the
interrupt already provided for hot-plug to signal things like when you
are starting migration, and possibly even just extend the shpc
functionality so that if this capability is present you have the
option to pause/resume instead of remove/probe the device in the case
of certain hot-plug events.  The fact is there may be some use for a
pause/resume type approach for PCIe hot-plug in the near future
anyway.  From the sounds of it Apple has required it for all
Thunderbolt device drivers so that they can halt the device in order
to shuffle resources around, perhaps we should look at something
similar for Linux.

The other advantage behind grouping functions on one bridge is things
like reset domains.  The PCI error handling logic will want to be able
to reset any devices that experienced an error in the event of
something such as a surprise removal.  By grouping all of the devices
you could disable/reset/enable them as one logical group in the event
of something such as the "bad path" approach Michael has mentioned.



These sounds we need to add a faked bridge for migration and adding a
driver in the guest for it. It also needs to extend PCI bus/hotplug
driver to do pause/resume other devices, right?

My concern is still that whether we can change PCI bus/hotplug like that
without spec change.

IRQ should be general for any devices and we may extend it for
migration. Device driver also can make decision to support migration
or not.






It would be great if we could avoid changing the guest; but at least
your guest
driver changes don't actually seem to be that hardware specific;
could your
changes actually be moved to generic PCI level so they could be made
to work for lots of drivers?




It is impossible to use one common solution for all devices unless the
PCIE
spec documents it clearly and i think one day it will be there. But
before
that, we need some workarounds on guest driver to make it work even it
looks
ugly.



Yes, so far there is not hardware migration support and it's hard to modify
bus level code. It also will block implementation on the Windows

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Lan, Tianyu



On 12/11/2015 12:11 AM, Michael S. Tsirkin wrote:

On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:



On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:

Ideally, it is able to leave guest driver unmodified but it requires the

hypervisor or qemu to aware the device which means we may need a driver in
hypervisor or qemu to handle the device on behalf of guest driver.

Can you answer the question of when do you use your code -
at the start of migration or
just before the end?


Just before stopping VCPU in this version and inject VF mailbox irq to
notify the driver if the irq handler is installed.
Qemu side also will check this via the faked PCI migration capability
and driver will set the status during device open() or resume() callback.


Right, this is the "good path" optimization. Whether this buys anything
as compared to just sending reset to the device when VCPU is stopped
needs to be measured. In any case, we probably do need a way to
interrupt driver on destination to make it reconfigure the device -
otherwise it might take seconds for it to notice.  And a way to make
sure driver can handle this surprise reset so we can block migration if
it can't.



Yes, we need such a way to notify driver about migration status and do
reset or restore operation on the destination machine. My original
design is to take advantage of device's irq to do that. Driver can tell
Qemu that which irq it prefers to handle such task and whether the irq
is enabled or bound with handler. We may discuss the detail in the other
thread.




It would be great if we could avoid changing the guest; but at least your guest
driver changes don't actually seem to be that hardware specific; could your
changes actually be moved to generic PCI level so they could be made
to work for lots of drivers?


It is impossible to use one common solution for all devices unless the PCIE
spec documents it clearly and i think one day it will be there. But before
that, we need some workarounds on guest driver to make it work even it looks
ugly.


Yes, so far there is not hardware migration support


VT-D supports setting dirty bit in the PTE in hardware.


Actually, this doesn't support in the current hardware.
VTD spec documents the dirty bit for first level translation which
requires devices to support DMA request with PASID(process
address space identifier). Most device don't support the feature.




and it's hard to modify
bus level code.


Why is it hard?


As Yang said, the concern is that PCI Spec doesn't document about how to 
do migration.





It also will block implementation on the Windows.


Implementation of what?  We are discussing motivation here, not
implementation.  E.g. windows drivers typically support surprise
removal, should you use that, you get some working code for free.  Just
stop worrying about it.  Make it work, worry about closed source
software later.


Dave


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: live migration vs device assignment (motivation)

2015-12-10 Thread Lan, Tianyu

On 12/10/2015 4:38 PM, Michael S. Tsirkin wrote:

Let's assume you do save state and do have a way to detect
whether state matches a given hardware. For example,
driver could store firmware and hardware versions
in the state, and then on destination, retrieve them
and compare. It will be pretty common that you have a mismatch,
and you must not just fail migration. You need a way to recover,
maybe with more downtime.


Second, you can change the driver but you can not be sure it will have
the chance to run at all. Host overload is a common reason to migrate
out of the host.  You also can not trust guest to do the right thing.
So how long do you want to wait until you decide guest is not
cooperating and kill it?  Most people will probably experiment a bit and
then add a bit of a buffer. This is not robust at all.

Again, maybe you ask driver to save state, and if it does
not respond for a while, then you still migrate,
and driver has to recover on destination.


With the above in mind, you need to support two paths:
1. "good path": driver stores state on source, checks it on destination
detects a match and restores state into the device
2. "bad path": driver does not store state, or detects a mismatch
on destination. driver has to assume device was lost,
and reset it

So what I am saying is, implement bad path first. Then good path
is an optimization - measure whether it's faster, and by how much.



These sound reasonable. Driver should have ability to do such check
to ensure hardware or firmware coherence after migration and reset 
device when migration happens at some unexpected position.




Also, it would be nice if on the bad path there was a way
to switch to another driver entirely, even if that means
a bit more downtime. For example, have a way for driver to
tell Linux it has to re-do probing for the device.


Just glace the code of device core. device_reprobe() does what you said.

/**
 * device_reprobe - remove driver for a device and probe for a new  driver
 * @dev: the device to reprobe
 *
 * This function detaches the attached driver (if any) for the given
 * device and restarts the driver probing process.  It is intended
 * to use if probing criteria changed during a devices lifetime and
 * driver attachment should change accordingly.
 */
int device_reprobe(struct device *dev)





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Lan, Tianyu



On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:

Ideally, it is able to leave guest driver unmodified but it requires the
>hypervisor or qemu to aware the device which means we may need a driver in
>hypervisor or qemu to handle the device on behalf of guest driver.

Can you answer the question of when do you use your code -
at the start of migration or
just before the end?


Just before stopping VCPU in this version and inject VF mailbox irq to
notify the driver if the irq handler is installed.
Qemu side also will check this via the faked PCI migration capability
and driver will set the status during device open() or resume() callback.




> >It would be great if we could avoid changing the guest; but at least your 
guest
> >driver changes don't actually seem to be that hardware specific; could your
> >changes actually be moved to generic PCI level so they could be made
> >to work for lots of drivers?

>
>It is impossible to use one common solution for all devices unless the PCIE
>spec documents it clearly and i think one day it will be there. But before
>that, we need some workarounds on guest driver to make it work even it looks
>ugly.


Yes, so far there is not hardware migration support and it's hard to 
modify bus level code. It also will block implementation on the Windows.



Dave


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Lan, Tianyu



On 12/8/2015 1:12 AM, Alexander Duyck wrote:

On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu <tianyu@intel.com> wrote:

On 12/5/2015 1:07 AM, Alexander Duyck wrote:



We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.



That is a poor argument.  I highly doubt Microsoft is interested in
having to modify all of the drivers that will support direct assignment
in order to support migration.  They would likely request something
similar to what I have in that they will want a way to do DMA tracking
with minimal modification required to the drivers.



This totally depends on the NIC or other devices' vendors and they
should make decision to support migration or not. If yes, they would
modify driver.


Having to modify every driver that wants to support live migration is
a bit much.  In addition I don't see this being limited only to NIC
devices.  You can direct assign a number of different devices, your
solution cannot be specific to NICs.


We are also adding such migration support for QAT device and so our
solution will not just be limit to NIC. Now just is the beginning.

We can't limit user to only use Linux guest. So the migration feature
should work for both Windows and Linux guest.




If just target to call suspend/resume during migration, the feature will
be meaningless. Most cases don't want to affect user during migration
a lot and so the service down time is vital. Our target is to apply
SRIOV NIC passthough to cloud service and NFV(network functions
virtualization) projects which are sensitive to network performance
and stability. From my opinion, We should give a change for device
driver to implement itself migration job. Call suspend and resume
callback in the driver if it doesn't care the performance during migration.


The suspend/resume callback should be efficient in terms of time.
After all we don't want the system to stall for a long period of time
when it should be either running or asleep.  Having it burn cycles in
a power state limbo doesn't do anyone any good.  If nothing else maybe
it will help to push the vendors to speed up those functions which
then benefit migration and the system sleep states.


If we can benefit both migration and suspend, that would be wonderful.
But migration and system pm is still different. Just for example,
driver doesn't need to put device into deep D-status during migration
and host can do this after migration while it's essential for
system sleep. PCI configure space and interrupt config is emulated by
Qemu and Qemu can migrate these configures to new machine. Driver
doesn't need to deal with such thing. So I think migration still needs a
different callback or different code path than device suspend/resume.

Another concern is that we have to rework PM core ore PCI bus driver
to call suspend/resume for passthrough devices during migration. This
also blocks new feature works on the Windows.



Also you keep assuming you can keep the device running while you do
the migration and you can't.  You are going to corrupt the memory if
you do, and you have yet to provide any means to explain how you are
going to solve that.



The main problem is tracking DMA issue. I will repose my solution in the
new thread for discussion. If not way to mark DMA page dirty when
DMA is enabled, we have to stop DMA for a small time to do that at the
last stage.








Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.



The ordering of your explanation here doesn't quite work.  What needs to
happen is that you have to disable DMA and then mark the pages as dirty.
   What the disabling of the BME does is signal to the hypervisor that
the device is now stopped.  The ixgbevf_suspend call already supported
by the driver is almost exactly what is needed to take care of something
like this.



This is why I hope to reserve a piece of space in the dma page to do dummy
write. This can help to mark page dirty while not require to stop DMA and
not race with DMA data.


You can't and it will still race.  What concerns me is that your
patches and the document you referenced earlier show a considerable
lack of understanding about how DMA and device drivers work.  There is
a reason why device drivers have so many memory barriers and the like
in them.  The fact is when you have CPU and a device both accessing
memory things have to be done in a very specific order and you cannot
violate that.

If you have a contiguous block of memory you expect the device to
write into you cannot just poke 

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Lan, Tianyu

On 12/9/2015 6:37 PM, Michael S. Tsirkin wrote:

On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote:

Hi Michael & Alexander:
Thanks a lot for your comments and suggestions.


It's nice that it's appreciated, but you then go on and ignore
all that I have written here:
https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html



No, I will reply it separately and according your suggestion to snip it 
into 3 thread.



We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.


This is not a reasonable argument.  It makes no sense to duplicate code
on Linux because you must duplicate code on Windows.  Let's assume you
must do it in the driver on windows because windows has closed source
drivers.  What does it matter? Linux can still do it as part of DMA API
and have it apply to all drivers.



Sure. Duplicated code should be encapsulated and make it able to reuse
by other drivers. Just like you said the dummy write part.

I meant the framework should not require to change Windows kernel code
(such as PM core or PCI bus driver)and this will block implementation on
the Windows.

I think it's not problem to duplicate code in the Windows drivers.


Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.

The dma page allocated by VF driver also needs to reserve space
to do dummy write.


I suggested ways to do it all in the hypervisor without driver hacks, or
hide it within DMA API without need to reserve extra space. Both
approaches seem much cleaner.



This sounds reasonable. We may discuss it detail in the separate thread.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Lan, Tianyu

On 12/9/2015 7:28 PM, Michael S. Tsirkin wrote:

I remember reading that it's possible to implement a bus driver
on windows if required.  But basically I don't see how windows can be
relevant to discussing guest driver patches. That discussion
probably belongs on the qemu maling list, not on lkml.


I am not sure whether we can write a bus driver for Windows to support
migration. But I think device vendor who want to support migration will 
improve their driver if we provide such framework in the

hypervisor which just need them to change their driver.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: live migration vs device assignment (motivation)

2015-12-09 Thread Lan, Tianyu

On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:

I thought about what this is doing at the high level, and I do have some
value in what you are trying to do, but I also think we need to clarify
the motivation a bit more.  What you are saying is not really what the
patches are doing.

And with that clearer understanding of the motivation in mind (assuming
it actually captures a real need), I would also like to suggest some
changes.


Motivation:
Most current solutions for migration with passthough device are based on
the PCI hotplug but it has side affect and can't work for all device.

For NIC device:
PCI hotplug solution can work around Network device migration
via switching VF and PF.

But switching network interface will introduce service down time.

I tested the service down time via putting VF and PV interface
into a bonded interface and ping the bonded interface during plug
and unplug VF.
1) About 100ms when add VF
2) About 30ms when del VF

It also requires guest to do switch configuration. These are hard to
manage and deploy from our customers. To maintain PV performance during
migration, host side also needs to assign a VF to PV device. This
affects scalability.

These factors block SRIOV NIC passthough usage in the cloud service and
OPNFV which require network high performance and stability a lot.


For other kind of devices, it's hard to work.
We are also adding migration support for QAT(QuickAssist Technology) device.

QAT device user case introduction.
Server, networking, big data, and storage applications use QuickAssist 
Technology to offload servers from handling compute-intensive 
operations, such as:
1) Symmetric cryptography functions including cipher operations and 
authentication operations
2) Public key functions including RSA, Diffie-Hellman, and elliptic 
curve cryptography

3) Compression and decompression functions including DEFLATE and LZS

PCI hotplug will not work for such devices during migration and these 
operations will fail when unplug device.


So we are trying implementing a new solution which really migrates
device state to target machine and won't affect user during migration
with low service down time.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: live migration vs device assignment (motivation)

2015-12-09 Thread Lan, Tianyu



On 12/10/2015 1:14 AM, Alexander Duyck wrote:

On Wed, Dec 9, 2015 at 8:26 AM, Lan, Tianyu <tianyu@intel.com> wrote:


For other kind of devices, it's hard to work.
We are also adding migration support for QAT(QuickAssist Technology) device.

QAT device user case introduction.
Server, networking, big data, and storage applications use QuickAssist
Technology to offload servers from handling compute-intensive operations,
such as:
1) Symmetric cryptography functions including cipher operations and
authentication operations
2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
cryptography
3) Compression and decompression functions including DEFLATE and LZS

PCI hotplug will not work for such devices during migration and these
operations will fail when unplug device.


I assume the problem is that with a PCI hotplug event you are losing
the state information for the device, do I have that right?

Looking over the QAT drivers it doesn't seem like any of them support
the suspend/resume PM calls.  I would imagine it makes it difficult
for a system with a QAT card in it to be able to drop the system to a
low power state.  You might want to try enabling suspend/resume
support for the devices on bare metal before you attempt to take on
migration as it would provide you with a good testing framework to see
what needs to be saved/restored within the device and in what order
before you attempt to do the same while migrating from one system to
another.


Sure. The suspend/resume job is under way.
Actually, we have enabled QAT work for migration internally. Doing more 
test and fixing bugs.




- Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: live migration vs device assignment (motivation)

2015-12-09 Thread Lan, Tianyu


On 12/10/2015 4:07 AM, Michael S. Tsirkin wrote:

On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote:

On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:

I thought about what this is doing at the high level, and I do have some
value in what you are trying to do, but I also think we need to clarify
the motivation a bit more.  What you are saying is not really what the
patches are doing.

And with that clearer understanding of the motivation in mind (assuming
it actually captures a real need), I would also like to suggest some
changes.


Motivation:
Most current solutions for migration with passthough device are based on
the PCI hotplug but it has side affect and can't work for all device.

For NIC device:
PCI hotplug solution can work around Network device migration
via switching VF and PF.


This is just more confusion. hotplug is just a way to add and remove
devices. switching VF and PF is up to guest and hypervisor.


This is a combination. Because it's not able to migrate device state in
the current world during migration(What we are doing), Exist solutions
of migrating VM with passthough NIC relies on the PCI hotplug. Unplug VF
before starting migration and then switch network from VF NIC to PV NIC
in order to maintain the network connection. Plug VF again after
migration and then switch from PV back to VF. Bond driver provides a way 
to switch between PV and VF NIC automatically with save IP and MAC and 
so bond driver is more preferred.





But switching network interface will introduce service down time.

I tested the service down time via putting VF and PV interface
into a bonded interface and ping the bonded interface during plug
and unplug VF.
1) About 100ms when add VF
2) About 30ms when del VF


OK and what's the source of the downtime?
I'm guessing that's just arp being repopulated.  So simply save and
re-populate it.

There would be a much cleaner solution.

Or maybe there's a timer there that just delays hotplug
for no reason. Fix it, everyone will benefit.


It also requires guest to do switch configuration.


That's just wrong. if you want a switch, you need to
configure a switch.


I meant the config of switching operation between PV and VF.




These are hard to
manage and deploy from our customers.


So kernel want to remain flexible, and the stack is
configurable. Downside: customers need to deploy userspace
to configure it. Your solution: a hard-coded configuration
within kernel and hypervisor.  Sorry, this makes no sense.
If kernel is easier for you to deploy than userspace,
you need to rethink your deployment strategy.


This is one factor.




To maintain PV performance during
migration, host side also needs to assign a VF to PV device. This
affects scalability.


No idea what this means.


These factors block SRIOV NIC passthough usage in the cloud service and
OPNFV which require network high performance and stability a lot.


Everyone needs performance and scalability.



For other kind of devices, it's hard to work.
We are also adding migration support for QAT(QuickAssist Technology) device.

QAT device user case introduction.
Server, networking, big data, and storage applications use QuickAssist
Technology to offload servers from handling compute-intensive operations,
such as:
1) Symmetric cryptography functions including cipher operations and
authentication operations
2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
cryptography
3) Compression and decompression functions including DEFLATE and LZS

PCI hotplug will not work for such devices during migration and these
operations will fail when unplug device.

So we are trying implementing a new solution which really migrates
device state to target machine and won't affect user during migration
with low service down time.


Let's assume for the sake of the argument that there's a lot going on
and removing the device is just too slow (though you should figure out
what's going on before giving up and just building something new from
scratch).


No, we can find a PV NIC as backup for VF NIC during migration but it 
doesn't work for other kinds of device since there is no backup for 
them. E,G When migration happens during users compresses files via QAT, 
it's impossible to remove QAT at that point. If do that, the compress 
operation will fail and affect user experience.




I still don't think you should be migrating state.  That's just too
fragile, and it also means you depend on driver to be nice and shut down
device on source, so you can not migrate at will.  Instead, reset device
on destination and re-initialize it.



Yes, saving and restoring device state relies on the driver and so we 
reworks driver and make it more friend to migration.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-07 Thread Lan, Tianyu

On 12/5/2015 1:07 AM, Alexander Duyck wrote:


We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.


That is a poor argument.  I highly doubt Microsoft is interested in
having to modify all of the drivers that will support direct assignment
in order to support migration.  They would likely request something
similar to what I have in that they will want a way to do DMA tracking
with minimal modification required to the drivers.


This totally depends on the NIC or other devices' vendors and they
should make decision to support migration or not. If yes, they would
modify driver.

If just target to call suspend/resume during migration, the feature will
be meaningless. Most cases don't want to affect user during migration
a lot and so the service down time is vital. Our target is to apply
SRIOV NIC passthough to cloud service and NFV(network functions
virtualization) projects which are sensitive to network performance
and stability. From my opinion, We should give a change for device
driver to implement itself migration job. Call suspend and resume
callback in the driver if it doesn't care the performance during migration.




Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.


The ordering of your explanation here doesn't quite work.  What needs to
happen is that you have to disable DMA and then mark the pages as dirty.
  What the disabling of the BME does is signal to the hypervisor that
the device is now stopped.  The ixgbevf_suspend call already supported
by the driver is almost exactly what is needed to take care of something
like this.


This is why I hope to reserve a piece of space in the dma page to do 
dummy write. This can help to mark page dirty while not require to stop 
DMA and not race with DMA data.


If can't do that, we have to stop DMA in a short time to mark all dma
pages dirty and then reenable it. I am not sure how much we can get by
this way to track all DMA memory with device running during migration. I
need to do some tests and compare results with stop DMA diretly at last
stage during migration.




The question is how we would go about triggering it.  I really don't
think the PCI configuration space approach is the right idea.
 I wonder
if we couldn't get away with some sort of ACPI event instead.  We
already require ACPI support in order to shut down the system
gracefully, I wonder if we couldn't get away with something similar in
order to suspend/resume the direct assigned devices gracefully.



I don't think there is such events in the current spec.
Otherwise, There are two kinds of suspend/resume callbacks.
1) System suspend/resume called during S2RAM and S2DISK.
2) Runtime suspend/resume called by pm core when device is idle.
If you want to do what you mentioned, you have to change PM core and
ACPI spec.


The dma page allocated by VF driver also needs to reserve space
to do dummy write.


No, this will not work.  If for example you have a VF driver allocating
memory for a 9K receive how will that work?  It isn't as if you can poke
a hole in the contiguous memory.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-12-04 Thread Lan, Tianyu


On 12/4/2015 4:05 PM, Michael S. Tsirkin wrote:

I haven't read it, but I would like to note you can't rely on research
papers.  If you propose a patch to be merged you need to measure what is
its actual effect on modern linux at the end of 2015.


Sure. Will do that.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-04 Thread Lan, Tianyu

Hi Michael & Alexander:
Thanks a lot for your comments and suggestions.

We still need to support Windows guest for migration and this is why our 
patches keep all changes in the driver since it's impossible to change 
Windows kernel.


Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.

The dma page allocated by VF driver also needs to reserve space
to do dummy write.


On 12/2/2015 7:44 PM, Michael S. Tsirkin wrote:

On Tue, Dec 01, 2015 at 10:36:33AM -0800, Alexander Duyck wrote:

On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin  wrote:

On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote:

On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin  wrote:



There are several components to this:
- dma_map_* needs to prevent page from
   being migrated while device is running.
   For example, expose some kind of bitmap from guest
   to host, set bit there while page is mapped.
   What happens if we stop the guest and some
   bits are still set? See dma_alloc_coherent below
   for some ideas.


Yeah, I could see something like this working.  Maybe we could do
something like what was done for the NX bit and make use of the upper
order bits beyond the limits of the memory range to mark pages as
non-migratable?

I'm curious.  What we have with a DMA mapped region is essentially
shared memory between the guest and the device.  How would we resolve
something like this with IVSHMEM, or are we blocked there as well in
terms of migration?


I have some ideas. Will post later.


I look forward to it.


- dma_unmap_* needs to mark page as dirty
   This can be done by writing into a page.

- dma_sync_* needs to mark page as dirty
   This is trickier as we can not change the data.
   One solution is using atomics.
   For example:
 int x = ACCESS_ONCE(*p);
 cmpxchg(p, x, x);
   Seems to do a write without changing page
   contents.


Like I said we can probably kill 2 birds with one stone by just
implementing our own dma_mark_clean() for x86 virtualized
environments.

I'd say we could take your solution one step further and just use 0
instead of bothering to read the value.  After all it won't write the
area if the value at the offset is not 0.


Really almost any atomic that has no side effect will do.
atomic or with 0
atomic and with 

It's just that cmpxchg already happens to have a portable
wrapper.


I was originally thinking maybe an atomic_add with 0 would be the way
to go.


cmpxchg with any value too.


  Either way though we still are using a locked prefix and
having to dirty a cache line per page which is going to come at some
cost.


I agree. It's likely not necessary for everyone
to be doing this: only people that both
run within the VM and want migration to work
need to do this logging.

So set some module option to have driver tell hypervisor that it
supports logging.  If bus mastering is enabled before this, migration is
blocked.  Or even pass some flag from hypervisor so
driver can detect it needs to log writes.
I guess this could be put in device config somewhere,
though in practice it's a global thing, not a per device one, so
maybe we need some new channel to
pass this flag to guest. CPUID?
Or maybe we can put some kind of agent in the initrd
and use the existing guest agent channel after all.
agent in initrd could open up a lot of new possibilities.



- dma_alloc_coherent memory (e.g. device rings)
   must be migrated after device stopped modifying it.
   Just stopping the VCPU is not enough:
   you must make sure device is not changing it.

   Or maybe the device has some kind of ring flush operation,
   if there was a reasonably portable way to do this
   (e.g. a flush capability could maybe be added to SRIOV)
   then hypervisor could do this.


This is where things start to get messy. I was suggesting the
suspend/resume to resolve this bit, but it might be possible to also
deal with this via something like this via clearing the bus master
enable bit for the VF.  If I am not mistaken that should disable MSI-X
interrupts and halt any DMA.  That should work as long as you have
some mechanism that is tracking the pages in use for DMA.


A bigger issue is recovering afterwards.


Agreed.


   In case you need to resume on source, you
   really need to follow the same path
   as on destination, preferably detecting
   device reset and restoring the device
   state.


The problem with detecting the reset is that you would likely have to
be polling to do something like that.


We could some event to guest to notify it about this event
through a new or 

Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-12-03 Thread Lan, Tianyu


On 12/2/2015 10:31 PM, Michael S. Tsirkin wrote:

>We hope
>to find a better way to make SRIOV NIC work in these cases and this is
>worth to do since SRIOV NIC provides better network performance compared
>with PV NIC.

If this is a performance optimization as the above implies,
you need to include some numbers, and document how did
you implement the switch and how did you measure the performance.



OK. Some ideas of my patches come from paper "CompSC: Live Migration with
Pass-through Devices".
http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/papers/p109.pdf

It compared performance data between the solution of switching PV and VF 
and VF migration.(Chapter 7: Discussion)




>Current patches have some issues. I think we can find
>solution for them andimprove them step by step.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support

2015-12-03 Thread Lan, Tianyu



On 12/3/2015 6:25 AM, Alex Williamson wrote:

On Tue, 2015-11-24 at 21:35 +0800, Lan Tianyu wrote:

This patch is to add SRIOV VF migration support.
Create new device type "vfio-sriov" and add faked PCI migration capability
to the type device.

The purpose of the new capability
1) sync migration status with VF driver in the VM
2) Get mailbox irq vector to notify VF driver during migration.
3) Provide a way to control injecting irq or not.

Qemu will migrate PCI configure space regs and MSIX config for VF.
Inject mailbox irq at last stage of migration to notify VF about
migration event and wait VF driver ready for migration. VF driver
writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
to tell Qemu.


What makes this sr-iov specific?  Why wouldn't we simply extend vfio-pci
with a migration=on feature?  Thanks,


Sounds reasonable and will update it.



Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 06/10] Qemu/PCI: Add macros for faked PCI migration capability

2015-12-03 Thread Lan, Tianyu



On 12/3/2015 6:25 AM, Alex Williamson wrote:

This will of course break if the PCI SIG defines that capability index.
Couldn't this be done within a vendor defined capability?  Thanks,


Yes, it should work and thanks for suggestion.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition

2015-12-03 Thread Lan, Tianyu


On 12/3/2015 6:25 AM, Alex Williamson wrote:

I didn't seen a matching kernel patch series for this, but why is the
kernel more capable of doing this than userspace is already?

The following link is the kernel patch.
http://marc.info/?l=kvm=144837328920989=2


These seem
like pointless ioctls, we're creating a purely virtual PCI capability,
the kernel doesn't really need to participate in that.


VFIO kernel driver has pci_config_map which indicates the PCI capability 
position and length which helps to find free PCI config regs. Qemu side 
doesn't have such info and can't get the exact table size of PCI 
capability. If we want to add such support in the Qemu, needs duplicates 
a lot of code of vfio_pci_configs.c in the Qemu.



Also, why are we
restricting ourselves to standard capabilities?


This version is to check whether it's on the right way and We can extend
this to pci extended capability later.


That's often a crowded
space and we can't always know whether an area is free or not based only
on it being covered by a capability.  Some capabilities can also appear
more than once, so there's context that isn't being passed to the kernel
here.  Thanks,


The region outside of PCI capability are not passed to kernel or used by 
Qemu for MSI/MSIX . It's possible to use these places for new 
capability. One concerns is that guest driver may abuse them and quirk 
of masking some special regs outside of capability maybe helpful.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-12-03 Thread Lan, Tianyu


On 12/2/2015 10:31 PM, Michael S. Tsirkin wrote:

>We hope
>to find a better way to make SRIOV NIC work in these cases and this is
>worth to do since SRIOV NIC provides better network performance compared
>with PV NIC.

If this is a performance optimization as the above implies,
you need to include some numbers, and document how did
you implement the switch and how did you measure the performance.



OK. Some ideas of my patches come from paper "CompSC: Live Migration with
Pass-through Devices".
http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/papers/p109.pdf

It compared performance data between the solution of switching PV and VF 
and VF migration.(Chapter 7: Discussion)




>Current patches have some issues. I think we can find
>solution for them andimprove them step by step.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-12-02 Thread Lan, Tianyu

On 12/1/2015 11:02 PM, Michael S. Tsirkin wrote:

But
it requires guest OS to do specific configurations inside and rely on
bonding driver which blocks it work on Windows.
 From performance side,
putting VF and virtio NIC under bonded interface will affect their
performance even when not do migration. These factors block to use VF
NIC passthough in some user cases(Especially in the cloud) which require
migration.


That's really up to guest. You don't need to do bonding,
you can just move the IP and mac from userspace, that's
possible on most OS-es.

Or write something in guest kernel that is more lightweight if you are
so inclined. What we are discussing here is the host-guest interface,
not the in-guest interface.


Current solution we proposed changes NIC driver and Qemu. Guest Os
doesn't need to do special thing for migration.
It's easy to deploy



Except of course these patches don't even work properly yet.

And when they do, even minor changes in host side NIC hardware across
migration will break guests in hard to predict ways.


Switching between PV and VF NIC will introduce network stop and the
latency of hotplug VF is measurable. For some user cases(cloud service
and OPNFV) which are sensitive to network stabilization and performance,
these are not friend and blocks SRIOV NIC usage in these case. We hope
to find a better way to make SRIOV NIC work in these cases and this is
worth to do since SRIOV NIC provides better network performance compared
with PV NIC. Current patches have some issues. I think we can find
solution for them andimprove them step by step.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-01 Thread Lan, Tianyu



On 12/1/2015 12:07 AM, Alexander Duyck wrote:

They can only be corrected if the underlying assumptions are correct
and they aren't.  Your solution would have never worked correctly.
The problem is you assume you can keep the device running when you are
migrating and you simply cannot.  At some point you will always have
to stop the device in order to complete the migration, and you cannot
stop it before you have stopped your page tracking mechanism.  So
unless the platform has an IOMMU that is somehow taking part in the
dirty page tracking you will not be able to stop the guest and then
the device, it will have to be the device and then the guest.


>Doing suspend and resume() may help to do migration easily but some
>devices requires low service down time. Especially network and I got
>that some cloud company promised less than 500ms network service downtime.

Honestly focusing on the downtime is getting the cart ahead of the
horse.  First you need to be able to do this without corrupting system
memory and regardless of the state of the device.  You haven't even
gotten to that state yet.  Last I knew the device had to be up in
order for your migration to even work.


I think the issue is that the content of rx package delivered to stack 
maybe changed during migration because the piece of memory won't be 
migrated to new machine. This may confuse applications or stack. Current 
dummy write solution can ensure the content of package won't change 
after doing dummy write while the content maybe not received data if 
migration happens before that point. We can recheck the content via 
checksum or crc in the protocol after dummy write to ensure the content 
is what VF received. I think stack has already done such checks and the 
package will be abandoned if failed to pass through the check.


Another way is to tell all memory driver are using to Qemu and let Qemu 
to migrate these memory after stopping VCPU and the device. This seems 
safe but implementation maybe complex.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-11-30 Thread Lan, Tianyu



On 11/30/2015 4:01 PM, Michael S. Tsirkin wrote:

It is still not very clear what it is you are trying to achieve, and
whether your patchset achieves it.  You merely say "adding live
migration" but it seems pretty clear this isn't about being able to
migrate a guest transparently, since you are adding a host/guest
handshake.

This isn't about functionality either: I think that on KVM, it isn't
hard to live migrate if you can do a host/guest handshake, even today,
with no kernel changes:
1. before migration, expose a pv nic to guest (can be done directly on
   boot)
2. use e.g. a serial connection to move IP from an assigned device to pv nic
3. maybe move the mac as well
4. eject the assigned device
5. detect eject on host (QEMU generates a DEVICE_DELETED event when this
happens) and start migration



This looks like the bonding driver solution which put pv nic and VF
in one bonded interface under active-backup mode. The bonding driver
will switch from VF to PV nic automatically when VF is unplugged during
migration. This is the only available solution for VF NIC migration. But
it requires guest OS to do specific configurations inside and rely on
bonding driver which blocks it work on Windows. From performance side,
putting VF and virtio NIC under bonded interface will affect their
performance even when not do migration. These factors block to use VF
NIC passthough in some user cases(Especially in the cloud) which require
migration.

Current solution we proposed changes NIC driver and Qemu. Guest Os
doesn't need to do special thing for migration. It's easy to deploy and
all changes are in the NIC driver, NIC vendor can implement migration
support just in the their driver.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-29 Thread Lan, Tianyu

On 11/26/2015 11:56 AM, Alexander Duyck wrote:

> I am not saying you cannot modify the drivers, however what you are
doing is far too invasive.  Do you seriously plan on modifying all of
the PCI device drivers out there in order to allow any device that
might be direct assigned to a port to support migration?  I certainly
hope not.  That is why I have said that this solution will not scale.


Current drivers are not migration friendly. If the driver wants to
support migration, it's necessary to be changed.

RFC PATCH V1 presented our ideas about how to deal with MMIO, ring and
DMA tracking during migration. These are common for most drivers and
they maybe problematic in the previous version but can be corrected later.

Doing suspend and resume() may help to do migration easily but some
devices requires low service down time. Especially network and I got
that some cloud company promised less than 500ms network service downtime.

So I think performance effect also should be taken into account when we 
design the framework.





What I am counter proposing seems like a very simple proposition.  It
can be implemented in two steps.

1.  Look at modifying dma_mark_clean().  It is a function called in
the sync and unmap paths of the lib/swiotlb.c.  If you could somehow
modify it to take care of marking the pages you unmap for Rx as being
dirty it will get you a good way towards your goal as it will allow
you to continue to do DMA while you are migrating the VM.

2.  Look at making use of the existing PCI suspend/resume calls that
are there to support PCI power management.  They have everything
needed to allow you to pause and resume DMA for the device before and
after the migration while retaining the driver state.  If you can
implement something that allows you to trigger these calls from the
PCI subsystem such as hot-plug then you would have a generic solution
that can be easily reproduced for multiple drivers beyond those
supported by ixgbevf.


Glanced at PCI hotplug code. The hotplug events are triggered by PCI 
hotplug controller and these event are defined in the controller spec.
It's hard to extend more events. Otherwise, we also need to add some 
specific codes in the PCI hotplug core since it's only add and remove
PCI device when it gets events. It's also a challenge to modify Windows 
hotplug codes. So we may need to find another way.






Thanks.

- Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-25 Thread Lan, Tianyu

On 11/25/2015 8:28 PM, Michael S. Tsirkin wrote:

Frankly, I don't really see what this short term hack buys us,
and if it goes in, we'll have to maintain it forever.



The framework of how to notify VF about migration status won't be
changed regardless of stopping VF or not before doing migration.
We hope to reach agreement on this first. Tracking dirty memory still
need to more discussions and we will continue working on it. Stop VF may
help to work around the issue and make tracking easier.



Also, assuming you just want to do ifdown/ifup for some reason, it's
easy enough to do using a guest agent, in a completely generic way.



Just ifdown/ifup is not enough for migration. It needs to restore some 
PCI settings before doing ifup on the target machine

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support

2015-11-25 Thread Lan, Tianyu


On 11/25/2015 5:03 AM, Michael S. Tsirkin wrote:

>+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
>+  uint32_t val, int len)
>+{
>+VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>+
>+if (addr == vdev->migration_cap + PCI_VF_MIGRATION_VF_STATUS
>+&& val == PCI_VF_READY_FOR_MIGRATION) {
>+qemu_event_set(_event);

This would wake migration so it can proceed -
except it needs QEMU lock to run, and that's
taken by the migration thread.


Sorry, I seem to miss something.
Which lock may cause dead lock when calling vfio_migration_cap_handle()
and run migration?
The function is called when VF accesses faked PCI capability.



It seems unlikely that this ever worked - how
did you test this?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-25 Thread Lan Tianyu
On 2015年11月25日 13:30, Alexander Duyck wrote:
> No, what I am getting at is that you can't go around and modify the
> configuration space for every possible device out there.  This
> solution won't scale.


PCI config space regs are emulation by Qemu and so We can find the free
PCI config space regs for the faked PCI capability. Its position can be
not permanent.


>  If you instead moved the logic for notifying
> the device into a separate mechanism such as making it a part of the
> hot-plug logic then you only have to write the code once per OS in
> order to get the hot-plug capability to pause/resume the device.  What
> I am talking about is not full hot-plug, but rather to extend the
> existing hot-plug in Qemu and the Linux kernel to support a
> "pause/resume" functionality.  The PCI hot-plug specification calls
> out the option of implementing something like this, but we don't
> currently have support for it.
>

Could you elaborate the part of PCI hot-plug specification you mentioned?

My concern is whether it needs to change PCI spec or not.



> I just feel doing it through PCI hot-plug messages will scale much
> better as you could likely make use of the power management
> suspend/resume calls to take care of most of the needed implementation
> details.
> 
> - Alex
-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Lan Tianyu
On 2015年11月24日 22:20, Alexander Duyck wrote:
> I'm still not a fan of this approach.  I really feel like this is
> something that should be resolved by extending the existing PCI hot-plug
> rather than trying to instrument this per driver.  Then you will get the
> goodness for multiple drivers and multiple OSes instead of just one.  An
> added advantage to dealing with this in the PCI hot-plug environment
> would be that you could then still do a hot-plug even if the guest
> didn't load a driver for the VF since you would be working with the PCI
> slot instead of the device itself.
> 
> - Alex

Hi Alex:
What's you mentioned seems the bonding driver solution.
Paper "Live Migration with Pass-through Device for Linux VM" describes
it. It does VF hotplug during migration. In order to maintain Network
connection when VF is out, it takes advantage of Linux bonding driver to
switch between VF NIC and emulated NIC. But the side affects, that
requires VM to do additional configure and the performance during
switching two NIC is not good.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-24 Thread Lan Tianyu
On 2015年11月25日 05:20, Michael S. Tsirkin wrote:
> I have to say, I was much more interested in the idea
> of tracking dirty memory. I have some thoughts about
> that one - did you give up on it then?

No, our finial target is to keep VF active before doing
migration and tracking dirty memory is essential. But this
seems not easy to do that in short term for upstream. As
starters, stop VF before migration.

After deep thinking, the way of stopping VF still needs tracking
DMA-accessed dirty memory to make sure the received data buffer
before stopping VF migrated. It's easier to do that via dummy writing
data buffer when receive packet.


-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-24 Thread Lan Tianyu
This patch is to add migration support for ixgbevf driver. Using
faked PCI migration capability table communicates with Qemu to
share migration status and mailbox irq vector index.

Qemu will notify VF via sending MSIX msg to trigger mailbox
vector during migration and store migration status in the
PCI_VF_MIGRATION_VMM_STATUS regs in the new capability table.
The mailbox irq will be triggered just befoe stop-and-copy stage
and after migration on the target machine.

VF driver will put down net when detect migration and tell
Qemu it's ready for migration via writing PCI_VF_MIGRATION_VF_STATUS
reg. After migration, put up net again.

Qemu will in charge of migrating PCI config space regs and MSIX config.

The patch is to dedicate on the normal case that net traffic works
when mailbox irq is enabled. For other cases(such as the driver
isn't loaded, adapter is suspended or closed), mailbox irq won't be
triggered and VF driver will disable it via PCI_VF_MIGRATION_CAP
reg. These case will be resolved later.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |   5 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 ++
 2 files changed, 107 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 775d089..4b8ba2f 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -438,6 +438,11 @@ struct ixgbevf_adapter {
u64 bp_tx_missed;
 #endif
 
+   u8 migration_cap;
+   u8 last_migration_reg;
+   unsigned long migration_status;
+   struct work_struct migration_task;
+
u8 __iomem *io_addr; /* Mainly for iounmap use */
u32 link_speed;
bool link_up;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index a16d267..95860c2 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -96,6 +96,8 @@ static int debug = -1;
 module_param(debug, int, 0);
 MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
 
+#define MIGRATION_IN_PROGRESS  0
+
 static void ixgbevf_service_event_schedule(struct ixgbevf_adapter *adapter)
 {
if (!test_bit(__IXGBEVF_DOWN, >state) &&
@@ -1262,6 +1264,22 @@ static void ixgbevf_set_itr(struct ixgbevf_q_vector 
*q_vector)
}
 }
 
+static void ixgbevf_migration_check(struct ixgbevf_adapter *adapter) 
+{
+   struct pci_dev *pdev = adapter->pdev;
+   u8 val;
+
+   pci_read_config_byte(pdev,
+adapter->migration_cap + PCI_VF_MIGRATION_VMM_STATUS,
+);
+
+   if (val != adapter->last_migration_reg) {
+   schedule_work(>migration_task);
+   adapter->last_migration_reg = val;
+   }
+
+}
+
 static irqreturn_t ixgbevf_msix_other(int irq, void *data)
 {
struct ixgbevf_adapter *adapter = data;
@@ -1269,6 +1287,7 @@ static irqreturn_t ixgbevf_msix_other(int irq, void *data)
 
hw->mac.get_link_status = 1;
 
+   ixgbevf_migration_check(adapter);
ixgbevf_service_event_schedule(adapter);
 
IXGBE_WRITE_REG(hw, IXGBE_VTEIMS, adapter->eims_other);
@@ -1383,6 +1402,7 @@ out:
 static int ixgbevf_request_msix_irqs(struct ixgbevf_adapter *adapter)
 {
struct net_device *netdev = adapter->netdev;
+   struct pci_dev *pdev = adapter->pdev;
int q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
int vector, err;
int ri = 0, ti = 0;
@@ -1423,6 +1443,12 @@ static int ixgbevf_request_msix_irqs(struct 
ixgbevf_adapter *adapter)
goto free_queue_irqs;
}
 
+   if (adapter->migration_cap) {
+   pci_write_config_byte(pdev,
+   adapter->migration_cap + PCI_VF_MIGRATION_IRQ,
+   vector);
+   }
+
return 0;
 
 free_queue_irqs:
@@ -2891,6 +2917,59 @@ static void ixgbevf_watchdog_subtask(struct 
ixgbevf_adapter *adapter)
ixgbevf_update_stats(adapter);
 }
 
+static void ixgbevf_migration_task(struct work_struct *work)
+{
+   struct ixgbevf_adapter *adapter = container_of(work,
+   struct ixgbevf_adapter,
+   migration_task);
+   struct pci_dev *pdev = adapter->pdev;
+   struct net_device *netdev = adapter->netdev;
+   u8 val;
+
+   if (!test_bit(MIGRATION_IN_PROGRESS, >migration_status)) {
+   pci_read_config_byte(pdev,
+adapter->migration_cap + PCI_VF_MIGRATION_VMM_STATUS,
+);
+   if (val != VMM_MIGRATION_START)
+   return;
+
+   pr_info("migration start\n");
+   set_bit(MIGRATION_IN_PROGRESS, >migration_status);

[RFC PATCH V2 1/3] VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO

2015-11-24 Thread Lan Tianyu
This patch is to add new ioctl cmd VFIO_GET_PCI_CAP_INFO to get
PCI cap table size and get free PCI config space regs according
pos and size.

Qemu will add faked PCI capability for migration and need such
info.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c | 21 
 drivers/vfio/pci/vfio_pci_config.c  | 38 +++--
 drivers/vfio/pci/vfio_pci_private.h |  5 +
 include/uapi/linux/vfio.h   | 12 
 4 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 69fab0f..2e42de0 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -784,6 +784,27 @@ hot_reset_release:
 
kfree(groups);
return ret;
+   } else if (cmd == VFIO_GET_PCI_CAP_INFO) {
+   struct vfio_pci_cap_info info;
+   int offset;
+
+   if (copy_from_user(, (void __user *)arg, sizeof(info)))
+   return -EFAULT;
+
+   switch (info.index) {
+   case VFIO_PCI_CAP_GET_SIZE:
+   info.size = vfio_get_cap_size(vdev, info.cap, 
info.offset);
+   break;
+   case VFIO_PCI_CAP_GET_FREE_REGION:
+   offset = vfio_find_free_pci_config_reg(vdev,
+   info.offset, info.size);
+   info.offset = offset;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return copy_to_user((void __user *)arg, , sizeof(info));
}
 
return -ENOTTY;
diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index ff75ca3..8afbda4 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -841,6 +841,21 @@ static int vfio_find_cap_start(struct vfio_pci_device 
*vdev, int pos)
return pos;
 }
 
+int vfio_find_free_pci_config_reg(struct vfio_pci_device *vdev,
+   int pos, int size)
+{
+   int i, offset = pos;
+
+   for (i = pos; i < PCI_CFG_SPACE_SIZE; i++) {
+   if (vdev->pci_config_map[i] != PCI_CAP_ID_INVALID)
+   offset = i + 1;
+   else if (i - offset + 1 == size)
+   return offset;
+   }
+
+   return 0;
+}
+
 static int vfio_msi_config_read(struct vfio_pci_device *vdev, int pos,
int count, struct perm_bits *perm,
int offset, __le32 *val)
@@ -1199,6 +1214,20 @@ static int vfio_fill_vconfig_bytes(struct 
vfio_pci_device *vdev,
return ret;
 }
 
+int vfio_get_cap_size(struct vfio_pci_device *vdev, u8 cap, int pos)
+{
+   int len;
+
+   len = pci_cap_length[cap];
+   if (len == 0xFF) { /* Variable length */
+   len = vfio_cap_len(vdev, cap, pos);
+   if (len < 0)
+   return len;
+   }
+
+   return len;
+}
+
 static int vfio_cap_init(struct vfio_pci_device *vdev)
 {
struct pci_dev *pdev = vdev->pdev;
@@ -1238,12 +1267,9 @@ static int vfio_cap_init(struct vfio_pci_device *vdev)
return ret;
 
if (cap <= PCI_CAP_ID_MAX) {
-   len = pci_cap_length[cap];
-   if (len == 0xFF) { /* Variable length */
-   len = vfio_cap_len(vdev, cap, pos);
-   if (len < 0)
-   return len;
-   }
+   len = vfio_get_cap_size(vdev, cap, pos);
+   if (len < 0)
+   return len;
}
 
if (!len) {
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index ae0e1b4..91b4f9b 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -89,4 +89,9 @@ extern void vfio_pci_uninit_perm_bits(void);
 
 extern int vfio_config_init(struct vfio_pci_device *vdev);
 extern void vfio_config_free(struct vfio_pci_device *vdev);
+extern int vfio_find_free_pci_config_reg(struct vfio_pci_device *vdev,
+   int pos, int size);
+extern int vfio_get_cap_size(struct vfio_pci_device *vdev,
+   u8 cap, int pos);
+
 #endif /* VFIO_PCI_PRIVATE_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index b57b750..dfa7023 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -495,6 +495,18 @@ struct vfio_eeh_pe_op {
 
 #define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21)
 
+#define VFIO_GET_PCI_CAP_INFO  _IO(VFIO_TYPE, VFIO_BASE + 22)
+struct vfio_pci_cap_info {
+   __u32   argsz;
+   __u32   flags;
+#define VFIO_PCI_CAP_GET_SIZE  (1 << 0)
+#define VFIO_

[RFC PATCH V2 2/3] PCI: Add macros for faked PCI migration capability

2015-11-24 Thread Lan Tianyu
This patch is to extend PCI CAP id for migration cap and
add reg macros. The CAP ID is trial and we may find better one if the
solution is feasible.

*PCI_VF_MIGRATION_CAP
For VF driver to  control that triggers mailbox irq or not during migration.

*PCI_VF_MIGRATION_VMM_STATUS
Qemu stores migration status in the reg

*PCI_VF_MIGRATION_VF_STATUS
VF driver tells Qemu ready for migration

*PCI_VF_MIGRATION_IRQ
VF driver stores mailbox interrupt vector in the reg for Qemu to trigger during 
migration.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 include/uapi/linux/pci_regs.h | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index efe3443..9defb6f 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -216,7 +216,8 @@
 #define  PCI_CAP_ID_MSIX   0x11/* MSI-X */
 #define  PCI_CAP_ID_SATA   0x12/* SATA Data/Index Conf. */
 #define  PCI_CAP_ID_AF 0x13/* PCI Advanced Features */
-#define  PCI_CAP_ID_MAXPCI_CAP_ID_AF
+#define  PCI_CAP_ID_MIGRATION  0X14   
+#define  PCI_CAP_ID_MAXPCI_CAP_ID_MIGRATION
 #define PCI_CAP_LIST_NEXT  1   /* Next capability in the list */
 #define PCI_CAP_FLAGS  2   /* Capability defined flags (16 bits) */
 #define PCI_CAP_SIZEOF 4
@@ -904,4 +905,19 @@
 #define PCI_TPH_CAP_ST_SHIFT   16  /* st table shift */
 #define PCI_TPH_BASE_SIZEOF12  /* size with no st table */
 
+/* Migration*/
+#define PCI_VF_MIGRATION_CAP   0x04
+#define PCI_VF_MIGRATION_VMM_STATUS0x05
+#define PCI_VF_MIGRATION_VF_STATUS 0x06
+#define PCI_VF_MIGRATION_IRQ   0x07
+
+#define PCI_VF_MIGRATION_DISABLE   0x00
+#define PCI_VF_MIGRATION_ENABLE0x01
+
+#define VMM_MIGRATION_END0x00
+#define VMM_MIGRATION_START  0x01
+
+#define PCI_VF_WAIT_FOR_MIGRATION   0x00
+#define PCI_VF_READY_FOR_MIGRATION  0x01
+
 #endif /* LINUX_PCI_REGS_H */
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support

2015-11-24 Thread Lan Tianyu
This patch is to add SRIOV VF migration support.
Create new device type "vfio-sriov" and add faked PCI migration capability
to the type device.

The purpose of the new capability
1) sync migration status with VF driver in the VM
2) Get mailbox irq vector to notify VF driver during migration.
3) Provide a way to control injecting irq or not.

Qemu will migrate PCI configure space regs and MSIX config for VF.
Inject mailbox irq at last stage of migration to notify VF about
migration event and wait VF driver ready for migration. VF driver
writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
to tell Qemu.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/vfio/Makefile.objs |   2 +-
 hw/vfio/pci.c |   6 ++
 hw/vfio/pci.h |   4 ++
 hw/vfio/sriov.c   | 178 ++
 4 files changed, 189 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/sriov.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index d540c9d..9cf0178 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,6 +1,6 @@
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
-obj-$(CONFIG_PCI) += pci.o
+obj-$(CONFIG_PCI) += pci.o sriov.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 endif
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7c43fc1..e7583b5 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2013,6 +2013,11 @@ void vfio_pci_write_config(PCIDevice *pdev, uint32_t 
addr,
 } else if (was_enabled && !is_enabled) {
 vfio_disable_msix(vdev);
 }
+} else if (vdev->migration_cap &&
+ranges_overlap(addr, len, vdev->migration_cap, 0x10)) {
+/* Write everything to QEMU to keep emulated bits correct */
+pci_default_write_config(pdev, addr, val, len);
+vfio_migration_cap_handle(pdev, addr, val, len);
 } else {
 /* Write everything to QEMU to keep emulated bits correct */
 pci_default_write_config(pdev, addr, val, len);
@@ -3517,6 +3522,7 @@ static int vfio_initfn(PCIDevice *pdev)
 vfio_register_err_notifier(vdev);
 vfio_register_req_notifier(vdev);
 vfio_setup_resetfn(vdev);
+vfio_add_migration_capability(vdev);
 
 return 0;
 
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 6c00575..ee6ca5e 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -134,6 +134,7 @@ typedef struct VFIOPCIDevice {
 PCIHostDeviceAddress host;
 EventNotifier err_notifier;
 EventNotifier req_notifier;
+uint16_tmigration_cap;
 int (*resetfn)(struct VFIOPCIDevice *);
 uint32_t features;
 #define VFIO_FEATURE_ENABLE_VGA_BIT 0
@@ -162,3 +163,6 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t 
addr, int len);
 void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
uint32_t val, int len);
 void vfio_enable_msix(VFIOPCIDevice *vdev);
+void vfio_add_migration_capability(VFIOPCIDevice *vdev);
+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
+   uint32_t val, int len);
diff --git a/hw/vfio/sriov.c b/hw/vfio/sriov.c
new file mode 100644
index 000..3109538
--- /dev/null
+++ b/hw/vfio/sriov.c
@@ -0,0 +1,178 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hw/hw.h"
+#include "hw/vfio/pci.h"
+#include "hw/vfio/vfio.h"
+#include "hw/vfio/vfio-common.h"
+
+#define TYPE_VFIO_SRIOV "vfio-sriov"
+
+#define SRIOV_LM_SETUP 0x01
+#define SRIOV_LM_COMPLETE 0x02
+
+QemuEvent migration_event;
+
+static void vfio_dev_post_load(void *opaque)
+{
+struct PCIDevice *pdev = (struct PCIDevice *)opaque;
+VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+MSIMessage msg;
+int vector;
+
+if (vfio_pci_read_config(pdev,
+vdev->migration_cap + PCI_VF_MIGRATION_CAP, 1)
+!= PCI_VF_MIGRATION_ENABLE)
+return;
+
+vector = vfio_pci_read_config(pdev,
+vdev->migration_cap + PCI_VF_MIGRATION_IRQ, 1);
+
+msg = msix_get_message(pdev, vector);
+kvm_irqchip_send_msi(kvm_state, msg);
+}
+
+static int vfio_dev_load(QEMUFile *f, void *opaque, int version_id)
+{
+struct PCIDevice *pdev = (struct PCIDevice *)opaque;
+VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+int ret;
+
+if(qemu_get_byte(f)!= SRIOV_LM_COMPLETE)
+return 0;
+
+ret = pci_device_load(pdev, f);
+if (ret) {
+error_report("Faild to load PCI config space.\n");
+return ret;
+}
+
+if (msix_enabled(pdev)) {
+vfio_enable_msix(vdev);
+msix_load(pdev, f);
+}
+
+vfio_pci_write_config(pdev,vdev->migration_cap +
+PCI_VF_MIGRATION_VMM_STATUS, VMM_MIGRATION_END, 1);
+vfio_pci_write_config(pdev,vdev->migration_cap +
+PCI_VF_MIGRATION_VF_STATUS, P

[RFC PATCH V2 10/10] Qemu/VFIO: Misc change for enable migration with VFIO

2015-11-24 Thread Lan Tianyu
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/vfio/pci.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e7583b5..404a5cd 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3625,11 +3625,6 @@ static Property vfio_pci_dev_properties[] = {
 DEFINE_PROP_END_OF_LIST(),
 };
 
-static const VMStateDescription vfio_pci_vmstate = {
-.name = "vfio-pci",
-.unmigratable = 1,
-};
-
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3637,7 +3632,6 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, 
void *data)
 
 dc->reset = vfio_pci_reset;
 dc->props = vfio_pci_dev_properties;
-dc->vmsd = _pci_vmstate;
 dc->desc = "VFIO-based PCI device assignment";
 set_bit(DEVICE_CATEGORY_MISC, dc->categories);
 pdc->init = vfio_initfn;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 01/10] Qemu/VFIO: Create head file pci.h to share data struct.

2015-11-24 Thread Lan Tianyu
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/vfio/pci.c | 137 +-
 hw/vfio/pci.h | 158 ++
 2 files changed, 159 insertions(+), 136 deletions(-)
 create mode 100644 hw/vfio/pci.h

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e0e339a..5c3f8a7 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -42,138 +42,7 @@
 #include "trace.h"
 #include "hw/vfio/vfio.h"
 #include "hw/vfio/vfio-common.h"
-
-struct VFIOPCIDevice;
-
-typedef struct VFIOQuirk {
-MemoryRegion mem;
-struct VFIOPCIDevice *vdev;
-QLIST_ENTRY(VFIOQuirk) next;
-struct {
-uint32_t base_offset:TARGET_PAGE_BITS;
-uint32_t address_offset:TARGET_PAGE_BITS;
-uint32_t address_size:3;
-uint32_t bar:3;
-
-uint32_t address_match;
-uint32_t address_mask;
-
-uint32_t address_val:TARGET_PAGE_BITS;
-uint32_t data_offset:TARGET_PAGE_BITS;
-uint32_t data_size:3;
-
-uint8_t flags;
-uint8_t read_flags;
-uint8_t write_flags;
-} data;
-} VFIOQuirk;
-
-typedef struct VFIOBAR {
-VFIORegion region;
-bool ioport;
-bool mem64;
-QLIST_HEAD(, VFIOQuirk) quirks;
-} VFIOBAR;
-
-typedef struct VFIOVGARegion {
-MemoryRegion mem;
-off_t offset;
-int nr;
-QLIST_HEAD(, VFIOQuirk) quirks;
-} VFIOVGARegion;
-
-typedef struct VFIOVGA {
-off_t fd_offset;
-int fd;
-VFIOVGARegion region[QEMU_PCI_VGA_NUM_REGIONS];
-} VFIOVGA;
-
-typedef struct VFIOINTx {
-bool pending; /* interrupt pending */
-bool kvm_accel; /* set when QEMU bypass through KVM enabled */
-uint8_t pin; /* which pin to pull for qemu_set_irq */
-EventNotifier interrupt; /* eventfd triggered on interrupt */
-EventNotifier unmask; /* eventfd for unmask on QEMU bypass */
-PCIINTxRoute route; /* routing info for QEMU bypass */
-uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */
-QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
-} VFIOINTx;
-
-typedef struct VFIOMSIVector {
-/*
- * Two interrupt paths are configured per vector.  The first, is only used
- * for interrupts injected via QEMU.  This is typically the non-accel path,
- * but may also be used when we want QEMU to handle masking and pending
- * bits.  The KVM path bypasses QEMU and is therefore higher performance,
- * but requires masking at the device.  virq is used to track the MSI route
- * through KVM, thus kvm_interrupt is only available when virq is set to a
- * valid (>= 0) value.
- */
-EventNotifier interrupt;
-EventNotifier kvm_interrupt;
-struct VFIOPCIDevice *vdev; /* back pointer to device */
-int virq;
-bool use;
-} VFIOMSIVector;
-
-enum {
-VFIO_INT_NONE = 0,
-VFIO_INT_INTx = 1,
-VFIO_INT_MSI  = 2,
-VFIO_INT_MSIX = 3,
-};
-
-/* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
-typedef struct VFIOMSIXInfo {
-uint8_t table_bar;
-uint8_t pba_bar;
-uint16_t entries;
-uint32_t table_offset;
-uint32_t pba_offset;
-MemoryRegion mmap_mem;
-void *mmap;
-} VFIOMSIXInfo;
-
-typedef struct VFIOPCIDevice {
-PCIDevice pdev;
-VFIODevice vbasedev;
-VFIOINTx intx;
-unsigned int config_size;
-uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
-off_t config_offset; /* Offset of config space region within device fd */
-unsigned int rom_size;
-off_t rom_offset; /* Offset of ROM region within device fd */
-void *rom;
-int msi_cap_size;
-VFIOMSIVector *msi_vectors;
-VFIOMSIXInfo *msix;
-int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
-int interrupt; /* Current interrupt type */
-VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
-VFIOVGA vga; /* 0xa, 0x3b0, 0x3c0 */
-PCIHostDeviceAddress host;
-EventNotifier err_notifier;
-EventNotifier req_notifier;
-int (*resetfn)(struct VFIOPCIDevice *);
-uint32_t features;
-#define VFIO_FEATURE_ENABLE_VGA_BIT 0
-#define VFIO_FEATURE_ENABLE_VGA (1 << VFIO_FEATURE_ENABLE_VGA_BIT)
-#define VFIO_FEATURE_ENABLE_REQ_BIT 1
-#define VFIO_FEATURE_ENABLE_REQ (1 << VFIO_FEATURE_ENABLE_REQ_BIT)
-int32_t bootindex;
-uint8_t pm_cap;
-bool has_vga;
-bool pci_aer;
-bool req_enabled;
-bool has_flr;
-bool has_pm_reset;
-bool rom_read_failed;
-} VFIOPCIDevice;
-
-typedef struct VFIORomBlacklistEntry {
-uint16_t vendor_id;
-uint16_t device_id;
-} VFIORomBlacklistEntry;
+#include "hw/vfio/pci.h"
 
 /*
  * List of device ids/vendor ids for which to disable
@@ -193,12 +62,8 @@ static const VFIORomBlacklistEntry romblacklist[] = {
 { 0x14e4, 0x168e }
 };
 
-#define MSIX_CAP_LENGTH 12
 
 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
-static uint32_t vfio_pci_

[RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-11-24 Thread Lan Tianyu
This patchset is to propose a solution of adding live migration
support for SRIOV NIC.

During migration, Qemu needs to let VF driver in the VM to know
migration start and end. Qemu adds faked PCI migration capability
to help to sync status between two sides during migration.

Qemu triggers VF's mailbox irq via sending MSIX msg when migration
status is changed. VF driver tells Qemu its mailbox vector index
via the new PCI capability. In some cases(NIC is suspended or closed),
VF mailbox irq is freed and VF driver can disable irq injecting via
new capability.   

VF driver will put down nic before migration and put up again on
the target machine.

Lan Tianyu (10):
  Qemu/VFIO: Create head file pci.h to share data struct.
  Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
  Qemu/VFIO: Rework vfio_std_cap_max_size() function
  Qemu/VFIO: Add vfio_find_free_cfg_reg() to find free PCI config space
regs
  Qemu/VFIO: Expose PCI config space read/write and msix functions
  Qemu/PCI: Add macros for faked PCI migration capability
  Qemu: Add post_load_state() to run after restoring CPU state
  Qemu: Add save_before_stop callback to run just before stopping VCPU
during migration
  Qemu/VFIO: Add SRIOV VF migration support
  Qemu/VFIO: Misc change for enable migration with VFIO

 hw/vfio/Makefile.objs   |   2 +-
 hw/vfio/pci.c   | 196 +---
 hw/vfio/pci.h   | 168 +
 hw/vfio/sriov.c | 178 
 include/hw/pci/pci_regs.h   |  19 +
 include/migration/vmstate.h |   5 ++
 include/sysemu/sysemu.h |   1 +
 linux-headers/linux/vfio.h  |  16 
 migration/migration.c   |   3 +-
 migration/savevm.c  |  28 +++
 10 files changed, 459 insertions(+), 157 deletions(-)
 create mode 100644 hw/vfio/pci.h
 create mode 100644 hw/vfio/sriov.c

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 03/10] Qemu/VFIO: Rework vfio_std_cap_max_size() function

2015-11-24 Thread Lan Tianyu
Use new ioctl cmd VFIO_GET_PCI_CAP_INFO to get PCI cap table size.
This helps to get accurate table size and faciliate to find free
PCI config space regs for faked PCI capability. Current code assigns
PCI config space regs from the start of last PCI capability table to
pos 0xff to the last capability and occupy some free PCI config space
regs.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/vfio/pci.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5c3f8a7..29845e3 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2344,18 +2344,20 @@ static void vfio_unmap_bars(VFIOPCIDevice *vdev)
 /*
  * General setup
  */
-static uint8_t vfio_std_cap_max_size(PCIDevice *pdev, uint8_t pos)
+static uint8_t vfio_std_cap_max_size(VFIOPCIDevice *vdev, uint8_t cap)
 {
-uint8_t tmp, next = 0xff;
+struct vfio_pci_cap_info reg_info = {
+.argsz = sizeof(reg_info),
+.index = VFIO_PCI_CAP_GET_SIZE,
+.cap = cap
+};
+int ret;
 
-for (tmp = pdev->config[PCI_CAPABILITY_LIST]; tmp;
- tmp = pdev->config[tmp + 1]) {
-if (tmp > pos && tmp < next) {
-next = tmp;
-}
-}
+ret = ioctl(vdev->vbasedev.fd, VFIO_GET_PCI_CAP_INFO, _info);
+if (ret || reg_info.size == 0)
+error_report("vfio: Failed to find free PCI config reg: %m\n");
 
-return next - pos;
+return reg_info.size;
 }
 
 static void vfio_set_word_bits(uint8_t *buf, uint16_t val, uint16_t mask)
@@ -2521,7 +2523,7 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t 
pos)
  * Since QEMU doesn't actually handle many of the config accesses,
  * exact size doesn't seem worthwhile.
  */
-size = vfio_std_cap_max_size(pdev, pos);
+size = vfio_std_cap_max_size(vdev, cap_id);
 
 /*
  * pci_add_capability always inserts the new capability at the head
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition

2015-11-24 Thread Lan Tianyu
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 linux-headers/linux/vfio.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 0508d0b..732b0bd 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -495,6 +495,22 @@ struct vfio_eeh_pe_op {
 
 #define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21)
 
+
+#define VFIO_FIND_FREE_PCI_CONFIG_REG   _IO(VFIO_TYPE, VFIO_BASE + 22)
+
+#define VFIO_GET_PCI_CAP_INFO   _IO(VFIO_TYPE, VFIO_BASE + 22)
+
+struct vfio_pci_cap_info {
+__u32 argsz;
+__u32 flags;
+#define VFIO_PCI_CAP_GET_SIZE (1 << 0)
+#define VFIO_PCI_CAP_GET_FREE_REGION (1 << 1)
+__u32 index;
+__u32 offset;
+__u32 size;
+__u8 cap;
+};
+
 /* * */
 
 #endif /* VFIO_H */
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 07/10] Qemu: Add post_load_state() to run after restoring CPU state

2015-11-24 Thread Lan Tianyu
After migration, Qemu needs to trigger mailbox irq to notify VF driver
in the guest about status change. The irq delivery restarts to work after
restoring CPU state. This patch is to add new callback to run after
restoring CPU state and provide a way to trigger mailbox irq later.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 include/migration/vmstate.h |  2 ++
 migration/savevm.c  | 15 +++
 2 files changed, 17 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 0695d7c..dc681a6 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -56,6 +56,8 @@ typedef struct SaveVMHandlers {
 int (*save_live_setup)(QEMUFile *f, void *opaque);
 uint64_t (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t 
max_size);
 
+/* This runs after restoring CPU related state */
+void (*post_load_state)(void *opaque);
 LoadStateHandler *load_state;
 } SaveVMHandlers;
 
diff --git a/migration/savevm.c b/migration/savevm.c
index 9e0e286..48b6223 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -702,6 +702,20 @@ bool qemu_savevm_state_blocked(Error **errp)
 return false;
 }
 
+void qemu_savevm_post_load(void)
+{
+SaveStateEntry *se;
+
+QTAILQ_FOREACH(se, _state.handlers, entry) {
+if (!se->ops || !se->ops->post_load_state) {
+continue;
+}
+
+se->ops->post_load_state(se->opaque);
+}
+}
+
+
 void qemu_savevm_state_header(QEMUFile *f)
 {
 trace_savevm_state_header();
@@ -1140,6 +1154,7 @@ int qemu_loadvm_state(QEMUFile *f)
 }
 
 cpu_synchronize_all_post_init();
+qemu_savevm_post_load();
 
 ret = 0;
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Lan Tianyu
This patchset is to propose a solution of adding live migration
support for SRIOV NIC.

During migration, Qemu needs to let VF driver in the VM to know
migration start and end. Qemu adds faked PCI migration capability
to help to sync status between two sides during migration.

Qemu triggers VF's mailbox irq via sending MSIX msg when migration
status is changed. VF driver tells Qemu its mailbox vector index
via the new PCI capability. In some cases(NIC is suspended or closed),
VF mailbox irq is freed and VF driver can disable irq injecting via
new capability.

VF driver will put down nic before migration and put up again on
the target machine.

Lan Tianyu (3):
  VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO
  PCI: Add macros for faked PCI migration capability
  Ixgbevf: Add migration support for ixgbevf driver

 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |   5 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 ++
 drivers/vfio/pci/vfio_pci.c   |  21 +
 drivers/vfio/pci/vfio_pci_config.c|  38 ++--
 drivers/vfio/pci/vfio_pci_private.h   |   5 ++
 include/uapi/linux/pci_regs.h |  18 +++-
 include/uapi/linux/vfio.h |  12 +++
 7 files changed, 194 insertions(+), 7 deletions(-)

-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 08/10] Qemu: Add save_before_stop callback to run just before stopping VCPU during migration

2015-11-24 Thread Lan Tianyu
This patch is to add a callback which is called just before stopping VCPU.
It's for VF migration to trigger mailbox irq as later as possible to
decrease service downtime.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 include/migration/vmstate.h |  3 +++
 include/sysemu/sysemu.h |  1 +
 migration/migration.c   |  3 ++-
 migration/savevm.c  | 13 +
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index dc681a6..093faf1 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -58,6 +58,9 @@ typedef struct SaveVMHandlers {
 
 /* This runs after restoring CPU related state */
 void (*post_load_state)(void *opaque);
+
+/* This runs before stopping VCPU */
+void (*save_before_stop)(QEMUFile *f, void *opaque);
 LoadStateHandler *load_state;
 } SaveVMHandlers;
 
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index df80951..3d0d72c 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -84,6 +84,7 @@ void qemu_announce_self(void);
 bool qemu_savevm_state_blocked(Error **errp);
 void qemu_savevm_state_begin(QEMUFile *f,
  const MigrationParams *params);
+void qemu_savevm_save_before_stop(QEMUFile *f);
 void qemu_savevm_state_header(QEMUFile *f);
 int qemu_savevm_state_iterate(QEMUFile *f);
 void qemu_savevm_state_complete(QEMUFile *f);
diff --git a/migration/migration.c b/migration/migration.c
index c6ac08a..fccadea 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -759,7 +759,6 @@ int64_t migrate_xbzrle_cache_size(void)
 }
 
 /* migration thread support */
-
 static void *migration_thread(void *opaque)
 {
 MigrationState *s = opaque;
@@ -788,6 +787,8 @@ static void *migration_thread(void *opaque)
 } else {
 int ret;
 
+qemu_savevm_save_before_stop(s->file);
+
 qemu_mutex_lock_iothread();
 start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
 qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
diff --git a/migration/savevm.c b/migration/savevm.c
index 48b6223..c2e4802 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -715,6 +715,19 @@ void qemu_savevm_post_load(void)
 }
 }
 
+void qemu_savevm_save_before_stop(QEMUFile *f)
+{
+SaveStateEntry *se;
+
+QTAILQ_FOREACH(se, _state.handlers, entry) {
+if (!se->ops || !se->ops->save_before_stop) {
+continue;
+}
+   
+se->ops->save_before_stop(f, se->opaque);
+}
+}
+
 
 void qemu_savevm_state_header(QEMUFile *f)
 {
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 05/10] Qemu/VFIO: Expose PCI config space read/write and msix functions

2015-11-24 Thread Lan Tianyu
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/vfio/pci.c | 6 +++---
 hw/vfio/pci.h | 4 
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index d0354a0..7c43fc1 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -613,7 +613,7 @@ static void vfio_msix_vector_release(PCIDevice *pdev, 
unsigned int nr)
 }
 }
 
-static void vfio_enable_msix(VFIOPCIDevice *vdev)
+void vfio_enable_msix(VFIOPCIDevice *vdev)
 {
 vfio_disable_interrupts(vdev);
 
@@ -1931,7 +1931,7 @@ static void vfio_bar_quirk_free(VFIOPCIDevice *vdev, int 
nr)
 /*
  * PCI config space
  */
-static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
+uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
 {
 VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
 uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val;
@@ -1964,7 +1964,7 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, 
uint32_t addr, int len)
 return val;
 }
 
-static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
+void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
   uint32_t val, int len)
 {
 VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 6083300..6c00575 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -158,3 +158,7 @@ typedef struct VFIORomBlacklistEntry {
 #define MSIX_CAP_LENGTH 12
 
 uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size);
+uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
+void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
+   uint32_t val, int len);
+void vfio_enable_msix(VFIOPCIDevice *vdev);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 04/10] Qemu/VFIO: Add vfio_find_free_cfg_reg() to find free PCI config space regs

2015-11-24 Thread Lan Tianyu
This patch is to add ioctl wrap to find free PCI config sapce regs.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/vfio/pci.c | 19 +++
 hw/vfio/pci.h |  2 ++
 2 files changed, 21 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 29845e3..d0354a0 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2508,6 +2508,25 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, 
uint8_t pos)
 }
 }
 
+uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size)
+{
+struct vfio_pci_cap_info reg_info = {
+.argsz = sizeof(reg_info),
+.offset = pos,
+.index = VFIO_PCI_CAP_GET_FREE_REGION,
+.size = size,
+};
+int ret;
+
+ret = ioctl(vdev->vbasedev.fd, VFIO_GET_PCI_CAP_INFO, _info);
+if (ret || reg_info.offset == 0) { 
+error_report("vfio: Failed to find free PCI config reg: %m\n");
+return -EFAULT;
+}
+
+return reg_info.offset; 
+}
+
 static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
 {
 PCIDevice *pdev = >pdev;
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 9f360bf..6083300 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -156,3 +156,5 @@ typedef struct VFIORomBlacklistEntry {
 } VFIORomBlacklistEntry;
 
 #define MSIX_CAP_LENGTH 12
+
+uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 06/10] Qemu/PCI: Add macros for faked PCI migration capability

2015-11-24 Thread Lan Tianyu
This patch is to extend PCI CAP id for migration cap and
add reg macros. The CAP ID is trial and we may find better one if the
solution is feasible.

*PCI_VF_MIGRATION_CAP
For VF driver to  control that triggers mailbox irq or not during migration.

*PCI_VF_MIGRATION_VMM_STATUS
Qemu stores migration status in the reg

*PCI_VF_MIGRATION_VF_STATUS
VF driver tells Qemu ready for migration

*PCI_VF_MIGRATION_IRQ
VF driver stores mailbox interrupt vector in the reg for Qemu to trigger during 
migration.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 include/hw/pci/pci_regs.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h
index 57e8c80..0dcaf7e 100644
--- a/include/hw/pci/pci_regs.h
+++ b/include/hw/pci/pci_regs.h
@@ -213,6 +213,7 @@
 #define  PCI_CAP_ID_MSIX   0x11/* MSI-X */
 #define  PCI_CAP_ID_SATA   0x12/* Serial ATA */
 #define  PCI_CAP_ID_AF 0x13/* PCI Advanced Features */
+#define  PCI_CAP_ID_MIGRATION   0x14 
 #define PCI_CAP_LIST_NEXT  1   /* Next capability in the list */
 #define PCI_CAP_FLAGS  2   /* Capability defined flags (16 bits) */
 #define PCI_CAP_SIZEOF 4
@@ -716,4 +717,22 @@
 #define PCI_ACS_CTRL   0x06/* ACS Control Register */
 #define PCI_ACS_EGRESS_CTL_V   0x08/* ACS Egress Control Vector */
 
+/* Migration*/
+#define PCI_VF_MIGRATION_CAP0x04
+#define PCI_VF_MIGRATION_VMM_STATUS0x05
+#define PCI_VF_MIGRATION_VF_STATUS 0x06
+#define PCI_VF_MIGRATION_IRQ   0x07
+
+#define PCI_VF_MIGRATION_CAP_SIZE   0x08
+
+#define VMM_MIGRATION_END0x00
+#define VMM_MIGRATION_START  0x01  
+
+#define PCI_VF_WAIT_FOR_MIGRATION   0x00  
+#define PCI_VF_READY_FOR_MIGRATION  0x01
+
+#define PCI_VF_MIGRATION_DISABLE0x00
+#define PCI_VF_MIGRATION_ENABLE 0x01
+
+
 #endif /* LINUX_PCI_REGS_H */
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-29 Thread Lan Tianyu
On 2015年10月26日 23:03, Alexander Duyck wrote:
> No.  I think you are missing the fact that there are 256 descriptors per
> page.  As such if you dirty just 1 you will be pulling in 255 more, of
> which you may or may not have pulled in the receive buffer for.
> 
> So for example if you have the descriptor ring size set to 256 then that
> means you are going to get whatever the descriptor ring has since you
> will be marking the entire ring dirty with every packet processed,
> however you cannot guarantee that you are going to get all of the
> receive buffers unless you go through and flush the entire ring prior to
> migrating.


Yes, that will be a problem. How about adding tag for each Rx buffer and
check the tag when deliver the Rx buffer to stack? If tag has been
overwritten, this means the packet data has been migrated.


> 
> This is why I have said you will need to do something to force the rings
> to be flushed such as initiating a PM suspend prior to migrating.  You
> need to do something to stop the DMA and flush the remaining Rx buffers
> if you want to have any hope of being able to migrate the Rx in a
> consistent state.  Beyond that the only other thing you have to worry
> about are the Rx buffers that have already been handed off to the
> stack.  However those should be handled if you do a suspend and somehow
> flag pages as dirty when they are unmapped from the DMA.
> 
> - Alex

This will be simple and maybe our first version to enable migration. But
we still hope to find a way not to disable DMA before stopping VCPU to
decrease service down time.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-29 Thread Lan Tianyu
On 2015年10月29日 14:58, Alexander Duyck wrote:
> 
> Your code was having to do a bunch of shuffling in order to get things
> set up so that you could bring the interface back up.  I would argue
> that it may actually be faster at least on the bring-up to just drop the
> old rings and start over since it greatly reduced the complexity and the
> amount of device related data that has to be moved.

If give up the old ring after migration and keep DMA running before
stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and
just make sure that all Rx buffers delivered to stack has been migrated.

1) Dummy write Rx buffer before checking Rx descriptor to ensure packet
migrated first.

2) Make a copy of Rx descriptor and then use the copied data to check
buffer status. Not use the original descriptor because it won't be
migrated and migration may happen between two access of the Rx descriptor.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-29 Thread Lan Tianyu
On 2015年10月30日 00:17, Alexander Duyck wrote:
> On 10/29/2015 01:33 AM, Lan Tianyu wrote:
>> On 2015年10月29日 14:58, Alexander Duyck wrote:
>>> Your code was having to do a bunch of shuffling in order to get things
>>> set up so that you could bring the interface back up.  I would argue
>>> that it may actually be faster at least on the bring-up to just drop the
>>> old rings and start over since it greatly reduced the complexity and the
>>> amount of device related data that has to be moved.
>> If give up the old ring after migration and keep DMA running before
>> stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and
>> just make sure that all Rx buffers delivered to stack has been migrated.
>>
>> 1) Dummy write Rx buffer before checking Rx descriptor to ensure packet
>> migrated first.
> 
> Don't dummy write the Rx descriptor.  You should only really need to
> dummy write the Rx buffer and you would do so after checking the
> descriptor, not before.  Otherwise you risk corrupting the Rx buffer
> because it is possible for you to read the Rx buffer, DMA occurs, and
> then you write back the Rx buffer and now you have corrupted the memory.
> 
>> 2) Make a copy of Rx descriptor and then use the copied data to check
>> buffer status. Not use the original descriptor because it won't be
>> migrated and migration may happen between two access of the Rx
>> descriptor.
> 
> Do not just blindly copy the Rx descriptor ring.  That is a recipe for
> disaster.  The problem is DMA has to happen in a very specific order for
> things to function correctly.  The Rx buffer has to be written and then
> the Rx descriptor.  The problem is you will end up getting a read-ahead
> on the Rx descriptor ring regardless of which order you dirty things in.


Sorry, I didn't say clearly.
I meant to copy one Rx descriptor when receive rx irq and handle Rx ring.

Current code in the ixgbevf_clean_rx_irq() checks status of the Rx
descriptor whether its Rx buffer has been populated data and then read
the packet length from Rx descriptor to handle the Rx buffer.

My idea is to do the following three steps when receive Rx buffer in the
ixgbevf_clean_rx_irq().

(1) dummy write the Rx buffer first,
(2) make a copy of its Rx descriptor
(3) Check the buffer status and get length from the copy.

Migration may happen every time.
Happen between (1) and (2). If the Rx buffer has been populated data, VF
driver will not know that on the new machine because the Rx descriptor
isn't migrated. But it's still safe.

Happen between (2) and (3). The copy will be migrated to new machine
and Rx buffer is migrated firstly. If there is data in the Rx buffer,
VF driver still can handle the buffer without migrating Rx descriptor.

The next buffers will be ignored since we don't migrate Rx descriptor
for them. Their status will be not completed on the new machine.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver

2015-10-25 Thread Lan, Tianyu



On 10/22/2015 4:45 AM, Alexander Duyck wrote:

+/* Record states hold by PF */
+memcpy(>vf_data, >vfinfo[vfn], sizeof(struct
vf_data_storage));
+
+vf_shift = vfn % 32;
+reg_offset = vfn / 32;
+
+reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
+reg &= ~(1 << vf_shift);
+IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
+
+reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
+reg &= ~(1 << vf_shift);
+IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
+
+reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
+reg &= ~(1 << vf_shift);
+IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);
+
+return sizeof(struct state_in_pf);
+}
+


This is a read.  Why does it need to switch off the VF?  Also why turn
of the anti-spoof, it doesn't make much sense.


This is to prevent packets which target to VM from delivering to 
original VF after migration. E,G After migration, VM pings the PF of 
original machine and the ping reply packet will forward to original

VF if it wasn't disabled.

BTW, the read is done when VM has been stopped on the source machine.





+static ssize_t ixgbe_store_state_in_pf(struct device *dev,
+   struct device_attribute *attr,
+   const char *buf, size_t count)
+{
+struct ixgbe_adapter *adapter = to_adapter(dev);
+struct pci_dev *pdev = adapter->pdev, *vdev;
+struct pci_dev *vf_pdev = to_pci_dev(dev);
+struct state_in_pf *state = (struct state_in_pf *)buf;
+int vfn = vf_pdev->virtfn_index;
+
+/* Check struct size */
+if (count != sizeof(struct state_in_pf)) {
+printk(KERN_ERR "State in PF size does not fit.\n");
+goto out;
+}
+
+/* Restore PCI configurations */
+vdev = ixgbe_get_virtfn_dev(pdev, vfn);
+if (vdev) {
+pci_write_config_word(vdev, IXGBE_PCI_VFCOMMAND,
state->command);
+pci_write_config_word(vdev, IXGBE_PCI_VFMSIXMC,
state->msix_message_control);
+}
+
+/* Restore states hold by PF */
+memcpy(>vfinfo[vfn], >vf_data, sizeof(struct
vf_data_storage));
+
+  out:
+return count;
+}


Just doing a memcpy to move the vfinfo over adds no value.  The fact is
there are a number of filters that have to be configured in hardware
after, and it isn't as simple as just migrating the values stored.


Restoring VF status in the PF is triggered by VF driver via new mailbox
msg and call ixgbe_restore_setting(). Here just copy data into vfinfo.
If configure hardware early, state will be cleared by FLR which is
trigger by restoring operation in the VF driver.



 As I
mentioned in the case of the 82598 there is also jumbo frames to take
into account.  If the first PF didn't have it enabled, but the second
one does that implies the state of the VF needs to change to account for
that.


Yes, that will be a problem and VF driver also need to know this change 
after migration and reconfigure jumbo frame




I really think you would be better off only migrating the data related
to what can be configured using the ip link command and leaving other
values such as clear_to_send at the reset value of 0. Then you can at
least restore state from the VF after just a couple of quick messages.


This sounds good. I will try it later.




+static struct device_attribute ixgbe_per_state_in_pf_attribute =
+__ATTR(state_in_pf, S_IRUGO | S_IWUSR,
+ixgbe_show_state_in_pf, ixgbe_store_state_in_pf);
+
+void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
+{
+struct pci_dev *pdev = adapter->pdev;
+struct pci_dev *vfdev;
+unsigned short vf_id;
+int pos, ret;
+
+pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
+if (!pos)
+return;
+
+/* get the device ID for the VF */
+pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, _id);
+
+vfdev = pci_get_device(pdev->vendor, vf_id, NULL);
+
+while (vfdev) {
+if (vfdev->is_virtfn) {
+ret = device_create_file(>dev,
+_per_state_in_pf_attribute);
+if (ret)
+pr_warn("Unable to add VF attribute for dev %s,\n",
+dev_name(>dev));
+}
+
+vfdev = pci_get_device(pdev->vendor, vf_id, vfdev);
+}
+}


Driver specific sysfs is a no-go.  Otherwise we will end up with a
different implementation of this for every driver.  You will need to
find a way to make this generic in order to have a hope of getting this
to be acceptable.


Yes, Alex Williamson proposed to get/put data via VFIO interface. This
will be more general. I will do more research about how to communicate
between PF driver and VFIO subsystem.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"

2015-10-25 Thread Lan, Tianyu



On 10/25/2015 2:03 PM, Alexander Duyck wrote:

On 10/24/2015 08:43 AM, Lan, Tianyu wrote:


On 10/22/2015 4:52 AM, Alexander Duyck wrote:

Also have you even considered the MSI-X configuration on the VF?  I
haven't seen anything anywhere that would have migrated the VF's MSI-X
configuration from BAR 3 on one system to the new system.


MSI-X migration is done by Hypervisor(Qemu).
Following link is my Qemu patch to do that.
http://marc.info/?l=kvm=144544706530484=2


I really don't like the idea of trying to migrate the MSI-X across from
host to host while it is still active.  I really think Qemu shouldn't be
moving this kind of data over in a migration.


Hi Alex:

VF MSI-X regs in the VM are faked by Qemu and Qemu maps host vectors of
VF with guest's vector. The MSIX data migrated are for faked regs rather
than the one on the host. After migration, Qemu will remap guest vectors
with host vectors on the new machine. Moreover, VM is stopped during
migrating MSI-X data.




I think that having the VF do a suspend/resume is the best way to go.
Then it simplifies things as all you have to deal with is the dirty page
tracking for the Rx DMA and you should be able to do this without making
things too difficult.



Yes, that will be simple and most concern is service down time. I will
test later.



- Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-25 Thread Lan Tianyu
On 2015年10月24日 02:36, Alexander Duyck wrote:
> I was thinking about it and I am pretty sure the dummy write approach is
> problematic at best.  Specifically the issue is that while you are
> performing a dummy write you risk pulling in descriptors for data that
> hasn't been dummy written to yet.  So when you resume and restore your
> descriptors you will have once that may contain Rx descriptors
> indicating they contain data when after the migration they don't.

How about changing sequence? dummy writing Rx packet data fist and then
its desc. This can ensure that RX data is migrated before its desc and
prevent such case.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"

2015-10-24 Thread Lan, Tianyu


On 10/22/2015 4:52 AM, Alexander Duyck wrote:

Also have you even considered the MSI-X configuration on the VF?  I
haven't seen anything anywhere that would have migrated the VF's MSI-X
configuration from BAR 3 on one system to the new system.


MSI-X migration is done by Hypervisor(Qemu).
Following link is my Qemu patch to do that.
http://marc.info/?l=kvm=144544706530484=2
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package

2015-10-24 Thread Lan, Tianyu


On 10/22/2015 5:14 AM, Alexander Duyck wrote:

Where is i being initialized?  It was here but you removed it.  Are you
using i without initializing it?


Sorry, the initialization was put into patch 10 by mistake. "i" is
assigned with "tx_ring->next_to_clean".
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device

2015-10-24 Thread Lan, Tianyu



On 10/22/2015 2:07 AM, Alexander Duyck wrote:

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

Add "virtfn_index" member in the struct pci_device to record VF sequence
of PF. This will be used in the VF sysfs node handle.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
  drivers/pci/iov.c   | 1 +
  include/linux/pci.h | 1 +
  2 files changed, 2 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ee0ebff..065b6bb 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -136,6 +136,7 @@ static int virtfn_add(struct pci_dev *dev, int id,
int reset)
  virtfn->physfn = pci_dev_get(dev);
  virtfn->is_virtfn = 1;
  virtfn->multifunction = 0;
+virtfn->virtfn_index = id;
  for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
  res = >resource[i + PCI_IOV_RESOURCES];
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 353db8d..85c5531 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -356,6 +356,7 @@ struct pci_dev {
  unsigned intio_window_1k:1;/* Intel P2P bridge 1K I/O
windows */
  unsigned intirq_managed:1;
  pci_dev_flags_t dev_flags;
+unsigned intvirtfn_index;
  atomic_tenable_cnt;/* pci_enable_device has been called */
  u32saved_config_space[16]; /* config space saved at
suspend time */



Can't you just calculate the VF index based on the VF BDF number
combined with the information in the PF BDF number and VF
offset/stride?  Seems kind of pointless to add a variable that is only
used by one driver and is in a slowpath when you can just calculate it
pretty quickly.


Good suggestion. Will try it.



- Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package

2015-10-24 Thread Lan, Tianyu



On 10/22/2015 8:58 PM, Michael S. Tsirkin wrote:

Do you really need to play the shifting games?
Can't you just reset everything and re-initialize the rings?
It's slower but way less intrusive.
Also removes the need to track writes into rings.


Shift ring is to avoid losing those packets in the ring.
This may cause some race condition and so I introduced a
lock to prevent such cases in the latter patch.
Yes, reset everything after migration can make thing easy.
But just like you said it would affect performance and loss
more packets. I can do a test later to get data about these
two way.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 11/12] IXGBEVF: Migrate VF statistic data

2015-10-21 Thread Lan Tianyu
VF statistic regs are read-only and can't be migrated via writing back
directly.

Currently, statistic data returned to user space by the driver is not equal
to value of statistic regs. VF driver records value of statistic regs as base 
data
when net interface is up or open, calculate increased count of regs during
last period of online service and added it to saved_reset data. When user
space collects statistic data, VF driver returns result of
"current - base + saved_reset". "Current" is reg value at that point.

Restoring net function after migration just likes net interface is up or open.
Call existed function to update base and saved_reset data to keep statistic
data continual during migration.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 04b6ce7..d22160f 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -3005,6 +3005,7 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter)
return 0;
 
del_timer_sync(>service_timer);
+   ixgbevf_update_stats(adapter);
pr_info("migration start\n");
migration_status = MIGRATION_IN_PROGRESS; 
 
@@ -3017,6 +3018,8 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter)
return 1;
 
ixgbevf_restore_state(adapter);
+   ixgbevf_save_reset_stats(adapter);
+   ixgbevf_init_last_counter_stats(adapter);
migration_status = MIGRATION_COMPLETED;
pr_info("migration end\n");
return 0;
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device

2015-10-21 Thread Lan Tianyu
Add "virtfn_index" member in the struct pci_device to record VF sequence
of PF. This will be used in the VF sysfs node handle.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 drivers/pci/iov.c   | 1 +
 include/linux/pci.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ee0ebff..065b6bb 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -136,6 +136,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int 
reset)
virtfn->physfn = pci_dev_get(dev);
virtfn->is_virtfn = 1;
virtfn->multifunction = 0;
+   virtfn->virtfn_index = id;
 
for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
res = >resource[i + PCI_IOV_RESOURCES];
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 353db8d..85c5531 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -356,6 +356,7 @@ struct pci_dev {
unsigned intio_window_1k:1; /* Intel P2P bridge 1K I/O windows */
unsigned intirq_managed:1;
pci_dev_flags_t dev_flags;
+   unsigned intvirtfn_index;
atomic_tenable_cnt; /* pci_enable_device has been called */
 
u32 saved_config_space[16]; /* config space saved at 
suspend time */
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-21 Thread Lan Tianyu
82.3 
ms
[983769928.256503] 64 bytes from 10.239.48.100: icmp_seq=4150 ttl=64 time=72.2 
ms
[983769928.256631] 64 bytes from 10.239.48.100: icmp_seq=4158 ttl=64 time=0.500 
ms
[983769928.257284] 64 bytes from 10.239.48.100: icmp_seq=4159 ttl=64 time=0.154 
ms
[983769928.258297] 64 bytes from 10.239.48.100: icmp_seq=4160 ttl=64 time=0.165 
ms

Todo
===
So far, the patchset isn't perfect. VF net interface can't be open, closed, 
down 
and up during migration. Will prevent such operation during migration in the
future job.

Very appreciate for your comments.


Lan Tianyu (12):
  PCI: Add virtfn_index for struct pci_device
  IXGBE: Add new mail box event to restore VF status in the PF driver
  IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF
driver
  IXGBE: Add ixgbe_ping_vf() to notify a specified VF via mailbox msg.
  IXGBE: Add new sysfs interface of "notify_vf"
  IXGBEVF: Add self emulation layer
  IXGBEVF: Add new mail box event for migration
  IXGBEVF: Rework code of finding the end transmit desc of package
  IXGBEVF: Add live migration support for VF driver
  IXGBEVF: Add lock to protect tx/rx ring operation
  IXGBEVF: Migrate VF statistic data
  IXGBEVF: Track dma dirty pages

 drivers/net/ethernet/intel/ixgbe/ixgbe.h   |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h   |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 245 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h  |   4 +
 drivers/net/ethernet/intel/ixgbevf/Makefile|   3 +-
 drivers/net/ethernet/intel/ixgbevf/defines.h   |   6 +
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h   |  10 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  | 179 ++-
 drivers/net/ethernet/intel/ixgbevf/mbx.h   |   3 +
 .../net/ethernet/intel/ixgbevf/self-emulation.c| 133 +++
 drivers/net/ethernet/intel/ixgbevf/vf.c|  10 +
 drivers/net/ethernet/intel/ixgbevf/vf.h|   6 +-
 drivers/pci/iov.c  |   1 +
 include/linux/pci.h|   1 +
 15 files changed, 582 insertions(+), 22 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c

-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package

2015-10-21 Thread Lan Tianyu
When transmit a package, the end transmit desc of package
indicates whether package is sent already. Current code records
the end desc's pointer in the next_to_watch of struct tx buffer.
This code will be broken if shifting desc ring after migration.
The pointer will be invalid. This patch is to replace recording
pointer with recording the desc number of the package and find
the end decs via the first desc and desc number.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  1 +
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 19 ---
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 775d089..c823616 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -54,6 +54,7 @@
  */
 struct ixgbevf_tx_buffer {
union ixgbe_adv_tx_desc *next_to_watch;
+   u16 desc_num;
unsigned long time_stamp;
struct sk_buff *skb;
unsigned int bytecount;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 4446916..056841c 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -210,6 +210,7 @@ static void ixgbevf_unmap_and_free_tx_resource(struct 
ixgbevf_ring *tx_ring,
   DMA_TO_DEVICE);
}
tx_buffer->next_to_watch = NULL;
+   tx_buffer->desc_num = 0;
tx_buffer->skb = NULL;
dma_unmap_len_set(tx_buffer, len, 0);
/* tx_buffer must be completely set up in the transmit path */
@@ -295,7 +296,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
union ixgbe_adv_tx_desc *tx_desc;
unsigned int total_bytes = 0, total_packets = 0;
unsigned int budget = tx_ring->count / 2;
-   unsigned int i = tx_ring->next_to_clean;
+   int i, watch_index;
 
if (test_bit(__IXGBEVF_DOWN, >state))
return true;
@@ -305,9 +306,17 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
i -= tx_ring->count;
 
do {
-   union ixgbe_adv_tx_desc *eop_desc = tx_buffer->next_to_watch;
+   union ixgbe_adv_tx_desc *eop_desc;
+
+   if (!tx_buffer->desc_num)
+   break;
+
+   if (i + tx_buffer->desc_num >= 0)
+   watch_index = i + tx_buffer->desc_num;
+   else
+   watch_index = i + tx_ring->count + tx_buffer->desc_num;
 
-   /* if next_to_watch is not set then there is no work pending */
+   eop_desc = IXGBEVF_TX_DESC(tx_ring, watch_index);
if (!eop_desc)
break;
 
@@ -320,6 +329,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
 
/* clear next_to_watch to prevent false hangs */
tx_buffer->next_to_watch = NULL;
+   tx_buffer->desc_num = 0;
 
/* update the statistics for this packet */
total_bytes += tx_buffer->bytecount;
@@ -3457,6 +3467,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
u32 tx_flags = first->tx_flags;
__le32 cmd_type;
u16 i = tx_ring->next_to_use;
+   u16 start;
 
tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
 
@@ -3540,6 +3551,8 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
 
/* set next_to_watch value indicating a packet is present */
first->next_to_watch = tx_desc;
+   start = first - tx_ring->tx_buffer_info;
+   first->desc_num = (i - start >= 0) ? i - start: i + tx_ring->count - 
start;
 
i++;
if (i == tx_ring->count)
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver

2015-10-21 Thread Lan Tianyu
To let VF driver in the guest to know migration status, Qemu will
fake PCI configure reg 0xF0 and 0xF1 to show migrate status and
get ack from VF driver.

When migration starts, Qemu will set reg "0xF0" to 1, notify
VF driver via triggering mail box msg and wait for VF driver to tell
it's ready for migration(set reg "0xF1" to 1). After migration, Qemu
will set reg "0xF0" to 0 and notify VF driver by mail box irq. VF
driver begins to restore tx/rx function after detecting sttatus change.

When VF receives mail box irq, it will check reg "0xF0" in the service
task function to get migration status and performs related operations
according its value.

Steps of restarting receive and transmit function
1) Restore VF status in the PF driver via sending mail event to PF driver
2) Write back reg values recorded by self emulation layer
3) Restart rx/tx ring
4) Recovery interrupt

Transmit/Receive descriptor head regs are read-only and can't
be restored via writing back recording reg value directly and they
are set to 0 during VF reset. To reuse original tx/rx rings, shift
desc ring in order to move the desc pointed by original head reg to
first entry of the ring and then enable tx/rx rings. VF restarts to
receive and transmit from original head desc.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/defines.h   |   6 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h   |   7 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  | 115 -
 .../net/ethernet/intel/ixgbevf/self-emulation.c| 107 +++
 4 files changed, 232 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h 
b/drivers/net/ethernet/intel/ixgbevf/defines.h
index 770e21a..113efd2 100644
--- a/drivers/net/ethernet/intel/ixgbevf/defines.h
+++ b/drivers/net/ethernet/intel/ixgbevf/defines.h
@@ -239,6 +239,12 @@ struct ixgbe_adv_tx_context_desc {
__le32 mss_l4len_idx;
 };
 
+union ixgbevf_desc {
+   union ixgbe_adv_tx_desc rx_desc;
+   union ixgbe_adv_rx_desc tx_desc;
+   struct ixgbe_adv_tx_context_desc tx_context_desc;
+};
+
 /* Adv Transmit Descriptor Config Masks */
 #define IXGBE_ADVTXD_DTYP_MASK 0x00F0 /* DTYP mask */
 #define IXGBE_ADVTXD_DTYP_CTXT 0x0020 /* Advanced Context Desc */
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index c823616..6eab402e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -109,7 +109,7 @@ struct ixgbevf_ring {
struct ixgbevf_ring *next;
struct net_device *netdev;
struct device *dev;
-   void *desc; /* descriptor ring memory */
+   union ixgbevf_desc *desc;   /* descriptor ring memory */
dma_addr_t dma; /* phys. address of descriptor ring */
unsigned int size;  /* length in bytes */
u16 count;  /* amount of descriptors */
@@ -493,6 +493,11 @@ extern void ixgbevf_write_eitr(struct ixgbevf_q_vector 
*q_vector);
 
 void ixgbe_napi_add_all(struct ixgbevf_adapter *adapter);
 void ixgbe_napi_del_all(struct ixgbevf_adapter *adapter);
+int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head);
+int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head);
+void ixgbevf_restore_state(struct ixgbevf_adapter *adapter);
+inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter);
+
 
 #ifdef DEBUG
 char *ixgbevf_get_hw_dev_name(struct ixgbe_hw *hw);
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 056841c..15ec361 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -91,6 +91,10 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit Virtual Function 
Network Driver");
 MODULE_LICENSE("GPL");
 MODULE_VERSION(DRV_VERSION);
 
+
+#define MIGRATION_COMPLETED   0x00
+#define MIGRATION_IN_PROGRESS 0x01
+
 #define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK)
 static int debug = -1;
 module_param(debug, int, 0);
@@ -221,6 +225,78 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring 
*ring)
return ring->stats.packets;
 }
 
+int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
+{
+   struct ixgbevf_tx_buffer *tx_buffer = NULL;
+   static union ixgbevf_desc *tx_desc = NULL;
+
+   tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
+   if (!tx_buffer)
+   return -ENOMEM;
+
+   tx_desc = vmalloc(sizeof(union ixgbevf_desc) * r->count);
+   if (!tx_desc)
+   return -ENOMEM;
+
+   memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
+   memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count 
- hea

[RFC Patch 04/12] IXGBE: Add ixgbe_ping_vf() to notify a specified VF via mailbox msg.

2015-10-21 Thread Lan Tianyu
This patch is to add ixgbe_ping_vf() to notify a specified VF. When
migration status is changed, it's necessary to notify VF the change.
VF driver will check the migrate status when it gets mailbox msg.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 19 ---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h |  1 +
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 89671eb..e247d67 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -1318,18 +1318,23 @@ void ixgbe_disable_tx_rx(struct ixgbe_adapter *adapter)
IXGBE_WRITE_REG(hw, IXGBE_VFRE(1), 0);
 }
 
-void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter)
+void ixgbe_ping_vf(struct ixgbe_adapter *adapter, int vfn)
 {
struct ixgbe_hw *hw = >hw;
u32 ping;
+
+   ping = IXGBE_PF_CONTROL_MSG;
+   if (adapter->vfinfo[vfn].clear_to_send)
+   ping |= IXGBE_VT_MSGTYPE_CTS;
+   ixgbe_write_mbx(hw, , 1, vfn);
+}
+
+void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter)
+{
int i;
 
-   for (i = 0 ; i < adapter->num_vfs; i++) {
-   ping = IXGBE_PF_CONTROL_MSG;
-   if (adapter->vfinfo[i].clear_to_send)
-   ping |= IXGBE_VT_MSGTYPE_CTS;
-   ixgbe_write_mbx(hw, , 1, i);
-   }
+   for (i = 0 ; i < adapter->num_vfs; i++)
+   ixgbe_ping_vf(adapter, i);
 }
 
 int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int vf, u8 *mac)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
index 2c197e6..143e2fd 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
@@ -41,6 +41,7 @@ void ixgbe_msg_task(struct ixgbe_adapter *adapter);
 int ixgbe_vf_configuration(struct pci_dev *pdev, unsigned int event_mask);
 void ixgbe_disable_tx_rx(struct ixgbe_adapter *adapter);
 void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter);
+void ixgbe_ping_vf(struct ixgbe_adapter *adapter, int vfn);
 int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int queue, u8 *mac);
 int ixgbe_ndo_set_vf_vlan(struct net_device *netdev, int queue, u16 vlan,
   u8 qos);
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 07/12] IXGBEVF: Add new mail box event for migration

2015-10-21 Thread Lan Tianyu
VF status in the PF driver needs to be restored after migration and reset
VF hardware. This patch is to add a new event for VF driver to notify PF
driver to restore status.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/mbx.h |  3 +++
 drivers/net/ethernet/intel/ixgbevf/vf.c  | 10 ++
 drivers/net/ethernet/intel/ixgbevf/vf.h  |  1 +
 3 files changed, 14 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/mbx.h 
b/drivers/net/ethernet/intel/ixgbevf/mbx.h
index 82f44e0..22761d8 100644
--- a/drivers/net/ethernet/intel/ixgbevf/mbx.h
+++ b/drivers/net/ethernet/intel/ixgbevf/mbx.h
@@ -112,6 +112,9 @@ enum ixgbe_pfvf_api_rev {
 #define IXGBE_VF_GET_RETA  0x0a/* VF request for RETA */
 #define IXGBE_VF_GET_RSS_KEY   0x0b/* get RSS hash key */
 
+/* mail box event for live migration  */
+#define IXGBE_VF_NOTIFY_RESUME  0x0c /* VF notify PF migration to restore 
status */
+
 /* length of permanent address message returned from PF */
 #define IXGBE_VF_PERMADDR_MSG_LEN  4
 /* word in permanent address message with the current multicast type */
diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.c 
b/drivers/net/ethernet/intel/ixgbevf/vf.c
index d1339b0..1e4e5e6 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.c
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.c
@@ -717,6 +717,15 @@ int ixgbevf_get_queues(struct ixgbe_hw *hw, unsigned int 
*num_tcs,
return err;
 }
 
+static void ixgbevf_notify_resume_vf(struct ixgbe_hw *hw)
+{
+   struct ixgbe_mbx_info *mbx = >mbx;
+   u32 msgbuf[1];
+
+   msgbuf[0] = IXGBE_VF_NOTIFY_RESUME;
+   mbx->ops.write_posted(hw, msgbuf, 1);
+}
+
 static const struct ixgbe_mac_operations ixgbevf_mac_ops = {
.init_hw= ixgbevf_init_hw_vf,
.reset_hw   = ixgbevf_reset_hw_vf,
@@ -729,6 +738,7 @@ static const struct ixgbe_mac_operations ixgbevf_mac_ops = {
.update_mc_addr_list= ixgbevf_update_mc_addr_list_vf,
.set_uc_addr= ixgbevf_set_uc_addr_vf,
.set_vfta   = ixgbevf_set_vfta_vf,
+   .notify_resume  = ixgbevf_notify_resume_vf,
 };
 
 const struct ixgbevf_info ixgbevf_82599_vf_info = {
diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h 
b/drivers/net/ethernet/intel/ixgbevf/vf.h
index 6a3f4eb..a25fe81 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.h
@@ -70,6 +70,7 @@ struct ixgbe_mac_operations {
s32 (*disable_mc)(struct ixgbe_hw *);
s32 (*clear_vfta)(struct ixgbe_hw *);
s32 (*set_vfta)(struct ixgbe_hw *, u32, u32, bool);
+   void (*notify_resume)(struct ixgbe_hw *); 
 };
 
 enum ixgbe_mac_type {
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/3] Qemu: Add pci-assign.h to share functions and struct definition with new file

2015-10-21 Thread Lan Tianyu
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/i386/kvm/pci-assign.c | 111 ++-
 hw/i386/kvm/pci-assign.h | 109 ++
 2 files changed, 112 insertions(+), 108 deletions(-)
 create mode 100644 hw/i386/kvm/pci-assign.h

diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c
index 74d22f4..616532d 100644
--- a/hw/i386/kvm/pci-assign.c
+++ b/hw/i386/kvm/pci-assign.c
@@ -37,112 +37,7 @@
 #include "hw/pci/pci.h"
 #include "hw/pci/msi.h"
 #include "kvm_i386.h"
-
-#define MSIX_PAGE_SIZE 0x1000
-
-/* From linux/ioport.h */
-#define IORESOURCE_IO   0x0100  /* Resource type */
-#define IORESOURCE_MEM  0x0200
-#define IORESOURCE_IRQ  0x0400
-#define IORESOURCE_DMA  0x0800
-#define IORESOURCE_PREFETCH 0x2000  /* No side effects */
-#define IORESOURCE_MEM_64   0x0010
-
-//#define DEVICE_ASSIGNMENT_DEBUG
-
-#ifdef DEVICE_ASSIGNMENT_DEBUG
-#define DEBUG(fmt, ...)   \
-do {  \
-fprintf(stderr, "%s: " fmt, __func__ , __VA_ARGS__);  \
-} while (0)
-#else
-#define DEBUG(fmt, ...)
-#endif
-
-typedef struct PCIRegion {
-int type;   /* Memory or port I/O */
-int valid;
-uint64_t base_addr;
-uint64_t size;/* size of the region */
-int resource_fd;
-} PCIRegion;
-
-typedef struct PCIDevRegions {
-uint8_t bus, dev, func; /* Bus inside domain, device and function */
-int irq;/* IRQ number */
-uint16_t region_number; /* number of active regions */
-
-/* Port I/O or MMIO Regions */
-PCIRegion regions[PCI_NUM_REGIONS - 1];
-int config_fd;
-} PCIDevRegions;
-
-typedef struct AssignedDevRegion {
-MemoryRegion container;
-MemoryRegion real_iomem;
-union {
-uint8_t *r_virtbase; /* mmapped access address for memory regions */
-uint32_t r_baseport; /* the base guest port for I/O regions */
-} u;
-pcibus_t e_size;/* emulated size of region in bytes */
-pcibus_t r_size;/* real size of region in bytes */
-PCIRegion *region;
-} AssignedDevRegion;
-
-#define ASSIGNED_DEVICE_PREFER_MSI_BIT  0
-#define ASSIGNED_DEVICE_SHARE_INTX_BIT  1
-
-#define ASSIGNED_DEVICE_PREFER_MSI_MASK (1 << ASSIGNED_DEVICE_PREFER_MSI_BIT)
-#define ASSIGNED_DEVICE_SHARE_INTX_MASK (1 << ASSIGNED_DEVICE_SHARE_INTX_BIT)
-
-typedef struct MSIXTableEntry {
-uint32_t addr_lo;
-uint32_t addr_hi;
-uint32_t data;
-uint32_t ctrl;
-} MSIXTableEntry;
-
-typedef enum AssignedIRQType {
-ASSIGNED_IRQ_NONE = 0,
-ASSIGNED_IRQ_INTX_HOST_INTX,
-ASSIGNED_IRQ_INTX_HOST_MSI,
-ASSIGNED_IRQ_MSI,
-ASSIGNED_IRQ_MSIX
-} AssignedIRQType;
-
-typedef struct AssignedDevice {
-PCIDevice dev;
-PCIHostDeviceAddress host;
-uint32_t dev_id;
-uint32_t features;
-int intpin;
-AssignedDevRegion v_addrs[PCI_NUM_REGIONS - 1];
-PCIDevRegions real_device;
-PCIINTxRoute intx_route;
-AssignedIRQType assigned_irq_type;
-struct {
-#define ASSIGNED_DEVICE_CAP_MSI (1 << 0)
-#define ASSIGNED_DEVICE_CAP_MSIX (1 << 1)
-uint32_t available;
-#define ASSIGNED_DEVICE_MSI_ENABLED (1 << 0)
-#define ASSIGNED_DEVICE_MSIX_ENABLED (1 << 1)
-#define ASSIGNED_DEVICE_MSIX_MASKED (1 << 2)
-uint32_t state;
-} cap;
-uint8_t emulate_config_read[PCI_CONFIG_SPACE_SIZE];
-uint8_t emulate_config_write[PCI_CONFIG_SPACE_SIZE];
-int msi_virq_nr;
-int *msi_virq;
-MSIXTableEntry *msix_table;
-hwaddr msix_table_addr;
-uint16_t msix_max;
-MemoryRegion mmio;
-char *configfd_name;
-int32_t bootindex;
-} AssignedDevice;
-
-#define TYPE_PCI_ASSIGN "kvm-pci-assign"
-#define PCI_ASSIGN(obj) OBJECT_CHECK(AssignedDevice, (obj), TYPE_PCI_ASSIGN)
+#include "pci-assign.h"
 
 static void assigned_dev_update_irq_routing(PCIDevice *dev);
 
@@ -1044,7 +939,7 @@ static bool assigned_dev_msix_masked(MSIXTableEntry *entry)
  * sure the physical MSI-X state tracks the guest's view, which is important
  * for some VF/PF and PF/fw communication channels.
  */
-static bool assigned_dev_msix_skipped(MSIXTableEntry *entry)
+bool assigned_dev_msix_skipped(MSIXTableEntry *entry)
 {
 return !entry->data;
 }
@@ -1114,7 +1009,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice 
*pci_dev)
 return r;
 }
 
-static void assigned_dev_update_msix(PCIDevice *pci_dev)
+void assigned_dev_update_msix(PCIDevice *pci_dev)
 {
 AssignedDevice *assigned_dev = PCI_ASSIGN(pci_dev);
 uint16_t ctrl_word = pci_get_word(pci_dev->config + pci_dev->msix_cap +
diff --git a/hw/i386/kvm/pci-assign.h b/hw/i386/kvm/pci-assign.h
new file mode 100644
index 000..91d00ea
--- /dev/null
+++ b/hw/i386/kvm/pci-assign.h
@@ -0,0 +1,109 @@
+#define MSIX_PAGE_SIZE 0x1

[RFC PATCH 0/3] Qemu/IXGBE: Add live migration support for SRIOV NIC

2015-10-21 Thread Lan Tianyu
This patchset is Qemu part for live migration support for SRIOV NIC.
kernel part patch information is in the following link.
http://marc.info/?l=kvm=144544635330193=2


Lan Tianyu (3):
  Qemu: Add pci-assign.h to share functions and struct definition with
new file
  Qemu: Add post_load_state() to run after restoring CPU state
  Qemu: Introduce pci-sriov device type to support VF live migration

 hw/i386/kvm/Makefile.objs   |   2 +-
 hw/i386/kvm/pci-assign.c| 113 +--
 hw/i386/kvm/pci-assign.h| 109 +++
 hw/i386/kvm/sriov.c | 213 
 include/migration/vmstate.h |   2 +
 migration/savevm.c  |  15 
 6 files changed, 344 insertions(+), 110 deletions(-)
 create mode 100644 hw/i386/kvm/pci-assign.h
 create mode 100644 hw/i386/kvm/sriov.c

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 06/12] IXGBEVF: Add self emulation layer

2015-10-21 Thread Lan Tianyu
In order to restore VF function after migration, add self emulation layer
to record regs' values during accessing regs.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/Makefile|  3 ++-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |  2 +-
 .../net/ethernet/intel/ixgbevf/self-emulation.c| 26 ++
 drivers/net/ethernet/intel/ixgbevf/vf.h|  5 -
 4 files changed, 33 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c

diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile 
b/drivers/net/ethernet/intel/ixgbevf/Makefile
index 4ce4c97..841c884 100644
--- a/drivers/net/ethernet/intel/ixgbevf/Makefile
+++ b/drivers/net/ethernet/intel/ixgbevf/Makefile
@@ -31,7 +31,8 @@
 
 obj-$(CONFIG_IXGBEVF) += ixgbevf.o
 
-ixgbevf-objs := vf.o \
+ixgbevf-objs := self-emulation.o \
+   vf.o \
 mbx.o \
 ethtool.o \
 ixgbevf_main.o
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index a16d267..4446916 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg)
 
if (IXGBE_REMOVED(reg_addr))
return IXGBE_FAILED_READ_REG;
-   value = readl(reg_addr + reg);
+   value = ixgbe_self_emul_readl(reg_addr, reg);
if (unlikely(value == IXGBE_FAILED_READ_REG))
ixgbevf_check_remove(hw, reg);
return value;
diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c 
b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
new file mode 100644
index 000..d74b2da
--- /dev/null
+++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
@@ -0,0 +1,26 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vf.h"
+#include "ixgbevf.h"
+
+static u32 hw_regs[0x4000];
+
+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr)
+{
+   u32 tmp;
+
+   tmp = readl(base + addr);
+   hw_regs[(unsigned long)addr] = tmp;
+
+   return tmp;
+}
+
+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr)
+{
+   hw_regs[(unsigned long)addr] = val;
+   writel(val, (volatile void __iomem *)(base + addr));
+}
diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h 
b/drivers/net/ethernet/intel/ixgbevf/vf.h
index d40f036..6a3f4eb 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.h
@@ -39,6 +39,9 @@
 
 struct ixgbe_hw;
 
+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr);
+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr);
+
 /* iterator type for walking multicast address lists */
 typedef u8* (*ixgbe_mc_addr_itr) (struct ixgbe_hw *hw, u8 **mc_addr_ptr,
  u32 *vmdq);
@@ -182,7 +185,7 @@ static inline void ixgbe_write_reg(struct ixgbe_hw *hw, u32 
reg, u32 value)
 
if (IXGBE_REMOVED(reg_addr))
return;
-   writel(value, reg_addr + reg);
+   ixgbe_self_emul_writel(value, reg_addr, reg);
 }
 
 #define IXGBE_WRITE_REG(h, r, v) ixgbe_write_reg(h, r, v)
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/3] Qemu: Add post_load_state() to run after restoring CPU state

2015-10-21 Thread Lan Tianyu
After migration, Qemu needs to trigger mailbox irq to notify VF driver
in the guest about status change. The irq delivery restarts to work after
restoring CPU state. This patch is to add new callback to run after
restoring CPU state and provide a way to trigger mailbox irq later.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 include/migration/vmstate.h |  2 ++
 migration/savevm.c  | 15 +++
 2 files changed, 17 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 0695d7c..dc681a6 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -56,6 +56,8 @@ typedef struct SaveVMHandlers {
 int (*save_live_setup)(QEMUFile *f, void *opaque);
 uint64_t (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t 
max_size);
 
+/* This runs after restoring CPU related state */
+void (*post_load_state)(void *opaque);
 LoadStateHandler *load_state;
 } SaveVMHandlers;
 
diff --git a/migration/savevm.c b/migration/savevm.c
index 9e0e286..48b6223 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -702,6 +702,20 @@ bool qemu_savevm_state_blocked(Error **errp)
 return false;
 }
 
+void qemu_savevm_post_load(void)
+{
+SaveStateEntry *se;
+
+QTAILQ_FOREACH(se, _state.handlers, entry) {
+if (!se->ops || !se->ops->post_load_state) {
+continue;
+}
+
+se->ops->post_load_state(se->opaque);
+}
+}
+
+
 void qemu_savevm_state_header(QEMUFile *f)
 {
 trace_savevm_state_header();
@@ -1140,6 +1154,7 @@ int qemu_loadvm_state(QEMUFile *f)
 }
 
 cpu_synchronize_all_post_init();
+qemu_savevm_post_load();
 
 ret = 0;
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation

2015-10-21 Thread Lan Tianyu
Ring shifting during restoring VF function maybe race with original
ring operation(transmit/receive package). This patch is to add tx/rx
lock to protect ring related data.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  2 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 28 ---
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 6eab402e..3a748c8 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -448,6 +448,8 @@ struct ixgbevf_adapter {
 
spinlock_t mbx_lock;
unsigned long last_reset;
+   spinlock_t mg_rx_lock;
+   spinlock_t mg_tx_lock;
 };
 
 enum ixbgevf_state_t {
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 15ec361..04b6ce7 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -227,8 +227,10 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring 
*ring)
 
 int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
 {
+   struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
struct ixgbevf_tx_buffer *tx_buffer = NULL;
static union ixgbevf_desc *tx_desc = NULL;
+   unsigned long flags;
 
tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
if (!tx_buffer)
@@ -238,6 +240,7 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
if (!tx_desc)
return -ENOMEM;
 
+   spin_lock_irqsave(>mg_tx_lock, flags);
memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count 
- head));
memcpy(>desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * 
head);
@@ -256,6 +259,8 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
else
r->next_to_use += (r->count - head);
 
+   spin_unlock_irqrestore(>mg_tx_lock, flags);
+
vfree(tx_buffer);
vfree(tx_desc);
return 0;
@@ -263,8 +268,10 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
 
 int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
 {
+   struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
struct ixgbevf_rx_buffer *rx_buffer = NULL;
static union ixgbevf_desc *rx_desc = NULL;
+   unsigned long flags;
 
rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count));
if (!rx_buffer)
@@ -274,6 +281,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
if (!rx_desc)
return -ENOMEM;
 
+   spin_lock_irqsave(>mg_rx_lock, flags);
memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count));
memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count 
- head));
memcpy(>desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * 
head);
@@ -291,6 +299,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
r->next_to_use -= head;
else
r->next_to_use += (r->count - head);
+   spin_unlock_irqrestore(>mg_rx_lock, flags);
 
vfree(rx_buffer);
vfree(rx_desc);
@@ -377,6 +386,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
if (test_bit(__IXGBEVF_DOWN, >state))
return true;
 
+   spin_lock(>mg_tx_lock);
+   i = tx_ring->next_to_clean;
tx_buffer = _ring->tx_buffer_info[i];
tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
i -= tx_ring->count;
@@ -471,6 +482,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
q_vector->tx.total_bytes += total_bytes;
q_vector->tx.total_packets += total_packets;
 
+   spin_unlock(>mg_tx_lock);
+
if (check_for_tx_hang(tx_ring) && ixgbevf_check_tx_hang(tx_ring)) {
struct ixgbe_hw *hw = >hw;
union ixgbe_adv_tx_desc *eop_desc;
@@ -999,10 +1012,12 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
struct ixgbevf_ring *rx_ring,
int budget)
 {
+   struct ixgbevf_adapter *adapter = netdev_priv(rx_ring->netdev);
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
u16 cleaned_count = ixgbevf_desc_unused(rx_ring);
struct sk_buff *skb = rx_ring->skb;
 
+   spin_lock(>mg_rx_lock);
while (likely(total_rx_packets < budget)) {
union ixgbe_adv_rx_desc *rx_desc;
 
@@ -1078,6 +1093,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
q_vec

[RFC Patch 12/12] IXGBEVF: Track dma dirty pages

2015-10-21 Thread Lan Tianyu
Migration relies on tracking dirty page to migrate memory.
Hardware can't automatically mark a page as dirty after DMA
memory access. VF descriptor rings and data buffers are modified
by hardware when receive and transmit data. To track such dirty memory
manually, do dummy writes(read a byte and write it back) during receive
and transmit data.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index d22160f..ce7bd7a 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -414,6 +414,9 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
if (!(eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
break;
 
+   /* write back status to mark page dirty */
+   eop_desc->wb.status = eop_desc->wb.status;
+
/* clear next_to_watch to prevent false hangs */
tx_buffer->next_to_watch = NULL;
tx_buffer->desc_num = 0;
@@ -946,15 +949,17 @@ static struct sk_buff *ixgbevf_fetch_rx_buffer(struct 
ixgbevf_ring *rx_ring,
 {
struct ixgbevf_rx_buffer *rx_buffer;
struct page *page;
+   u8 *page_addr;
 
rx_buffer = _ring->rx_buffer_info[rx_ring->next_to_clean];
page = rx_buffer->page;
prefetchw(page);
 
-   if (likely(!skb)) {
-   void *page_addr = page_address(page) +
- rx_buffer->page_offset;
+   /* Mark page dirty */
+   page_addr = page_address(page) + rx_buffer->page_offset;
+   *page_addr = *page_addr;
 
+   if (likely(!skb)) {
/* prefetch first cache line of first page */
prefetch(page_addr);
 #if L1_CACHE_BYTES < 128
@@ -1032,6 +1037,9 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
if (!ixgbevf_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
break;
 
+   /* Write back status to mark page dirty */
+   rx_desc->wb.upper.status_error = rx_desc->wb.upper.status_error;
+
/* This memory barrier is needed to keep us from reading
 * any other fields out of the rx_desc until we know the
 * RXD_STAT_DD bit is set
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 02/12] IXGBE: Add new mail box event to restore VF status in the PF driver

2015-10-21 Thread Lan Tianyu
This patch is to restore VF status in the PF driver when get event
from VF.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h   |  1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h   |  1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 40 ++
 3 files changed, 42 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 636f9e3..9d5669a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -148,6 +148,7 @@ struct vf_data_storage {
bool pf_set_mac;
u16 pf_vlan; /* When set, guest VLAN config not allowed. */
u16 pf_qos;
+   u32 vf_lpe;
u16 tx_rate;
u16 vlan_count;
u8 spoofchk_enabled;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
index b1e4703..8fdb38d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
@@ -91,6 +91,7 @@ enum ixgbe_pfvf_api_rev {
 
 /* mailbox API, version 1.1 VF requests */
 #define IXGBE_VF_GET_QUEUES0x09 /* get queue configuration */
+#define IXGBE_VF_NOTIFY_RESUME0x0c /* VF notify PF migration finishing */
 
 /* GET_QUEUES return data indices within the mailbox */
 #define IXGBE_VF_TX_QUEUES 1   /* number of Tx queues supported */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 1d17b58..ab2a2e2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -648,6 +648,42 @@ static inline void ixgbe_write_qde(struct ixgbe_adapter 
*adapter, u32 vf,
}
 }
 
+/**
+ *  Restore the settings by mailbox, after migration
+ **/
+void ixgbe_restore_setting(struct ixgbe_adapter *adapter, u32 vf)
+{
+   struct ixgbe_hw *hw = >hw;
+   u32 reg, reg_offset, vf_shift;
+   int rar_entry = hw->mac.num_rar_entries - (vf + 1);
+
+   vf_shift = vf % 32;
+   reg_offset = vf / 32;
+
+   /* enable transmit and receive for vf */
+   reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
+   reg |= (1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
+   reg |= (1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
+   reg |= (1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);
+
+   ixgbe_vf_reset_event(adapter, vf);
+
+   hw->mac.ops.set_rar(hw, rar_entry,
+   adapter->vfinfo[vf].vf_mac_addresses,
+   vf, IXGBE_RAH_AV);
+
+
+   if (adapter->vfinfo[vf].vf_lpe)
+   ixgbe_set_vf_lpe(adapter, >vfinfo[vf].vf_lpe, vf);
+}
+
 static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf)
 {
struct ixgbe_ring_feature *vmdq = >ring_feature[RING_F_VMDQ];
@@ -1047,6 +1083,7 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter 
*adapter, u32 vf)
break;
case IXGBE_VF_SET_LPE:
retval = ixgbe_set_vf_lpe(adapter, msgbuf, vf);
+   adapter->vfinfo[vf].vf_lpe = *msgbuf;
break;
case IXGBE_VF_SET_MACVLAN:
retval = ixgbe_set_vf_macvlan_msg(adapter, msgbuf, vf);
@@ -1063,6 +1100,9 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter 
*adapter, u32 vf)
case IXGBE_VF_GET_RSS_KEY:
retval = ixgbe_get_vf_rss_key(adapter, msgbuf, vf);
break;
+   case IXGBE_VF_NOTIFY_RESUME:
+   ixgbe_restore_setting(adapter, vf);
+   break;
default:
e_err(drv, "Unhandled Msg %8.8x\n", msgbuf[0]);
retval = IXGBE_ERR_MBX;
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver

2015-10-21 Thread Lan Tianyu
This patch is to add sysfs interface state_in_pf under sysfs directory
of VF PCI device for Qemu to get and put VF status in the PF driver during
migration.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 156 -
 1 file changed, 155 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index ab2a2e2..89671eb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -124,6 +124,157 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter 
*adapter)
return -ENOMEM;
 }
 
+#define IXGBE_PCI_VFCOMMAND   0x4
+#define IXGBE_PCI_VFMSIXMC0x72
+#define IXGBE_SRIOV_VF_OFFSET 0x180
+#define IXGBE_SRIOV_VF_STRIDE 0x2
+
+#define to_adapter(dev) ((struct ixgbe_adapter 
*)(pci_get_drvdata(to_pci_dev(dev)->physfn)))
+
+struct state_in_pf {
+   u16 command;
+   u16 msix_message_control;
+   struct vf_data_storage vf_data;
+};
+
+static struct pci_dev *ixgbe_get_virtfn_dev(struct pci_dev *pdev, int vfn)
+{
+   u16 rid = pdev->devfn + IXGBE_SRIOV_VF_OFFSET + IXGBE_SRIOV_VF_STRIDE * 
vfn;
+   return pci_get_bus_and_slot(pdev->bus->number + (rid >> 8), rid & 0xff);
+}
+
+static ssize_t ixgbe_show_state_in_pf(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   struct ixgbe_adapter *adapter = to_adapter(dev);
+   struct pci_dev *pdev = adapter->pdev, *vdev;
+   struct pci_dev *vf_pdev = to_pci_dev(dev);
+   struct ixgbe_hw *hw = >hw;
+   struct state_in_pf *state = (struct state_in_pf *)buf;
+   int vfn = vf_pdev->virtfn_index;
+   u32 reg, reg_offset, vf_shift;
+
+   /* Clear VF mac and disable VF */
+   ixgbe_del_mac_filter(adapter, adapter->vfinfo[vfn].vf_mac_addresses, 
vfn);
+
+   /* Record PCI configurations */
+   vdev = ixgbe_get_virtfn_dev(pdev, vfn);
+   if (vdev) {
+   pci_read_config_word(vdev, IXGBE_PCI_VFCOMMAND, 
>command);
+   pci_read_config_word(vdev, IXGBE_PCI_VFMSIXMC, 
>msix_message_control);
+   }
+   else
+   printk(KERN_WARNING "Unable to find VF device.\n");
+
+   /* Record states hold by PF */
+   memcpy(>vf_data, >vfinfo[vfn], sizeof(struct 
vf_data_storage));
+
+   vf_shift = vfn % 32;
+   reg_offset = vfn / 32;
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
+   reg &= ~(1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
+   reg &= ~(1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
+   reg &= ~(1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);
+
+   return sizeof(struct state_in_pf);
+}
+
+static ssize_t ixgbe_store_state_in_pf(struct device *dev,
+  struct device_attribute *attr,
+  const char *buf, size_t count)
+{
+   struct ixgbe_adapter *adapter = to_adapter(dev);
+   struct pci_dev *pdev = adapter->pdev, *vdev;
+   struct pci_dev *vf_pdev = to_pci_dev(dev);
+   struct state_in_pf *state = (struct state_in_pf *)buf;
+   int vfn = vf_pdev->virtfn_index;
+
+   /* Check struct size */
+   if (count != sizeof(struct state_in_pf)) {
+   printk(KERN_ERR "State in PF size does not fit.\n");
+   goto out;
+   }
+
+   /* Restore PCI configurations */
+   vdev = ixgbe_get_virtfn_dev(pdev, vfn);
+   if (vdev) {
+   pci_write_config_word(vdev, IXGBE_PCI_VFCOMMAND, 
state->command);
+   pci_write_config_word(vdev, IXGBE_PCI_VFMSIXMC, 
state->msix_message_control);
+   }
+
+   /* Restore states hold by PF */
+   memcpy(>vfinfo[vfn], >vf_data, sizeof(struct 
vf_data_storage));
+
+  out:
+   return count;
+}
+
+static struct device_attribute ixgbe_per_state_in_pf_attribute =
+   __ATTR(state_in_pf, S_IRUGO | S_IWUSR,
+   ixgbe_show_state_in_pf, ixgbe_store_state_in_pf);
+
+void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
+{
+   struct pci_dev *pdev = adapter->pdev;
+   struct pci_dev *vfdev;
+   unsigned short vf_id;
+   int pos, ret;
+
+   pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
+   if (!pos)
+   return;
+
+   /* get the device ID for the VF */
+   pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, _id);
+
+   vfdev = pci_get_device(pdev->vendor, vf_id, NULL);
+
+   while (vfdev) {
+   if (vfdev->is_virtfn) {
+   ret = device_create_file(>dev,
+

[PATCH] KVM: Avoid warning user requested TSC rate below hardware speed when create VM.

2015-06-16 Thread Lan Tianyu
KVM populates max_tsc_khz with tsc_khz at arch init stage on the
constant tsc machine and creates VM with max_tsc_khz as tsc. However,
tsc_khz maybe changed during tsc clocksource driver refines calibration.
This will cause to create VM with slow tsc and produce the following
warning. To fix the issue, compare max_tsc_khz with tsc_khz and
update max_tsc_khz with new value of tsc_khz if it has been changed
when create a VM.

[   94.916906] [ cut here ]
[   94.922127] WARNING: CPU: 0 PID: 824 at arch/x86/kvm/vmx.c:2272 
vmx_set_tsc_khz+0x3e/0x50()
[   94.931503] user requested TSC rate below hardware speed
[   94.937470] Modules linked in:
[   94.940923] CPU: 0 PID: 824 Comm: qemu-system-x86 Tainted: G  D W   
4.1.0-rc3+ #4
[   94.960350]  81f453f8 88027e9f3bc8 81b5eb8a 

[   94.968721]  88027e9f3c18 88027e9f3c08 810e6f8a 
8802
[   94.977103]  001d3300 88027e98 0001 
88027e98
[   94.985476] Call Trace:
[   94.988242]  [81b5eb8a] dump_stack+0x45/0x57
[   94.994020]  [810e6f8a] warn_slowpath_common+0x8a/0xc0
[   95.000772]  [810e7006] warn_slowpath_fmt+0x46/0x50
[   95.007222]  [8104676e] vmx_set_tsc_khz+0x3e/0x50
[   95.013488]  [810112f7] kvm_set_tsc_khz.part.106+0xa7/0xe0
[   95.020618]  [8101e628] kvm_arch_vcpu_init+0x208/0x230
[   95.027373]  [81003bf9] kvm_vcpu_init+0xc9/0x110
[   95.033540]  [81049fd0] vmx_create_vcpu+0x70/0xc30
[   95.039911]  [81049f80] ? vmx_create_vcpu+0x20/0xc30
[   95.046476]  [8101dc9e] kvm_arch_vcpu_create+0x3e/0x60
[   95.053233]  [81009f00] kvm_vm_ioctl+0x1a0/0x770
[   95.059400]  [8129f395] ? __fget+0x5/0x200
[   95.064991]  [8115b85f] ? rcu_irq_exit+0x7f/0xd0
[   95.071157]  [81293448] do_vfs_ioctl+0x308/0x540
[   95.077323]  [8129f301] ? expand_files+0x1f1/0x280
[   95.083684]  [8147836b] ? selinux_file_ioctl+0x5b/0x100
[   95.090538]  [81293701] SyS_ioctl+0x81/0xa0
[   95.096218]  [81b6a72e] system_call_fastpath+0x12/0x76
[   95.102974] ---[ end trace 08ade884081d9dd7 ]---

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99861
Signed-off-by: Lan Tianyu tianyu@intel.com
---
 arch/x86/kvm/x86.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 43f0df7..6c7fefe 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7814,6 +7814,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 {
struct page *page;
struct kvm *kvm;
+   int cpu;
int r;
 
BUG_ON(vcpu-kvm == NULL);
@@ -7833,6 +7834,21 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
}
vcpu-arch.pio_data = page_address(page);
 
+   /*
+* max_tsc_khz records tsc_khz at arch init stage on the constant tsc
+* machine. However, tsc_khz maybe changed during tsc clocksource driver
+* refines calibration. This will cause to create VM with slow tsc
+* and produce warning. To avoid such case, check whether tsc_khz
+* has been changed here and update max_tsc_khz with new value of
+* tsc_khz if changed.
+*/
+   if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)  max_tsc_khz != tsc_khz) {
+   max_tsc_khz = tsc_khz;
+   pr_debug(kvm: max_tsc_khz is changed to %ld\n, max_tsc_khz);
+   for_each_online_cpu(cpu)
+   smp_call_function_single(cpu, tsc_khz_changed, NULL, 1);
+   }
+
kvm_set_tsc_khz(vcpu, max_tsc_khz);
 
r = kvm_mmu_create(vcpu);
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] kvm tools: Use kernel error check functions

2011-12-15 Thread Lan Tianyu
Add compiler.h file to support using kernel error check funciotns(e.g
ERR_PTR, PTR_ERR and so on).

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/include/linux/compiler.h |   16 
 1 files changed, 16 insertions(+), 0 deletions(-)
 create mode 100644 tools/kvm/include/linux/compiler.h

diff --git a/tools/kvm/include/linux/compiler.h 
b/tools/kvm/include/linux/compiler.h
new file mode 100644
index 000..bd360c2
--- /dev/null
+++ b/tools/kvm/include/linux/compiler.h
@@ -0,0 +1,16 @@
+#ifndef _PERF_LINUX_COMPILER_H_
+#define _PERF_LINUX_COMPILER_H_
+
+#ifndef __always_inline
+#define __always_inlineinline
+#endif
+#define __user
+#define __attribute_const__
+
+#define __used __attribute__((__unused__))
+#define __iomem
+#define __force
+#define __must_check
+#define unlikely
+
+#endif
-- 
1.7.6.rc2.8.g28eb

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] kvm tools, qcow: Add support for growing refcount blocks

2011-12-15 Thread Lan Tianyu
This patch enables allocating new refcount blocks and so then kvm tools
could expand qcow2 image much larger.

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/disk/qcow.c |  111 +---
 1 files changed, 94 insertions(+), 17 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index e139fa5..b2dae66 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -12,14 +12,20 @@
 #include string.h
 #include unistd.h
 #include fcntl.h
+#include errno.h
 #ifdef CONFIG_HAS_ZLIB
 #include zlib.h
 #endif
 
+#include linux/err.h
 #include linux/byteorder.h
 #include linux/kernel.h
 #include linux/types.h
 
+static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append);
+static int qcow_write_refcount_table(struct qcow *q);
+static u64 qcow_alloc_clusters(struct qcow *q, u64 size, int update_ref);
+static void  qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size);
 
 static inline int qcow_pwrite_sync(int fd,
void *buf, size_t count, off_t offset)
@@ -657,6 +663,58 @@ static struct qcow_refcount_block 
*refcount_block_search(struct qcow *q, u64 off
return rfb;
 }
 
+static struct qcow_refcount_block *qcow_grow_refcount_block(struct qcow *q,
+   u64 clust_idx)
+{
+   struct qcow_header *header = q-header;
+   struct qcow_refcount_table *rft = q-refcount_table;
+   struct qcow_refcount_block *rfb;
+   u64 new_block_offset;
+   u64 rft_idx;
+
+   rft_idx = clust_idx  (header-cluster_bits -
+   QCOW_REFCOUNT_BLOCK_SHIFT);
+
+   if (rft_idx = rft-rf_size) {
+   pr_warning(Don't support grow refcount block table);
+   return NULL;
+   }
+
+   new_block_offset = qcow_alloc_clusters(q, q-cluster_size, 0);
+   if (new_block_offset  0)
+   return NULL;
+
+   rfb = new_refcount_block(q, new_block_offset);
+   if (!rfb)
+   return NULL;
+
+   memset(rfb-entries, 0x00, q-cluster_size);
+   rfb-dirty = 1;
+
+   /* write refcount block */
+   if (write_refcount_block(q, rfb)  0)
+   goto free_rfb;
+
+   if (cache_refcount_block(q, rfb)  0)
+   goto free_rfb;
+
+   rft-rf_table[rft_idx] = cpu_to_be64(new_block_offset);
+   if (update_cluster_refcount(q, new_block_offset 
+   header-cluster_bits, 1)  0)
+   goto recover_rft;
+
+   if (qcow_write_refcount_table(q)  0)
+   goto recover_rft;
+
+   return rfb;
+
+recover_rft:
+   rft-rf_table[rft_idx] = 0;
+free_rfb:
+   free(rfb);
+   return NULL;
+}
+
 static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, 
u64 clust_idx)
 {
struct qcow_header *header = q-header;
@@ -667,14 +725,11 @@ static struct qcow_refcount_block 
*qcow_read_refcount_block(struct qcow *q, u64
 
rft_idx = clust_idx  (header-cluster_bits - 
QCOW_REFCOUNT_BLOCK_SHIFT);
if (rft_idx = rft-rf_size)
-   return NULL;
+   return ERR_PTR(-ENOSPC);
 
rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]);
-
-   if (!rfb_offset) {
-   pr_warning(Don't support to grow refcount table);
-   return NULL;
-   }
+   if (!rfb_offset)
+   return ERR_PTR(-ENOSPC);
 
rfb = refcount_block_search(q, rfb_offset);
if (rfb)
@@ -705,7 +760,9 @@ static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
u64 rfb_idx;
 
rfb = qcow_read_refcount_block(q, clust_idx);
-   if (!rfb) {
+   if (PTR_ERR(rfb) == -ENOSPC)
+   return 0;
+   else if (IS_ERR_OR_NULL(rfb)) {
pr_warning(Error while reading refcount table);
return -1;
}
@@ -729,7 +786,13 @@ static int update_cluster_refcount(struct qcow *q, u64 
clust_idx, u16 append)
u64 rfb_idx;
 
rfb = qcow_read_refcount_block(q, clust_idx);
-   if (!rfb) {
+   if (PTR_ERR(rfb) == -ENOSPC) {
+   rfb = qcow_grow_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(error while growing refcount table);
+   return -1;
+   }
+   } else if (IS_ERR_OR_NULL(rfb)) {
pr_warning(error while reading refcount table);
return -1;
}
@@ -774,11 +837,11 @@ static void  qcow_free_clusters(struct qcow *q, u64 
clust_start, u64 size)
  * can satisfy the size. free_clust_idx is initialized to zero and
  * Record last position.
  */
-static u64 qcow_alloc_clusters(struct qcow *q, u64 size)
+static u64 qcow_alloc_clusters(struct qcow *q, u64 size, int update_ref)
 {
struct qcow_header *header = q-header;
u16 clust_refcount;
-   u32 clust_idx, i;
+   u32 clust_idx = 0, i;
u64 clust_num;
 
clust_num = (size + (q-cluster_size - 1))  header-cluster_bits;
@@ -793,12 +856,15 @@ again

Re: [RFC PATCH] kvm tools, qcow: Add support for growing refcount blocks

2011-12-13 Thread lan,Tianyu
oh,yeah. Thanks for your reply. It's a good lesson for me. :)
I will update soon.
On 二, 2011-12-13 at 16:57 +0800, Kevin Wolf wrote:
 Am 13.12.2011 04:41, schrieb lan,Tianyu:
  On 一, 2011-12-12 at 19:15 +0800, Kevin Wolf wrote:
  Am 12.12.2011 11:58, schrieb Pekka Enberg:
  On Mon, 12 Dec 2011, Kevin Wolf wrote:
  @@ -667,14 +722,11 @@ static struct qcow_refcount_block 
  *qcow_read_refcount_block(struct qcow *q, u64
 
  rft_idx = clust_idx  (header-cluster_bits - 
  QCOW_REFCOUNT_BLOCK_SHIFT);
  if (rft_idx = rft-rf_size)
  -   return NULL;
  +   return (void *)-ENOSPC;
 
  Is this allowed style in kvm-tool? :-/
 
  It needs to use ERR_PTR() and related macros but otherwise I don't see a 
  big problem with it.
 
  Can you be sure that it never clashes with a valid allocation when you
  use this in userspace?
 
  But yes, at least using appropriate functions should be required. And
  this means that you can't only check for -ENOSPC, but you need to check
  for all possible error codes (IS_ERR_VALUE() I guess).
  The qcow_read_refcount_block() is invoiked in the two places
  qcow_get_refcount() and update_cluster_refcount() and will only return
  NULL or -ENOSPC.
  
  In the qcow_get_refcount(), when qcow_read_refcount_block() returned
  -ENOSPEC(no refcount block is available.), returning zero which means
  the cluster is not in use and the refcount block can be grew in the
  update_cluster_refcount().  
  
  In the update_cluster_refcount(), when qcow_read_refcount_block()
  returned -ENOSPEC, it is necessary to grow the refcount blocks.
  
  So the ENOSPEC should be specially dealt with.  
  
  Does this make sense? :)
 
 I'm not saying that your code won't work today, just that it's brittle.
 If someone adds a different error return code somewhere and doesn't
 check if all callers properly deal with error codes, it will break.
 
 Kevin

Thanks
Tianyu Lan

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm tools, qcow: Add support for growing refcount blocks

2011-12-12 Thread lan,Tianyu
Thanks for your review.
On 一, 2011-12-12 at 17:55 +0800, Kevin Wolf wrote:
 Am 12.12.2011 03:03, schrieb Lan Tianyu:
  This patch enables allocating new refcount blocks and so then kvm tools
  could expand qcow2 image much larger.
  
  Signed-off-by: Lan Tianyu tianyu@intel.com
  ---
   tools/kvm/disk/qcow.c |  105 
  +---
   1 files changed, 89 insertions(+), 16 deletions(-)
  
  diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
  index e139fa5..929ba69 100644
  --- a/tools/kvm/disk/qcow.c
  +++ b/tools/kvm/disk/qcow.c
  @@ -12,6 +12,7 @@
   #include string.h
   #include unistd.h
   #include fcntl.h
  +#include errno.h
   #ifdef CONFIG_HAS_ZLIB
   #include zlib.h
   #endif
  @@ -20,6 +21,10 @@
   #include linux/kernel.h
   #include linux/types.h
   
  +static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 
  append);
  +static int qcow_write_refcount_table(struct qcow *q);
  +static u64 qcow_alloc_clusters(struct qcow *q, u64 size, int update_ref);
  +static void  qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size);
   
   static inline int qcow_pwrite_sync(int fd,
  void *buf, size_t count, off_t offset)
  @@ -657,6 +662,56 @@ static struct qcow_refcount_block 
  *refcount_block_search(struct qcow *q, u64 off
  return rfb;
   }
   
  +static struct qcow_refcount_block *qcow_grow_refcount_block(struct qcow *q,
  +   u64 clust_idx)
  +{
  +   struct qcow_header *header = q-header;
  +   struct qcow_refcount_table *rft = q-refcount_table;
  +   struct qcow_refcount_block *rfb;
  +   u64 new_block_offset;
  +   u64 rft_idx;
  +
  +   rft_idx = clust_idx  (header-cluster_bits -
  +   QCOW_REFCOUNT_BLOCK_SHIFT);
  +
  +   if (rft_idx = rft-rf_size) {
  +   pr_warning(Don't support grow refcount block table);
  +   return NULL;
  +   }
  +
  +   new_block_offset = qcow_alloc_clusters(q, q-cluster_size, 0);
  +   if (new_block_offset  0)
  +   return NULL;
  +
  +   rfb = new_refcount_block(q, new_block_offset);
  +   if (!rfb)
  +   return NULL;
  +
  +   memset(rfb-entries, 0x00, q-cluster_size);
  +   rfb-dirty = 1;
  +
  +   /* write refcount block */
  +   if (write_refcount_block(q, rfb)  0)
  +   goto free_rfb;
  +
  +   if (cache_refcount_block(q, rfb)  0)
  +   goto free_rfb;
  +
  +   rft-rf_table[rft_idx] = cpu_to_be64(new_block_offset);
  +   if (qcow_write_refcount_table(q)  0)
  +   goto free_rfb;
  +
  +   if (update_cluster_refcount(q, new_block_offset 
  +   header-cluster_bits, 1)  0)
  +   goto free_rfb;
 
 This order is unsafe, you could end up with a refcount block that is
 already in use, but still has a refcount of 0.
How about following?
rft-rf_table[rft_idx] = cpu_to_be64(new_block_offset);

if (update_cluster_refcount(q, new_block_offset 
header-cluster_bits, 1)  0) {
rft-rf_table[rft_idx] =  0;
goto free_rfb;
}

if (qcow_write_refcount_table(q)  0) {
rft-rf_table[rft_idx] =  0;
qcow_free_clusters(q, new_block_offset, q-cluster_size);
goto free_rfb;
}

update_cluster_refcount() will use the rft-rf_table. So updating the 
rft-rf_table
must be before update_cluster_refcount().
  +
  +   return rfb;
  +
  +free_rfb:
  +   free(rfb);
  +   return NULL;
  +}
  +
   static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow 
  *q, u64 clust_idx)
   {
  struct qcow_header *header = q-header;
  @@ -667,14 +722,11 @@ static struct qcow_refcount_block 
  *qcow_read_refcount_block(struct qcow *q, u64
   
  rft_idx = clust_idx  (header-cluster_bits - 
  QCOW_REFCOUNT_BLOCK_SHIFT);
  if (rft_idx = rft-rf_size)
  -   return NULL;
  +   return (void *)-ENOSPC;
 
 Is this allowed style in kvm-tool? :-/
 
 Kevin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm tools, qcow: Add support for growing refcount blocks

2011-12-12 Thread lan,Tianyu
On 一, 2011-12-12 at 18:58 +0800, Pekka Enberg wrote:
 On Mon, 12 Dec 2011, Kevin Wolf wrote:
  @@ -667,14 +722,11 @@ static struct qcow_refcount_block 
  *qcow_read_refcount_block(struct qcow *q, u64
 
 rft_idx = clust_idx  (header-cluster_bits - 
  QCOW_REFCOUNT_BLOCK_SHIFT);
 if (rft_idx = rft-rf_size)
  -  return NULL;
  +  return (void *)-ENOSPC;
 
  Is this allowed style in kvm-tool? :-/
 
 It needs to use ERR_PTR() and related macros but otherwise I don't see a 
 big problem with it.
 
   Pekka

I have tried to use ERR_PTR(). But when I included linux/err.h, a
compile error  was following.

  CC   disk/qcow.o
In file included from disk/qcow.c:20:
../../include/linux/err.h:22: error: expected '=', ',', ';', 'asm' or
'__attribute__' before 'ERR_PTR'
../../include/linux/err.h:27: error: expected '=', ',', ';', 'asm' or
'__attribute__' before 'PTR_ERR'
../../include/linux/err.h:32: error: expected '=', ',', ';', 'asm' or
'__attribute__' before 'IS_ERR'
../../include/linux/err.h:37: error: expected '=', ',', ';', 'asm' or
'__attribute__' before 'IS_ERR_OR_NULL'
../../include/linux/err.h:49: error: expected '=', ',', ';', 'asm' or
'__attribute__' before 'ERR_CAST'
../../include/linux/err.h:55: error: expected '=', ',', ';', 'asm' or
'__attribute__' before 'PTR_RET'


in the linux/err.h

static inline void * __must_check ERR_PTR(long error)
{
return (void *) error;
}

when I comment __must_check, the error disappear.

Finally I did cast myself rather than use ERR_PTR().
Are there some substitutions of ERR_PTR in the user space or choices?
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm tools, qcow: Add support for growing refcount blocks

2011-12-12 Thread lan,Tianyu
On 一, 2011-12-12 at 19:15 +0800, Kevin Wolf wrote:
 Am 12.12.2011 11:58, schrieb Pekka Enberg:
  On Mon, 12 Dec 2011, Kevin Wolf wrote:
  @@ -667,14 +722,11 @@ static struct qcow_refcount_block 
  *qcow_read_refcount_block(struct qcow *q, u64
 
rft_idx = clust_idx  (header-cluster_bits - 
  QCOW_REFCOUNT_BLOCK_SHIFT);
if (rft_idx = rft-rf_size)
  - return NULL;
  + return (void *)-ENOSPC;
 
  Is this allowed style in kvm-tool? :-/
  
  It needs to use ERR_PTR() and related macros but otherwise I don't see a 
  big problem with it.
 
 Can you be sure that it never clashes with a valid allocation when you
 use this in userspace?
 
 But yes, at least using appropriate functions should be required. And
 this means that you can't only check for -ENOSPC, but you need to check
 for all possible error codes (IS_ERR_VALUE() I guess).
The qcow_read_refcount_block() is invoiked in the two places
qcow_get_refcount() and update_cluster_refcount() and will only return
NULL or -ENOSPC.

In the qcow_get_refcount(), when qcow_read_refcount_block() returned
-ENOSPEC(no refcount block is available.), returning zero which means
the cluster is not in use and the refcount block can be grew in the
update_cluster_refcount().  

In the update_cluster_refcount(), when qcow_read_refcount_block()
returned -ENOSPEC, it is necessary to grow the refcount blocks.

So the ENOSPEC should be specially dealt with.  

Does this make sense? :)
 Kevin

Thanks
Tianyu Lan.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] kvm tools, qcow: Add support for growing refcount blocks

2011-12-11 Thread Lan Tianyu
This patch enables allocating new refcount blocks and so then kvm tools
could expand qcow2 image much larger.

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/disk/qcow.c |  105 +---
 1 files changed, 89 insertions(+), 16 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index e139fa5..929ba69 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -12,6 +12,7 @@
 #include string.h
 #include unistd.h
 #include fcntl.h
+#include errno.h
 #ifdef CONFIG_HAS_ZLIB
 #include zlib.h
 #endif
@@ -20,6 +21,10 @@
 #include linux/kernel.h
 #include linux/types.h
 
+static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append);
+static int qcow_write_refcount_table(struct qcow *q);
+static u64 qcow_alloc_clusters(struct qcow *q, u64 size, int update_ref);
+static void  qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size);
 
 static inline int qcow_pwrite_sync(int fd,
void *buf, size_t count, off_t offset)
@@ -657,6 +662,56 @@ static struct qcow_refcount_block 
*refcount_block_search(struct qcow *q, u64 off
return rfb;
 }
 
+static struct qcow_refcount_block *qcow_grow_refcount_block(struct qcow *q,
+   u64 clust_idx)
+{
+   struct qcow_header *header = q-header;
+   struct qcow_refcount_table *rft = q-refcount_table;
+   struct qcow_refcount_block *rfb;
+   u64 new_block_offset;
+   u64 rft_idx;
+
+   rft_idx = clust_idx  (header-cluster_bits -
+   QCOW_REFCOUNT_BLOCK_SHIFT);
+
+   if (rft_idx = rft-rf_size) {
+   pr_warning(Don't support grow refcount block table);
+   return NULL;
+   }
+
+   new_block_offset = qcow_alloc_clusters(q, q-cluster_size, 0);
+   if (new_block_offset  0)
+   return NULL;
+
+   rfb = new_refcount_block(q, new_block_offset);
+   if (!rfb)
+   return NULL;
+
+   memset(rfb-entries, 0x00, q-cluster_size);
+   rfb-dirty = 1;
+
+   /* write refcount block */
+   if (write_refcount_block(q, rfb)  0)
+   goto free_rfb;
+
+   if (cache_refcount_block(q, rfb)  0)
+   goto free_rfb;
+
+   rft-rf_table[rft_idx] = cpu_to_be64(new_block_offset);
+   if (qcow_write_refcount_table(q)  0)
+   goto free_rfb;
+
+   if (update_cluster_refcount(q, new_block_offset 
+   header-cluster_bits, 1)  0)
+   goto free_rfb;
+
+   return rfb;
+
+free_rfb:
+   free(rfb);
+   return NULL;
+}
+
 static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, 
u64 clust_idx)
 {
struct qcow_header *header = q-header;
@@ -667,14 +722,11 @@ static struct qcow_refcount_block 
*qcow_read_refcount_block(struct qcow *q, u64
 
rft_idx = clust_idx  (header-cluster_bits - 
QCOW_REFCOUNT_BLOCK_SHIFT);
if (rft_idx = rft-rf_size)
-   return NULL;
+   return (void *)-ENOSPC;
 
rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]);
-
-   if (!rfb_offset) {
-   pr_warning(Don't support to grow refcount table);
-   return NULL;
-   }
+   if (!rfb_offset)
+   return (void *)-ENOSPC;
 
rfb = refcount_block_search(q, rfb_offset);
if (rfb)
@@ -708,7 +760,8 @@ static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
if (!rfb) {
pr_warning(Error while reading refcount table);
return -1;
-   }
+   } else if ((long)rfb == -ENOSPC)
+   return 0;
 
rfb_idx = clust_idx  (((1ULL 
(header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
@@ -732,6 +785,12 @@ static int update_cluster_refcount(struct qcow *q, u64 
clust_idx, u16 append)
if (!rfb) {
pr_warning(error while reading refcount table);
return -1;
+   } else if ((long)rfb == -ENOSPC) {
+   rfb = qcow_grow_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(error while growing refcount table);
+   return -1;
+   }
}
 
rfb_idx = clust_idx  (((1ULL 
@@ -774,11 +833,11 @@ static void  qcow_free_clusters(struct qcow *q, u64 
clust_start, u64 size)
  * can satisfy the size. free_clust_idx is initialized to zero and
  * Record last position.
  */
-static u64 qcow_alloc_clusters(struct qcow *q, u64 size)
+static u64 qcow_alloc_clusters(struct qcow *q, u64 size, int update_ref)
 {
struct qcow_header *header = q-header;
u16 clust_refcount;
-   u32 clust_idx, i;
+   u32 clust_idx = 0, i;
u64 clust_num;
 
clust_num = (size + (q-cluster_size - 1))  header-cluster_bits;
@@ -793,12 +852,15 @@ again:
goto again;
}
 
-   for (i = 0; i  clust_num; i++)
-   if (update_cluster_refcount(q,
-   q-free_clust_idx

[PATCH v5] kvm tools, qcow: Add the support for copy-on-write cluster

2011-11-28 Thread Lan Tianyu
When meeting request to write the cluster without copied flag,
allocate a new cluster and write original data with modification
to the new cluster. This also adds support for the writing operation
of the qcow2 compressed image. After testing, image file can pass
through qemu-img check. The performance is needed to be improved.

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/disk/qcow.c|  454 ++
 tools/kvm/include/kvm/qcow.h |2 +
 2 files changed, 289 insertions(+), 167 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index 680b37d..e139fa5 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -20,6 +20,16 @@
 #include linux/kernel.h
 #include linux/types.h
 
+
+static inline int qcow_pwrite_sync(int fd,
+   void *buf, size_t count, off_t offset)
+{
+   if (pwrite_in_full(fd, buf, count, offset)  0)
+   return -1;
+
+   return fdatasync(fd);
+}
+
 static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new)
 {
struct rb_node **link = (root-rb_node), *parent = NULL;
@@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct 
qcow_l2_table *c)
 
size = 1  header-l2_bits;
 
-   if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset)  0)
+   if (qcow_pwrite_sync(q-fd, c-table,
+   size * sizeof(u64), c-offset)  0)
return -1;
 
c-dirty = 0;
@@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table 
*c)
 */
lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, 
list);
 
-   if (qcow_l2_cache_write(q, lru)  0)
-   goto error;
-
/* Remove the node from the cache */
rb_erase(lru-node, r);
list_del_init(lru-list);
@@ -508,47 +516,6 @@ static ssize_t qcow_read_sector(struct disk_image *disk, 
u64 sector,
return total;
 }
 
-static inline u64 file_size(int fd)
-{
-   struct stat st;
-
-   if (fstat(fd, st)  0)
-   return 0;
-
-   return st.st_size;
-}
-
-static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t 
offset)
-{
-   if (pwrite_in_full(fd, buf, count, offset)  0)
-   return -1;
-
-   return fdatasync(fd);
-}
-
-/* Writes a level 2 table at the end of the file. */
-static u64 qcow_write_l2_table(struct qcow *q, u64 *table)
-{
-   struct qcow_header *header = q-header;
-   u64 clust_sz;
-   u64 f_sz;
-   u64 off;
-   u64 sz;
-
-   f_sz= file_size(q-fd);
-   if (!f_sz)
-   return 0;
-
-   sz  = 1  header-l2_bits;
-   clust_sz= 1  header-cluster_bits;
-   off = ALIGN(f_sz, clust_sz);
-
-   if (pwrite_in_full(q-fd, table, sz * sizeof(u64), off)  0)
-   return 0;
-
-   return off;
-}
-
 static void refcount_table_free_cache(struct qcow_refcount_table *rft)
 {
struct rb_root *r = rft-root;
@@ -601,7 +568,8 @@ static int write_refcount_block(struct qcow *q, struct 
qcow_refcount_block *rfb)
if (!rfb-dirty)
return 0;
 
-   if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), 
rfb-offset)  0)
+   if (qcow_pwrite_sync(q-fd, rfb-entries,
+   rfb-size * sizeof(u16), rfb-offset)  0)
return -1;
 
rfb-dirty = 0;
@@ -618,9 +586,6 @@ static int cache_refcount_block(struct qcow *q, struct 
qcow_refcount_block *c)
if (rft-nr_cached == MAX_CACHE_NODES) {
lru = list_first_entry(rft-lru_list, struct 
qcow_refcount_block, list);
 
-   if (write_refcount_block(q, lru)  0)
-   goto error;
-
rb_erase(lru-node, r);
list_del_init(lru-list);
rft-nr_cached--;
@@ -706,6 +671,11 @@ static struct qcow_refcount_block 
*qcow_read_refcount_block(struct qcow *q, u64
 
rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]);
 
+   if (!rfb_offset) {
+   pr_warning(Don't support to grow refcount table);
+   return NULL;
+   }
+
rfb = refcount_block_search(q, rfb_offset);
if (rfb)
return rfb;
@@ -728,35 +698,140 @@ error_free_rfb:
return NULL;
 }
 
+static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(Error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+
+   if (rfb_idx = rfb-size) {
+   pr_warning(L1: refcount block index out of bounds);
+   return -1

[PATCH v3] kvm tools, qcow: Add the support for copy-on-write cluster

2011-11-23 Thread Lan Tianyu
When meeting request to write the cluster without copied flag,
allocate a new cluster and write original data with modification
to the new cluster. This also adds support for the writing operation
of the qcow2 compressed image. After testing, image file can pass
through qemu-img check. The performance is needed to be improved.

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/disk/qcow.c|  456 +++---
 tools/kvm/include/kvm/qcow.h |2 +
 2 files changed, 292 insertions(+), 166 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index 680b37d..7cf6fda 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -20,6 +20,16 @@
 #include linux/kernel.h
 #include linux/types.h
 
+
+static inline int qcow_pwrite_sync(int fd,
+   void *buf, size_t count, off_t offset)
+{
+   if (pwrite_in_full(fd, buf, count, offset)  0)
+   return -1;
+
+   return fdatasync(fd);
+}
+
 static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new)
 {
struct rb_node **link = (root-rb_node), *parent = NULL;
@@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct 
qcow_l2_table *c)
 
size = 1  header-l2_bits;
 
-   if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset)  0)
+   if (qcow_pwrite_sync(q-fd, c-table,
+   size * sizeof(u64), c-offset)  0)
return -1;
 
c-dirty = 0;
@@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table 
*c)
 */
lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, 
list);
 
-   if (qcow_l2_cache_write(q, lru)  0)
-   goto error;
-
/* Remove the node from the cache */
rb_erase(lru-node, r);
list_del_init(lru-list);
@@ -508,47 +516,6 @@ static ssize_t qcow_read_sector(struct disk_image *disk, 
u64 sector,
return total;
 }
 
-static inline u64 file_size(int fd)
-{
-   struct stat st;
-
-   if (fstat(fd, st)  0)
-   return 0;
-
-   return st.st_size;
-}
-
-static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t 
offset)
-{
-   if (pwrite_in_full(fd, buf, count, offset)  0)
-   return -1;
-
-   return fdatasync(fd);
-}
-
-/* Writes a level 2 table at the end of the file. */
-static u64 qcow_write_l2_table(struct qcow *q, u64 *table)
-{
-   struct qcow_header *header = q-header;
-   u64 clust_sz;
-   u64 f_sz;
-   u64 off;
-   u64 sz;
-
-   f_sz= file_size(q-fd);
-   if (!f_sz)
-   return 0;
-
-   sz  = 1  header-l2_bits;
-   clust_sz= 1  header-cluster_bits;
-   off = ALIGN(f_sz, clust_sz);
-
-   if (pwrite_in_full(q-fd, table, sz * sizeof(u64), off)  0)
-   return 0;
-
-   return off;
-}
-
 static void refcount_table_free_cache(struct qcow_refcount_table *rft)
 {
struct rb_root *r = rft-root;
@@ -601,7 +568,8 @@ static int write_refcount_block(struct qcow *q, struct 
qcow_refcount_block *rfb)
if (!rfb-dirty)
return 0;
 
-   if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), 
rfb-offset)  0)
+   if (qcow_pwrite_sync(q-fd, rfb-entries,
+   rfb-size * sizeof(u16), rfb-offset)  0)
return -1;
 
rfb-dirty = 0;
@@ -618,9 +586,6 @@ static int cache_refcount_block(struct qcow *q, struct 
qcow_refcount_block *c)
if (rft-nr_cached == MAX_CACHE_NODES) {
lru = list_first_entry(rft-lru_list, struct 
qcow_refcount_block, list);
 
-   if (write_refcount_block(q, lru)  0)
-   goto error;
-
rb_erase(lru-node, r);
list_del_init(lru-list);
rft-nr_cached--;
@@ -706,6 +671,11 @@ static struct qcow_refcount_block 
*qcow_read_refcount_block(struct qcow *q, u64
 
rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]);
 
+   if (!rfb_offset) {
+   pr_warning(Don't support to grow refcount table);
+   return NULL;
+   }
+
rfb = refcount_block_search(q, rfb_offset);
if (rfb)
return rfb;
@@ -728,35 +698,140 @@ error_free_rfb:
return NULL;
 }
 
+static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(Error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+
+   if (rfb_idx = rfb-size) {
+   pr_warning(L1: refcount block index out of bounds);
+   return -1

[PATCH v4] kvm tools, qcow: Add the support for copy-on-write cluster

2011-11-23 Thread Lan Tianyu
When meeting request to write the cluster without copied flag,
allocate a new cluster and write original data with modification
to the new cluster. This also adds support for the writing operation
of the qcow2 compressed image. After testing, image file can pass
through qemu-img check. The performance is needed to be improved.

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/disk/qcow.c|  456 +++---
 tools/kvm/include/kvm/qcow.h |2 +
 2 files changed, 292 insertions(+), 166 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index 680b37d..400aae8 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -20,6 +20,16 @@
 #include linux/kernel.h
 #include linux/types.h
 
+
+static inline int qcow_pwrite_sync(int fd,
+   void *buf, size_t count, off_t offset)
+{
+   if (pwrite_in_full(fd, buf, count, offset)  0)
+   return -1;
+
+   return fdatasync(fd);
+}
+
 static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new)
 {
struct rb_node **link = (root-rb_node), *parent = NULL;
@@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct 
qcow_l2_table *c)
 
size = 1  header-l2_bits;
 
-   if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset)  0)
+   if (qcow_pwrite_sync(q-fd, c-table,
+   size * sizeof(u64), c-offset)  0)
return -1;
 
c-dirty = 0;
@@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table 
*c)
 */
lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, 
list);
 
-   if (qcow_l2_cache_write(q, lru)  0)
-   goto error;
-
/* Remove the node from the cache */
rb_erase(lru-node, r);
list_del_init(lru-list);
@@ -508,47 +516,6 @@ static ssize_t qcow_read_sector(struct disk_image *disk, 
u64 sector,
return total;
 }
 
-static inline u64 file_size(int fd)
-{
-   struct stat st;
-
-   if (fstat(fd, st)  0)
-   return 0;
-
-   return st.st_size;
-}
-
-static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t 
offset)
-{
-   if (pwrite_in_full(fd, buf, count, offset)  0)
-   return -1;
-
-   return fdatasync(fd);
-}
-
-/* Writes a level 2 table at the end of the file. */
-static u64 qcow_write_l2_table(struct qcow *q, u64 *table)
-{
-   struct qcow_header *header = q-header;
-   u64 clust_sz;
-   u64 f_sz;
-   u64 off;
-   u64 sz;
-
-   f_sz= file_size(q-fd);
-   if (!f_sz)
-   return 0;
-
-   sz  = 1  header-l2_bits;
-   clust_sz= 1  header-cluster_bits;
-   off = ALIGN(f_sz, clust_sz);
-
-   if (pwrite_in_full(q-fd, table, sz * sizeof(u64), off)  0)
-   return 0;
-
-   return off;
-}
-
 static void refcount_table_free_cache(struct qcow_refcount_table *rft)
 {
struct rb_root *r = rft-root;
@@ -601,7 +568,8 @@ static int write_refcount_block(struct qcow *q, struct 
qcow_refcount_block *rfb)
if (!rfb-dirty)
return 0;
 
-   if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), 
rfb-offset)  0)
+   if (qcow_pwrite_sync(q-fd, rfb-entries,
+   rfb-size * sizeof(u16), rfb-offset)  0)
return -1;
 
rfb-dirty = 0;
@@ -618,9 +586,6 @@ static int cache_refcount_block(struct qcow *q, struct 
qcow_refcount_block *c)
if (rft-nr_cached == MAX_CACHE_NODES) {
lru = list_first_entry(rft-lru_list, struct 
qcow_refcount_block, list);
 
-   if (write_refcount_block(q, lru)  0)
-   goto error;
-
rb_erase(lru-node, r);
list_del_init(lru-list);
rft-nr_cached--;
@@ -706,6 +671,11 @@ static struct qcow_refcount_block 
*qcow_read_refcount_block(struct qcow *q, u64
 
rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]);
 
+   if (!rfb_offset) {
+   pr_warning(Don't support to grow refcount table);
+   return NULL;
+   }
+
rfb = refcount_block_search(q, rfb_offset);
if (rfb)
return rfb;
@@ -728,35 +698,140 @@ error_free_rfb:
return NULL;
 }
 
+static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(Error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+
+   if (rfb_idx = rfb-size) {
+   pr_warning(L1: refcount block index out of bounds);
+   return -1

RE: [PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-21 Thread Lan, Tianyu
Yeah. I will fix them in the next version.

-Original Message-
From: penb...@gmail.com [mailto:penb...@gmail.com] On Behalf Of Pekka Enberg
Sent: Monday, November 21, 2011 5:06 PM
To: Lan, Tianyu
Cc: kvm@vger.kernel.org; asias.he...@gmail.com; levinsasha...@gmail.com; 
prasadjoshi...@gmail.com; kw...@redhat.com
Subject: Re: [PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

On Mon, Nov 21, 2011 at 9:12 AM, Lan Tianyu tianyu@intel.com wrote:
 +/*Allocate clusters according to the size. Find a postion that
 + *can satisfy the size. free_clust_idx is initialized to zero and
 + *Record last position.
 +*/

Can you please fix up your comments to use the following standard kernel style:

/*
 * Allocate clusters according to the size. Find a postion that
 * can satisfy the size. free_clust_idx is initialized to zero and
 * Record last position.
 */

[ Whitespace after asterisk and starting line doesn't have text. ]

 -               rfb = qcow_read_refcount_block(q, clust_idx);
 -               if (!rfb) {
 -                       pr_warning(L1: error while reading refcount table);
 +               clust_new_start = qcow_alloc_clusters(q, q-cluster_size);
 +               if (clust_new_start  0) {
 +                       pr_warning(Cluster alloc error!);

Please drop the exclamation marks from pr_warning() and pr_error()
messages. It just adds pointless noise.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] kvm tools, qcow: Add the support for copy-on-write cluster

2011-11-21 Thread Lan Tianyu
When meeting request to write the cluster without copied flag,
allocate a new cluster and write original data with modification
to the new cluster. This also adds support for the writing operation
of the qcow2 compressed image. After testing, image file can pass
through qemu-img check. The performance is needed to be improved.

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/disk/qcow.c|  421 +-
 tools/kvm/include/kvm/qcow.h |2 +
 2 files changed, 292 insertions(+), 131 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index 680b37d..99b13e7 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -20,6 +20,16 @@
 #include linux/kernel.h
 #include linux/types.h
 
+
+static inline int qcow_pwrite_sync(int fd,
+   void *buf, size_t count, off_t offset)
+{
+   if (pwrite_in_full(fd, buf, count, offset)  0)
+   return -1;
+
+   return fdatasync(fd);
+}
+
 static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new)
 {
struct rb_node **link = (root-rb_node), *parent = NULL;
@@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct 
qcow_l2_table *c)
 
size = 1  header-l2_bits;
 
-   if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset)  0)
+   if (qcow_pwrite_sync(q-fd, c-table,
+   size * sizeof(u64), c-offset)  0)
return -1;
 
c-dirty = 0;
@@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table 
*c)
 */
lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, 
list);
 
-   if (qcow_l2_cache_write(q, lru)  0)
-   goto error;
-
/* Remove the node from the cache */
rb_erase(lru-node, r);
list_del_init(lru-list);
@@ -518,14 +526,6 @@ static inline u64 file_size(int fd)
return st.st_size;
 }
 
-static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t 
offset)
-{
-   if (pwrite_in_full(fd, buf, count, offset)  0)
-   return -1;
-
-   return fdatasync(fd);
-}
-
 /* Writes a level 2 table at the end of the file. */
 static u64 qcow_write_l2_table(struct qcow *q, u64 *table)
 {
@@ -601,7 +601,8 @@ static int write_refcount_block(struct qcow *q, struct 
qcow_refcount_block *rfb)
if (!rfb-dirty)
return 0;
 
-   if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), 
rfb-offset)  0)
+   if (qcow_pwrite_sync(q-fd, rfb-entries,
+   rfb-size * sizeof(u16), rfb-offset)  0)
return -1;
 
rfb-dirty = 0;
@@ -618,9 +619,6 @@ static int cache_refcount_block(struct qcow *q, struct 
qcow_refcount_block *c)
if (rft-nr_cached == MAX_CACHE_NODES) {
lru = list_first_entry(rft-lru_list, struct 
qcow_refcount_block, list);
 
-   if (write_refcount_block(q, lru)  0)
-   goto error;
-
rb_erase(lru-node, r);
list_del_init(lru-list);
rft-nr_cached--;
@@ -706,6 +704,11 @@ static struct qcow_refcount_block 
*qcow_read_refcount_block(struct qcow *q, u64
 
rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]);
 
+   if (!rfb_offset) {
+   pr_warning(Don't support to grow refcount table);
+   return NULL;
+   }
+
rfb = refcount_block_search(q, rfb_offset);
if (rfb)
return rfb;
@@ -728,35 +731,140 @@ error_free_rfb:
return NULL;
 }
 
+static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(Error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+
+   if (rfb_idx = rfb-size) {
+   pr_warning(L1: refcount block index out of bounds);
+   return -1;
+   }
+
+   return be16_to_cpu(rfb-entries[rfb_idx]);
+}
+
+static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u16 refcount;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+   if (rfb_idx = rfb-size) {
+   pr_warning(refcount block index out of bounds);
+   return -1;
+   }
+
+   refcount = be16_to_cpu(rfb-entries[rfb_idx]) + append;
+   rfb

[PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-20 Thread Lan Tianyu
When meeting request to write the cluster without copied flag,
allocate a new cluster and write original data with modification
to the new cluster. This also adds support for the writing operation
of the qcow2 compressed image. After testing, image file can pass
through qemu-img check. The performance is needed to be improved.

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/disk/qcow.c|  411 +-
 tools/kvm/include/kvm/qcow.h |2 +
 2 files changed, 285 insertions(+), 128 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index 680b37d..723faaa 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -20,6 +20,16 @@
 #include linux/kernel.h
 #include linux/types.h
 
+
+static inline int qcow_pwrite_sync(int fd,
+   void *buf, size_t count, off_t offset)
+{
+   if (pwrite_in_full(fd, buf, count, offset)  0)
+   return -1;
+
+   return fdatasync(fd);
+}
+
 static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new)
 {
struct rb_node **link = (root-rb_node), *parent = NULL;
@@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct 
qcow_l2_table *c)
 
size = 1  header-l2_bits;
 
-   if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset)  0)
+   if (qcow_pwrite_sync(q-fd, c-table,
+   size * sizeof(u64), c-offset)  0)
return -1;
 
c-dirty = 0;
@@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table 
*c)
 */
lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, 
list);
 
-   if (qcow_l2_cache_write(q, lru)  0)
-   goto error;
-
/* Remove the node from the cache */
rb_erase(lru-node, r);
list_del_init(lru-list);
@@ -518,14 +526,6 @@ static inline u64 file_size(int fd)
return st.st_size;
 }
 
-static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t 
offset)
-{
-   if (pwrite_in_full(fd, buf, count, offset)  0)
-   return -1;
-
-   return fdatasync(fd);
-}
-
 /* Writes a level 2 table at the end of the file. */
 static u64 qcow_write_l2_table(struct qcow *q, u64 *table)
 {
@@ -601,7 +601,8 @@ static int write_refcount_block(struct qcow *q, struct 
qcow_refcount_block *rfb)
if (!rfb-dirty)
return 0;
 
-   if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), 
rfb-offset)  0)
+   if (qcow_pwrite_sync(q-fd, rfb-entries,
+   rfb-size * sizeof(u16), rfb-offset)  0)
return -1;
 
rfb-dirty = 0;
@@ -618,9 +619,6 @@ static int cache_refcount_block(struct qcow *q, struct 
qcow_refcount_block *c)
if (rft-nr_cached == MAX_CACHE_NODES) {
lru = list_first_entry(rft-lru_list, struct 
qcow_refcount_block, list);
 
-   if (write_refcount_block(q, lru)  0)
-   goto error;
-
rb_erase(lru-node, r);
list_del_init(lru-list);
rft-nr_cached--;
@@ -706,6 +704,11 @@ static struct qcow_refcount_block 
*qcow_read_refcount_block(struct qcow *q, u64
 
rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]);
 
+   if (!rfb_offset) {
+   pr_warning(Don't support to grow refcount table);
+   return NULL;
+   }
+
rfb = refcount_block_search(q, rfb_offset);
if (rfb)
return rfb;
@@ -728,35 +731,138 @@ error_free_rfb:
return NULL;
 }
 
-/*
- * QCOW file might grow during a write operation. Not only data but metadata is
- * also written at the end of the file. Therefore it is necessary to ensure
- * every write is committed to disk. Hence we use uses qcow_pwrite_sync() to
- * synchronize the in-core state of QCOW image to disk.
- *
- * We also try to restore the image to a consistent state if the metdata
- * operation fails. The two metadat operations are: level 1 and level 2 table
- * update. If either of them fails the image is truncated to a consistent 
state.
+static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(Error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+
+   if (rfb_idx = rfb-size) {
+   pr_warning(L1: refcount block index out of bounds);
+   return -1;
+   }
+
+   return be16_to_cpu(rfb-entries[rfb_idx]);
+}
+
+static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u16

RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-19 Thread Lan, Tianyu
How about using the sync_file_range to sync the metadata?

-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of 
Lan, Tianyu
Sent: Saturday, November 19, 2011 10:09 AM
To: Kevin Wolf
Cc: penb...@kernel.org; kvm@vger.kernel.org; asias.he...@gmail.com; 
levinsasha...@gmail.com; prasadjoshi...@gmail.com
Subject: RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write 
clusters

Hi Kevin:
Thanks for your review. The following means that there should be a 
fsync after updating 
metadata(refcunt block, l1 table and l2 table).

Thanks 
Tianyu Lan

-Original Message-
 + /*write l2 table*/
 + l2t-dirty = 1;
 + if (qcow_l2_cache_write(q, l2t)  0)
   goto free_cache;

You need to make sure that the refcount update is written first (e.g.
with fsync), otherwise you risk corruption when the host crashes in the middle.

  
 - if (cache_table(q, l2t)  0) {
 - if (ftruncate(q-fd, f_sz)  0)
 - goto free_cache;
 + /* Update the l1 talble */
 + l1t-l1_table[l1t_idx] = cpu_to_be64(l2t_new_offset
 + | QCOW2_OFLAG_COPIED);
  
 - goto free_cache;
 - }
 + if (pwrite_in_full(q-fd, l1t-l1_table,
 + l1t-table_size * sizeof(u64),
 + header-l1_table_offset)  0)
 + goto error;

Likewise, the L1 table write must be ordered against the L2 write.

goto error is using the wrong label.

  
 - /* Update the in-core entry */
 - l1t-l1_table[l1t_idx] = cpu_to_be64(l2t_offset);
 + /*cache l2 table*/
 + cache_table(q, l2t);

After so many explicit comments, you can probably guess what's wrong here...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-19 Thread Lan, Tianyu
OK. Thx.
But fsync is too slow. I try to find a way to sync a range of file. 
Are there any solutions to meet my purpose?

Thanks
Tianyu Lan

-Original Message-
From: Sasha Levin [mailto:levinsasha...@gmail.com] 
Sent: Sunday, November 20, 2011 12:27 AM
To: Lan, Tianyu
Cc: Kevin Wolf; penb...@kernel.org; kvm@vger.kernel.org; asias.he...@gmail.com; 
prasadjoshi...@gmail.com
Subject: RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write 
clusters

On Sat, 2011-11-19 at 23:30 +0800, Lan, Tianyu wrote:
 How about using the sync_file_range to sync the metadata?

sync_file_range() is only a hint, it doesn't actually assure anything.

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-19 Thread Lan, Tianyu
Yeah. That will make the work quite simple.
After testing, fdatasync is better than fsync.

-Original Message-
From: Sasha Levin [mailto:levinsasha...@gmail.com] 
Sent: Sunday, November 20, 2011 2:23 PM
To: Lan, Tianyu
Cc: Kevin Wolf; penb...@kernel.org; kvm@vger.kernel.org; asias.he...@gmail.com; 
prasadjoshi...@gmail.com
Subject: RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write 
clusters

On Sun, 2011-11-20 at 14:14 +0800, Lan, Tianyu wrote:
 OK. Thx.
 But fsync is too slow. I try to find a way to sync a range of file. 
 Are there any solutions to meet my purpose?

fdatasync() is as good as it'll get.

tbh, maybe we should just consider opening QCOW images with O_SYNC and
just get it over with?

 Thanks
 Tianyu Lan
 
 -Original Message-
 From: Sasha Levin [mailto:levinsasha...@gmail.com] 
 Sent: Sunday, November 20, 2011 12:27 AM
 To: Lan, Tianyu
 Cc: Kevin Wolf; penb...@kernel.org; kvm@vger.kernel.org; 
 asias.he...@gmail.com; prasadjoshi...@gmail.com
 Subject: RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for 
 copy-on-write clusters
 
 On Sat, 2011-11-19 at 23:30 +0800, Lan, Tianyu wrote:
  How about using the sync_file_range to sync the metadata?
 
 sync_file_range() is only a hint, it doesn't actually assure anything.
 

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-18 Thread Lan Tianyu
When meeting request to write the cluster without copied flag,
allocate a new cluster and write original data with modification
to the new cluster. This also adds support for the writing operation
of the qcow2 compressed image. After testing, image file can pass
through qemu-img check.

 Please enter the commit message for your changes. Lines starting

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/disk/qcow.c|  366 +-
 tools/kvm/include/kvm/qcow.h |2 +
 2 files changed, 255 insertions(+), 113 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index 680b37d..4d9125d 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -122,9 +122,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table 
*c)
 */
lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, 
list);
 
-   if (qcow_l2_cache_write(q, lru)  0)
-   goto error;
-
/* Remove the node from the cache */
rb_erase(lru-node, r);
list_del_init(lru-list);
@@ -618,9 +615,6 @@ static int cache_refcount_block(struct qcow *q, struct 
qcow_refcount_block *c)
if (rft-nr_cached == MAX_CACHE_NODES) {
lru = list_first_entry(rft-lru_list, struct 
qcow_refcount_block, list);
 
-   if (write_refcount_block(q, lru)  0)
-   goto error;
-
rb_erase(lru-node, r);
list_del_init(lru-list);
rft-nr_cached--;
@@ -706,6 +700,11 @@ static struct qcow_refcount_block 
*qcow_read_refcount_block(struct qcow *q, u64
 
rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]);
 
+   if (!rfb_offset) {
+   pr_warning(Don't support to grow refcount table);
+   return NULL;
+   }
+
rfb = refcount_block_search(q, rfb_offset);
if (rfb)
return rfb;
@@ -728,35 +727,121 @@ error_free_rfb:
return NULL;
 }
 
-/*
- * QCOW file might grow during a write operation. Not only data but metadata is
- * also written at the end of the file. Therefore it is necessary to ensure
- * every write is committed to disk. Hence we use uses qcow_pwrite_sync() to
- * synchronize the in-core state of QCOW image to disk.
- *
- * We also try to restore the image to a consistent state if the metdata
- * operation fails. The two metadat operations are: level 1 and level 2 table
- * update. If either of them fails the image is truncated to a consistent 
state.
+static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(Error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+
+   if (rfb_idx = rfb-size) {
+   pr_warning(L1: refcount block index out of bounds);
+   return -1;
+   }
+
+   return be16_to_cpu(rfb-entries[rfb_idx]);
+}
+
+static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u16 refcount;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+   if (rfb_idx = rfb-size) {
+   pr_warning(refcount block index out of bounds);
+   return -1;
+   }
+
+   refcount = be16_to_cpu(rfb-entries[rfb_idx]) + append;
+   rfb-entries[rfb_idx] = cpu_to_be16(refcount);
+   rfb-dirty = 1;
+
+   /*write refcount block*/
+   write_refcount_block(q, rfb);
+
+   /*update free_clust_idx since refcount becomes zero*/
+   if (!refcount  clust_idx  q-free_clust_idx)
+   q-free_clust_idx = clust_idx;
+
+   return 0;
+}
+
+static void  qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size)
+{
+   struct qcow_header *header = q-header;
+   u64 start, end, offset;
+
+   start = clust_start  ~(q-cluster_size - 1);
+   end = (clust_start + size - 1)  ~(q-cluster_size - 1);
+   for (offset = start; offset = end; offset += q-cluster_size)
+   update_cluster_refcount(q, offset  header-cluster_bits, -1);
+}
+
+/*Allocate clusters according to the size. Find a postion that
+ *can satisfy the size. free_clust_idx is initialized to zero and
+ *Record last position.
+*/
+static u64 qcow_alloc_clusters(struct qcow *q, u64 size)
+{
+   struct qcow_header *header = q-header

RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-18 Thread Lan, Tianyu
Hi Kevin:
Thanks for your review. The following means that there should be a 
fsync after updating 
metadata(refcunt block, l1 table and l2 table).

Thanks 
Tianyu Lan

-Original Message-
 + /*write l2 table*/
 + l2t-dirty = 1;
 + if (qcow_l2_cache_write(q, l2t)  0)
   goto free_cache;

You need to make sure that the refcount update is written first (e.g.
with fsync), otherwise you risk corruption when the host crashes in the middle.

  
 - if (cache_table(q, l2t)  0) {
 - if (ftruncate(q-fd, f_sz)  0)
 - goto free_cache;
 + /* Update the l1 talble */
 + l1t-l1_table[l1t_idx] = cpu_to_be64(l2t_new_offset
 + | QCOW2_OFLAG_COPIED);
  
 - goto free_cache;
 - }
 + if (pwrite_in_full(q-fd, l1t-l1_table,
 + l1t-table_size * sizeof(u64),
 + header-l1_table_offset)  0)
 + goto error;

Likewise, the L1 table write must be ordered against the L2 write.

goto error is using the wrong label.

  
 - /* Update the in-core entry */
 - l1t-l1_table[l1t_idx] = cpu_to_be64(l2t_offset);
 + /*cache l2 table*/
 + cache_table(q, l2t);

After so many explicit comments, you can probably guess what's wrong here...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-16 Thread Lan Tianyu
When meeting request to write the cluster without copied flag,
allocate a new cluster and write original data with modification
to the new cluster. This also can add support for the writing
operation of the qcow2 compressed image.

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/disk/qcow.c|  322 --
 tools/kvm/include/kvm/qcow.h |2 +
 2 files changed, 218 insertions(+), 106 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index 680b37d..2b9af73 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -122,9 +122,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table 
*c)
 */
lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, 
list);
 
-   if (qcow_l2_cache_write(q, lru)  0)
-   goto error;
-
/* Remove the node from the cache */
rb_erase(lru-node, r);
list_del_init(lru-list);
@@ -728,35 +725,110 @@ error_free_rfb:
return NULL;
 }
 
-/*
- * QCOW file might grow during a write operation. Not only data but metadata is
- * also written at the end of the file. Therefore it is necessary to ensure
- * every write is committed to disk. Hence we use uses qcow_pwrite_sync() to
- * synchronize the in-core state of QCOW image to disk.
- *
- * We also try to restore the image to a consistent state if the metdata
- * operation fails. The two metadat operations are: level 1 and level 2 table
- * update. If either of them fails the image is truncated to a consistent 
state.
+static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+
+   if (rfb_idx = rfb-size) {
+   pr_warning(L1: refcount block index out of bounds);
+   return -1;
+   }
+
+   return be16_to_cpu(rfb-entries[rfb_idx]);
+}
+
+static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u16 refcount;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+   if (rfb_idx = rfb-size) {
+   pr_warning(refcount block index out of bounds);
+   return -1;
+   }
+
+   refcount = be16_to_cpu(rfb-entries[rfb_idx]) + append;
+   rfb-entries[rfb_idx] = cpu_to_be16(refcount);
+   rfb-dirty = 1;
+
+   /*write refcount block*/
+   write_refcount_block(q, rfb);
+
+   /*update free_clust_idx since refcount becomes zero*/
+   if (!refcount  clust_idx  q-free_clust_idx)
+   q-free_clust_idx = clust_idx;
+
+   return 0;
+}
+
+/*Allocate clusters according to the size. Find a postion that
+ *can satisfy the size. free_clust_idx is initialized to zero and
+ *Record last position.
+*/
+static u64 qcow_alloc_clusters(struct qcow *q, u64 size)
+{
+   struct qcow_header *header = q-header;
+   u16 clust_refcount;
+   u32 clust_idx, i;
+   u64 clust_num;
+
+   clust_num = (size + (q-cluster_size - 1))  header-cluster_bits;
+
+again:
+   for (i = 0; i  clust_num; i++) {
+   clust_idx = q-free_clust_idx++;
+   clust_refcount = qcow_get_refcount(q, clust_idx);
+   if (clust_refcount  0)
+   return -1;
+   else if (clust_refcount  0)
+   goto again;
+   }
+
+   for (i = 0; i  clust_num; i++)
+   update_cluster_refcount(q,
+   q-free_clust_idx - clust_num + i, 1);
+
+   return (q-free_clust_idx - clust_num)  header-cluster_bits;
+}
+
+/*Get l2 table. If the table has been copied, read table directly.
+ *If the table exists, allocate a new cluster and copy the table
+ *to the new cluster.
  */
-static ssize_t qcow_write_cluster(struct qcow *q, u64 offset, void *buf, u32 
src_len)
+static int get_cluster_table(struct qcow *q, u64 offset,
+   struct qcow_l2_table **result_l2t, u64 *result_l2_index)
 {
struct qcow_header *header = q-header;
struct qcow_l1_table *l1t = q-table;
struct qcow_l2_table *l2t;
-   u64 clust_start;
-   u64 clust_flags;
-   u64 l2t_offset;
-   u64 clust_off;
-   u64 l2t_size;
-   u64 clust_sz;
u64 l1t_idx;
+   u64 l2t_offset

[PATCH RFC] kvm tools: Add support for the read operation of qcow and qcow2 compressed image

2011-10-12 Thread Lan Tianyu
This patch adds the decompression operation when confirming the qcow or
qcow2 image is compressed. This patch also divides the read cluster
fucntion into two which are respective for qcow and qcow2 in order to be
convenient to support these two kind images. Add some macros for qcow.

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/Makefile   |1 +
 tools/kvm/disk/qcow.c|  315 ++
 tools/kvm/include/kvm/qcow.h |   33 +++--
 3 files changed, 276 insertions(+), 73 deletions(-)

diff --git a/tools/kvm/Makefile b/tools/kvm/Makefile
index fef624d..0d0da4b 100644
--- a/tools/kvm/Makefile
+++ b/tools/kvm/Makefile
@@ -116,6 +116,7 @@ OBJS+= bios/bios-rom.o
 LIBS   += -lrt
 LIBS   += -lpthread
 LIBS   += -lutil
+LIBS   += -lz
 
 # Additional ARCH settings for x86
 ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/ -e s/sun4u/sparc64/ \
diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index 2471aeb..2f5c16a 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -12,6 +12,7 @@
 #include string.h
 #include unistd.h
 #include fcntl.h
+#include zlib.h
 
 #include linux/byteorder.h
 #include linux/kernel.h
@@ -233,12 +234,41 @@ error:
return NULL;
 }
 
-static ssize_t qcow_read_cluster(struct qcow *q, u64 offset, void *dst, u32 
dst_len)
+static int qcow_decompress_buffer(u8 *out_buf, int out_buf_size,
+   const u8 *buf, int buf_size)
+{
+   z_stream strm1, *strm = strm1;
+   int ret, out_len;
+
+   memset(strm, 0, sizeof(*strm));
+
+   strm-next_in = (u8 *)buf;
+   strm-avail_in = buf_size;
+   strm-next_out = out_buf;
+   strm-avail_out = out_buf_size;
+
+   ret = inflateInit2(strm, -12);
+   if (ret != Z_OK)
+   return -1;
+
+   ret = inflate(strm, Z_FINISH);
+   out_len = strm-next_out - out_buf;
+   if ((ret != Z_STREAM_END  ret != Z_BUF_ERROR) ||
+   out_len != out_buf_size) {
+   inflateEnd(strm);
+   return -1;
+   }
+
+   inflateEnd(strm);
+   return 0;
+}
+
+static ssize_t qcow1_read_cluster(struct qcow *q, u64 offset,
+   void *dst, u32 dst_len)
 {
struct qcow_header *header = q-header;
struct qcow_l1_table *l1t = q-table;
struct qcow_l2_table *l2t;
-   u64 cluster_size;
u64 clust_offset;
u64 clust_start;
u64 l2t_offset;
@@ -246,30 +276,24 @@ static ssize_t qcow_read_cluster(struct qcow *q, u64 
offset, void *dst, u32 dst_
u64 l2t_size;
u64 l1_idx;
u64 l2_idx;
-
-   cluster_size = 1  header-cluster_bits;
+   int coffset;
+   int csize;
 
l1_idx = get_l1_index(q, offset);
if (l1_idx = l1t-table_size)
return -1;
 
clust_offset = get_cluster_offset(q, offset);
-   if (clust_offset = cluster_size)
+   if (clust_offset = q-cluster_size)
return -1;
 
-   length = cluster_size - clust_offset;
+   length = q-cluster_size - clust_offset;
if (length  dst_len)
length = dst_len;
 
mutex_lock(q-mutex);
 
l2t_offset = be64_to_cpu(l1t-l1_table[l1_idx]);
-   if (l2t_offset  QCOW_OFLAG_COMPRESSED) {
-   pr_warning(compressed sectors are not supported);
-   goto out_error;
-   }
-
-   l2t_offset = QCOW_OFFSET_MASK;
if (!l2t_offset)
goto zero_cluster;
 
@@ -285,20 +309,130 @@ static ssize_t qcow_read_cluster(struct qcow *q, u64 
offset, void *dst, u32 dst_
goto out_error;
 
clust_start = be64_to_cpu(l2t-table[l2_idx]);
-   if (clust_start  QCOW_OFLAG_COMPRESSED) {
-   pr_warning(compressed sectors are not supported);
-   goto out_error;
+   if (clust_start  QCOW1_OFLAG_COMPRESSED) {
+   coffset = clust_start  q-cluster_offset_mask;
+   csize = clust_start  (63 - q-header-cluster_bits);
+   csize = (q-cluster_size - 1);
+
+   if (pread_in_full(q-fd, q-cluster_data, csize,
+ coffset)  0) {
+   goto out_error;
+   }
+
+   if (qcow_decompress_buffer(q-cluster_cache, q-cluster_size,
+   q-cluster_data, csize)  0) {
+   goto out_error;
+   }
+
+   memcpy(dst, q-cluster_cache + clust_offset, length);
+   mutex_unlock(q-mutex);
+   } else{
+   if (!clust_start)
+   goto zero_cluster;
+
+   mutex_unlock(q-mutex);
+
+   if (pread_in_full(q-fd, dst, length,
+ clust_start + clust_offset)  0)
+   return -1;
}
 
-   clust_start = QCOW_OFFSET_MASK;
-   if (!clust_start)
-   goto zero_cluster;
+   return length;
+
+zero_cluster:
+   mutex_unlock(q-mutex