RE: [PATCH v16 Kernel 4/7] vfio iommu: Implementation of ioctl for dirty pages tracking.

2020-03-30 Thread Zhao, Yan Y



> -Original Message-
> From: kvm-ow...@vger.kernel.org  On Behalf
> Of Alex Williamson
> Sent: Tuesday, March 31, 2020 10:38 AM
> To: Zhao, Yan Y 
> Cc: Kirti Wankhede ; c...@nvidia.com; Tian, Kevin
> ; Yang, Ziye ; Liu, Changpeng
> ; Liu, Yi L ;
> mlevi...@redhat.com; eskul...@redhat.com; coh...@redhat.com;
> dgilb...@redhat.com; jonathan.dav...@nutanix.com; eau...@redhat.com;
> a...@ozlabs.ru; pa...@linux.ibm.com; fel...@nutanix.com;
> zhengxiao...@alibaba-inc.com; shuangtai@alibaba-inc.com;
> ken@amd.com; Wang, Zhi A ; qemu-
> de...@nongnu.org; k...@vger.kernel.org
> Subject: Re: [PATCH v16 Kernel 4/7] vfio iommu: Implementation of ioctl for
> dirty pages tracking.
> 
> On Mon, 30 Mar 2020 21:16:21 -0400
> Yan Zhao  wrote:
> 
> > On Tue, Mar 31, 2020 at 09:12:59AM +0800, Alex Williamson wrote:
> > > On Mon, 30 Mar 2020 20:50:47 -0400
> > > Yan Zhao  wrote:
> > >
> > > > On Tue, Mar 31, 2020 at 08:53:47AM +0800, Alex Williamson wrote:
> > > > > On Mon, 30 Mar 2020 19:51:31 -0400 Yan Zhao
> > > > >  wrote:
> > > > >
> > > > > > On Mon, Mar 30, 2020 at 09:49:21PM +0800, Kirti Wankhede wrote:
> > > > > > >
> > > > > > >
> > > > > > > On 3/30/2020 8:54 AM, Yan Zhao wrote:
> > > > > > > > On Fri, Mar 27, 2020 at 01:28:13PM +0800, Kirti Wankhede wrote:
> > > > > > > >> Hit send button little early.
> > > > > > > >>
> > > > > > > >>   >
> > > > > > > >>   > I checked v12, it's not like what I said.
> > > > > > > >>   > In v12, bitmaps are generated per vfio_dma, and
> combination of the
> > > > > > > >>   > bitmaps are required in order to generate a big bitmap
> suiting for dirty
> > > > > > > >>   > query. It can cause problem when offset not aligning.
> > > > > > > >>   > But what I propose here is to generate an rb tree 
> > > > > > > >> orthogonal
> to the tree
> > > > > > > >>   > of vfio_dma.
> > > > > > > >>   >
> > > > > > > >>   > as to CPU cycles saving, I don't think 
> > > > > > > >> iterating/translating
> page by page
> > > > > > > >>   > would achieve that purpose.
> > > > > > > >>   >
> > > > > > > >>
> > > > > > > >> Instead of creating one extra rb tree for dirty pages
> > > > > > > >> tracking in v10 tried to use dma->pfn_list itself, we
> > > > > > > >> tried changes in v10, v11 and v12, latest version is
> > > > > > > >> evolved version with best possible approach after discussion.
> Probably, go through v11 as well.
> > > > > > > >> https://patchwork.kernel.org/patch/11298335/
> > > > > > > >>
> > > > > > > > I'm not sure why all those previous implementations are
> > > > > > > > bound to vfio_dma. for vIOMMU on, in most cases, a
> > > > > > > > vfio_dma is only for a page, so generating a one-byte bitmap for
> a single page in each vfio_dma ?
> > > > > > > > is it possible to creating one extra rb tree to keep dirty
> > > > > > > > ranges, and one fixed length kernel bitmap whose content
> > > > > > > > is generated on query, serving as a bouncing buffer for
> > > > > > > > copy_to_user
> > > > > > > >
> > > > > > >
> > > > > > > One fixed length? what should be fixed value? then isn't it
> > > > > > > better to fix the size to dma->size?
> > > > > > >
> > > > > > > This is also to prevent DoS attack, user space application
> > > > > > > can query a very large range.
> > > > > > >
> > > > > > > >>
> > > > > > > >> On 3/27/2020 6:00 AM, Yan Zhao wrote:
> > > > > > > >>> On Fri, Mar 27, 2020 at 05:39:01AM +0800, Kirti Wankhede
> wrote:
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>> On 3/25/2020 7:41 AM, Yan Zhao wrote:
> > > > &g

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-31 Thread Zhao Yan
On Sat, Mar 30, 2019 at 10:14:07PM +0800, Alex Williamson wrote:
> On Fri, 29 Mar 2019 19:10:50 -0400
> Zhao Yan  wrote:
> 
> > On Fri, Mar 29, 2019 at 10:26:39PM +0800, Alex Williamson wrote:
> > > On Thu, 28 Mar 2019 22:47:04 -0400
> > > Zhao Yan  wrote:
> > >   
> > > > On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:  
> > > > > On Thu, 28 Mar 2019 10:21:38 +0100
> > > > > Erik Skultety  wrote:
> > > > > 
> > > > > > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:
> > > > > > > hi Alex and Dave,
> > > > > > > Thanks for your replies.
> > > > > > > Please see my comments inline.
> > > > > > >
> > > > > > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:  
> > > > > > > 
> > > > > > > > On Wed, 27 Mar 2019 20:18:54 +
> > > > > > > > "Dr. David Alan Gilbert"  wrote:
> > > > > > > >  
> > > > > > > > > * Zhao Yan (yan.y.z...@intel.com) wrote:  
> > > > > > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck 
> > > > > > > > > > wrote:  
> > > > > > > > > > > > > > >   b) How do we detect if we're migrating from/to 
> > > > > > > > > > > > > > > the wrong device or
> > > > > > > > > > > > > > > version of device?  Or say to a device with older 
> > > > > > > > > > > > > > > firmware or perhaps
> > > > > > > > > > > > > > > a device that has less device memory ?  
> > > > > > > > > > > > > > Actually it's still an open for VFIO migration. 
> > > > > > > > > > > > > > Need to think about
> > > > > > > > > > > > > > whether it's better to check that in libvirt or 
> > > > > > > > > > > > > > qemu (like a device magic
> > > > > > > > > > > > > > along with verion ?).  
> > > > > > > > > > > >
> > > > > > > > > > > > We must keep the hardware generation is the same with 
> > > > > > > > > > > > one POD of public cloud
> > > > > > > > > > > > providers. But we still think about the live migration 
> > > > > > > > > > > > between from the the lower
> > > > > > > > > > > > generation of hardware migrated to the higher 
> > > > > > > > > > > > generation.  
> > > > > > > > > > >
> > > > > > > > > > > Agreed, lower->higher is the one direction that might 
> > > > > > > > > > > make sense to
> > > > > > > > > > > support.
> > > > > > > > > > >
> > > > > > > > > > > But regardless of that, I think we need to make sure that 
> > > > > > > > > > > incompatible
> > > > > > > > > > > devices/versions fail directly instead of failing in a 
> > > > > > > > > > > subtle, hard to
> > > > > > > > > > > debug way. Might be useful to do some initial sanity 
> > > > > > > > > > > checks in libvirt
> > > > > > > > > > > as well.
> > > > > > > > > > >
> > > > > > > > > > > How easy is it to obtain that information in a form that 
> > > > > > > > > > > can be
> > > > > > > > > > > consumed by higher layers? Can we find out the device 
> > > > > > > > > > > type at least?
> > > > > > > > > > > What about some kind of revision?  
> > > > > > > > > > hi Alex and Cornelia
> > > > > > > > > > for device compatibility, do you think it's a good idea to 
> > > > > > > > > > use "version"
> > > > > > > > > > and "device version" f

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-29 Thread Zhao Yan
On Fri, Mar 29, 2019 at 10:26:39PM +0800, Alex Williamson wrote:
> On Thu, 28 Mar 2019 22:47:04 -0400
> Zhao Yan  wrote:
> 
> > On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:
> > > On Thu, 28 Mar 2019 10:21:38 +0100
> > > Erik Skultety  wrote:
> > >   
> > > > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:  
> > > > > hi Alex and Dave,
> > > > > Thanks for your replies.
> > > > > Please see my comments inline.
> > > > >
> > > > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:
> > > > > > On Wed, 27 Mar 2019 20:18:54 +
> > > > > > "Dr. David Alan Gilbert"  wrote:
> > > > > >
> > > > > > > * Zhao Yan (yan.y.z...@intel.com) wrote:
> > > > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > >   
> > > > > > > > > > > > >   b) How do we detect if we're migrating from/to the 
> > > > > > > > > > > > > wrong device or
> > > > > > > > > > > > > version of device?  Or say to a device with older 
> > > > > > > > > > > > > firmware or perhaps
> > > > > > > > > > > > > a device that has less device memory ?
> > > > > > > > > > > > Actually it's still an open for VFIO migration. Need to 
> > > > > > > > > > > > think about
> > > > > > > > > > > > whether it's better to check that in libvirt or qemu 
> > > > > > > > > > > > (like a device magic
> > > > > > > > > > > > along with verion ?).
> > > > > > > > > >
> > > > > > > > > > We must keep the hardware generation is the same with one 
> > > > > > > > > > POD of public cloud
> > > > > > > > > > providers. But we still think about the live migration 
> > > > > > > > > > between from the the lower
> > > > > > > > > > generation of hardware migrated to the higher generation.   
> > > > > > > > > >  
> > > > > > > > >
> > > > > > > > > Agreed, lower->higher is the one direction that might make 
> > > > > > > > > sense to
> > > > > > > > > support.
> > > > > > > > >
> > > > > > > > > But regardless of that, I think we need to make sure that 
> > > > > > > > > incompatible
> > > > > > > > > devices/versions fail directly instead of failing in a 
> > > > > > > > > subtle, hard to
> > > > > > > > > debug way. Might be useful to do some initial sanity checks 
> > > > > > > > > in libvirt
> > > > > > > > > as well.
> > > > > > > > >
> > > > > > > > > How easy is it to obtain that information in a form that can 
> > > > > > > > > be
> > > > > > > > > consumed by higher layers? Can we find out the device type at 
> > > > > > > > > least?
> > > > > > > > > What about some kind of revision?
> > > > > > > > hi Alex and Cornelia
> > > > > > > > for device compatibility, do you think it's a good idea to use 
> > > > > > > > "version"
> > > > > > > > and "device version" fields?
> > > > > > > >
> > > > > > > > version field: identify live migration interface's version. it 
> > > > > > > > can have a
> > > > > > > > sort of backward compatibility, like target machine's version 
> > > > > > > > >= source
> > > > > > > > machine's version. something like that.
> > > > > >
> > > > > > Don't we essentially already have this via the device specific 
> > > > > > region?
> > > > > > The struct vfio_info_cap_header includes id and version fields, so 
> > > > > > we
> > > > > > can declare a migration id 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-28 Thread Zhao Yan
On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:
> On Thu, 28 Mar 2019 10:21:38 +0100
> Erik Skultety  wrote:
> 
> > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:
> > > hi Alex and Dave,
> > > Thanks for your replies.
> > > Please see my comments inline.
> > >
> > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:  
> > > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > > "Dr. David Alan Gilbert"  wrote:
> > > >  
> > > > > * Zhao Yan (yan.y.z...@intel.com) wrote:  
> > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > > > > >   b) How do we detect if we're migrating from/to the 
> > > > > > > > > > > wrong device or
> > > > > > > > > > > version of device?  Or say to a device with older 
> > > > > > > > > > > firmware or perhaps
> > > > > > > > > > > a device that has less device memory ?  
> > > > > > > > > > Actually it's still an open for VFIO migration. Need to 
> > > > > > > > > > think about
> > > > > > > > > > whether it's better to check that in libvirt or qemu (like 
> > > > > > > > > > a device magic
> > > > > > > > > > along with verion ?).  
> > > > > > > >
> > > > > > > > We must keep the hardware generation is the same with one POD 
> > > > > > > > of public cloud
> > > > > > > > providers. But we still think about the live migration between 
> > > > > > > > from the the lower
> > > > > > > > generation of hardware migrated to the higher generation.  
> > > > > > >
> > > > > > > Agreed, lower->higher is the one direction that might make sense 
> > > > > > > to
> > > > > > > support.
> > > > > > >
> > > > > > > But regardless of that, I think we need to make sure that 
> > > > > > > incompatible
> > > > > > > devices/versions fail directly instead of failing in a subtle, 
> > > > > > > hard to
> > > > > > > debug way. Might be useful to do some initial sanity checks in 
> > > > > > > libvirt
> > > > > > > as well.
> > > > > > >
> > > > > > > How easy is it to obtain that information in a form that can be
> > > > > > > consumed by higher layers? Can we find out the device type at 
> > > > > > > least?
> > > > > > > What about some kind of revision?  
> > > > > > hi Alex and Cornelia
> > > > > > for device compatibility, do you think it's a good idea to use 
> > > > > > "version"
> > > > > > and "device version" fields?
> > > > > >
> > > > > > version field: identify live migration interface's version. it can 
> > > > > > have a
> > > > > > sort of backward compatibility, like target machine's version >= 
> > > > > > source
> > > > > > machine's version. something like that.  
> > > >
> > > > Don't we essentially already have this via the device specific region?
> > > > The struct vfio_info_cap_header includes id and version fields, so we
> > > > can declare a migration id and increment the version for any
> > > > incompatible changes to the protocol.  
> > > yes, good idea!
> > > so, what about declaring below new cap?
> > > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > struct vfio_region_info_cap_migration {
> > > struct vfio_info_cap_header header;
> > > __u32 device_version_len;
> > > __u8  device_version[];
> > > };
> 
> I'm not sure why we need a new region for everything, it seems this
> could fit within the protocol of a single region.  This could simply be
> a new action to retrieve the version where the protocol would report
> the number of bytes available, just like the migration stream itself.
so, to get version of VFIO live migration device state interface (simply
call it migration interface?),
a new cap looks like this:
#define VFIO_REGION_INFO_CAP_MIGRATION 4
it contains struct vfio_i

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-28 Thread Zhao Yan
hi Alex and Dave,
Thanks for your replies.
Please see my comments inline.

On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:
> On Wed, 27 Mar 2019 20:18:54 +
> "Dr. David Alan Gilbert"  wrote:
> 
> > * Zhao Yan (yan.y.z...@intel.com) wrote:
> > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > >   b) How do we detect if we're migrating from/to the wrong 
> > > > > > > > device or
> > > > > > > > version of device?  Or say to a device with older firmware or 
> > > > > > > > perhaps
> > > > > > > > a device that has less device memory ?
> > > > > > > Actually it's still an open for VFIO migration. Need to think 
> > > > > > > about
> > > > > > > whether it's better to check that in libvirt or qemu (like a 
> > > > > > > device magic
> > > > > > > along with verion ?).
> > > > > 
> > > > > We must keep the hardware generation is the same with one POD of 
> > > > > public cloud
> > > > > providers. But we still think about the live migration between from 
> > > > > the the lower
> > > > > generation of hardware migrated to the higher generation.  
> > > > 
> > > > Agreed, lower->higher is the one direction that might make sense to
> > > > support.
> > > > 
> > > > But regardless of that, I think we need to make sure that incompatible
> > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > as well.
> > > > 
> > > > How easy is it to obtain that information in a form that can be
> > > > consumed by higher layers? Can we find out the device type at least?
> > > > What about some kind of revision?  
> > > hi Alex and Cornelia
> > > for device compatibility, do you think it's a good idea to use "version"
> > > and "device version" fields?
> > > 
> > > version field: identify live migration interface's version. it can have a
> > > sort of backward compatibility, like target machine's version >= source
> > > machine's version. something like that.
> 
> Don't we essentially already have this via the device specific region?
> The struct vfio_info_cap_header includes id and version fields, so we
> can declare a migration id and increment the version for any
> incompatible changes to the protocol.
yes, good idea!
so, what about declaring below new cap? 
#define VFIO_REGION_INFO_CAP_MIGRATION 4
struct vfio_region_info_cap_migration {
struct vfio_info_cap_header header;
__u32 device_version_len;
__u8  device_version[];
};


> > > 
> > > device_version field consists two parts:
> > > 1. vendor id : it takes 32 bits. e.g. 0x8086.
> 
> Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> suggest we use a bit to flag it as such so we can reserve that portion
> of the 32bit address space.  See for example:
> 
> #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE(1 << 31)
> #define VFIO_REGION_TYPE_PCI_VENDOR_MASK(0x)
> 
> For vendor specific regions.
Yes, use PCI vendor ID.
you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
to identify it's a PCI ID.
Thanks for pointing it out. 
But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
why it's 0x? I searched QEMU and kernel code and did not find anywhere
uses it.


> > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > thinks can identify a source device. e.g. pciid + mdev type.
> > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > 
> > > 
> > > struct vfio_device_state_ctl {
> > >  __u32 version;/* ro */
> > >  __u8 device_version[MAX_DEVICE_VERSION_LEN];/* ro */
> > >  struct {
> > >   __u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > >   ...
> > >  }data;
> > >  ...
> > >  };
> 
> We have a buffer area where we can read and write data from the vendor
> driver, why would we have this IS_COMPATIBLE protocol to write the
> device version string but use a static fixed length version string in
> the control header to read it?  IOW, let's use GET_VERSION,
> CHECK_VERSION actions 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-27 Thread Zhao Yan
On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:
> > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > a device that has less device memory ?  
> > > > Actually it's still an open for VFIO migration. Need to think about
> > > > whether it's better to check that in libvirt or qemu (like a device 
> > > > magic
> > > > along with verion ?).  
> > 
> > We must keep the hardware generation is the same with one POD of public 
> > cloud
> > providers. But we still think about the live migration between from the the 
> > lower
> > generation of hardware migrated to the higher generation.
> 
> Agreed, lower->higher is the one direction that might make sense to
> support.
> 
> But regardless of that, I think we need to make sure that incompatible
> devices/versions fail directly instead of failing in a subtle, hard to
> debug way. Might be useful to do some initial sanity checks in libvirt
> as well.
> 
> How easy is it to obtain that information in a form that can be
> consumed by higher layers? Can we find out the device type at least?
> What about some kind of revision?
hi Alex and Cornelia
for device compatibility, do you think it's a good idea to use "version"
and "device version" fields?

version field: identify live migration interface's version. it can have a
sort of backward compatibility, like target machine's version >= source
machine's version. something like that.

device_version field consists two parts:
1. vendor id : it takes 32 bits. e.g. 0x8086.
2. vendor proprietary string: it can be any string that a vendor driver
thinks can identify a source device. e.g. pciid + mdev type.
"vendor id" is to avoid overlap of "vendor proprietary string".


struct vfio_device_state_ctl {
 __u32 version;/* ro */
 __u8 device_version[MAX_DEVICE_VERSION_LEN];/* ro */
 struct {
__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
...
 }data;
 ...
 };

Then, an action IS_COMPATIBLE is added to check device compatibility.

The flow to figure out whether a source device is migratable to target device
is like that:
1. in source side's .save_setup, save source device's device_version string
2. in target side's .load_state, load source device's device version string
and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
to check whether the source device is compatible to it.

The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
maintain a compatibility table and decide whether source device is compatible
to target device according to its proprietary table.
In device_version string, vendor driver only has to describe the source
device as elaborately as possible and resorts to vendor driver in target side
to figure out whether they are compatible.

Thanks
Yan






Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-17 Thread Zhao Yan
On Mon, Mar 18, 2019 at 11:09:04AM +0800, Alex Williamson wrote:
> On Sun, 17 Mar 2019 22:51:27 -0400
> Zhao Yan  wrote:
> 
> > On Fri, Mar 15, 2019 at 10:24:02AM +0800, Alex Williamson wrote:
> > > On Thu, 14 Mar 2019 19:05:06 -0400
> > > Zhao Yan  wrote:
> > >   
> > > > On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:  
> > > > > On Wed, 13 Mar 2019 21:12:22 -0400
> > > > > Zhao Yan  wrote:
> > > > > 
> > > > > > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:
> > > > > > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > > > > > Zhao Yan  wrote:
> > > > > > >   
> > > > > > > > hi Alex
> > > > > > > > Any comments to the sequence below?
> > > > > > > > 
> > > > > > > > Actaully we have some concerns and suggestions to 
> > > > > > > > userspace-opaque migration
> > > > > > > > data.
> > > > > > > > 
> > > > > > > > 1. if data is opaque to userspace, kernel interface must be 
> > > > > > > > tightly bound to
> > > > > > > > migration. 
> > > > > > > >e.g. vendor driver has to know state (running + not logging) 
> > > > > > > > should not
> > > > > > > >return any data, and state (running + logging) should return 
> > > > > > > > whole
> > > > > > > >snapshot first and dirty later. it also has to know qemu 
> > > > > > > > migration will
> > > > > > > >not call GET_BUFFER in state (running + not logging), 
> > > > > > > > otherwise, it has
> > > > > > > >to adjust its behavior.  
> > > > > > > 
> > > > > > > This all just sounds like defining the protocol we expect with the
> > > > > > > interface.  For instance if we define a session as beginning when
> > > > > > > logging is enabled and ending when the device is stopped and the
> > > > > > > interface reports no more data is available, then we can state 
> > > > > > > that any
> > > > > > > partial accumulation of data is incomplete relative to migration. 
> > > > > > >  If
> > > > > > > userspace wants to initiate a new migration stream, they can 
> > > > > > > simply
> > > > > > > toggle logging.  How the vendor driver provides the data during 
> > > > > > > the
> > > > > > > session is not defined, but beginning the session with a snapshot
> > > > > > > followed by repeated iterations of dirtied data is certainly a 
> > > > > > > valid
> > > > > > > approach.
> > > > > > >   
> > > > > > > > 2. vendor driver cannot ensure userspace get all the data it 
> > > > > > > > intends to
> > > > > > > > save in pre-copy phase.
> > > > > > > >   e.g. in stop-and-copy phase, vendor driver has to first check 
> > > > > > > > and send
> > > > > > > >   data in previous phase.  
> > > > > > > 
> > > > > > > First, I don't think the device has control of when QEMU switches 
> > > > > > > from
> > > > > > > pre-copy to stop-and-copy, the protocol needs to support that
> > > > > > > transition at any point.  However, it seems a simply data 
> > > > > > > available
> > > > > > > counter provides an indication of when it might be optimal to 
> > > > > > > make such
> > > > > > > a transition.  If a vendor driver follows a scheme as above, the
> > > > > > > available data counter would indicate a large value, the entire 
> > > > > > > initial
> > > > > > > snapshot of the device.  As the migration continues and pages are
> > > > > > > dirtied, the device would reach a steady state amount of data
> > > > > > > available, depending on the guest activity.  This could indicate 
> > > > > > > to the
> > > > > > > user to stop the device.  T

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-17 Thread Zhao Yan
On Fri, Mar 15, 2019 at 10:24:02AM +0800, Alex Williamson wrote:
> On Thu, 14 Mar 2019 19:05:06 -0400
> Zhao Yan  wrote:
> 
> > On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:
> > > On Wed, 13 Mar 2019 21:12:22 -0400
> > > Zhao Yan  wrote:
> > >   
> > > > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:  
> > > > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > > > Zhao Yan  wrote:
> > > > > 
> > > > > > hi Alex
> > > > > > Any comments to the sequence below?
> > > > > > 
> > > > > > Actaully we have some concerns and suggestions to userspace-opaque 
> > > > > > migration
> > > > > > data.
> > > > > > 
> > > > > > 1. if data is opaque to userspace, kernel interface must be tightly 
> > > > > > bound to
> > > > > > migration. 
> > > > > >e.g. vendor driver has to know state (running + not logging) 
> > > > > > should not
> > > > > >return any data, and state (running + logging) should return 
> > > > > > whole
> > > > > >snapshot first and dirty later. it also has to know qemu 
> > > > > > migration will
> > > > > >not call GET_BUFFER in state (running + not logging), otherwise, 
> > > > > > it has
> > > > > >to adjust its behavior.
> > > > > 
> > > > > This all just sounds like defining the protocol we expect with the
> > > > > interface.  For instance if we define a session as beginning when
> > > > > logging is enabled and ending when the device is stopped and the
> > > > > interface reports no more data is available, then we can state that 
> > > > > any
> > > > > partial accumulation of data is incomplete relative to migration.  If
> > > > > userspace wants to initiate a new migration stream, they can simply
> > > > > toggle logging.  How the vendor driver provides the data during the
> > > > > session is not defined, but beginning the session with a snapshot
> > > > > followed by repeated iterations of dirtied data is certainly a valid
> > > > > approach.
> > > > > 
> > > > > > 2. vendor driver cannot ensure userspace get all the data it 
> > > > > > intends to
> > > > > > save in pre-copy phase.
> > > > > >   e.g. in stop-and-copy phase, vendor driver has to first check and 
> > > > > > send
> > > > > >   data in previous phase.
> > > > > 
> > > > > First, I don't think the device has control of when QEMU switches from
> > > > > pre-copy to stop-and-copy, the protocol needs to support that
> > > > > transition at any point.  However, it seems a simply data available
> > > > > counter provides an indication of when it might be optimal to make 
> > > > > such
> > > > > a transition.  If a vendor driver follows a scheme as above, the
> > > > > available data counter would indicate a large value, the entire 
> > > > > initial
> > > > > snapshot of the device.  As the migration continues and pages are
> > > > > dirtied, the device would reach a steady state amount of data
> > > > > available, depending on the guest activity.  This could indicate to 
> > > > > the
> > > > > user to stop the device.  The migration stream would not be considered
> > > > > completed until the available data counter reaches zero while the
> > > > > device is in the stopped|logging state.
> > > > > 
> > > > > > 3. if all the sequence is tightly bound to live migration, can we 
> > > > > > remove the
> > > > > > logging state? what about adding two states migrate-in and 
> > > > > > migrate-out?
> > > > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > > > >migrate-out is for source side when migration starts. together 
> > > > > > with
> > > > > >state running and stopped, it can substitute state logging.
> > > > > >migrate-in is for target side.
> > > > > 
> > > > > In fact, Kirti's implementation specifies a data direction, but I 
> > > 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-14 Thread Zhao Yan
On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:
> On Wed, 13 Mar 2019 21:12:22 -0400
> Zhao Yan  wrote:
> 
> > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:
> > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > Zhao Yan  wrote:
> > >   
> > > > hi Alex
> > > > Any comments to the sequence below?
> > > > 
> > > > Actaully we have some concerns and suggestions to userspace-opaque 
> > > > migration
> > > > data.
> > > > 
> > > > 1. if data is opaque to userspace, kernel interface must be tightly 
> > > > bound to
> > > > migration. 
> > > >e.g. vendor driver has to know state (running + not logging) should 
> > > > not
> > > >return any data, and state (running + logging) should return whole
> > > >snapshot first and dirty later. it also has to know qemu migration 
> > > > will
> > > >not call GET_BUFFER in state (running + not logging), otherwise, it 
> > > > has
> > > >to adjust its behavior.  
> > > 
> > > This all just sounds like defining the protocol we expect with the
> > > interface.  For instance if we define a session as beginning when
> > > logging is enabled and ending when the device is stopped and the
> > > interface reports no more data is available, then we can state that any
> > > partial accumulation of data is incomplete relative to migration.  If
> > > userspace wants to initiate a new migration stream, they can simply
> > > toggle logging.  How the vendor driver provides the data during the
> > > session is not defined, but beginning the session with a snapshot
> > > followed by repeated iterations of dirtied data is certainly a valid
> > > approach.
> > >   
> > > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > > save in pre-copy phase.
> > > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > > >   data in previous phase.  
> > > 
> > > First, I don't think the device has control of when QEMU switches from
> > > pre-copy to stop-and-copy, the protocol needs to support that
> > > transition at any point.  However, it seems a simply data available
> > > counter provides an indication of when it might be optimal to make such
> > > a transition.  If a vendor driver follows a scheme as above, the
> > > available data counter would indicate a large value, the entire initial
> > > snapshot of the device.  As the migration continues and pages are
> > > dirtied, the device would reach a steady state amount of data
> > > available, depending on the guest activity.  This could indicate to the
> > > user to stop the device.  The migration stream would not be considered
> > > completed until the available data counter reaches zero while the
> > > device is in the stopped|logging state.
> > >   
> > > > 3. if all the sequence is tightly bound to live migration, can we 
> > > > remove the
> > > > logging state? what about adding two states migrate-in and migrate-out?
> > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > >migrate-out is for source side when migration starts. together with
> > > >state running and stopped, it can substitute state logging.
> > > >migrate-in is for target side.  
> > > 
> > > In fact, Kirti's implementation specifies a data direction, but I think
> > > we still need logging to indicate sessions.  I'd also assume that
> > > logging implies some overhead for the vendor driver.
> > >  
> > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > migrate-out are more universal againt hardware requirement changes.
> > 
> > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:  
> > > > > hi Alex
> > > > > thanks for your reply.
> > > > > 
> > > > > So, if we choose migration data to be userspace opaque, do you think 
> > > > > below
> > > > > sequence is the right behavior for vendor driver to follow:
> > > > > 
> > > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER 
> > > > > to
> > > > > vendor driver,  vendor driver should reject and return 0.  
> > > 
> > > What would this state mean otherwise?  If we're not logging then it

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-13 Thread Zhao Yan
On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:
> On Tue, 12 Mar 2019 21:13:01 -0400
> Zhao Yan  wrote:
> 
> > hi Alex
> > Any comments to the sequence below?
> > 
> > Actaully we have some concerns and suggestions to userspace-opaque migration
> > data.
> > 
> > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > migration. 
> >e.g. vendor driver has to know state (running + not logging) should not
> >return any data, and state (running + logging) should return whole
> >snapshot first and dirty later. it also has to know qemu migration will
> >not call GET_BUFFER in state (running + not logging), otherwise, it has
> >to adjust its behavior.
> 
> This all just sounds like defining the protocol we expect with the
> interface.  For instance if we define a session as beginning when
> logging is enabled and ending when the device is stopped and the
> interface reports no more data is available, then we can state that any
> partial accumulation of data is incomplete relative to migration.  If
> userspace wants to initiate a new migration stream, they can simply
> toggle logging.  How the vendor driver provides the data during the
> session is not defined, but beginning the session with a snapshot
> followed by repeated iterations of dirtied data is certainly a valid
> approach.
> 
> > 2. vendor driver cannot ensure userspace get all the data it intends to
> > save in pre-copy phase.
> >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> >   data in previous phase.
> 
> First, I don't think the device has control of when QEMU switches from
> pre-copy to stop-and-copy, the protocol needs to support that
> transition at any point.  However, it seems a simply data available
> counter provides an indication of when it might be optimal to make such
> a transition.  If a vendor driver follows a scheme as above, the
> available data counter would indicate a large value, the entire initial
> snapshot of the device.  As the migration continues and pages are
> dirtied, the device would reach a steady state amount of data
> available, depending on the guest activity.  This could indicate to the
> user to stop the device.  The migration stream would not be considered
> completed until the available data counter reaches zero while the
> device is in the stopped|logging state.
> 
> > 3. if all the sequence is tightly bound to live migration, can we remove the
> > logging state? what about adding two states migrate-in and migrate-out?
> > so there are four states: running, stopped, migrate-in, migrate-out.
> >migrate-out is for source side when migration starts. together with
> >state running and stopped, it can substitute state logging.
> >migrate-in is for target side.
> 
> In fact, Kirti's implementation specifies a data direction, but I think
> we still need logging to indicate sessions.  I'd also assume that
> logging implies some overhead for the vendor driver.
>
ok. If you prefer logging, I'm ok with it. just found migrate-in and
migrate-out are more universal againt hardware requirement changes.

> > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:
> > > hi Alex
> > > thanks for your reply.
> > > 
> > > So, if we choose migration data to be userspace opaque, do you think below
> > > sequence is the right behavior for vendor driver to follow:
> > > 
> > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > vendor driver,  vendor driver should reject and return 0.
> 
> What would this state mean otherwise?  If we're not logging then it
> should not be expected that we can construct dirtied data from a
> previous read of the state before logging was enabled (it would be
> outside of the "session").  So at best this is an incomplete segment of
> the initial snapshot of the device, but that presumes how the vendor
> driver constructs the data.  I wouldn't necessarily mandate the vendor
> driver reject it, but I think we should consider it undefined and
> vendor specific relative to the migration interface.
> 
> > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > driver,
> > >a. vendor driver shoud first query a whole snapshot of device memory
> > >(let's use this term to represent device's standalone memory for now),
> > >b. vendor driver returns a chunk of data just queried to userspace,
> > >while recording current pos in data.
> > >c. vendor driver finds all data just queried is finished transmitting 
> > > to
&

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-12 Thread Zhao Yan
hi Alex
Any comments to the sequence below?

Actaully we have some concerns and suggestions to userspace-opaque migration
data.

1. if data is opaque to userspace, kernel interface must be tightly bound to
migration. 
   e.g. vendor driver has to know state (running + not logging) should not
   return any data, and state (running + logging) should return whole
   snapshot first and dirty later. it also has to know qemu migration will
   not call GET_BUFFER in state (running + not logging), otherwise, it has
   to adjust its behavior.

2. vendor driver cannot ensure userspace get all the data it intends to
save in pre-copy phase.
  e.g. in stop-and-copy phase, vendor driver has to first check and send
  data in previous phase.
 

3. if all the sequence is tightly bound to live migration, can we remove the
logging state? what about adding two states migrate-in and migrate-out?
so there are four states: running, stopped, migrate-in, migrate-out.
   migrate-out is for source side when migration starts. together with
   state running and stopped, it can substitute state logging.
   migrate-in is for target side.


Thanks
Yan

On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:
> hi Alex
> thanks for your reply.
> 
> So, if we choose migration data to be userspace opaque, do you think below
> sequence is the right behavior for vendor driver to follow:
> 
> 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> vendor driver,  vendor driver should reject and return 0.
> 
> 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> driver,
>a. vendor driver shoud first query a whole snapshot of device memory
>(let's use this term to represent device's standalone memory for now),
>b. vendor driver returns a chunk of data just queried to userspace,
>while recording current pos in data.
>c. vendor driver finds all data just queried is finished transmitting to
>userspace, and queries only dirty data in device memory now.
>d. vendor driver returns a chunk of data just quered (this time is dirty
>data )to userspace while recording current pos in data
>e. if all data is transmited to usespace and still GET_BUFFERs come from
>userspace, vendor driver starts another round of dirty data query.
> 
> 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> driver,
>a. if vendor driver finds there's previously untransmitted data, returns
>them until all transmitted.
>b. vendor driver then queries dirty data again and transmits them.
>c. at last, vendor driver queris device config data (which has to be
>queried at last and sent once) and transmits them.
> 
> 
> for the 1 bullet, if LOGGING state is firstly set and migration aborts
> then,  vendor driver has to be able to detect that condition. so seemingly,
> vendor driver has to know more qemu's migration state, like migration
> called and failed. Do you think that's acceptable?
> 
> 
> Thanks
> Yan
> 
> 



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-11 Thread Zhao Yan
hi Alex
thanks for your reply.

So, if we choose migration data to be userspace opaque, do you think below
sequence is the right behavior for vendor driver to follow:

1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
vendor driver,  vendor driver should reject and return 0.

2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
driver,
   a. vendor driver shoud first query a whole snapshot of device memory
   (let's use this term to represent device's standalone memory for now),
   b. vendor driver returns a chunk of data just queried to userspace,
   while recording current pos in data.
   c. vendor driver finds all data just queried is finished transmitting to
   userspace, and queries only dirty data in device memory now.
   d. vendor driver returns a chunk of data just quered (this time is dirty
   data )to userspace while recording current pos in data
   e. if all data is transmited to usespace and still GET_BUFFERs come from
   userspace, vendor driver starts another round of dirty data query.

3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
driver,
   a. if vendor driver finds there's previously untransmitted data, returns
   them until all transmitted.
   b. vendor driver then queries dirty data again and transmits them.
   c. at last, vendor driver queris device config data (which has to be
   queried at last and sent once) and transmits them.


for the 1 bullet, if LOGGING state is firstly set and migration aborts
then,  vendor driver has to be able to detect that condition. so seemingly,
vendor driver has to know more qemu's migration state, like migration
called and failed. Do you think that's acceptable?


Thanks
Yan





Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-05 Thread Zhao Yan
hi Alex
we still have some opens as below. could you kindly help review on that? :)

Thanks
Yan

On Mon, Feb 25, 2019 at 10:22:56AM +0800, Zhao Yan wrote:
> On Thu, Feb 21, 2019 at 01:40:51PM -0700, Alex Williamson wrote:
> > Hi Yan,
> > 
> > Thanks for working on this!
> > 
> > On Tue, 19 Feb 2019 16:50:54 +0800
> > Yan Zhao  wrote:
> > 
> > > This patchset enables VFIO devices to have live migration capability.
> > > Currently it does not support post-copy phase.
> > > 
> > > It follows Alex's comments on last version of VFIO live migration patches,
> > > including device states, VFIO device state region layout, dirty bitmap's
> > > query.
> > > 
> > > Device Data
> > > ---
> > > Device data is divided into three types: device memory, device config,
> > > and system memory dirty pages produced by device.
> > > 
> > > Device config: data like MMIOs, page tables...
> > > Every device is supposed to possess device config data.
> > >   Usually device config's size is small (no big than 10M), and it
> > 
> > I'm not sure how we can really impose a limit here, it is what it is
> > for a device.  A smaller state is obviously desirable to reduce
> > downtime, but some devices could have very large states.
> > 
> > > needs to be loaded in certain strict order.
> > > Therefore, device config only needs to be saved/loaded in
> > > stop-and-copy phase.
> > > The data of device config is held in device config region.
> > > Size of device config data is smaller than or equal to that of
> > > device config region.
> > 
> > So the intention here is that this is the last data read from the
> > device and it's done in one pass, so the region needs to be large
> > enough to expose all config data at once.  On restore it's the last
> > data written before switching the device to the run state.
> > 
> > > 
> > > Device Memory: device's internal memory, standalone and outside system
> > 
> > s/system/VM/
> > 
> > > memory. It is usually very big.
> > 
> > Or it doesn't exist.  Not sure we should be setting expectations since
> > it will vary per device.
> > 
> > > This kind of data needs to be saved / loaded in pre-copy and
> > > stop-and-copy phase.
> > > The data of device memory is held in device memory region.
> > > Size of devie memory is usually larger than that of device
> > > memory region. qemu needs to save/load it in chunks of size of
> > > device memory region.
> > > Not all device has device memory. Like IGD only uses system 
> > > memory.
> > 
> > It seems a little gratuitous to me that this is a separate region or
> > that this data is handled separately.  All of this data is opaque to
> > QEMU, so why do we need to separate it?
> hi Alex,
> as the device state interfaces are provided by kernel, it is expected to
> meet as general needs as possible. So, do you think there are such use
> cases from user space that user space knows well of the device, and
> it wants kernel to return desired data back to it.
> E.g. It just wants to get whole device config data including all mmios,
> page tables, pci config data...
> or, It just wants to get current device memory snapshot, not including any
> dirty data.
> Or, It just needs the dirty pages in device memory or system memory.
> With all this accurate query, quite a lot of useful features can be
> developped in user space.
> 
> If all of this data is opaque to user app, seems the only use case is
> for live migration.
> 
> From another aspect, if the final solution is to let the data opaque to
> user space, like what NV did, kernel side's implementation will be more
> complicated, and actually a little challenge to vendor driver.
> in that case, in pre-copy phase,
> 1. in not LOGGING state, vendor driver first returns full data including
> full device memory snapshot
> 2. user space reads some data (you can't expect it to finish reading all
> data)
> 3. then userspace set the device state to LOGGING to start dirty data
> logging
> 4. vendor driver starts dirty data logging, and appends the dirty data to
> the tail of remaining unread full data and increase the pending data size?
> 5. user space keeps reading data.
> 6. vendor driver keeps appending new dirty data to the tail of remaining
> unread full data/dirty data and increase the pending data size?
> 
> in stop-and-copy phase

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-24 Thread Zhao Yan
On Thu, Feb 21, 2019 at 01:40:51PM -0700, Alex Williamson wrote:
> Hi Yan,
> 
> Thanks for working on this!
> 
> On Tue, 19 Feb 2019 16:50:54 +0800
> Yan Zhao  wrote:
> 
> > This patchset enables VFIO devices to have live migration capability.
> > Currently it does not support post-copy phase.
> > 
> > It follows Alex's comments on last version of VFIO live migration patches,
> > including device states, VFIO device state region layout, dirty bitmap's
> > query.
> > 
> > Device Data
> > ---
> > Device data is divided into three types: device memory, device config,
> > and system memory dirty pages produced by device.
> > 
> > Device config: data like MMIOs, page tables...
> > Every device is supposed to possess device config data.
> > Usually device config's size is small (no big than 10M), and it
> 
> I'm not sure how we can really impose a limit here, it is what it is
> for a device.  A smaller state is obviously desirable to reduce
> downtime, but some devices could have very large states.
> 
> > needs to be loaded in certain strict order.
> > Therefore, device config only needs to be saved/loaded in
> > stop-and-copy phase.
> > The data of device config is held in device config region.
> > Size of device config data is smaller than or equal to that of
> > device config region.
> 
> So the intention here is that this is the last data read from the
> device and it's done in one pass, so the region needs to be large
> enough to expose all config data at once.  On restore it's the last
> data written before switching the device to the run state.
> 
> > 
> > Device Memory: device's internal memory, standalone and outside system
> 
> s/system/VM/
> 
> > memory. It is usually very big.
> 
> Or it doesn't exist.  Not sure we should be setting expectations since
> it will vary per device.
> 
> > This kind of data needs to be saved / loaded in pre-copy and
> > stop-and-copy phase.
> > The data of device memory is held in device memory region.
> > Size of devie memory is usually larger than that of device
> > memory region. qemu needs to save/load it in chunks of size of
> > device memory region.
> > Not all device has device memory. Like IGD only uses system memory.
> 
> It seems a little gratuitous to me that this is a separate region or
> that this data is handled separately.  All of this data is opaque to
> QEMU, so why do we need to separate it?
hi Alex,
as the device state interfaces are provided by kernel, it is expected to
meet as general needs as possible. So, do you think there are such use
cases from user space that user space knows well of the device, and
it wants kernel to return desired data back to it.
E.g. It just wants to get whole device config data including all mmios,
page tables, pci config data...
or, It just wants to get current device memory snapshot, not including any
dirty data.
Or, It just needs the dirty pages in device memory or system memory.
With all this accurate query, quite a lot of useful features can be
developped in user space.

If all of this data is opaque to user app, seems the only use case is
for live migration.

>From another aspect, if the final solution is to let the data opaque to
user space, like what NV did, kernel side's implementation will be more
complicated, and actually a little challenge to vendor driver.
in that case, in pre-copy phase,
1. in not LOGGING state, vendor driver first returns full data including
full device memory snapshot
2. user space reads some data (you can't expect it to finish reading all
data)
3. then userspace set the device state to LOGGING to start dirty data
logging
4. vendor driver starts dirty data logging, and appends the dirty data to
the tail of remaining unread full data and increase the pending data size?
5. user space keeps reading data.
6. vendor driver keeps appending new dirty data to the tail of remaining
unread full data/dirty data and increase the pending data size?

in stop-and-copy phase
1. user space sets device state to exit LOGGING state,
2. vendor driver stops data logging. it has to append device config
   data at the tail of remaining dirty data unread by userspace.

during this flow, when vendor driver should get dirty data? just keeps
logging and appends to tail? how to ensure dirty data is refresh new before
LOGGING state exit? how does vendor driver know whether certain dirty data
is copied or not?

I've no idea how NVidia handle this problem, and they don't open their
kernel side code. 
just feel it's a bit hard for other vendor drivers to follow:)

> > System memory dirty pages: If a device produces dirty pages in system
> > memory, it is able to get dirty bitmap for certain range of system
> > memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Zhao Yan
On Thu, Feb 21, 2019 at 03:16:45AM +, Gonglei (Arei) wrote:
> 
> 
> 
> > -Original Message-
> > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > Sent: Thursday, February 21, 2019 10:05 AM
> > To: Gonglei (Arei) 
> > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > k...@vger.kernel.org
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > > >
> > > > > 5) About log sync, why not register log_global_start/stop in
> > > > vfio_memory_listener?
> > > > >
> > > > >
> > > > seems log_global_start/stop cannot be iterately called in pre-copy 
> > > > phase?
> > > > for dirty pages in system memory, it's better to transfer dirty data
> > > > iteratively to reduce down time, right?
> > > >
> > >
> > > We just need invoking only once for start and stop logging. Why we need to
> > call
> > > them literately? See memory_listener of vhost.
> > >
> > the dirty pages in system memory produces by device is incremental.
> > if it can be got iteratively, the dirty pages in stop-and-copy phase can be
> > minimal.
> > :)
> > 
> I mean starting or stopping the capability of logging, not log sync. 
> 
> We register the below callbacks:
> 
> .log_sync = vfio_log_sync,
> .log_global_start = vfio_log_global_start,
> .log_global_stop = vfio_log_global_stop,
>
.log_global_start is also a good point to notify logging state.
But if notifying in .save_setup handler, we can do fine-grained
control of when to notify of logging starting together with get_buffer
operation.
Is there any special benifit by registering to .log_global_start/stop?


> Regards,
> -Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Zhao Yan
On Thu, Feb 21, 2019 at 03:33:24AM +, Gonglei (Arei) wrote:
> 
> > -Original Message-
> > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > Sent: Thursday, February 21, 2019 9:59 AM
> > To: Gonglei (Arei) 
> > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > k...@vger.kernel.org
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > On Thu, Feb 21, 2019 at 01:35:43AM +, Gonglei (Arei) wrote:
> > >
> > >
> > > > -Original Message-
> > > > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > > > Sent: Thursday, February 21, 2019 8:25 AM
> > > > To: Gonglei (Arei) 
> > > > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > > > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > > > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > > > coh...@redhat.com; shuangtai@alibaba-inc.com;
> > dgilb...@redhat.com;
> > > > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > > > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > > > jonathan.dav...@nutanix.com; changpeng@intel.com;
> > ken@amd.com;
> > > > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > > > k...@vger.kernel.org
> > > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > > >
> > > > On Wed, Feb 20, 2019 at 11:56:01AM +, Gonglei (Arei) wrote:
> > > > > Hi yan,
> > > > >
> > > > > Thanks for your work.
> > > > >
> > > > > I have some suggestions or questions:
> > > > >
> > > > > 1) Would you add msix mode support,? if not, pls add a check in
> > > > vfio_pci_save_config(), likes Nvidia's solution.
> > > > ok.
> > > >
> > > > > 2) We should start vfio devices before vcpu resumes, so we can't rely 
> > > > > on
> > vm
> > > > start change handler completely.
> > > > vfio devices is by default set to running state.
> > > > In the target machine, its state transition flow is 
> > > > running->stop->running.
> > >
> > > That's confusing. We should start vfio devices after vfio_load_state,
> > otherwise
> > > how can you keep the devices' information are the same between source side
> > > and destination side?
> > >
> > so, your meaning is to set device state to running in the first call to
> > vfio_load_state?
> > 
> No, it should start devices after vfio_load_state and before vcpu resuming.
>

What about set device state to running in load_cleanup handler ?

> > > > so, maybe you can ignore the stop notification in kernel?
> > > > > 3) We'd better support live migration rollback since have many failure
> > > > scenarios,
> > > > >  register a migration notifier is a good choice.
> > > > I think this patchset can also handle the failure case well.
> > > > if migration failure or cancelling happens,
> > > > in cleanup handler, LOGGING state is cleared. device state(running or
> > > > stopped) keeps as it is).
> > >
> > > IIRC there're many failure paths don't calling cleanup handler.
> > >
> > could you take an example?
> 
> Never mind, that's another bug I think. 
> 
> > > > then,
> > > > if vm switches back to running, device state will be set to running;
> > > > if vm stayes at stopped state, device state is also stopped (it has no
> > > > meaning to let it in running state).
> > > > Do you think so ?
> > > >
> > > IF the underlying state machine is complicated,
> > > We should tell the canceling state to vendor driver proactively.
> > >
> > That makes sense.
> > 
> > > > > 4) Four memory region for live migration is too complicated IMHO.
> > > > one big region requires the sub-regions well padded.
> > > > like for the first control fields, they hav

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Zhao Yan
> > 
> > > 5) About log sync, why not register log_global_start/stop in
> > vfio_memory_listener?
> > >
> > >
> > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > for dirty pages in system memory, it's better to transfer dirty data
> > iteratively to reduce down time, right?
> > 
> 
> We just need invoking only once for start and stop logging. Why we need to 
> call
> them literately? See memory_listener of vhost.
>
the dirty pages in system memory produces by device is incremental.
if it can be got iteratively, the dirty pages in stop-and-copy phase can be
minimal. 
:)

> Regards,
> -Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Zhao Yan
On Thu, Feb 21, 2019 at 01:35:43AM +, Gonglei (Arei) wrote:
> 
> 
> > -Original Message-
> > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > Sent: Thursday, February 21, 2019 8:25 AM
> > To: Gonglei (Arei) 
> > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > k...@vger.kernel.org
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > On Wed, Feb 20, 2019 at 11:56:01AM +, Gonglei (Arei) wrote:
> > > Hi yan,
> > >
> > > Thanks for your work.
> > >
> > > I have some suggestions or questions:
> > >
> > > 1) Would you add msix mode support,? if not, pls add a check in
> > vfio_pci_save_config(), likes Nvidia's solution.
> > ok.
> > 
> > > 2) We should start vfio devices before vcpu resumes, so we can't rely on 
> > > vm
> > start change handler completely.
> > vfio devices is by default set to running state.
> > In the target machine, its state transition flow is running->stop->running.
> 
> That's confusing. We should start vfio devices after vfio_load_state, 
> otherwise
> how can you keep the devices' information are the same between source side
> and destination side?
>
so, your meaning is to set device state to running in the first call to
vfio_load_state?

> > so, maybe you can ignore the stop notification in kernel?
> > > 3) We'd better support live migration rollback since have many failure
> > scenarios,
> > >  register a migration notifier is a good choice.
> > I think this patchset can also handle the failure case well.
> > if migration failure or cancelling happens,
> > in cleanup handler, LOGGING state is cleared. device state(running or
> > stopped) keeps as it is).
> 
> IIRC there're many failure paths don't calling cleanup handler.
>
could you take an example?
> > then,
> > if vm switches back to running, device state will be set to running;
> > if vm stayes at stopped state, device state is also stopped (it has no
> > meaning to let it in running state).
> > Do you think so ?
> > 
> IF the underlying state machine is complicated,
> We should tell the canceling state to vendor driver proactively.
> 
That makes sense.

> > > 4) Four memory region for live migration is too complicated IMHO.
> > one big region requires the sub-regions well padded.
> > like for the first control fields, they have to be padded to 4K.
> > the same for other data fields.
> > Otherwise, mmap simply fails, because the start-offset and size for mmap
> > both need to be PAGE aligned.
> > 
> But if we don't need use mmap for control filed and device state, they are 
> small basically.
> The performance is enough using pread/pwrite. 
> 
we don't mmap control fields. but if data fields going immedately after
control fields (e.g. just 64 bytes), we can't mmap data fields
successfully because its start offset is 64. Therefore control fields have
to be padded to 4k to let data fields start from 4k.
That's the drawback of one big region holding both control and data fields.

> > Also, 4 regions is clearer in my view :)
> > 
> > > 5) About log sync, why not register log_global_start/stop in
> > vfio_memory_listener?
> > >
> > >
> > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > for dirty pages in system memory, it's better to transfer dirty data
> > iteratively to reduce down time, right?
> > 
> 
> We just need invoking only once for start and stop logging. Why we need to 
> call
> them literately? See memory_listener of vhost.
> 



> Regards,
> -Gonglei



Re: [Qemu-devel] [PATCH 1/5] vfio/migration: define kernel interfaces

2019-02-20 Thread Zhao Yan
On Wed, Feb 20, 2019 at 06:08:13PM +0100, Cornelia Huck wrote:
> On Wed, 20 Feb 2019 02:36:36 -0500
> Zhao Yan  wrote:
> 
> > On Tue, Feb 19, 2019 at 02:09:18PM +0100, Cornelia Huck wrote:
> > > On Tue, 19 Feb 2019 16:52:14 +0800
> > > Yan Zhao  wrote:
> (...)
> > > > + *  Size of device config data is smaller than or equal to 
> > > > that of
> > > > + *  device config region.  
> > > 
> > > Not sure if I understand that sentence correctly... but what if a
> > > device has more config state than fits into this region? Is that
> > > supposed to be covered by the device memory region? Or is this assumed
> > > to be something so exotic that we don't need to plan for it?
> > >   
> > Device config data and device config region are all provided by vendor
> > driver, so vendor driver is always able to create a large enough device
> > config region to hold device config data.
> > So, if a device has data that are better to be saved after device stop and
> > saved/loaded in strict order, the data needs to be in device config region.
> > This kind of data is supposed to be small.
> > If the device data can be saved/loaded several times, it can also be put
> > into device memory region.
> 
> So, it is the vendor driver's decision which device information should
> go via which region? With the device config data supposed to be
> saved/loaded in one go?
Right, exactly.


> (...)
> > > > +/* version number of the device state interface */
> > > > +#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1  
> > > 
> > > Hm. Is this supposed to be backwards-compatible, should we need to bump
> > > this?
> > >  
> > currently no backwords-compatible. we can discuss on that.
> 
> It might be useful if we discover that we need some extensions. But I'm
> not sure how much work it would be.
> 
> (...)
> > > > +/*
> > > > + * DEVICE STATES
> > > > + *
> > > > + * Four states are defined for a VFIO device:
> > > > + * RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP.
> > > > + * They can be set by writing to device_state field of
> > > > + * vfio_device_state_ctl region.  
> > > 
> > > Who controls this? Userspace?  
> > 
> > Yes. Userspace notifies vendor driver to do the state switching.
> 
> Might be good to mention this (just to make it obvious).
>
Got it. thanks

> > > > + * LOGGING state is a special state that it CANNOT exist
> > > > + * independently.  
> > > 
> > > So it's not a state, but rather a modifier?
> > >   
> > yes. or thinking LOGGING/not LOGGING as bit 1 of a device state,
> > whereas RUNNING/STOPPED is bit 0 of a device state.
> > They have to be got as a whole.
> 
> So it is (on a bit level):
> RUNNING -> 00
> STOPPED -> 01
> LOGGING/RUNNING -> 10
> LOGGING/STOPPED -> 11
> 

Yes.

> > > > + * It must be set alongside with state RUNNING or STOP, i.e,
> > > > + * RUNNING & LOGGING, STOP & LOGGING.
> > > > + * It is used for dirty data logging both for device memory
> > > > + * and system memory.
> > > > + *
> > > > + * LOGGING only impacts device/system memory. In LOGGING state, get 
> > > > buffer
> > > > + * of device memory returns dirty pages since last call; outside 
> > > > LOGGING
> > > > + * state, get buffer of device memory returns whole snapshot of device
> > > > + * memory. system memory's dirty page is only available in LOGGING 
> > > > state.
> > > > + *
> > > > + * Device config should be always accessible and return whole config 
> > > > snapshot
> > > > + * regardless of LOGGING state.
> > > > + * */
> > > > +#define VFIO_DEVICE_STATE_RUNNING 0
> > > > +#define VFIO_DEVICE_STATE_STOP 1
> > > > +#define VFIO_DEVICE_STATE_LOGGING 2
> 
> This makes it look a bit like LOGGING were an individual state, while 2
> is in reality LOGGING/RUNNING... not sure how to make that more
> obvious. Maybe (as we are dealing with a u32):
> 
> #define VFIO_DEVICE_STATE_RUNNING 0x
> #define VFIO_DEVICE_STATE_STOPPED 0x0001
> #define VFIO_DEVICE_STATE_LOGGING_RUNNING 0x0002
> #define VFIO_DEVICE_STATE_LOGGING_STOPPED 0x0003
> #define VFIO_DEVICE_STATE_LOGGING_MASK 0x0002
>
Yes, yours are better, thanks:)

> > > > +
> > > > +/* action to get data from device memory or device config
>

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Zhao Yan
On Wed, Feb 20, 2019 at 11:01:43AM +, Dr. David Alan Gilbert wrote:
> * Zhao Yan (yan.y.z...@intel.com) wrote:
> > On Tue, Feb 19, 2019 at 11:32:13AM +, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > This patchset enables VFIO devices to have live migration capability.
> > > > Currently it does not support post-copy phase.
> > > > 
> > > > It follows Alex's comments on last version of VFIO live migration 
> > > > patches,
> > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > query.
> > > 
> > > Hi,
> > >   I've sent minor comments to later patches; but some minor general
> > > comments:
> > > 
> > >   a) Never trust the incoming migrations stream - it might be corrupt,
> > > so check when you can.
> > hi Dave
> > Thanks for this suggestion. I'll add more checks for migration streams.
> > 
> > 
> > >   b) How do we detect if we're migrating from/to the wrong device or
> > > version of device?  Or say to a device with older firmware or perhaps
> > > a device that has less device memory ?
> > Actually it's still an open for VFIO migration. Need to think about
> > whether it's better to check that in libvirt or qemu (like a device magic
> > along with verion ?).
> > This patchset is intended to settle down the main device state interfaces
> > for VFIO migration. So that we can work on that and improve it.
> > 
> > 
> > >   c) Consider using the trace_ mechanism - it's really useful to
> > > add to loops writing/reading data so that you can see when it fails.
> > > 
> > > Dave
> > >
> > Got it. many thanks~~
> > 
> > 
> > > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > > 'migrtion'
> > 
> > sorry :)
> 
> No problem.
> 
> Given the mails, I'm guessing you've mostly tested this on graphics
> devices?  Have you also checked with VFIO network cards?
> 
yes, I tested it on Intel's graphics devices which do not have device
memory. so the cap of device-memory is off.
I believe this patchset can work well on VFIO network cards as well,
because Gonglei once said their NIC can work well on our previous code
(i.e. device-memory cap off).


> Also see the mail I sent in reply to Kirti's series; we need to boil
> these down to one solution.
>
Maybe Kirti can merge their implementaion into the code for device-memory
cap (like in my patch 5 for device-memory).

> Dave
> 
> > > 
> > > > Device Data
> > > > ---
> > > > Device data is divided into three types: device memory, device config,
> > > > and system memory dirty pages produced by device.
> > > > 
> > > > Device config: data like MMIOs, page tables...
> > > > Every device is supposed to possess device config data.
> > > > Usually device config's size is small (no big than 10M), and it
> > > > needs to be loaded in certain strict order.
> > > > Therefore, device config only needs to be saved/loaded in
> > > > stop-and-copy phase.
> > > > The data of device config is held in device config region.
> > > > Size of device config data is smaller than or equal to that of
> > > > device config region.
> > > > 
> > > > Device Memory: device's internal memory, standalone and outside system
> > > > memory. It is usually very big.
> > > > This kind of data needs to be saved / loaded in pre-copy and
> > > > stop-and-copy phase.
> > > > The data of device memory is held in device memory region.
> > > > Size of devie memory is usually larger than that of device
> > > > memory region. qemu needs to save/load it in chunks of size of
> > > > device memory region.
> > > > Not all device has device memory. Like IGD only uses system 
> > > > memory.
> > > > 
> > > > System memory dirty pages: If a device produces dirty pages in system
> > > > memory, it is able to get dirty bitmap for certain range of 
> > > > system
> > > > memory. This dirty bitmap is queried in pre-copy and 
> > > > stop-and-copy
> > > > phase in .log_sync callback. By setting dirty bitmap in 
> > > > .log_sync
> > > > callback, dirty pages in system memo

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Zhao Yan
On Wed, Feb 20, 2019 at 11:56:01AM +, Gonglei (Arei) wrote:
> Hi yan,
> 
> Thanks for your work.
> 
> I have some suggestions or questions:
> 
> 1) Would you add msix mode support,? if not, pls add a check in 
> vfio_pci_save_config(), likes Nvidia's solution.
ok.

> 2) We should start vfio devices before vcpu resumes, so we can't rely on vm 
> start change handler completely.
vfio devices is by default set to running state.
In the target machine, its state transition flow is running->stop->running.
so, maybe you can ignore the stop notification in kernel?
> 3) We'd better support live migration rollback since have many failure 
> scenarios,
>  register a migration notifier is a good choice.
I think this patchset can also handle the failure case well.
if migration failure or cancelling happens, 
in cleanup handler, LOGGING state is cleared. device state(running or
stopped) keeps as it is).
then,
if vm switches back to running, device state will be set to running;
if vm stayes at stopped state, device state is also stopped (it has no
meaning to let it in running state).
Do you think so ?

> 4) Four memory region for live migration is too complicated IMHO. 
one big region requires the sub-regions well padded.
like for the first control fields, they have to be padded to 4K.
the same for other data fields.
Otherwise, mmap simply fails, because the start-offset and size for mmap
both need to be PAGE aligned.

Also, 4 regions is clearer in my view :)

> 5) About log sync, why not register log_global_start/stop in 
> vfio_memory_listener?
> 
> 
seems log_global_start/stop cannot be iterately called in pre-copy phase?
for dirty pages in system memory, it's better to transfer dirty data
iteratively to reduce down time, right?


> Regards,
> -Gonglei
> 
> 
> > -Original Message-
> > From: Yan Zhao [mailto:yan.y.z...@intel.com]
> > Sent: Tuesday, February 19, 2019 4:51 PM
> > To: alex.william...@redhat.com; qemu-devel@nongnu.org
> > Cc: intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com; Gonglei (Arei)
> > ; k...@vger.kernel.org; Yan Zhao
> > 
> > Subject: [PATCH 0/5] QEMU VFIO live migration
> > 
> > This patchset enables VFIO devices to have live migration capability.
> > Currently it does not support post-copy phase.
> > 
> > It follows Alex's comments on last version of VFIO live migration patches,
> > including device states, VFIO device state region layout, dirty bitmap's
> > query.
> > 
> > Device Data
> > ---
> > Device data is divided into three types: device memory, device config,
> > and system memory dirty pages produced by device.
> > 
> > Device config: data like MMIOs, page tables...
> > Every device is supposed to possess device config data.
> > Usually device config's size is small (no big than 10M), and it
> > needs to be loaded in certain strict order.
> > Therefore, device config only needs to be saved/loaded in
> > stop-and-copy phase.
> > The data of device config is held in device config region.
> > Size of device config data is smaller than or equal to that of
> > device config region.
> > 
> > Device Memory: device's internal memory, standalone and outside system
> > memory. It is usually very big.
> > This kind of data needs to be saved / loaded in pre-copy and
> > stop-and-copy phase.
> > The data of device memory is held in device memory region.
> > Size of devie memory is usually larger than that of device
> > memory region. qemu needs to save/load it in chunks of size of
> > device memory region.
> > Not all device has device memory. Like IGD only uses system memory.
> > 
> > System memory dirty pages: If a device produces dirty pages in system
> > memory, it is able to get dirty bitmap for certain range of system
> > memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > callback, dirty pages in system memory will be save/loaded by ram's
> > live migration code.
> > The dirty bitmap of system memory is held in dirty bitmap region.
> > If system memory range is larger than that dirty bitmap region can
> > hold, qemu will cut it into several chunks and get dirty bitmap in
> > succession.
> > 
> > 
> > Device State Regions
> > 
> > Vendor driver is required to expose two mandatory regions and another two
> > optional regions if it plans to support 

Re: [Qemu-devel] [PATCH 5/5] vfio/migration: support device memory capability

2019-02-20 Thread Zhao Yan
On Wed, Feb 20, 2019 at 11:14:24AM +0100, Christophe de Dinechin wrote:
> 
> 
> > On 20 Feb 2019, at 08:58, Zhao Yan  wrote:
> > 
> > On Tue, Feb 19, 2019 at 03:42:36PM +0100, Christophe de Dinechin wrote:
> >> 
> >> 
> >>> On 19 Feb 2019, at 09:53, Yan Zhao  wrote:
> >>> 
> >>> If a device has device memory capability, save/load data from device 
> >>> memory
> >>> in pre-copy and stop-and-copy phases.
> >>> 
> >>> LOGGING state is set for device memory for dirty page logging:
> >>> in LOGGING state, get device memory returns whole device memory snapshot;
> >>> outside LOGGING state, get device memory returns dirty data since last get
> >>> operation.
> >>> 
> >>> Usually, device memory is very big, qemu needs to chunk it into several
> >>> pieces each with size of device memory region.
> >>> 
> >>> Signed-off-by: Yan Zhao 
> >>> Signed-off-by: Kirti Wankhede 
> >>> ---
> >>> hw/vfio/migration.c | 235 
> >>> ++--
> >>> hw/vfio/pci.h   |   1 +
> >>> 2 files changed, 231 insertions(+), 5 deletions(-)
> >>> 
> >>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >>> index 16d6395..f1e9309 100644
> >>> --- a/hw/vfio/migration.c
> >>> +++ b/hw/vfio/migration.c
> >>> @@ -203,6 +203,201 @@ static int 
> >>> vfio_load_data_device_config(VFIOPCIDevice *vdev,
> >>>return 0;
> >>> }
> >>> 
> >>> +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> >>> +{
> >>> +VFIODevice *vbasedev = >vbasedev;
> >>> +VFIORegion *region_ctl =
> >>> +>migration->region[VFIO_DEVSTATE_REGION_CTL];
> >>> +uint64_t len;
> >>> +int sz;
> >>> +
> >>> +sz = sizeof(len);
> >>> +if (pread(vbasedev->fd, , sz,
> >>> +region_ctl->fd_offset +
> >>> +offsetof(struct vfio_device_state_ctl, 
> >>> device_memory.size))
> >>> +!= sz) {
> >>> +error_report("vfio: Failed to get length of device memory”);
> >> 
> >> s/length/size/ ? (to be consistent with function name)
> > 
> > ok. thanks
> >>> +return -1;
> >>> +}
> >>> +vdev->migration->devmem_size = len;
> >>> +return 0;
> >>> +}
> >>> +
> >>> +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t 
> >>> size)
> >>> +{
> >>> +VFIODevice *vbasedev = >vbasedev;
> >>> +VFIORegion *region_ctl =
> >>> +>migration->region[VFIO_DEVSTATE_REGION_CTL];
> >>> +int sz;
> >>> +
> >>> +sz = sizeof(size);
> >>> +if (pwrite(vbasedev->fd, , sz,
> >>> +region_ctl->fd_offset +
> >>> +offsetof(struct vfio_device_state_ctl, 
> >>> device_memory.size))
> >>> +!= sz) {
> >>> +error_report("vfio: Failed to set length of device comemory”);
> >> 
> >> What is comemory? Typo?
> > 
> > Right, typo. should be "memory" :)
> >> 
> >> Same comment about length vs size
> >> 
> > got it. thanks
> > 
> >>> +return -1;
> >>> +}
> >>> +vdev->migration->devmem_size = size;
> >>> +return 0;
> >>> +}
> >>> +
> >>> +static
> >>> +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> >>> +uint64_t pos, uint64_t len)
> >>> +{
> >>> +VFIODevice *vbasedev = >vbasedev;
> >>> +VFIORegion *region_ctl =
> >>> +>migration->region[VFIO_DEVSTATE_REGION_CTL];
> >>> +VFIORegion *region_devmem =
> >>> +
> >>> >migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> >>> +void *dest;
> >>> +uint32_t sz;
> >>> +uint8_t *buf = NULL;
> >>> +uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> >>> +
> >>> +if (len > region_devmem->size) {
> >> 
> >> Is it intentional that

Re: [Qemu-devel] [PATCH 2/5] vfio/migration: support device of device config capability

2019-02-20 Thread Zhao Yan
On Tue, Feb 19, 2019 at 03:37:24PM +0100, Cornelia Huck wrote:
> On Tue, 19 Feb 2019 16:52:27 +0800
> Yan Zhao  wrote:
> 
> > Device config is the default data that every device should have. so
> > device config capability is by default on, no need to set.
> > 
> > - Currently two type of resources are saved/loaded for device of device
> >   config capability:
> >   General PCI config data, and Device config data.
> >   They are copies as a whole when precopy is stopped.
> > 
> > Migration setup flow:
> > - Setup device state regions, check its device state version and 
> > capabilities.
> >   Mmap Device Config Region and Dirty Bitmap Region, if available.
> > - If device state regions are failed to get setup, a migration blocker is
> >   registered instead.
> > - Added SaveVMHandlers to register device state save/load handlers.
> > - Register VM state change handler to set device's running/stop states.
> > - On migration startup on source machine, set device's state to
> >   VFIO_DEVICE_STATE_LOGGING
> > 
> > Signed-off-by: Yan Zhao 
> > Signed-off-by: Yulei Zhang 
> > ---
> >  hw/vfio/Makefile.objs |   2 +-
> >  hw/vfio/migration.c   | 633 
> > ++
> >  hw/vfio/pci.c |   1 -
> >  hw/vfio/pci.h |  25 +-
> >  include/hw/vfio/vfio-common.h |   1 +
> >  5 files changed, 659 insertions(+), 3 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> > diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> > index 8b3f664..f32ff19 100644
> > --- a/hw/vfio/Makefile.objs
> > +++ b/hw/vfio/Makefile.objs
> > @@ -1,6 +1,6 @@
> >  ifeq ($(CONFIG_LINUX), y)
> >  obj-$(CONFIG_SOFTMMU) += common.o
> > -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> > +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o
> 
> I think you want to split the migration code: The type-independent
> code, and the pci-specific code.
>
ok. actually, now only saving/loading of pci generic config data is
pci-specific. the data getting/setting through device state
interfaces are type-independent.

> >  obj-$(CONFIG_VFIO_CCW) += ccw.o
> >  obj-$(CONFIG_SOFTMMU) += platform.o
> >  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > new file mode 100644
> > index 000..16d6395
> > --- /dev/null
> > +++ b/hw/vfio/migration.c
> > @@ -0,0 +1,633 @@
> > +#include "qemu/osdep.h"
> > +
> > +#include "hw/vfio/vfio-common.h"
> > +#include "migration/blocker.h"
> > +#include "migration/register.h"
> > +#include "qapi/error.h"
> > +#include "pci.h"
> > +#include "sysemu/kvm.h"
> > +#include "exec/ram_addr.h"
> > +
> > +#define VFIO_SAVE_FLAG_SETUP 0
> > +#define VFIO_SAVE_FLAG_PCI 1
> > +#define VFIO_SAVE_FLAG_DEVCONFIG 2
> > +#define VFIO_SAVE_FLAG_DEVMEMORY 4
> > +#define VFIO_SAVE_FLAG_CONTINUE 8
> > +
> > +static int vfio_device_state_region_setup(VFIOPCIDevice *vdev,
> > +VFIORegion *region, uint32_t subtype, const char *name)
> 
> This function looks like it should be more generic and e.g. take a
> VFIODevice instead of a VFIOPCIDevice as argument.
> 
> > +{
> > +VFIODevice *vbasedev = >vbasedev;
> > +struct vfio_region_info *info;
> > +int ret;
> > +
> > +ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_DEVICE_STATE,
> > +subtype, );
> > +if (ret) {
> > +error_report("Failed to get info of region %s", name);
> > +return ret;
> > +}
> > +
> > +if (vfio_region_setup(OBJECT(vdev), vbasedev,
> > +region, info->index, name)) {
> > +error_report("Failed to setup migrtion region %s", name);
> > +return ret;
> > +}
> > +
> > +if (vfio_region_mmap(region)) {
> > +error_report("Failed to mmap migrtion region %s", name);
> > +}
> > +
> > +return 0;
> > +}
> > +
> > +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev)
> > +{
> > +   return !!(vdev->migration->data_caps & 
> > VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY);
> > +}
> > +
> > +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev)
> > +{
> > +   return !!(vdev->migration->data_caps & 
> > VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY);
> > +}
> 
> These two as well. The migration structure should probably hang off the
> VFIODevice instead.
>
ok.

> > +
> > +static bool vfio_device_state_region_mmaped(VFIORegion *region)
> > +{
> > +bool mmaped = true;
> > +if (region->nr_mmaps != 1 || region->mmaps[0].offset ||
> > +(region->size != region->mmaps[0].size) ||
> > +(region->mmaps[0].mmap == NULL)) {
> > +mmaped = false;
> > +}
> > +
> > +return mmaped;
> > +}
> 
> s/mmaped/mmapped/ ?

yes :)
> 
> > +
> > +static int vfio_get_device_config_size(VFIOPCIDevice *vdev)
> > +{
> > +VFIODevice *vbasedev = >vbasedev;
> > +VFIORegion *region_ctl =
> > +>migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +VFIORegion *region_config =
> > +

Re: [Qemu-devel] [PATCH 5/5] vfio/migration: support device memory capability

2019-02-20 Thread Zhao Yan
On Tue, Feb 19, 2019 at 03:42:36PM +0100, Christophe de Dinechin wrote:
> 
> 
> > On 19 Feb 2019, at 09:53, Yan Zhao  wrote:
> > 
> > If a device has device memory capability, save/load data from device memory
> > in pre-copy and stop-and-copy phases.
> > 
> > LOGGING state is set for device memory for dirty page logging:
> > in LOGGING state, get device memory returns whole device memory snapshot;
> > outside LOGGING state, get device memory returns dirty data since last get
> > operation.
> > 
> > Usually, device memory is very big, qemu needs to chunk it into several
> > pieces each with size of device memory region.
> > 
> > Signed-off-by: Yan Zhao 
> > Signed-off-by: Kirti Wankhede 
> > ---
> > hw/vfio/migration.c | 235 
> > ++--
> > hw/vfio/pci.h   |   1 +
> > 2 files changed, 231 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 16d6395..f1e9309 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice 
> > *vdev,
> > return 0;
> > }
> > 
> > +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> > +{
> > +VFIODevice *vbasedev = >vbasedev;
> > +VFIORegion *region_ctl =
> > +>migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +uint64_t len;
> > +int sz;
> > +
> > +sz = sizeof(len);
> > +if (pread(vbasedev->fd, , sz,
> > +region_ctl->fd_offset +
> > +offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +!= sz) {
> > +error_report("vfio: Failed to get length of device memory”);
> 
> s/length/size/ ? (to be consistent with function name)

ok. thanks
> > +return -1;
> > +}
> > +vdev->migration->devmem_size = len;
> > +return 0;
> > +}
> > +
> > +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> > +{
> > +VFIODevice *vbasedev = >vbasedev;
> > +VFIORegion *region_ctl =
> > +>migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +int sz;
> > +
> > +sz = sizeof(size);
> > +if (pwrite(vbasedev->fd, , sz,
> > +region_ctl->fd_offset +
> > +offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +!= sz) {
> > +error_report("vfio: Failed to set length of device comemory”);
> 
> What is comemory? Typo?

Right, typo. should be "memory" :)
> 
> Same comment about length vs size
>
got it. thanks

> > +return -1;
> > +}
> > +vdev->migration->devmem_size = size;
> > +return 0;
> > +}
> > +
> > +static
> > +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> > +uint64_t pos, uint64_t len)
> > +{
> > +VFIODevice *vbasedev = >vbasedev;
> > +VFIORegion *region_ctl =
> > +>migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +VFIORegion *region_devmem =
> > +>migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +void *dest;
> > +uint32_t sz;
> > +uint8_t *buf = NULL;
> > +uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> > +
> > +if (len > region_devmem->size) {
> 
> Is it intentional that there is no error_report here?
>
an error_report here may be better.
> > +return -1;
> > +}
> > +
> > +sz = sizeof(pos);
> > +if (pwrite(vbasedev->fd, , sz,
> > +region_ctl->fd_offset +
> > +offsetof(struct vfio_device_state_ctl, device_memory.pos))
> > +!= sz) {
> > +error_report("vfio: Failed to set save buffer pos");
> > +return -1;
> > +}
> > +sz = sizeof(action);
> > +if (pwrite(vbasedev->fd, , sz,
> > +region_ctl->fd_offset +
> > +offsetof(struct vfio_device_state_ctl, 
> > device_memory.action))
> > +!= sz) {
> > +error_report("vfio: Failed to set save buffer action");
> > +return -1;
> > +}
> > +
> > +if (!vfio_device_state_region_mmaped(region_devmem)) {
> > +buf = g_malloc(len);
> > +if (buf == NULL) {
> > +error_report("vfio: Failed to allocate memory for migrate”);
> s/migrate/migration/ ?

yes, thanks
> > +return -1;
> > +}
> > +if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != 
> > len) {
> > +error_report("vfio: error load device memory buffer”);
> s/load/loading/ ?
error to load? :)

> > +return -1;
> > +}
> > +qemu_put_be64(f, len);
> > +qemu_put_be64(f, pos);
> > +qemu_put_buffer(f, buf, len);
> > +g_free(buf);
> > +} else {
> > +dest = region_devmem->mmaps[0].mmap;
> > +qemu_put_be64(f, len);
> > +qemu_put_be64(f, pos);
> > +qemu_put_buffer(f, dest, len);
> > +}
> > +return 0;
> > +}
> > +
> > +static int 

Re: [Qemu-devel] [PATCH 1/5] vfio/migration: define kernel interfaces

2019-02-19 Thread Zhao Yan
On Tue, Feb 19, 2019 at 02:09:18PM +0100, Cornelia Huck wrote:
> On Tue, 19 Feb 2019 16:52:14 +0800
> Yan Zhao  wrote:
> 
> > - defined 4 device states regions: one control region and 3 data regions
> > - defined layout of control region in struct vfio_device_state_ctl
> > - defined 4 device states: running, stop, running, stop
> > - define 3 device data categories: device config, device memory, system
> >   memory
> > - defined 2 device data capabilities: device memory and system memory
> > - defined device state interfaces' version and 12 device state interfaces
> > 
> > Signed-off-by: Yan Zhao 
> > Signed-off-by: Kevin Tian 
> > Signed-off-by: Yulei Zhang 
> > ---
> >  linux-headers/linux/vfio.h | 260 
> > +
> >  1 file changed, 260 insertions(+)
> 
> [commenting here for convenience; changes obviously need to be done in
> the Linux patch]
> 
yes, you can find the corresponding kernel part code at
https://patchwork.freedesktop.org/series/56876/


> > 
> > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > index ceb6453..a124fc1 100644
> > --- a/linux-headers/linux/vfio.h
> > +++ b/linux-headers/linux/vfio.h
> > @@ -303,6 +303,56 @@ struct vfio_region_info_cap_type {
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2)
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG  (3)
> >  
> > +/* Device State region type and sub-type
> > + *
> > + * A VFIO device driver needs to register up to four device state regions 
> > in
> > + * total: two mandatory and another two optional, if it plans to support 
> > device
> > + * state management.
> 
> Suggest to rephrase:
> 
> "A VFIO device driver that plans to support device state management
> needs to register..."
>
ok :)

> > + *
> > + * 1. region CTL :
> > + *  Mandatory.
> > + *  This is a control region.
> > + *  Its layout is defined in struct vfio_device_state_ctl.
> > + *  Reading from this region can get version, capabilities and data
> > + *  size of device state interfaces.
> > + *  Writing to this region can set device state, data size and
> > + *  choose which interface to use.
> > + * 2. region DEVICE_CONFIG
> > + *  Mandatory.
> > + *  This is a data region that holds device config data.
> > + *  Device config is such kind of data like MMIOs, page tables...
> 
> "Device config is data such as..."

ok :)
> 
> > + *  Every device is supposed to possess device config data.
> > + *  Usually the size of device config data is small (no big
> 
> s/no big/no bigger/

right :)
> 
> > + *  than 10M), and it needs to be loaded in certain strict
> > + *  order.
> > + *  Therefore no dirty data logging is enabled for device
> > + *  config and it must be got/set as a whole.
> > + *  Size of device config data is smaller than or equal to that of
> > + *  device config region.
> 
> Not sure if I understand that sentence correctly... but what if a
> device has more config state than fits into this region? Is that
> supposed to be covered by the device memory region? Or is this assumed
> to be something so exotic that we don't need to plan for it?
> 
Device config data and device config region are all provided by vendor
driver, so vendor driver is always able to create a large enough device
config region to hold device config data.
So, if a device has data that are better to be saved after device stop and
saved/loaded in strict order, the data needs to be in device config region.
This kind of data is supposed to be small.
If the device data can be saved/loaded several times, it can also be put
into device memory region.


> > + *  It is able to be mmaped into user space.
> > + * 3. region DEVICE_MEMORY
> > + *  Optional.
> > + *  This is a data region that holds device memory data.
> > + *  Device memory is device's internal memory, standalone and 
> > outside
> 
> s/outside/distinct from/ ?
ok.
> 
> > + *  system memory.  It is usually very big.
> > + *  Not all device has device memory. Like IGD only uses system
> 
> s/all devices has/all devices have/
> 
> s/Like/E.g./
>
ok :)

> > + *  memory and has no device memory.
> > + *  Size of devie memory is usually larger than that of device
> 
> s/devie/device/
> 
thanks:)

> > + *  memory region. qemu needs to save/load it in chunks of size of
> > + *  device memory region.
> 
> I'd rather not explicitly mention QEMU in this header. Maybe
> "Userspace"?
>
ok.

> > + *  It is able to be mmaped into user space.
> > + * 4. region DIRTY_BITMAP
> > + *  Optional.
> > + *  This is a data region that holds bitmap of dirty pages in 
> > system
> > + *  memory that a VFIO devices produces.
> > + *  It is able to be mmaped into user space.
> > + */
> > +#define VFIO_REGION_TYPE_DEVICE_STATE 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-19 Thread Zhao Yan
On Tue, Feb 19, 2019 at 11:32:13AM +, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.z...@intel.com) wrote:
> > This patchset enables VFIO devices to have live migration capability.
> > Currently it does not support post-copy phase.
> > 
> > It follows Alex's comments on last version of VFIO live migration patches,
> > including device states, VFIO device state region layout, dirty bitmap's
> > query.
> 
> Hi,
>   I've sent minor comments to later patches; but some minor general
> comments:
> 
>   a) Never trust the incoming migrations stream - it might be corrupt,
> so check when you can.
hi Dave
Thanks for this suggestion. I'll add more checks for migration streams.


>   b) How do we detect if we're migrating from/to the wrong device or
> version of device?  Or say to a device with older firmware or perhaps
> a device that has less device memory ?
Actually it's still an open for VFIO migration. Need to think about
whether it's better to check that in libvirt or qemu (like a device magic
along with verion ?).
This patchset is intended to settle down the main device state interfaces
for VFIO migration. So that we can work on that and improve it.


>   c) Consider using the trace_ mechanism - it's really useful to
> add to loops writing/reading data so that you can see when it fails.
> 
> Dave
>
Got it. many thanks~~


> (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> 'migrtion'

sorry :)
> 
> > Device Data
> > ---
> > Device data is divided into three types: device memory, device config,
> > and system memory dirty pages produced by device.
> > 
> > Device config: data like MMIOs, page tables...
> > Every device is supposed to possess device config data.
> > Usually device config's size is small (no big than 10M), and it
> > needs to be loaded in certain strict order.
> > Therefore, device config only needs to be saved/loaded in
> > stop-and-copy phase.
> > The data of device config is held in device config region.
> > Size of device config data is smaller than or equal to that of
> > device config region.
> > 
> > Device Memory: device's internal memory, standalone and outside system
> > memory. It is usually very big.
> > This kind of data needs to be saved / loaded in pre-copy and
> > stop-and-copy phase.
> > The data of device memory is held in device memory region.
> > Size of devie memory is usually larger than that of device
> > memory region. qemu needs to save/load it in chunks of size of
> > device memory region.
> > Not all device has device memory. Like IGD only uses system memory.
> > 
> > System memory dirty pages: If a device produces dirty pages in system
> > memory, it is able to get dirty bitmap for certain range of system
> > memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > callback, dirty pages in system memory will be save/loaded by ram's
> > live migration code.
> > The dirty bitmap of system memory is held in dirty bitmap region.
> > If system memory range is larger than that dirty bitmap region can
> > hold, qemu will cut it into several chunks and get dirty bitmap in
> > succession.
> > 
> > 
> > Device State Regions
> > 
> > Vendor driver is required to expose two mandatory regions and another two
> > optional regions if it plans to support device state management.
> > 
> > So, there are up to four regions in total.
> > One control region: mandatory.
> > Get access via read/write system call.
> > Its layout is defined in struct vfio_device_state_ctl
> > Three data regions: mmaped into qemu.
> > device config region: mandatory, holding data of device config
> > device memory region: optional, holding data of device memory
> > dirty bitmap region: optional, holding bitmap of system memory
> > dirty pages
> > 
> > (The reason why four seperate regions are defined is that the unit of mmap
> > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > control and three mmaped regions for data seems better than one big region
> > padded and sparse mmaped).
> > 
> > 
> > kernel device state interface [1]
> > --
> > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > 
> > #define VFIO_DEVICE_STATE_RUNNING 0 
> > #define VFIO_DEVICE_STATE_STOP 1
> > #define VFIO_DEVICE_STATE_LOGGING 2
> > 
> > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > 
> > struct vfio_device_state_ctl {
> > __u32 version;/* ro */
> > 

Re: [Qemu-devel] [PATCH 5/5] vfio/migration: support device memory capability

2019-02-19 Thread Zhao Yan
On Tue, Feb 19, 2019 at 11:25:43AM +, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.z...@intel.com) wrote:
> > If a device has device memory capability, save/load data from device memory
> > in pre-copy and stop-and-copy phases.
> > 
> > LOGGING state is set for device memory for dirty page logging:
> > in LOGGING state, get device memory returns whole device memory snapshot;
> > outside LOGGING state, get device memory returns dirty data since last get
> > operation.
> > 
> > Usually, device memory is very big, qemu needs to chunk it into several
> > pieces each with size of device memory region.
> > 
> > Signed-off-by: Yan Zhao 
> > Signed-off-by: Kirti Wankhede 
> > ---
> >  hw/vfio/migration.c | 235 
> > ++--
> >  hw/vfio/pci.h   |   1 +
> >  2 files changed, 231 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 16d6395..f1e9309 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice 
> > *vdev,
> >  return 0;
> >  }
> >  
> > +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> > +{
> > +VFIODevice *vbasedev = >vbasedev;
> > +VFIORegion *region_ctl =
> > +>migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +uint64_t len;
> > +int sz;
> > +
> > +sz = sizeof(len);
> > +if (pread(vbasedev->fd, , sz,
> > +region_ctl->fd_offset +
> > +offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +!= sz) {
> > +error_report("vfio: Failed to get length of device memory");
> > +return -1;
> > +}
> > +vdev->migration->devmem_size = len;
> > +return 0;
> > +}
> > +
> > +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> > +{
> > +VFIODevice *vbasedev = >vbasedev;
> > +VFIORegion *region_ctl =
> > +>migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +int sz;
> > +
> > +sz = sizeof(size);
> > +if (pwrite(vbasedev->fd, , sz,
> > +region_ctl->fd_offset +
> > +offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +!= sz) {
> > +error_report("vfio: Failed to set length of device comemory");
> > +return -1;
> > +}
> > +vdev->migration->devmem_size = size;
> > +return 0;
> > +}
> > +
> > +static
> > +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> > +uint64_t pos, uint64_t len)
> > +{
> > +VFIODevice *vbasedev = >vbasedev;
> > +VFIORegion *region_ctl =
> > +>migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +VFIORegion *region_devmem =
> > +>migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +void *dest;
> > +uint32_t sz;
> > +uint8_t *buf = NULL;
> > +uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> > +
> > +if (len > region_devmem->size) {
> > +return -1;
> > +}
> > +
> > +sz = sizeof(pos);
> > +if (pwrite(vbasedev->fd, , sz,
> > +region_ctl->fd_offset +
> > +offsetof(struct vfio_device_state_ctl, device_memory.pos))
> > +!= sz) {
> > +error_report("vfio: Failed to set save buffer pos");
> > +return -1;
> > +}
> > +sz = sizeof(action);
> > +if (pwrite(vbasedev->fd, , sz,
> > +region_ctl->fd_offset +
> > +offsetof(struct vfio_device_state_ctl, 
> > device_memory.action))
> > +!= sz) {
> > +error_report("vfio: Failed to set save buffer action");
> > +return -1;
> > +}
> > +
> > +if (!vfio_device_state_region_mmaped(region_devmem)) {
> > +buf = g_malloc(len);
> > +if (buf == NULL) {
> > +error_report("vfio: Failed to allocate memory for migrate");
> > +return -1;
> > +}
> > +if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != 
> > len) {
> > +error_report("vfio: error load device memory buffer");
> 
> That's forgotten to g_free(buf)
>
Right, I'll correct that.

> > +return -1;
> > +}
> > +qemu_put_be64(f, len);
> > +qemu_put_be64(f, pos);
> > +qemu_put_buffer(f, buf, len);
> > +g_free(buf);
> > +} else {
> > +dest = region_devmem->mmaps[0].mmap;
> > +qemu_put_be64(f, len);
> > +qemu_put_be64(f, pos);
> > +qemu_put_buffer(f, dest, len);
> > +}
> > +return 0;
> > +}
> > +
> > +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +VFIORegion *region_devmem =
> > +>migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +uint64_t total_len = vdev->migration->devmem_size;
> > +uint64_t pos = 0;
> > +
> > +qemu_put_be64(f, total_len);
> > +while (pos < 

Re: [Qemu-devel] [PATCH 2/5] vfio/migration: support device of device config capability

2019-02-19 Thread Zhao Yan
On Tue, Feb 19, 2019 at 11:01:45AM +, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.z...@intel.com) wrote:
> > Device config is the default data that every device should have. so
> > device config capability is by default on, no need to set.
> > 
> > - Currently two type of resources are saved/loaded for device of device
> >   config capability:
> >   General PCI config data, and Device config data.
> >   They are copies as a whole when precopy is stopped.
> > 
> > Migration setup flow:
> > - Setup device state regions, check its device state version and 
> > capabilities.
> >   Mmap Device Config Region and Dirty Bitmap Region, if available.
> > - If device state regions are failed to get setup, a migration blocker is
> >   registered instead.
> > - Added SaveVMHandlers to register device state save/load handlers.
> > - Register VM state change handler to set device's running/stop states.
> > - On migration startup on source machine, set device's state to
> >   VFIO_DEVICE_STATE_LOGGING
> > 
> > Signed-off-by: Yan Zhao 
> > Signed-off-by: Yulei Zhang 
> > ---
> >  hw/vfio/Makefile.objs |   2 +-
> >  hw/vfio/migration.c   | 633 
> > ++
> >  hw/vfio/pci.c |   1 -
> >  hw/vfio/pci.h |  25 +-
> >  include/hw/vfio/vfio-common.h |   1 +
> >  5 files changed, 659 insertions(+), 3 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> > diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> > index 8b3f664..f32ff19 100644
> > --- a/hw/vfio/Makefile.objs
> > +++ b/hw/vfio/Makefile.objs
> > @@ -1,6 +1,6 @@
> >  ifeq ($(CONFIG_LINUX), y)
> >  obj-$(CONFIG_SOFTMMU) += common.o
> > -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> > +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o
> >  obj-$(CONFIG_VFIO_CCW) += ccw.o
> >  obj-$(CONFIG_SOFTMMU) += platform.o
> >  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > new file mode 100644
> > index 000..16d6395
> > --- /dev/null
> > +++ b/hw/vfio/migration.c
> > @@ -0,0 +1,633 @@
> > +#include "qemu/osdep.h"
> > +
> > +#include "hw/vfio/vfio-common.h"
> > +#include "migration/blocker.h"
> > +#include "migration/register.h"
> > +#include "qapi/error.h"
> > +#include "pci.h"
> > +#include "sysemu/kvm.h"
> > +#include "exec/ram_addr.h"
> > +
> > +#define VFIO_SAVE_FLAG_SETUP 0
> > +#define VFIO_SAVE_FLAG_PCI 1
> > +#define VFIO_SAVE_FLAG_DEVCONFIG 2
> > +#define VFIO_SAVE_FLAG_DEVMEMORY 4
> > +#define VFIO_SAVE_FLAG_CONTINUE 8
> > +
> > +static int vfio_device_state_region_setup(VFIOPCIDevice *vdev,
> > +VFIORegion *region, uint32_t subtype, const char *name)
> > +{
> > +VFIODevice *vbasedev = >vbasedev;
> > +struct vfio_region_info *info;
> > +int ret;
> > +
> > +ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_DEVICE_STATE,
> > +subtype, );
> > +if (ret) {
> > +error_report("Failed to get info of region %s", name);
> > +return ret;
> > +}
> > +
> > +if (vfio_region_setup(OBJECT(vdev), vbasedev,
> > +region, info->index, name)) {
> > +error_report("Failed to setup migrtion region %s", name);
> > +return ret;
> > +}
> > +
> > +if (vfio_region_mmap(region)) {
> > +error_report("Failed to mmap migrtion region %s", name);
> > +}
> > +
> > +return 0;
> > +}
> > +
> > +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev)
> > +{
> > +   return !!(vdev->migration->data_caps & 
> > VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY);
> > +}
> > +
> > +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev)
> > +{
> > +   return !!(vdev->migration->data_caps & 
> > VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY);
> > +}
> > +
> > +static bool vfio_device_state_region_mmaped(VFIORegion *region)
> > +{
> > +bool mmaped = true;
> > +if (region->nr_mmaps != 1 || region->mmaps[0].offset ||
> > +(region->size != region->mmaps[0].size) ||
> > +(region->mmaps[0].mmap == NULL)) {
> > +mmaped = false;
> > +}
> > +
> > +return mmaped;
> > +}
> > +
> > +static int vfio_get_device_config_size(VFIOPCIDevice *vdev)
> > +{
> > +VFIODevice *vbasedev = >vbasedev;
> > +VFIORegion *region_ctl =
> > +>migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +VFIORegion *region_config =
> > +>migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > +uint64_t len;
> > +int sz;
> > +
> > +sz = sizeof(len);
> > +if (pread(vbasedev->fd, , sz,
> > +region_ctl->fd_offset +
> > +offsetof(struct vfio_device_state_ctl, device_config.size))
> > +!= sz) {
> > +error_report("vfio: Failed to get length of device config");
> > +return -1;
> > +}
> > +if (len > region_config->size) {
> > +error_report("vfio: Error device config length");
> > +return -1;
> > +}

Re: [Qemu-devel] [PATCH] vfio: assign idstr for VFIO's mmaped regions for migration

2019-01-09 Thread Zhao Yan
On Tue, Jan 08, 2019 at 10:09:11AM -0700, Alex Williamson wrote:
> On Tue,  8 Jan 2019 01:03:48 -0500
> Zhao Yan  wrote:
> 
> > if multiple regions in vfio are mmaped, their corresponding ramblocks
> > are like below, i.e. their idstrs are "".
> > 
> > (qemu) info ramblock
> > Block Name  PSize   Offset   UsedTotal
> > pc.ram  4 KiB  0x 0x2000 0x2000
> > 4 KiB  0x2110 0x2000 0x2000
> > 4 KiB  0x2090 0x0080 0x0080
> > 4 KiB  0x2024 0x00687000 0x00687000
> > 4 KiB  0x200c 0x00178000 0x00178000
> > pc.bios 4 KiB  0x2000 0x0004 0x0004
> > pc.rom  4 KiB  0x2004 0x0002 0x0002
> > 
> > This is because ramblocks' idstr are assigned by calling
> > vmstate_register_ram(), but memory region of type ram device ptr does not
> > call vmstate_register_ram().
> > vfio_region_mmap
> > |->memory_region_init_ram_device_ptr
> >|-> memory_region_init_ram_ptr
> > 
> > Without empty idstrs will cause problem to snapshot copying during
> > migration, because it uses ramblocks' idstr to identify ramblocks.
> > ram_save_setup {
> >   …
> >   RAMBLOCK_FOREACH(block) {
> >   qemu_put_byte(f, strlen(block->idstr));
> >   qemu_put_buffer(f, (uint8_t *)block->idstr,strlen(block->idstr));
> >   qemu_put_be64(f, block->used_length);
> >   }
> >   …
> > }
> > ram_load() {
> > block = qemu_ram_block_by_name(id);
> > if (block) {
> > if (length != block->used_length) {
> > qemu_ram_resize(block, length, _err);
> > }
> >  ….
> >}
> > }
> > 
> > Therefore, in this patch,
> > vmstate_register_ram() is called for memory region of type ram ptr,
> > also a unique vfioid is assigned to vfio devices across source
> > and target vms.
> > e.g. in source vm, use qemu parameter
> > -device
> > vfio-pci,sysfsdev=/sys/bus/pci/devices/:00:02.0/
> > 882cc4da-dede-11e7-9180-078a62063ab1,vfioid=igd
> > 
> > and in target vm, use qemu paramter
> > -device
> > vfio-pci,sysfsdev=/sys/bus/pci/devices/:00:02.0/
> > 5ac1fb20-2bbf-4842-bb7e-36c58c3be9cd,vfioid=igd
> 
> Why wouldn't we just use the id= (DeviceState.id) value instead of
> adding yet another one?  I can't imagine anyone, especially libvirt,
> wants to deal with a vfio specific id for a device.
>
hi Alex
You are right! DeviceState.id can be used here. Thanks for your suggestion.


> > Signed-off-by: Zhao Yan 
> > ---
> >  hw/vfio/pci.c | 8 +++-
> >  include/hw/vfio/vfio-common.h | 1 +
> >  memory.c  | 4 
> >  3 files changed, 12 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index c0cb1ec289..7bc2ed0752 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -2533,7 +2533,12 @@ static void vfio_populate_device(VFIOPCIDevice 
> > *vdev, Error **errp)
> >  }
> >  
> >  for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; 
> > i++) {
> > -char *name = g_strdup_printf("%s BAR %d", vbasedev->name, i);
> > +char *name;
> > +if (vbasedev->vfioid) {
> > +name = g_strdup_printf("%s BAR %d", vbasedev->vfioid, i);
> > +} else {
> > +name = g_strdup_printf("%s BAR %d", vbasedev->name, i);
> > +}
> >  
> >  ret = vfio_region_setup(OBJECT(vdev), vbasedev,
> >  >bars[i].region, i, name);
> > @@ -3180,6 +3185,7 @@ static void vfio_instance_init(Object *obj)
> >  static Property vfio_pci_dev_properties[] = {
> >  DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
> >  DEFINE_PROP_STRING("sysfsdev", VFIOPCIDevice, vbasedev.sysfsdev),
> > +DEFINE_PROP_STRING("vfioid", VFIOPCIDevice, vbasedev.vfioid),
> >  DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
> >  display, ON_OFF_AUTO_OFF),
> >  DEFINE_PROP_UINT32("x-intx-mmap-timeout-ms", VFIOPCIDevice,
> > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > index 1b434d02f6..84bab94f52 

[Qemu-devel] [PATCH] vfio: assign idstr for VFIO's mmaped regions for migration

2019-01-07 Thread Zhao Yan
if multiple regions in vfio are mmaped, their corresponding ramblocks
are like below, i.e. their idstrs are "".

(qemu) info ramblock
Block Name  PSize   Offset   UsedTotal
pc.ram  4 KiB  0x 0x2000 0x2000
4 KiB  0x2110 0x2000 0x2000
4 KiB  0x2090 0x0080 0x0080
4 KiB  0x2024 0x00687000 0x00687000
4 KiB  0x200c 0x00178000 0x00178000
pc.bios 4 KiB  0x2000 0x0004 0x0004
pc.rom  4 KiB  0x2004 0x0002 0x0002

This is because ramblocks' idstr are assigned by calling
vmstate_register_ram(), but memory region of type ram device ptr does not
call vmstate_register_ram().
vfio_region_mmap
|->memory_region_init_ram_device_ptr
   |-> memory_region_init_ram_ptr

Without empty idstrs will cause problem to snapshot copying during
migration, because it uses ramblocks' idstr to identify ramblocks.
ram_save_setup {
  …
  RAMBLOCK_FOREACH(block) {
  qemu_put_byte(f, strlen(block->idstr));
  qemu_put_buffer(f, (uint8_t *)block->idstr,strlen(block->idstr));
  qemu_put_be64(f, block->used_length);
  }
  …
}
ram_load() {
block = qemu_ram_block_by_name(id);
if (block) {
if (length != block->used_length) {
qemu_ram_resize(block, length, _err);
}
 ….
   }
}

Therefore, in this patch,
vmstate_register_ram() is called for memory region of type ram ptr,
also a unique vfioid is assigned to vfio devices across source
and target vms.
e.g. in source vm, use qemu parameter
-device
vfio-pci,sysfsdev=/sys/bus/pci/devices/:00:02.0/
882cc4da-dede-11e7-9180-078a62063ab1,vfioid=igd

and in target vm, use qemu paramter
-device
vfio-pci,sysfsdev=/sys/bus/pci/devices/:00:02.0/
5ac1fb20-2bbf-4842-bb7e-36c58c3be9cd,vfioid=igd

Signed-off-by: Zhao Yan 
---
 hw/vfio/pci.c | 8 +++-
 include/hw/vfio/vfio-common.h | 1 +
 memory.c  | 4 
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c0cb1ec289..7bc2ed0752 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2533,7 +2533,12 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, 
Error **errp)
 }
 
 for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) {
-char *name = g_strdup_printf("%s BAR %d", vbasedev->name, i);
+char *name;
+if (vbasedev->vfioid) {
+name = g_strdup_printf("%s BAR %d", vbasedev->vfioid, i);
+} else {
+name = g_strdup_printf("%s BAR %d", vbasedev->name, i);
+}
 
 ret = vfio_region_setup(OBJECT(vdev), vbasedev,
 >bars[i].region, i, name);
@@ -3180,6 +3185,7 @@ static void vfio_instance_init(Object *obj)
 static Property vfio_pci_dev_properties[] = {
 DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
 DEFINE_PROP_STRING("sysfsdev", VFIOPCIDevice, vbasedev.sysfsdev),
+DEFINE_PROP_STRING("vfioid", VFIOPCIDevice, vbasedev.vfioid),
 DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
 display, ON_OFF_AUTO_OFF),
 DEFINE_PROP_UINT32("x-intx-mmap-timeout-ms", VFIOPCIDevice,
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 1b434d02f6..84bab94f52 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -108,6 +108,7 @@ typedef struct VFIODevice {
 struct VFIOGroup *group;
 char *sysfsdev;
 char *name;
+char *vfioid;
 DeviceState *dev;
 int fd;
 int type;
diff --git a/memory.c b/memory.c
index d14c6dec1d..dbb29fa989 100644
--- a/memory.c
+++ b/memory.c
@@ -1588,6 +1588,7 @@ void memory_region_init_ram_ptr(MemoryRegion *mr,
 uint64_t size,
 void *ptr)
 {
+DeviceState *owner_dev;
 memory_region_init(mr, owner, name, size);
 mr->ram = true;
 mr->terminates = true;
@@ -1597,6 +1598,9 @@ void memory_region_init_ram_ptr(MemoryRegion *mr,
 /* qemu_ram_alloc_from_ptr cannot fail with ptr != NULL.  */
 assert(ptr != NULL);
 mr->ram_block = qemu_ram_alloc_from_ptr(size, ptr, mr, _fatal);
+
+owner_dev = DEVICE(owner);
+vmstate_register_ram(mr, owner_dev);
 }
 
 void memory_region_init_ram_device_ptr(MemoryRegion *mr,
-- 
2.17.1




[Qemu-devel] [PATCH] vfio: assign idstr for VFIO's mmaped regions for migration

2019-01-07 Thread Zhao Yan
if multiple regions in vfio are mmaped, their corresponding ramblocks
are like below, i.e. their idstrs are "".

(qemu) info ramblock
Block Name  PSize   Offset   UsedTotal
pc.ram  4 KiB  0x 0x2000 0x2000
4 KiB  0x2110 0x2000 0x2000
4 KiB  0x2090 0x0080 0x0080
4 KiB  0x2024 0x00687000 0x00687000
4 KiB  0x200c 0x00178000 0x00178000
pc.bios 4 KiB  0x2000 0x0004 0x0004
pc.rom  4 KiB  0x2004 0x0002 0x0002

This is because ramblocks' idstr are assigned by calling
vmstate_register_ram(), but memory region of type ram device ptr does not
call vmstate_register_ram().
vfio_region_mmap
|->memory_region_init_ram_device_ptr
   |-> memory_region_init_ram_ptr

Without empty idstrs will cause problem to snapshot copying during
migration, because it uses ramblocks' idstr to identify ramblocks.
ram_save_setup {
  …
  RAMBLOCK_FOREACH(block) {
  qemu_put_byte(f, strlen(block->idstr));
  qemu_put_buffer(f, (uint8_t *)block->idstr,strlen(block->idstr));
  qemu_put_be64(f, block->used_length);
  }
  …
}
ram_load() {
block = qemu_ram_block_by_name(id);
if (block) {
if (length != block->used_length) {
qemu_ram_resize(block, length, _err);
}
 ….
   }
}

Therefore, in this patch,
vmstate_register_ram() is called for memory region of type ram ptr,
also a unique vfioid is assigned to vfio devices across source
and target vms.
e.g. in source vm, use qemu parameter
-device
vfio-pci,sysfsdev=/sys/bus/pci/devices/:00:02.0/
882cc4da-dede-11e7-9180-078a62063ab1,vfioid=igd

and in target vm, use qemu paramter
-device
vfio-pci,sysfsdev=/sys/bus/pci/devices/:00:02.0/
5ac1fb20-2bbf-4842-bb7e-36c58c3be9cd,vfioid=igd

Signed-off-by: Zhao Yan 
---
 hw/vfio/pci.c | 8 +++-
 include/hw/vfio/vfio-common.h | 1 +
 memory.c  | 4 
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c0cb1ec289..7bc2ed0752 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2533,7 +2533,12 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, 
Error **errp)
 }
 
 for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) {
-char *name = g_strdup_printf("%s BAR %d", vbasedev->name, i);
+char *name;
+if (vbasedev->vfioid) {
+name = g_strdup_printf("%s BAR %d", vbasedev->vfioid, i);
+} else {
+name = g_strdup_printf("%s BAR %d", vbasedev->name, i);
+}
 
 ret = vfio_region_setup(OBJECT(vdev), vbasedev,
 >bars[i].region, i, name);
@@ -3180,6 +3185,7 @@ static void vfio_instance_init(Object *obj)
 static Property vfio_pci_dev_properties[] = {
 DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
 DEFINE_PROP_STRING("sysfsdev", VFIOPCIDevice, vbasedev.sysfsdev),
+DEFINE_PROP_STRING("vfioid", VFIOPCIDevice, vbasedev.vfioid),
 DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
 display, ON_OFF_AUTO_OFF),
 DEFINE_PROP_UINT32("x-intx-mmap-timeout-ms", VFIOPCIDevice,
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 1b434d02f6..84bab94f52 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -108,6 +108,7 @@ typedef struct VFIODevice {
 struct VFIOGroup *group;
 char *sysfsdev;
 char *name;
+char *vfioid;
 DeviceState *dev;
 int fd;
 int type;
diff --git a/memory.c b/memory.c
index d14c6dec1d..dbb29fa989 100644
--- a/memory.c
+++ b/memory.c
@@ -1588,6 +1588,7 @@ void memory_region_init_ram_ptr(MemoryRegion *mr,
 uint64_t size,
 void *ptr)
 {
+DeviceState *owner_dev;
 memory_region_init(mr, owner, name, size);
 mr->ram = true;
 mr->terminates = true;
@@ -1597,6 +1598,9 @@ void memory_region_init_ram_ptr(MemoryRegion *mr,
 /* qemu_ram_alloc_from_ptr cannot fail with ptr != NULL.  */
 assert(ptr != NULL);
 mr->ram_block = qemu_ram_alloc_from_ptr(size, ptr, mr, _fatal);
+
+owner_dev = DEVICE(owner);
+vmstate_register_ram(mr, owner_dev);
 }
 
 void memory_region_init_ram_device_ptr(MemoryRegion *mr,
-- 
2.17.1




Re: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices

2018-12-18 Thread Zhao Yan
right, a capabilities field in struct vfio_device_migration_info can avoid
populating iteration APIs and migration states into every vendor drivers
who actually may not requires those APIs and simply do nothing or return
value 0 in response to those APIs.

struct vfio_device_migration_info {
__u32 device_state; /* VFIO device state */
+ __u32 capabilities;/* VFIO device capabilities */
struct {
__u64 precopy_only;
__u64 compatible;
__u64 postcopy_only;
__u64 threshold_size;
} pending;  
 ...
};
 
So, only for devices who need iteration APIs, like GPU with standalone
video memory, can set flag VFIO_MIGRATION_HAS_ITERTATION to this
capabilities field. Then callbacks like save_live_iterate(),
is_active_iterate(), save_live_pending() will check the flag
VFIO_MIGRATION_HAS_ITERTATION in capabilities field and send requests
into vendor driver.

But, for simple devices who only use system memory, like IGD and NIC,
will not set the flag VFIO_MIGRATION_HAS_ITERTATION, and as a result, no
need to handle requests like "Get buffer", "Set buffer", "Get pending
bytes" triggered by QEMU iteration callbacks. And therefore, detailed
migration states are not cared for vendor drivers for these devices.

Thanks to Gonglei for providing this idea and details.
Free free to give your comments to the above description.


On Mon, Dec 17, 2018 at 11:19:49AM +, Gonglei (Arei) wrote:
> Hi,
> 
> It's great to see this patch series, which is a very important step, although 
> currently only consider GPU mdev devices to support hot migration. 
> 
> However, this is based on the VFIO framework after all, so we expect 
> that we can make this live migration framework more general.
> 
> For example, the vfio_save_pending() callback is used to obtain device
> memory (such as GPU memory), but if the device (such as network card) 
> has no special proprietary memory, but only system memory? 
> It is too much to perform a null operation for this kind of device by writing
> memory to the vendor driver of kernel space. 
> 
> I think we can acquire the capability from the vendor driver before using 
> this. 
> If there is device memory that needs iterative copying, the vendor driver 
> return
> ture, otherwise return false. Then QEMU implement the specific logic, 
> otherwise return directly. Just like getting the capability list of KVM
> module, can we?
> 
> 
> Regards,
> -Gonglei
> 
> 
> > -Original Message-
> > From: Qemu-devel
> > [mailto:qemu-devel-bounces+arei.gonglei=huawei@nongnu.org] On
> > Behalf Of Kirti Wankhede
> > Sent: Wednesday, November 21, 2018 4:40 AM
> > To: alex.william...@redhat.com; c...@nvidia.com
> > Cc: zhengxiao...@alibaba-inc.com; kevin.t...@intel.com; yi.l@intel.com;
> > eskul...@redhat.com; ziye.y...@intel.com; qemu-devel@nongnu.org;
> > coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > a...@ozlabs.ru; Kirti Wankhede ;
> > eau...@redhat.com; fel...@nutanix.com; jonathan.dav...@nutanix.com;
> > changpeng@intel.com; ken@amd.com
> > Subject: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
> > 
> > - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> > - Added SaveVMHandlers and implemented all basic functions required for live
> >   migration.
> > - Added VM state change handler to know running or stopped state of VM.
> > - Added migration state change notifier to get notification on migration 
> > state
> >   change. This state is translated to VFIO device state and conveyed to 
> > vendor
> >   driver.
> > - VFIO device supportd migration or not is decided based of migration region
> >   query. If migration region query is successful then migration is supported
> >   else migration is blocked.
> > - Structure vfio_device_migration_info is mapped at 0th offset of migration
> >   region and should always trapped by VFIO device's driver. Added both type 
> > of
> >   access support, trapped or mmapped, for data section of the region.
> > - To save device state, read data offset and size using structure
> >   vfio_device_migration_info.data, accordingly copy data from the region.
> > - To restore device state, write data offset and size in the structure and 
> > write
> >   data in the region.
> > - To get dirty page bitmap, write start address and pfn count then read 
> > count of
> >   pfns copied and accordingly read those from the rest of the region or
> > mmaped
> >   part of the region. This copy is iterated till page bitmap for all 
> > requested
> >   pfns are copied.
> > 
> > Signed-off-by: Kirti Wankhede 
> > Reviewed-by: Neo Jia 
> > ---
> >  hw/vfio/Makefile.objs |   2 +-
> >  hw/vfio/migration.c   | 729
> > ++
> >  include/hw/vfio/vfio-common.h |  23 ++
> >  3 files changed, 753 insertions(+), 1 

[Qemu-devel] [PATCH v4] xen/pt: allow passthrough of devices with bogus interrupt pin

2018-12-05 Thread Zhao Yan
For some pci device, even its PCI_INTERRUPT_PIN is not 0, it actually
doesn't support INTx mode, so its machine irq read from host sysfs is 0.
In that case, report PCI_INTERRUPT_PIN as 0 to guest and let passthrough
continue.

Cc: Roger Pau Monné 
Cc: Jan Beulich 
Reviewed-by: Roger Pau Monné 
Signed-off-by: Zhao Yan 

---
v2: fix some coding style issue
v3:
   1. let subject be more descriptive (roger)
   2. disable INTx assertion if machine irq is 0.(roger)
   3. in xen_pt_irqpin_reg_init(), drop the else branch as the default
value for *data is 0. (roger)
v4:
drop setting machine_irq as its default value is 0. (roger)
---
 hw/xen/xen_pt.c | 6 ++
 hw/xen/xen_pt_config_init.c | 4 +++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
index e5a6eff..82d7375 100644
--- a/hw/xen/xen_pt.c
+++ b/hw/xen/xen_pt.c
@@ -847,6 +847,12 @@ static void xen_pt_realize(PCIDevice *d, Error **errp)
 }
 
 machine_irq = s->real_device.irq;
+if (machine_irq == 0) {
+XEN_PT_LOG(d, "machine irq is 0\n");
+cmd |= PCI_COMMAND_INTX_DISABLE;
+goto out;
+}
+
 rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, );
 if (rc < 0) {
 error_setg_errno(errp, errno, "Mapping machine irq %u to"
diff --git a/hw/xen/xen_pt_config_init.c b/hw/xen/xen_pt_config_init.c
index aee31c6..12f71c1 100644
--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -300,7 +300,9 @@ static int xen_pt_irqpin_reg_init(XenPCIPassthroughState *s,
   XenPTRegInfo *reg, uint32_t real_offset,
   uint32_t *data)
 {
-*data = xen_pt_pci_read_intx(s);
+if (s->real_device.irq) {
+*data = xen_pt_pci_read_intx(s);
+}
 return 0;
 }
 
-- 
2.7.4




Re: [Qemu-devel] [PATCH v3] xen/pt: allow passthrough of devices with bogus interrupt pin

2018-12-04 Thread Zhao Yan
On Tue, Dec 04, 2018 at 10:30:18AM +0100, Roger Pau Monné wrote:
> On Tue, Dec 04, 2018 at 02:50:49AM -0500, Zhao Yan wrote:
> > For some pci device, even its PCI_INTERRUPT_PIN is not 0, it actually
> > doesn't support INTx mode, so its machine irq read from host sysfs is 0.
> > In that case, report PCI_INTERRUPT_PIN as 0 to guest and let passthrough
> > continue.
> > 
> > Cc: Roger Pau Monné 
> > Cc: Jan Beulich 
> > Signed-off-by: Zhao Yan 
> 
> Reviewed-by: Roger Pau Monné 
> 
> One nit below.
Got it. Thank you!
:)
> 
> > 
> > ---
> > v2: fix some coding style issue
> > v3:
> >1. let subject be more descriptive (roger)
> >2. disable INTx assertion if machine irq is 0.(roger)
> >3. in xen_pt_irqpin_reg_init(), drop the else branch as the default
> > value for *data is 0. (roger)
> > ---
> >  hw/xen/xen_pt.c | 7 +++
> >  hw/xen/xen_pt_config_init.c | 4 +++-
> >  2 files changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
> > index e5a6eff..b563837 100644
> > --- a/hw/xen/xen_pt.c
> > +++ b/hw/xen/xen_pt.c
> > @@ -847,6 +847,13 @@ static void xen_pt_realize(PCIDevice *d, Error **errp)
> >  }
> >  
> >  machine_irq = s->real_device.irq;
> > +if (machine_irq == 0) {
> > +XEN_PT_LOG(d, "machine irq is 0\n");
> > +cmd |= PCI_COMMAND_INTX_DISABLE;
> > +s->machine_irq = 0;
> 
> AFAICT this is already initialized to 0, so you can drop setting
> machine_irq.
> 
> Thanks, Roger.



[Qemu-devel] [PATCH v3] xen/pt: allow passthrough of devices with bogus interrupt pin

2018-12-03 Thread Zhao Yan
For some pci device, even its PCI_INTERRUPT_PIN is not 0, it actually
doesn't support INTx mode, so its machine irq read from host sysfs is 0.
In that case, report PCI_INTERRUPT_PIN as 0 to guest and let passthrough
continue.

Cc: Roger Pau Monné 
Cc: Jan Beulich 
Signed-off-by: Zhao Yan 

---
v2: fix some coding style issue
v3:
   1. let subject be more descriptive (roger)
   2. disable INTx assertion if machine irq is 0.(roger)
   3. in xen_pt_irqpin_reg_init(), drop the else branch as the default
value for *data is 0. (roger)
---
 hw/xen/xen_pt.c | 7 +++
 hw/xen/xen_pt_config_init.c | 4 +++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
index e5a6eff..b563837 100644
--- a/hw/xen/xen_pt.c
+++ b/hw/xen/xen_pt.c
@@ -847,6 +847,13 @@ static void xen_pt_realize(PCIDevice *d, Error **errp)
 }
 
 machine_irq = s->real_device.irq;
+if (machine_irq == 0) {
+XEN_PT_LOG(d, "machine irq is 0\n");
+cmd |= PCI_COMMAND_INTX_DISABLE;
+s->machine_irq = 0;
+goto out;
+}
+
 rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, );
 if (rc < 0) {
 error_setg_errno(errp, errno, "Mapping machine irq %u to"
diff --git a/hw/xen/xen_pt_config_init.c b/hw/xen/xen_pt_config_init.c
index aee31c6..12f71c1 100644
--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -300,7 +300,9 @@ static int xen_pt_irqpin_reg_init(XenPCIPassthroughState *s,
   XenPTRegInfo *reg, uint32_t real_offset,
   uint32_t *data)
 {
-*data = xen_pt_pci_read_intx(s);
+if (s->real_device.irq) {
+*data = xen_pt_pci_read_intx(s);
+}
 return 0;
 }
 
-- 
2.7.4




Re: [Qemu-devel] [Xen-devel] [PATCH v2] xen/pt: Fix a xen passthrough failure

2018-12-03 Thread Zhao Yan
hi Andrew,

It's a pci device that does not support legacy intx mode, but it accidently
reports PCI_INTERRUPT_PIN as 1, which should be 0 instead.
So, in dom0, the machine irq is 0, which will cause later
xc_physdev_map_pirq() fail and passthrough failure.

Therefore, we treat this case as PCI_INTERRUPT_PIN is 0 and report to guest
the right value (0) of PCI_INTERRUPT_PIN.
Then in guest, it's able to use msi mode and function normally.

On Mon, Dec 03, 2018 at 02:12:58PM +, Andrew Cooper wrote:
> On 03/12/2018 05:04, Zhao Yan wrote:
> > For some pci device, even its PCI_INTERRUPT_PIN is not 0, it actually
> > doesn't support INTx mode, so its machine irq read from host sysfs is 0.
> > In that case, report PCI_INTERRUPT_PIN as 0 to guest and let passthrough
> > continue.
> 
> What causes this problem?  It it a non-PCI compliant device?  Is it a
> device which has legacy lines prohibited?
> 
> ~Andrew




Re: [Qemu-devel] [PATCH v2] xen/pt: Fix a xen passthrough failure

2018-12-03 Thread Zhao Yan
On Mon, Dec 03, 2018 at 12:20:30PM +0100, Roger Pau Monné wrote:
> Hello,
> 
> Thanks for the patch.
> 
> The subject should be more descriptive, "Fix a xen passthrough
> failure" is too generic. How about: "allow passthrough of devices with
> bogus interrupt pin" or something similar.
right, thanks for your suggestion. I'll change the subject to this one.

> 
> On Mon, Dec 03, 2018 at 12:04:38AM -0500, Zhao Yan wrote:
> > For some pci device, even its PCI_INTERRUPT_PIN is not 0, it actually
> > doesn't support INTx mode, so its machine irq read from host sysfs is 0.
> > In that case, report PCI_INTERRUPT_PIN as 0 to guest and let passthrough
> > continue.
> > 
> > v2: fix some coding style issue
> 
> The changelog between versions should be below the '---'.
> 
got it. thanks :)

> > 
> > Cc: Roger Pau Monné 
> > Cc: Jan Beulich 
> > Signed-off-by: Zhao Yan 
> > ---
> >  hw/xen/xen_pt.c | 5 +
> >  hw/xen/xen_pt_config_init.c | 8 +++-
> >  2 files changed, 12 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
> > index f1f3a3727c..d601c9979c 100644
> > --- a/hw/xen/xen_pt.c
> > +++ b/hw/xen/xen_pt.c
> > @@ -847,6 +847,11 @@ static void xen_pt_realize(PCIDevice *d, Error **errp)
> >  }
> >  
> >  machine_irq = s->real_device.irq;
> > +if (machine_irq == 0) {
> > +XEN_PT_LOG(d, "machine irq is 0\n");
> 
> I would maybe consider disabling INTX assertion here on the command
> register.
> 
ok, I'll add that.
Actually I was vacillating on whether to add the disabling of INTx
assertion, as I thought PCI_INTERRUPT_PIN was already reported 0. 
But adding the disabling is better.


> > +goto out;
> > +}
> > +
> >  rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, );
> >  if (rc < 0) {
> >  error_setg_errno(errp, errno, "Mapping machine irq %u to"
> > diff --git a/hw/xen/xen_pt_config_init.c b/hw/xen/xen_pt_config_init.c
> > index 47f9010c75..1007b6c977 100644
> > --- a/hw/xen/xen_pt_config_init.c
> > +++ b/hw/xen/xen_pt_config_init.c
> > @@ -300,7 +300,13 @@ static int 
> > xen_pt_irqpin_reg_init(XenPCIPassthroughState *s,
> >XenPTRegInfo *reg, uint32_t real_offset,
> >uint32_t *data)
> >  {
> > -*data = xen_pt_pci_read_intx(s);
> > +if (s->real_device.irq)
> > +*data = xen_pt_pci_read_intx(s);
> > +else {
> > +XEN_PT_LOG(>dev,
> > +"machine irq is 0, init guest PCI_INTERRUPT_PIN to 0\n");
> > +*data = 0;
> 
> The default value for the register is already zero, so you could drop
> the else branch AFAICT.
ok, I'll drop the else branch. thanks for your review :)


> Thanks, Roger.



[Qemu-devel] [PATCH v2] xen/pt: Fix a xen passthrough failure

2018-12-02 Thread Zhao Yan
For some pci device, even its PCI_INTERRUPT_PIN is not 0, it actually
doesn't support INTx mode, so its machine irq read from host sysfs is 0.
In that case, report PCI_INTERRUPT_PIN as 0 to guest and let passthrough
continue.

v2: fix some coding style issue

Cc: Roger Pau Monné 
Cc: Jan Beulich 
Signed-off-by: Zhao Yan 
---
 hw/xen/xen_pt.c | 5 +
 hw/xen/xen_pt_config_init.c | 8 +++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
index f1f3a3727c..d601c9979c 100644
--- a/hw/xen/xen_pt.c
+++ b/hw/xen/xen_pt.c
@@ -847,6 +847,11 @@ static void xen_pt_realize(PCIDevice *d, Error **errp)
 }
 
 machine_irq = s->real_device.irq;
+if (machine_irq == 0) {
+XEN_PT_LOG(d, "machine irq is 0\n");
+goto out;
+}
+
 rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, );
 if (rc < 0) {
 error_setg_errno(errp, errno, "Mapping machine irq %u to"
diff --git a/hw/xen/xen_pt_config_init.c b/hw/xen/xen_pt_config_init.c
index 47f9010c75..1007b6c977 100644
--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -300,7 +300,13 @@ static int xen_pt_irqpin_reg_init(XenPCIPassthroughState 
*s,
   XenPTRegInfo *reg, uint32_t real_offset,
   uint32_t *data)
 {
-*data = xen_pt_pci_read_intx(s);
+if (s->real_device.irq)
+*data = xen_pt_pci_read_intx(s);
+else {
+XEN_PT_LOG(>dev,
+"machine irq is 0, init guest PCI_INTERRUPT_PIN to 0\n");
+*data = 0;
+}
 return 0;
 }
 
-- 
2.17.1




[Qemu-devel] [PATCH] xen/pt: Fix a xen passthrough failure

2018-12-02 Thread Zhao Yan
For some pci device, even its PCI_INTERRUPT_PIN is not 0, it actually
doesn't support INTx mode, so its machine irq read from host sysfs is 0.
In that case, report PCI_INTERRUPT_PIN as 0 to guest and let passthrough
continue.

Cc: Roger Pau Monné 
Cc: Jan Beulich 
Signed-off-by: Zhao Yan 
---
 hw/xen/xen_pt.c | 5 +
 hw/xen/xen_pt_config_init.c | 7 ++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
index f1f3a3727c..de63cb8e94 100644
--- a/hw/xen/xen_pt.c
+++ b/hw/xen/xen_pt.c
@@ -847,6 +847,11 @@ static void xen_pt_realize(PCIDevice *d, Error **errp)
 }
 
 machine_irq = s->real_device.irq;
+if(machine_irq == 0) {
+XEN_PT_LOG(d, "machine irq is 0\n");
+goto out;
+}
+
 rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, );
 if (rc < 0) {
 error_setg_errno(errp, errno, "Mapping machine irq %u to"
diff --git a/hw/xen/xen_pt_config_init.c b/hw/xen/xen_pt_config_init.c
index 47f9010c75..8baddbed90 100644
--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -300,7 +300,12 @@ static int xen_pt_irqpin_reg_init(XenPCIPassthroughState 
*s,
   XenPTRegInfo *reg, uint32_t real_offset,
   uint32_t *data)
 {
-*data = xen_pt_pci_read_intx(s);
+if (s->real_device.irq)
+*data = xen_pt_pci_read_intx(s);
+else {
+XEN_PT_LOG(>dev, "machine irq is 0, init guest PCI_INTERRUPT_PIN to 
0\n");
+*data = 0;
+}
 return 0;
 }
 
-- 
2.17.1




Re: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices

2018-11-29 Thread Zhao Yan
On Wed, Nov 21, 2018 at 04:39:41AM +0800, Kirti Wankhede wrote:
> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> - Added SaveVMHandlers and implemented all basic functions required for live
>   migration.
> - Added VM state change handler to know running or stopped state of VM.
> - Added migration state change notifier to get notification on migration state
>   change. This state is translated to VFIO device state and conveyed to vendor
>   driver.
> - VFIO device supportd migration or not is decided based of migration region
>   query. If migration region query is successful then migration is supported
>   else migration is blocked.
> - Structure vfio_device_migration_info is mapped at 0th offset of migration
>   region and should always trapped by VFIO device's driver. Added both type of
>   access support, trapped or mmapped, for data section of the region.
> - To save device state, read data offset and size using structure
>   vfio_device_migration_info.data, accordingly copy data from the region.
> - To restore device state, write data offset and size in the structure and 
> write
>   data in the region.
> - To get dirty page bitmap, write start address and pfn count then read count 
> of
>   pfns copied and accordingly read those from the rest of the region or mmaped
>   part of the region. This copy is iterated till page bitmap for all requested
>   pfns are copied.
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> ---
>  hw/vfio/Makefile.objs |   2 +-
>  hw/vfio/migration.c   | 729 
> ++
>  include/hw/vfio/vfio-common.h |  23 ++
>  3 files changed, 753 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index a2e7a0a7cf02..2cf2ba1440f2 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,5 +1,5 @@
>  ifeq ($(CONFIG_LINUX), y)
> -obj-$(CONFIG_SOFTMMU) += common.o
> +obj-$(CONFIG_SOFTMMU) += common.o migration.o
>  obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> new file mode 100644
> index ..717fb63e4f43
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,729 @@
> +/*
> + * Migration support for VFIO devices
> + *
> + * Copyright NVIDIA, Inc. 2018
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include 
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "cpu.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
> +#include "migration/register.h"
> +#include "migration/blocker.h"
> +#include "migration/misc.h"
> +#include "qapi/error.h"
> +#include "exec/ramlist.h"
> +#include "exec/ram_addr.h"
> +#include "pci.h"
> +
> +/*
> + * Flags used as delimiter:
> + * 0x => MSB 32-bit all 1s
> + * 0xef10 => emulated (virtual) function IO
> + * 0x => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE  (0xef11ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xef12ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
> +
> +static void vfio_migration_region_exit(VFIODevice *vbasedev)
> +{
> +VFIOMigration *migration = vbasedev->migration;
> +
> +if (!migration) {
> +return;
> +}
> +
> +if (migration->region.buffer.size) {
> +vfio_region_exit(>region.buffer);
> +vfio_region_finalize(>region.buffer);
> +}
> +}
> +
> +static int vfio_migration_region_init(VFIODevice *vbasedev)
> +{
> +VFIOMigration *migration = vbasedev->migration;
> +Object *obj = NULL;
> +int ret = -EINVAL;
> +
> +if (!migration) {
> +return ret;
> +}
> +
> +/* Migration support added for PCI device only */
> +if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +obj = vfio_pci_get_object(vbasedev);
> +}
> +
> +if (!obj) {
> +return ret;
> +}
> +
> +ret = vfio_region_setup(obj, vbasedev, >region.buffer,
> +migration->region.index, "migration");
> +if (ret) {
> +error_report("Failed to setup VFIO migration region %d: %s",
> +  migration->region.index, strerror(-ret));
> +goto err;
> +}
> +
> +if (!migration->region.buffer.size) {
> +ret = -EINVAL;
> +error_report("Invalid region size of VFIO migration region %d: %s",
> + migration->region.index, strerror(-ret));
> +goto err;
> +}
> +
> +if (migration->region.buffer.mmaps) {
> +ret = vfio_region_mmap(>region.buffer);
> +if (ret) {
> +error_report("Failed to mmap VFIO migration region %d: %s",
> + migration->region.index, strerror(-ret));
> 

Re: [Qemu-devel] [PATCH] Xen PCI passthrough: fix passthrough failure when irq map failure

2018-11-25 Thread Zhao Yan
Hi Roger and Jan,
Thanks for your review. You are right.
I'll try the way to report PCI_INTERRUPT_PIN as 0 to the guest.

Thanks
Yan

On Fri, Nov 23, 2018 at 03:26:44AM -0700, Jan Beulich wrote:
> >>> On 23.11.18 at 11:19,  wrote:
> > Adding Jan in case he has an opinion on my reply below.
> 
> I agree, fwiw.
> 
> Jan
> 
> > On Fri, Nov 23, 2018 at 12:04:51AM -0500, Zhao Yan wrote:
> >> On Thu, Nov 22, 2018 at 03:18:05PM +0100, Roger Pau Monné wrote:
> >> > On Thu, Nov 22, 2018 at 08:11:20AM -0500, Zhao Yan wrote:
> >> > > On Thu, Oct 18, 2018 at 03:56:36PM +0100, Roger Pau Monné wrote:
> >> > > > On Thu, Oct 18, 2018 at 08:22:41AM +, Zhao, Yan Y wrote:
> >> > > > > Hi
> >> > > > > The background for this patch is that: for some pci device, even 
> >> > > > > it's PCI_INTERRUPT_PIN is not 0, it
> actually does not support INTx mode, so we should just report error, disable 
> INTx mode and continue the passthrough.
> >> > > > > However, the commit 5a11d0f7 regards this as error condition and 
> >> > > > > let qemu quit passthrough, which is too
> rigorous.
> >> > > > > 
> >> > > > > Error message is below:
> >> > > > > libxl: error: libxl_qmp.c:287:qmp_handle_error_response: Domain 
> >> > > > > 2:received an error message from QMP server:
> Mapping machine irq 0 to pirq -1 failed: Operation not permitted
> >> > > > 
> >> > > > I'm having issues figuring out what's happening here.
> >> > > > s->real_device.irq is 0, yet the PCI config space read of
> >> > > > PCI_INTERRUPT_PIN returns something different than 0.
> >> > > > 
> >> > > > AFAICT this is due to some kind of error in Linux, so that even when
> >> > > > the device is supposed to have a valid IRQ the sysfs node it is set 
> >> > > > to
> >> > > > 0, do you know the actual underlying cause of this?
> >> > > > 
> >> > > > Thanks, Roger.
> >> > > Hi Roger
> >> > > Sorry for the later reply, I just missed this mail...
> >> > > On my side, it's because the hardware actually does not support INTx 
> >> > > mode,
> >> > > but its configuration space does not report PCI_INTERRUPT_PIN to 0. 
> >> > > It's a
> >> > > hardware bug, but previous version of qemu can tolerate it, switch to 
> >> > > MSI
> >> > > and make passthrough work.
> >> > 
> >> > Then I think it would be better to check both PCI_INTERRUPT_PIN and
> >> > s->real_device.irq before attempting to map the IRQ.
> >> > 
> >> > Making the error non-fatal would mean that a device with a valid IRQ
> >> > could fail to be setup correctly but the guest will still be created,
> >> > and things won't go as expected when the guest attempts to use it.
> >> > 
> >> > Thanks, Roger.
> >> hi roger
> >> thanks for your sugguestion. it's right that "s->real_device.irq" is 
> >> needed to be checked before mapping, like if
> it's 0.
> >> but on the other hand, maybe xc_physdev_map_pirq() itself can serve as a 
> >> checking of "s->real_device.irq" ?
> >> like in our case, it will fail and return -EPERM.
> >> then error hanling is still conducted ==>set INTX_DISABLE flag, 
> >> eventhrough the error is not fatal.
> >> 
> >> machine_irq = s->real_device.irq;
> >> rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, );
> >> if (rc < 0) {
> >> error_setg_errno(errp, errno, "Mapping machine irq %u to"
> >>  " pirq %i failed", machine_irq, pirq);
> >> 
> >> /* Disable PCI intx assertion (turn on bit10 of devctl) */
> >> cmd |= PCI_COMMAND_INTX_DISABLE;
> >> machine_irq = 0;
> >> s->machine_irq = 0;
> >> So, do you think it's all right just converting fatal error to non-fatal?
> > 
> > As I said above, I think it would be better to leave the error as
> > fatal and avoid attempting a xc_physdev_map_pirq with a machine_irq ==
> > 0, which will fail.
> > 
> > If we really want to go down the route of making the error non-fatal,
> > I think you will also have to report PCI_INTERRUPT_PIN as 0 to the
> > guest, so that it's clear to the guest that the device doesn't have
> > legacy interrupt support.
> > 
> > Exposing a device with PCI_INTERRUPT_PIN != 0 but then not allowing
> > the guest to clear PCI_COMMAND_INTX_DISABLE is likely bogus.
> > 
> > Thanks, Roger.
> 
> 
> 



Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface

2018-11-22 Thread Zhao Yan
On Wed, Nov 21, 2018 at 04:39:39AM +0800, Kirti Wankhede wrote:
> - Defined MIGRATION region type and sub-type.
> - Defined VFIO device states during migration process.
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined actions and members of structure usage for each action:
> * To convey VFIO device state to be transitioned to.
> * To get pending bytes yet to be migrated for VFIO device
> * To ask driver to write data to migration region and return number of 
> bytes
>   written in the region
> * In migration resume path, user space app writes to migration region and
>   communicates it to vendor driver.
> * Get bitmap of dirty pages from vendor driver from given start address
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> ---
>  linux-headers/linux/vfio.h | 130 
> +
>  1 file changed, 130 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 3615a269d378..a6e45cb2cae2 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG  (3)
> 
> +/* Migration region type and sub-type */
> +#define VFIO_REGION_TYPE_MIGRATION (1 << 30)
> +#define VFIO_REGION_SUBTYPE_MIGRATION  (1)
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be 
> mmapped
>   * which allows direct access to non-MSIX registers which happened to be 
> within
> @@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
> 
>  #define VFIO_DEVICE_IOEVENTFD  _IO(VFIO_TYPE, VFIO_BASE + 16)
> 
> +/**
> + * VFIO device states :
> + * VFIO User space application should set the device state to indicate vendor
> + * driver in which state the VFIO device should transitioned.
> + * - VFIO_DEVICE_STATE_NONE:
> + *   State when VFIO device is initialized but not yet running.
> + * - VFIO_DEVICE_STATE_RUNNING:
> + *   Transition VFIO device in running state, that is, user space 
> application or
> + *   VM is active.
> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> + *   Transition VFIO device in migration setup state. This is used to prepare
> + *   VFIO device for migration while application or VM and vCPUs are still in
> + *   running state.
> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> + *   When VFIO user space application or VM is active and vCPUs are running,
> + *   transition VFIO device in pre-copy state.
> + * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
> + *   When VFIO user space application or VM is stopped and vCPUs are halted,
> + *   transition VFIO device in stop-and-copy state.
> + * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
> + *   When VFIO user space application has copied data provided by vendor 
> driver.
> + *   This state is used by vendor driver to clean up all software state that 
> was
> + *   setup during MIGRATION_SETUP state.
> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
> + *   Transition VFIO device to resume state, that is, start resuming VFIO 
> device
> + *   when user space application or VM is not running and vCPUs are halted.
> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
> + *   When user space application completes iterations of providing device 
> state
> + *   data, transition device in resume completed state.
> + * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
> + *   Migration process failed due to some reason, transition device to failed
> + *   state. If migration process fails while saving at source, resume device 
> at
> + *   source. If migration process fails while resuming application or VM at
> + *   destination, stop restoration at destination and resume at source.
> + * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
> + *   User space application has cancelled migration process either for some
> + *   known reason or due to user's intervention. Transition device to 
> Cancelled
> + *   state, that is, resume device state as it was during running state at
> + *   source.
> + */
> +
> +enum {
> +VFIO_DEVICE_STATE_NONE,
> +VFIO_DEVICE_STATE_RUNNING,
> +VFIO_DEVICE_STATE_MIGRATION_SETUP,
> +VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
> +VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
> +VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> +VFIO_DEVICE_STATE_MIGRATION_RESUME,
> +VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> +VFIO_DEVICE_STATE_MIGRATION_FAILED,
> +VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> +};
> +
> +/**
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related 
> migration
> + * information.
> + *
> + * Action Set state:
> + *  To tell vendor driver the state VFIO device should be transitioned 
> to.
> + *  device_state [input] : 

Re: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices

2018-11-22 Thread Zhao Yan
On Fri, Nov 23, 2018 at 02:51:39AM +0530, Kirti Wankhede wrote:
> 
> 
> On 11/21/2018 1:09 PM, Zhao, Yan Y wrote:
> > 
> > 
> >> -Original Message-
> >> From: Qemu-devel [mailto:qemu-devel-
> >> bounces+yan.y.zhao=intel@nongnu.org] On Behalf Of Kirti Wankhede
> >> Sent: Wednesday, November 21, 2018 4:40 AM
> >> To: alex.william...@redhat.com; c...@nvidia.com
> >> Cc: zhengxiao...@alibaba-inc.com; Tian, Kevin ; Liu, 
> >> Yi L
> >> ; eskul...@redhat.com; Yang, Ziye 
> >> ;
> >> qemu-devel@nongnu.org; coh...@redhat.com; shuangtai@alibaba-inc.com;
> >> dgilb...@redhat.com; Wang, Zhi A ;
> >> mlevi...@redhat.com; pa...@linux.ibm.com; a...@ozlabs.ru; Kirti Wankhede
> >> ; eau...@redhat.com; fel...@nutanix.com;
> >> jonathan.dav...@nutanix.com; Liu, Changpeng ;
> >> ken@amd.com
> >> Subject: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
> >>
> >> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> >> - Added SaveVMHandlers and implemented all basic functions required for 
> >> live
> >>   migration.
> >> - Added VM state change handler to know running or stopped state of VM.
> >> - Added migration state change notifier to get notification on migration 
> >> state
> >>   change. This state is translated to VFIO device state and conveyed to 
> >> vendor
> >>   driver.
> >> - VFIO device supportd migration or not is decided based of migration 
> >> region
> >>   query. If migration region query is successful then migration is 
> >> supported
> >>   else migration is blocked.
> >> - Structure vfio_device_migration_info is mapped at 0th offset of migration
> >>   region and should always trapped by VFIO device's driver. Added both 
> >> type of
> >>   access support, trapped or mmapped, for data section of the region.
> >> - To save device state, read data offset and size using structure
> >>   vfio_device_migration_info.data, accordingly copy data from the region.
> >> - To restore device state, write data offset and size in the structure and 
> >> write
> >>   data in the region.
> >> - To get dirty page bitmap, write start address and pfn count then read 
> >> count of
> >>   pfns copied and accordingly read those from the rest of the region or 
> >> mmaped
> >>   part of the region. This copy is iterated till page bitmap for all 
> >> requested
> >>   pfns are copied.
> >>
> >> Signed-off-by: Kirti Wankhede 
> >> Reviewed-by: Neo Jia 
> >> ---
> >>  hw/vfio/Makefile.objs |   2 +-
> >>  hw/vfio/migration.c   | 729
> >> ++
> >>  include/hw/vfio/vfio-common.h |  23 ++
> >>  3 files changed, 753 insertions(+), 1 deletion(-)  create mode 100644
> >> hw/vfio/migration.c
> >>
> >> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs index
> >> a2e7a0a7cf02..2cf2ba1440f2 100644
> >> --- a/hw/vfio/Makefile.objs
> >> +++ b/hw/vfio/Makefile.objs
> >> @@ -1,5 +1,5 @@
> >>  ifeq ($(CONFIG_LINUX), y)
> >> -obj-$(CONFIG_SOFTMMU) += common.o
> >> +obj-$(CONFIG_SOFTMMU) += common.o migration.o
> >>  obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> >>  obj-$(CONFIG_VFIO_CCW) += ccw.o
> >>  obj-$(CONFIG_SOFTMMU) += platform.o
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c new file mode 100644
> >> index ..717fb63e4f43
> >> --- /dev/null
> >> +++ b/hw/vfio/migration.c
> >> @@ -0,0 +1,729 @@
> >> +/*
> >> + * Migration support for VFIO devices
> >> + *
> >> + * Copyright NVIDIA, Inc. 2018
> >> + *
> >> + * This work is licensed under the terms of the GNU GPL, version 2. See
> >> + * the COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include 
> >> +
> >> +#include "hw/vfio/vfio-common.h"
> >> +#include "cpu.h"
> >> +#include "migration/migration.h"
> >> +#include "migration/qemu-file.h"
> >> +#include "migration/register.h"
> >> +#include "migration/blocker.h"
> >> +#include "migration/misc.h"
> >> +#include "qapi/error.h"
> >> +#include "exec/ramlist.h"
> >> +#i

Re: [Qemu-devel] [PATCH] Xen PCI passthrough: fix passthrough failure when irq map failure

2018-11-22 Thread Zhao Yan
On Thu, Nov 22, 2018 at 03:18:05PM +0100, Roger Pau Monné wrote:
> On Thu, Nov 22, 2018 at 08:11:20AM -0500, Zhao Yan wrote:
> > On Thu, Oct 18, 2018 at 03:56:36PM +0100, Roger Pau Monné wrote:
> > > On Thu, Oct 18, 2018 at 08:22:41AM +, Zhao, Yan Y wrote:
> > > > Hi
> > > > The background for this patch is that: for some pci device, even it's 
> > > > PCI_INTERRUPT_PIN is not 0, it actually does not support INTx mode, so 
> > > > we should just report error, disable INTx mode and continue the 
> > > > passthrough.
> > > > However, the commit 5a11d0f7 regards this as error condition and let 
> > > > qemu quit passthrough, which is too rigorous.
> > > > 
> > > > Error message is below:
> > > > libxl: error: libxl_qmp.c:287:qmp_handle_error_response: Domain 
> > > > 2:received an error message from QMP server: Mapping machine irq 0 to 
> > > > pirq -1 failed: Operation not permitted
> > > 
> > > I'm having issues figuring out what's happening here.
> > > s->real_device.irq is 0, yet the PCI config space read of
> > > PCI_INTERRUPT_PIN returns something different than 0.
> > > 
> > > AFAICT this is due to some kind of error in Linux, so that even when
> > > the device is supposed to have a valid IRQ the sysfs node it is set to
> > > 0, do you know the actual underlying cause of this?
> > > 
> > > Thanks, Roger.
> > Hi Roger
> > Sorry for the later reply, I just missed this mail...
> > On my side, it's because the hardware actually does not support INTx mode,
> > but its configuration space does not report PCI_INTERRUPT_PIN to 0. It's a
> > hardware bug, but previous version of qemu can tolerate it, switch to MSI
> > and make passthrough work.
> 
> Then I think it would be better to check both PCI_INTERRUPT_PIN and
> s->real_device.irq before attempting to map the IRQ.
> 
> Making the error non-fatal would mean that a device with a valid IRQ
> could fail to be setup correctly but the guest will still be created,
> and things won't go as expected when the guest attempts to use it.
> 
> Thanks, Roger.
hi roger
thanks for your sugguestion. it's right that "s->real_device.irq" is needed to 
be checked before mapping, like if it's 0.
but on the other hand, maybe xc_physdev_map_pirq() itself can serve as a 
checking of "s->real_device.irq" ?
like in our case, it will fail and return -EPERM.
then error hanling is still conducted ==>set INTX_DISABLE flag, eventhrough the 
error is not fatal.

machine_irq = s->real_device.irq;
rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, );
if (rc < 0) {
error_setg_errno(errp, errno, "Mapping machine irq %u to"
 " pirq %i failed", machine_irq, pirq);

/* Disable PCI intx assertion (turn on bit10 of devctl) */
cmd |= PCI_COMMAND_INTX_DISABLE;
machine_irq = 0;
s->machine_irq = 0;
So, do you think it's all right just converting fatal error to non-fatal?


Thanks
Yan




Re: [Qemu-devel] [PATCH] Xen PCI passthrough: fix passthrough failure when irq map failure

2018-11-22 Thread Zhao Yan
On Thu, Oct 18, 2018 at 03:56:36PM +0100, Roger Pau Monné wrote:
> On Thu, Oct 18, 2018 at 08:22:41AM +0000, Zhao, Yan Y wrote:
> > Hi
> > The background for this patch is that: for some pci device, even it's 
> > PCI_INTERRUPT_PIN is not 0, it actually does not support INTx mode, so we 
> > should just report error, disable INTx mode and continue the passthrough.
> > However, the commit 5a11d0f7 regards this as error condition and let qemu 
> > quit passthrough, which is too rigorous.
> > 
> > Error message is below:
> > libxl: error: libxl_qmp.c:287:qmp_handle_error_response: Domain 2:received 
> > an error message from QMP server: Mapping machine irq 0 to pirq -1 failed: 
> > Operation not permitted
> 
> I'm having issues figuring out what's happening here.
> s->real_device.irq is 0, yet the PCI config space read of
> PCI_INTERRUPT_PIN returns something different than 0.
> 
> AFAICT this is due to some kind of error in Linux, so that even when
> the device is supposed to have a valid IRQ the sysfs node it is set to
> 0, do you know the actual underlying cause of this?
> 
> Thanks, Roger.
Hi Roger
Sorry for the later reply, I just missed this mail...
On my side, it's because the hardware actually does not support INTx mode,
but its configuration space does not report PCI_INTERRUPT_PIN to 0. It's a
hardware bug, but previous version of qemu can tolerate it, switch to MSI
and make passthrough work.

Thanks
Yan



Re: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices

2018-11-22 Thread Zhao, Yan Y



> -Original Message-
> From: Qemu-devel [mailto:qemu-devel-
> bounces+yan.y.zhao=intel@nongnu.org] On Behalf Of Kirti Wankhede
> Sent: Wednesday, November 21, 2018 4:40 AM
> To: alex.william...@redhat.com; c...@nvidia.com
> Cc: zhengxiao...@alibaba-inc.com; Tian, Kevin ; Liu, Yi 
> L
> ; eskul...@redhat.com; Yang, Ziye ;
> qemu-devel@nongnu.org; coh...@redhat.com; shuangtai@alibaba-inc.com;
> dgilb...@redhat.com; Wang, Zhi A ;
> mlevi...@redhat.com; pa...@linux.ibm.com; a...@ozlabs.ru; Kirti Wankhede
> ; eau...@redhat.com; fel...@nutanix.com;
> jonathan.dav...@nutanix.com; Liu, Changpeng ;
> ken@amd.com
> Subject: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
> 
> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> - Added SaveVMHandlers and implemented all basic functions required for live
>   migration.
> - Added VM state change handler to know running or stopped state of VM.
> - Added migration state change notifier to get notification on migration state
>   change. This state is translated to VFIO device state and conveyed to vendor
>   driver.
> - VFIO device supportd migration or not is decided based of migration region
>   query. If migration region query is successful then migration is supported
>   else migration is blocked.
> - Structure vfio_device_migration_info is mapped at 0th offset of migration
>   region and should always trapped by VFIO device's driver. Added both type of
>   access support, trapped or mmapped, for data section of the region.
> - To save device state, read data offset and size using structure
>   vfio_device_migration_info.data, accordingly copy data from the region.
> - To restore device state, write data offset and size in the structure and 
> write
>   data in the region.
> - To get dirty page bitmap, write start address and pfn count then read count 
> of
>   pfns copied and accordingly read those from the rest of the region or mmaped
>   part of the region. This copy is iterated till page bitmap for all requested
>   pfns are copied.
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> ---
>  hw/vfio/Makefile.objs |   2 +-
>  hw/vfio/migration.c   | 729
> ++
>  include/hw/vfio/vfio-common.h |  23 ++
>  3 files changed, 753 insertions(+), 1 deletion(-)  create mode 100644
> hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs index
> a2e7a0a7cf02..2cf2ba1440f2 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,5 +1,5 @@
>  ifeq ($(CONFIG_LINUX), y)
> -obj-$(CONFIG_SOFTMMU) += common.o
> +obj-$(CONFIG_SOFTMMU) += common.o migration.o
>  obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c new file mode 100644
> index ..717fb63e4f43
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,729 @@
> +/*
> + * Migration support for VFIO devices
> + *
> + * Copyright NVIDIA, Inc. 2018
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include 
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "cpu.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
> +#include "migration/register.h"
> +#include "migration/blocker.h"
> +#include "migration/misc.h"
> +#include "qapi/error.h"
> +#include "exec/ramlist.h"
> +#include "exec/ram_addr.h"
> +#include "pci.h"
> +
> +/*
> + * Flags used as delimiter:
> + * 0x => MSB 32-bit all 1s
> + * 0xef10 => emulated (virtual) function IO
> + * 0x => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE  (0xef11ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xef12ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
> +
> +static void vfio_migration_region_exit(VFIODevice *vbasedev) {
> +VFIOMigration *migration = vbasedev->migration;
> +
> +if (!migration) {
> +return;
> +}
> +
> +if (migration->region.buffer.size) {
> +vfio_region_exit(>region.buffer);
> +vfio_region_finalize(>region.buffer);
> +}
> +}
> +
> +static int vfio_migration_region_init(VFIODevice *vbasedev) {
> +VFIOMigration *migration = vbasedev->migration;
> +Object *obj = NULL;
> +int ret = -EINVAL;
> +
> +if (!migration) {
> +return ret;
> +}
> +
> +/* Migration support added for PCI device only */
> +if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +obj = vfio_pci_get_object(vbasedev);
> +}
> +
> +if (!obj) {
> +return ret;
> +}
> +
> +ret = vfio_region_setup(obj, vbasedev, >region.buffer,
> +migration->region.index, "migration");
> +if (ret) {
> +error_report("Failed to setup VFIO 

Re: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices

2018-11-20 Thread Zhao, Yan Y



> -Original Message-
> From: Qemu-devel [mailto:qemu-devel-
> bounces+yan.y.zhao=intel@nongnu.org] On Behalf Of Kirti Wankhede
> Sent: Wednesday, November 21, 2018 4:40 AM
> To: alex.william...@redhat.com; c...@nvidia.com
> Cc: zhengxiao...@alibaba-inc.com; Tian, Kevin ; Liu, Yi 
> L
> ; eskul...@redhat.com; Yang, Ziye ;
> qemu-devel@nongnu.org; coh...@redhat.com; shuangtai@alibaba-inc.com;
> dgilb...@redhat.com; Wang, Zhi A ;
> mlevi...@redhat.com; pa...@linux.ibm.com; a...@ozlabs.ru; Kirti Wankhede
> ; eau...@redhat.com; fel...@nutanix.com;
> jonathan.dav...@nutanix.com; Liu, Changpeng ;
> ken@amd.com
> Subject: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
> 
> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> - Added SaveVMHandlers and implemented all basic functions required for live
>   migration.
> - Added VM state change handler to know running or stopped state of VM.
> - Added migration state change notifier to get notification on migration state
>   change. This state is translated to VFIO device state and conveyed to vendor
>   driver.
> - VFIO device supportd migration or not is decided based of migration region
>   query. If migration region query is successful then migration is supported
>   else migration is blocked.
> - Structure vfio_device_migration_info is mapped at 0th offset of migration
>   region and should always trapped by VFIO device's driver. Added both type of
>   access support, trapped or mmapped, for data section of the region.
> - To save device state, read data offset and size using structure
>   vfio_device_migration_info.data, accordingly copy data from the region.
> - To restore device state, write data offset and size in the structure and 
> write
>   data in the region.
> - To get dirty page bitmap, write start address and pfn count then read count 
> of
>   pfns copied and accordingly read those from the rest of the region or mmaped
>   part of the region. This copy is iterated till page bitmap for all requested
>   pfns are copied.
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> ---
>  hw/vfio/Makefile.objs |   2 +-
>  hw/vfio/migration.c   | 729
> ++
>  include/hw/vfio/vfio-common.h |  23 ++
>  3 files changed, 753 insertions(+), 1 deletion(-)  create mode 100644
> hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs index
> a2e7a0a7cf02..2cf2ba1440f2 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,5 +1,5 @@
>  ifeq ($(CONFIG_LINUX), y)
> -obj-$(CONFIG_SOFTMMU) += common.o
> +obj-$(CONFIG_SOFTMMU) += common.o migration.o
>  obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c new file mode 100644
> index ..717fb63e4f43
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,729 @@
> +/*
> + * Migration support for VFIO devices
> + *
> + * Copyright NVIDIA, Inc. 2018
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include 
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "cpu.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
> +#include "migration/register.h"
> +#include "migration/blocker.h"
> +#include "migration/misc.h"
> +#include "qapi/error.h"
> +#include "exec/ramlist.h"
> +#include "exec/ram_addr.h"
> +#include "pci.h"
> +
> +/*
> + * Flags used as delimiter:
> + * 0x => MSB 32-bit all 1s
> + * 0xef10 => emulated (virtual) function IO
> + * 0x => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE  (0xef11ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xef12ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
> +
> +static void vfio_migration_region_exit(VFIODevice *vbasedev) {
> +VFIOMigration *migration = vbasedev->migration;
> +
> +if (!migration) {
> +return;
> +}
> +
> +if (migration->region.buffer.size) {
> +vfio_region_exit(>region.buffer);
> +vfio_region_finalize(>region.buffer);
> +}
> +}
> +
> +static int vfio_migration_region_init(VFIODevice *vbasedev) {
> +VFIOMigration *migration = vbasedev->migration;
> +Object *obj = NULL;
> +int ret = -EINVAL;
> +
> +if (!migration) {
> +return ret;
> +}
> +
> +/* Migration support added for PCI device only */
> +if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +obj = vfio_pci_get_object(vbasedev);
> +}
> +
> +if (!obj) {
> +return ret;
> +}
> +
> +ret = vfio_region_setup(obj, vbasedev, >region.buffer,
> +migration->region.index, "migration");
> +if (ret) {
> +error_report("Failed to setup VFIO 

Re: [Qemu-devel] [PATCH] Xen PCI passthrough: fix passthrough failure when irq map failure

2018-10-18 Thread Zhao, Yan Y
Hi
The background for this patch is that: for some pci device, even it's 
PCI_INTERRUPT_PIN is not 0, it actually does not support INTx mode, so we 
should just report error, disable INTx mode and continue the passthrough.
However, the commit 5a11d0f7 regards this as error condition and let qemu quit 
passthrough, which is too rigorous.

Error message is below:
libxl: error: libxl_qmp.c:287:qmp_handle_error_response: Domain 2:received an 
error message from QMP server: Mapping machine irq 0 to pirq -1 failed: 
Operation not permitted
libxl: error: libxl_pci.c:1306:libxl__add_pcidevs: Domain 
2:libxl_device_pci_add failed: -3
libxl: error: libxl_create.c:1458:domcreate_attach_devices: Domain 2:unable to 
add pci devices
libxl: error: libxl_domain.c:1003:libxl__destroy_domid: Domain  2:Non-existant 
domain
libxl: error: libxl_domain.c:962:domain_destroy_callback: Domain  2:Unable to 
destroy guest
libxl: error: libxl_domain.c:889:domain_destroy_cb: Domain 2:Destruction of 
domain failed

After partially revert 5a11d0f7, the device can be passed through into domU and 
running quite well using msi mode.

> -Original Message-
> From: Zhao, Yan Y
> Sent: Tuesday, October 16, 2018 10:15 AM
> To: qemu-devel@nongnu.org; sstabell...@kernel.org;
> anthony.per...@citrix.com; xen-de...@lists.xenproject.org
> Cc: Zhao, Yan Y 
> Subject: [PATCH] Xen PCI passthrough: fix passthrough failure when irq map
> failure
> 
> Commit 5a11d0f7 mistakenly converted a log message into an error condition
> when irq map is failed for the pci device being passed through. Revert that 
> part
> of the commit.
> 
> Signed-off-by: Zhao Yan 
> ---
>  hw/xen/xen_pt.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c index e5a6eff44f..840fd0f748
> 100644
> --- a/hw/xen/xen_pt.c
> +++ b/hw/xen/xen_pt.c
> @@ -849,7 +849,7 @@ static void xen_pt_realize(PCIDevice *d, Error **errp)
>  machine_irq = s->real_device.irq;
>  rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, );
>  if (rc < 0) {
> -error_setg_errno(errp, errno, "Mapping machine irq %u to"
> +XEN_PT_ERR(d, "Mapping machine irq %u to"
>   " pirq %i failed", machine_irq, pirq);
> 
>  /* Disable PCI intx assertion (turn on bit10 of devctl) */ @@ -871,7 
> +871,7
> @@ static void xen_pt_realize(PCIDevice *d, Error **errp)
> PCI_SLOT(d->devfn),
> e_intx);
>  if (rc < 0) {
> -error_setg_errno(errp, errno, "Binding of interrupt %u failed",
> +XEN_PT_ERR(d, "Binding of interrupt %u failed",
>   e_intx);
> 
>  /* Disable PCI intx assertion (turn on bit10 of devctl) */
> --
> 2.17.1




[Qemu-devel] [PATCH] Xen PCI passthrough: fix passthrough failure when irq map failure

2018-10-15 Thread Zhao Yan
Commit 5a11d0f7 mistakenly converted a log message into an error
condition when irq map is failed for the pci device being
passed through. Revert that part of the commit.

Signed-off-by: Zhao Yan 
---
 hw/xen/xen_pt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
index e5a6eff44f..840fd0f748 100644
--- a/hw/xen/xen_pt.c
+++ b/hw/xen/xen_pt.c
@@ -849,7 +849,7 @@ static void xen_pt_realize(PCIDevice *d, Error **errp)
 machine_irq = s->real_device.irq;
 rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, );
 if (rc < 0) {
-error_setg_errno(errp, errno, "Mapping machine irq %u to"
+XEN_PT_ERR(d, "Mapping machine irq %u to"
  " pirq %i failed", machine_irq, pirq);
 
 /* Disable PCI intx assertion (turn on bit10 of devctl) */
@@ -871,7 +871,7 @@ static void xen_pt_realize(PCIDevice *d, Error **errp)
PCI_SLOT(d->devfn),
e_intx);
 if (rc < 0) {
-error_setg_errno(errp, errno, "Binding of interrupt %u failed",
+XEN_PT_ERR(d, "Binding of interrupt %u failed",
  e_intx);
 
 /* Disable PCI intx assertion (turn on bit10 of devctl) */
-- 
2.17.1